Skip to content

Dataset.jl and Corresponding Tests#3

Open
etontackett wants to merge 19 commits intomainfrom
feature/datasets
Open

Dataset.jl and Corresponding Tests#3
etontackett wants to merge 19 commits intomainfrom
feature/datasets

Conversation

@etontackett
Copy link
Copy Markdown
Collaborator

@etontackett etontackett commented Mar 5, 2026

Add Dataset struct, CSV loader, one-hot encoding, and corresponding tests.

Description

This pull request introduces dataset.jl, which defines the Dataset struct and supports loading and preprocessing datasets used in Ridge Regression Experiments. The Dataset type provides a representation consisting of a matrix, response vector, and the dataset name. The file also implements a function (load_csv_dataset) for loading datasets from CSV files or URLs and removing rows with missing values. Additionally, categorical variables are converted into numeric form using the one_hot_encoding function. Tests were added to verify the correctness of the dataset constructor, the one-hot encoding process, and the CSV dataset loading functionality.

Issues resolved include improving code documentation by adding comments that describe code flow and functionality. Removed any print statements. The unit tests were updated and expanded to better improve code coverage. Any dependencies were added to the .toml file. Doc strings were also revised to follow standard formatting and documentation style, including the use of admonitions.

Motivation and Context

Many datasets contain categorical or missing variables, which must be handled prior to applying ridge regression methods. This change standardizes datasets and preprocessing workflows to ensure that we have numeric features, no missing data, and converting categorical variables to one-hot encoded features. This in turn provides consistent experimental units for the ridge regression framework.

Types of changes

  • CI
  • Docs
  • Feature
  • Fix
  • Performance
  • Refactor
  • Style
  • Test
  • Other (use sparingly):

Checklists:

Code and Comments
If this PR includes modifications to the code base, please select all that apply.

  • My code follows the code style of this project.
  • I have updated all package dependencies (if any).
  • I have included all relevant files to realize the functionality of the PR.
  • I have exported relevant functionality (if any).

API Documentation

  • For every exported function (if any), I have included a detailed docstring.
  • I have checked the spelling and grammar of all docstring updates through an external tool.
  • I have checked that the docstring's function signature is correctly formatted and has all arguments.
  • I have checked that the docstring's list of arguments, fields, or return values match the function.
  • I have compiled the docs locally and read through all docstring updates to check for errors.

Manual Documentation

  • I have checked the spelling and grammar of all manual updates through an external tool.
  • Any code included in the docstring is tested using doc tests to ensure consistency.
  • I have compiled the docs locally and read through all manual updates to check for errors.

Testing

  • I have added unit tests to cover my changes. (For Macros, be sure to check
    @code_lowered and
    @code_typed)
  • All new and existing tests passed.
  • I have achieved sufficient code coverage.

@vp314 vp314 self-requested a review March 10, 2026 16:49
Comment thread docs/src/design.md
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an update to the pages of the documentation (i.e., the manual). So you should go through and check that you are doing all the things required for manual pages updates. Also, PRs should be more focused.

Comment thread src/dataset.jl Outdated
Comment on lines +1 to +5
using CSV
using DataFrames
using Downloads

export Dataset, csv_dataset
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Julia, we put using/import statements in the main source file. We do the same for export statements.

Comment thread src/dataset.jl
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All dependencies should appear in the Project.toml file. You should activate the package environment and then "add ..." your dependencies to ensure compatibility and correct environment for the package.

Comment thread src/dataset.jl Outdated
Comment on lines +95 to +106
# Arguments
- `path_or_url::String`
Local file path or web URL that has CSV data.

- `target_col`
Column index OR column name containing the response variable.

- `name::String`
Dataset name.

# Returns
`Dataset`
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to abide by the style guide as you have done above.

Comment thread src/dataset.jl Outdated
lv = unique(scol)
ind = scol .== permutedims(lv)

println("Variable: $name")
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have print statements inside of code.

Comment thread src/dataset.jl Outdated
Comment thread test/dataset_tests.jl
Comment thread src/dataset.jl Outdated
# Throws
- `ArgumentError`: If rows in `X` does not equal length of `y`.

# Notes
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes should be admonitions. See documenter.jl's documentation on admonitions.

@etontackett etontackett changed the title Feature/datasets Dataset.jl and Corresponding Tests Mar 17, 2026
Comment thread src/dataset.jl
Comment on lines +1 to +12
"""
Dataset(name, X, y)

Contains datasets for ridge regression experiments.

# Fields
- `name::String`: Name of dataset
- `X::Matrix{Float64}`: Matrix of variables/features
- `y::Vector{Float64}`: Target vector

# Throws
- `ArgumentError`: If rows in `X` does not equal length of `y`.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be documentation for the struct being created and then there should be documentation for the constructor in the same docstring.

Comment thread src/dataset.jl Outdated
size(X, 1) == length(y) ||
throw(ArgumentError("X and y must have same number of rows"))

new(name, Matrix{Float64}(X), Vector{Float64}(y))
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are interested in looking at sparse design matrices, this functionality precludes that as any matrix would be converted to Matrix{Float64} type which is dense. You can fix this by considering parametric types or Union types for the fields.

Comment thread src/dataset.jl Outdated
# Returns
- `Dataset`: A dataset containing the encoded feature matrix `X`, response vector `y`, and dataset name.
"""
function csv_dataset(path_or_url::String;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not follow BlueStyle

Comment thread src/dataset.jl Outdated
# Returns
- `Matrix{Float64}`: A numeric matrix containing the encoded feature.
"""
function one_hot_encode(Xdf::DataFrame; drop_first::Bool = true)::Matrix{Float64}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this function should focus on one-hot encoding a specific column provided to the function rather than an entire data frame as we do not always know which columns should be one-hot encoded just from their type. Think of categorical data that is saved in the data set as integers rather than as words.

Comment thread src/dataset.jl Outdated

end
"""
csv_dataset(path_or_url; target_col, name="csv_dataset")
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad function name.

Comment thread src/RidgeRegression.jl Outdated
Comment on lines +6 to +8
export Dataset, csv_dataset, one_hot_encode

include("dataset.jl")
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should include before you export generally, but if it works this is fine too.

Copy link
Copy Markdown
Owner

@vp314 vp314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple edits to improve the presentation, style and logic of the PR

Comment thread src/dataset.jl
Comment thread src/dataset.jl
Comment on lines +4 to +9
A dataset for Ridge Regression experiements.

# Description

A `Dataset` object stores the design matrix ``X`` and response vector ``y``
for a regression problem. These datasets serve as the experimental units for ridge regression experiments, allowing us to evaluate the performance of ridge regression models on various datasets.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should consolidate this into a few sentences that describes the nature of what is going on as concisely as possible.

Comment thread src/dataset.jl
# Keyword Arguments
- `cols_to_encode`: A collection of column names or indices to one-hot encode.
- `drop_first::Bool=true`: If `true`, drop the first dummy column for
each categorical feature to avoid multicollinearity.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indenting here is not consistent with the indent level in line 51. Please check the bluestyle guide to understand how much indentation is needed.

Comment thread src/dataset.jl
each categorical feature to avoid multicollinearity.

# Returns
- `Matrix{Float64}`: A numeric matrix containing the encoded feature.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `Matrix{Float64}`: A numeric matrix containing the encoded feature.
- `::Matrix{Float64}`: A numeric matrix containing the encoded feature.

Comment thread src/dataset.jl


for name in names(Xdf) #Selecting columns that aren't the target variable and pushing them to the columns.
col = Xdf[!, name]
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this inside the first if statement on line 75

Comment thread src/dataset.jl
Comment thread src/dataset.jl
Comment thread src/dataset.jl
Comment thread test/dataset_tests.jl
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Individual test files should be wrapped as their own modules.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you test for missing values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants