Dataset.jl and Corresponding Tests by etontackett · Pull Request #3 · vp314/experiments-RidgeRegression

etontackett · 2026-03-05T14:29:45Z

Add Dataset struct, CSV loader, one-hot encoding, and corresponding tests.

Description

This pull request introduces dataset.jl, which defines the Dataset struct and supports loading and preprocessing datasets used in Ridge Regression Experiments. The Dataset type provides a representation consisting of a matrix, response vector, and the dataset name. The file also implements a function (load_csv_dataset) for loading datasets from CSV files or URLs and removing rows with missing values. Additionally, categorical variables are converted into numeric form using the one_hot_encoding function. Tests were added to verify the correctness of the dataset constructor, the one-hot encoding process, and the CSV dataset loading functionality.

Issues resolved include improving code documentation by adding comments that describe code flow and functionality. Removed any print statements. The unit tests were updated and expanded to better improve code coverage. Any dependencies were added to the .toml file. Doc strings were also revised to follow standard formatting and documentation style, including the use of admonitions.

Motivation and Context

Many datasets contain categorical or missing variables, which must be handled prior to applying ridge regression methods. This change standardizes datasets and preprocessing workflows to ensure that we have numeric features, no missing data, and converting categorical variables to one-hot encoded features. This in turn provides consistent experimental units for the ridge regression framework.

Types of changes

Checklists:

Code and Comments
If this PR includes modifications to the code base, please select all that apply.

My code follows the code style of this project.
I have updated all package dependencies (if any).
I have included all relevant files to realize the functionality of the PR.
I have exported relevant functionality (if any).

API Documentation

For every exported function (if any), I have included a detailed docstring.
I have checked the spelling and grammar of all docstring updates through an external tool.
I have checked that the docstring's function signature is correctly formatted and has all arguments.
I have checked that the docstring's list of arguments, fields, or return values match the function.
I have compiled the docs locally and read through all docstring updates to check for errors.

Manual Documentation

I have checked the spelling and grammar of all manual updates through an external tool.
Any code included in the docstring is tested using doc tests to ensure consistency.
I have compiled the docs locally and read through all manual updates to check for errors.

Testing

I have added unit tests to cover my changes. (For Macros, be sure to check
@code_lowered and
@code_typed)
All new and existing tests passed.
I have achieved sufficient code coverage.

vp314 · 2026-03-10T16:52:51Z

This is an update to the pages of the documentation (i.e., the manual). So you should go through and check that you are doing all the things required for manual pages updates. Also, PRs should be more focused.

vp314 · 2026-03-10T16:53:45Z

+using CSV
+using DataFrames
+using Downloads
+
+export Dataset, csv_dataset


In Julia, we put using/import statements in the main source file. We do the same for export statements.

vp314 · 2026-03-10T16:55:05Z

All dependencies should appear in the Project.toml file. You should activate the package environment and then "add ..." your dependencies to ensure compatibility and correct environment for the package.

vp314 · 2026-03-10T16:55:58Z

+# Arguments
+- `path_or_url::String`
+    Local file path or web URL that has CSV data.
+
+- `target_col`
+    Column index OR column name containing the response variable.
+
+- `name::String`
+    Dataset name.
+
+# Returns
+`Dataset`


Need to abide by the style guide as you have done above.

vp314 · 2026-03-10T16:56:48Z

+        lv = unique(scol)
+        ind = scol .== permutedims(lv)
+
+        println("Variable: $name")


We should not have print statements inside of code.

vp314 · 2026-03-10T17:01:10Z

+# Throws
+- `ArgumentError`: If rows in `X` does not equal length of `y`.
+
+# Notes


Notes should be admonitions. See documenter.jl's documentation on admonitions.

vp314 · 2026-03-17T16:55:07Z

+"""
+    Dataset(name, X, y)
+
+Contains datasets for ridge regression experiments.
+
+# Fields
+- `name::String`: Name of dataset
+- `X::Matrix{Float64}`: Matrix of variables/features
+- `y::Vector{Float64}`: Target vector
+
+# Throws
+- `ArgumentError`: If rows in `X` does not equal length of `y`.


There should be documentation for the struct being created and then there should be documentation for the constructor in the same docstring.

vp314 · 2026-03-17T16:57:58Z

+        size(X, 1) == length(y) ||
+            throw(ArgumentError("X and y must have same number of rows"))
+
+        new(name, Matrix{Float64}(X), Vector{Float64}(y))


If you are interested in looking at sparse design matrices, this functionality precludes that as any matrix would be converted to Matrix{Float64} type which is dense. You can fix this by considering parametric types or Union types for the fields.

vp314 · 2026-03-17T17:00:17Z

+# Returns
+- `Dataset`: A dataset containing the encoded feature matrix `X`, response vector `y`, and dataset name.
+"""
+function csv_dataset(path_or_url::String;


This does not follow BlueStyle

vp314 · 2026-03-17T17:01:21Z

+# Returns
+- `Matrix{Float64}`: A numeric matrix containing the encoded feature.
+"""
+function one_hot_encode(Xdf::DataFrame; drop_first::Bool = true)::Matrix{Float64}


Maybe this function should focus on one-hot encoding a specific column provided to the function rather than an entire data frame as we do not always know which columns should be one-hot encoded just from their type. Think of categorical data that is saved in the data set as integers rather than as words.

vp314 · 2026-03-17T17:02:27Z

+
+end
+"""
+    csv_dataset(path_or_url; target_col, name="csv_dataset")


This is a bad function name.

vp314 · 2026-03-17T17:03:26Z

+export Dataset, csv_dataset, one_hot_encode
+
+include("dataset.jl")


You should include before you export generally, but if it works this is fine too.

vp314

Multiple edits to improve the presentation, style and logic of the PR

vp314 · 2026-04-07T17:12:31Z

+A dataset for Ridge Regression experiements.
+
+# Description
+
+A `Dataset` object stores the design matrix ``X`` and response vector ``y``
+for a regression problem. These datasets serve as the experimental units for ridge regression experiments, allowing us to evaluate the performance of ridge regression models on various datasets.


You should consolidate this into a few sentences that describes the nature of what is going on as concisely as possible.

vp314 · 2026-04-07T17:15:30Z

+# Keyword Arguments
+- `cols_to_encode`: A collection of column names or indices to one-hot encode.
+- `drop_first::Bool=true`: If `true`, drop the first dummy column for
+  each categorical feature to avoid multicollinearity.


The indenting here is not consistent with the indent level in line 51. Please check the bluestyle guide to understand how much indentation is needed.

vp314 · 2026-04-07T17:15:46Z

+  each categorical feature to avoid multicollinearity.
+
+# Returns
+- `Matrix{Float64}`: A numeric matrix containing the encoded feature.


Suggested change

- `Matrix{Float64}`: A numeric matrix containing the encoded feature.

- `::Matrix{Float64}`: A numeric matrix containing the encoded feature.

vp314 · 2026-04-07T17:17:50Z

+
+
+    for name in names(Xdf) #Selecting columns that aren't the target variable and pushing them to the columns.
+        col = Xdf[!, name]


Maybe move this inside the first if statement on line 75

vp314 · 2026-04-07T17:23:17Z

Individual test files should be wrapped as their own modules.

vp314 · 2026-04-07T17:25:52Z

Where do you test for missing values?

EtonT471 added 3 commits March 2, 2026 15:32

Add dataset utilities and tests

7dc2968

Adding dataset_tests.jl

47fc954

Small changes to design.md

787a88d

vp314 self-requested a review March 10, 2026 16:49

vp314 requested changes Mar 10, 2026

View reviewed changes

EtonT471 added 2 commits March 16, 2026 22:28

March 16 Updates

2dd2295

Ridge Regression file

173cdd1

etontackett changed the title ~~Feature/datasets~~ Dataset.jl and Corresponding Tests Mar 17, 2026

EtonT471 added 3 commits March 16, 2026 23:12

dataset.jl small update

042d5d6

Updated Experimental Units and Treatments Sections

70cfa67

Small changes

9b8c48c

vp314 requested changes Mar 17, 2026

View reviewed changes

EtonT471 added 11 commits March 20, 2026 18:45

Changes

bf9707b

Ridge Regreesion jl changes

214e74d

Fix test environment and dataset tests

a87c905

Remove design.md from this PR (separate design branch)

9e1ad08

3/22 Updates

13fc5bf

New Edits

5cbeb0b

Design added and tests

121fbc0

Fixing project.toml

049bacc

fixing root project.toml

fa3a24a

Adding Linear Algebra

e1b0106

Adding csv

17a8d48

vp314 requested changes Apr 7, 2026

View reviewed changes

		export Dataset, csv_dataset, one_hot_encode

		include("dataset.jl")

	- `Matrix{Float64}`: A numeric matrix containing the encoded feature.
	- `::Matrix{Float64}`: A numeric matrix containing the encoded feature.



		for name in names(Xdf) #Selecting columns that aren't the target variable and pushing them to the columns.
		col = Xdf[!, name]

Conversation

etontackett commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Types of changes

Checklists:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vp314 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

etontackett commented Mar 5, 2026 •

edited

Loading