M sstats big/work/20260513 arrow chunking followups by Rudhik1904 · Pull Request #17 · Vitek-Lab/MSstatsBig

Rudhik1904 · 2026-05-18T21:53:34Z

Motivation and Context

reduceBigSpectronaut() previously streamed Spectronaut CSV exports through readr::read_delim_chunked, which holds a string-interning pool that grows across chunks and pushed peak memory well above one batch's working set. This branch ports the reader to Arrow (which releases per-batch state) and follows up with the three remaining items from TODO-arrow_chunking_followups.md:

The Arrow block_size was Arrow's 256 KiB default — too small for wide Spectronaut rows, causing Invalid: straddling object straddles two block boundaries errors and excessive parser overhead.
End users had no documented escape hatch when the straddling error did fire on extra-wide exports.
cleanSpectronautChunk() ran every batch through ~13 sequential dplyr verbs on a data.frame, producing repeated transient column allocations and fragmenting R's allocator. The rest of the MSstatsConvert family is data.table-native.

Solution: replace readr with Arrow's CsvReadOptions / Scanner / ToRecordBatchReader, raise the default block_size to 16 MiB and expose it as a user parameter, document the straddling-object workaround at the parameter, and rewrite cleanSpectronautChunk() in pure data.table.

Changes

Arrow reader replaces readr::read_delim_chunked in reduceBigSpectronaut(). Uses arrow::open_dataset() + Scanner + ToRecordBatchReader to stream record batches; preserves the existing comma/tab/semicolon delimiter switch via CsvParseOptions$delimiter. Per-batch progress logging every 1,000 batches.
New block_size parameter on both reduceBigSpectronaut() and the user-facing bigSpectronauttoMSstatsFormat(). Default 16L * 1024L * 1024L (16 MiB) replaces Arrow's 256 KiB default. Coerced to integer and validated (length 1, non-NA, positive).
Roxygen @param block_size documents the exact error string (Invalid: straddling object straddles two block boundaries) and the recommended override (64L * 1024L * 1024L) so users hitting the straddling error on pathological rows have a self-service fix.
cleanSpectronautChunk() rewritten in data.table. setDT(input) at entry; column selection via dt[, cols, with = FALSE]; two-step rename via setnames(..., skip_absent = TRUE) matching the MSstatsConvert family convention; in-place column updates via :=; conditional NA assignment via mask form dt[cond, Intensity := NA_real_]. Function shrank from ~88 lines to ~64.
NA-q-value semantics preserved. Q-value filters use is.na(EGQvalue) | EGQvalue >= cutoff so rows with missing q-values still get Intensity = NA, matching the previous dplyr::if_else behavior (a naive data.table translation would silently change this).
Dead code removed. The dplyr::collect(head(dplyr::select(...))) pattern at the old lines 140/144 was a no-op residue from an earlier Arrow-Table refactor and is gone.
data.table added to Imports in DESCRIPTION and imported via @importFrom data.table := .SD setDT setnames so the package is cedta()-aware. Regenerated NAMESPACE and man/bigSpectronauttoMSstatsFormat.Rd via devtools::document().

Testing

All tests run via devtools::test(): 51 PASS, 0 FAIL, 0 WARN, 0 SKIP.

New tests added in tests/testthat/test-converters.R:

reduceBigSpectronaut validates block_size: rejects negative, zero, NA, length-2 vector, and unparseable string inputs.
bigSpectronauttoMSstatsFormat plumbs block_size through: spies on reduceBigSpectronaut via mockery::stub, asserts the default forwards 16L * 1024L * 1024L and an explicit override forwards the user's value.
cleanSpectronautChunk schema smoke test: synthetic minimal Spectronaut-shaped input, asserts output column set and basic values.
cleanSpectronautChunk filter_by_excluded: rows with Excluded == "True" get Intensity = NA.
cleanSpectronautChunk filter_by_identified: rows with Identified == "False" get Intensity = NA.
cleanSpectronautChunk filter_by_qvalue (incl. NA case): rows below cutoff are kept, above cutoff become NA, and NA q-values become NA — the explicit semantic guarantee from the rewrite.
cleanSpectronautChunk drops rows where F.FrgLossType != "noloss".

Before this branch cleanSpectronautChunk had no direct test coverage (existing tests stubbed reduceBigSpectronaut out entirely).

Checklist Before Requesting a Review

I have read the MSstats contributing guidelines
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

Motivation & Context

Spectronaut CSV imports previously used readr::read_delim_chunked() which accumulates string-interning state across chunks, causing unbounded memory growth on very large exports. Certain extra-wide Spectronaut exports also triggered Arrow-style “straddling object” errors when rows spanned block boundaries. Additionally, cleanSpectronautChunk() used dplyr pipelines which are less optimal for large, chunked in-place processing.

Solution: replace the readr chunked reader with an Arrow-based streaming CSV reader that processes single record-batches (bounding peak memory) and expose a tunable block_size for block-boundary issues. Rewrite cleanSpectronautChunk() in data.table for in-place, memory-efficient transformations. Add progress logging and parameter plumbing so callers can tune block_size.

Detailed changes

Package metadata and imports
- DESCRIPTION: added data.table to Imports.
- NAMESPACE: added importFrom(data.table, ":="), importFrom(data.table, ".SD"), importFrom(data.table, setDT), importFrom(data.table, setnames).
- Roxygen importFrom directive added for data.table symbols.
R/clean_spectronaut.R
- reduceBigSpectronaut()
  - Replaced readr chunked reader with Arrow streaming reader using arrow::open_dataset(), arrow::Scanner$create(...)->ToRecordBatchReader().
  - Detects CSV delimiter (comma/ tab/ semicolon) from filename and configures CsvParseOptions$create(delimiter = delim).
  - Uses CsvReadOptions$create(block_size = block_size) and CsvConvertOptions to avoid materializing unrelated columns.
  - Defines needed_cols and passes convert_options$include_columns (via open_dataset) so only required Spectronaut columns are parsed.
  - Iterates batch-by-batch: reader$read_next_batch() → as.data.frame(batch) → cleanSpectronautChunk(...).
  - Tracks position (pos), batch index, elapsed time, and logs progress every 1,000 batches; emits final summary when done.
  - Adds new argument block_size with default 16L * 1024L * 1024L, coerced to integer and validated (length 1, non-NA, > 0).
  - Roxygen/documentation updated to explain block_size and recommend larger values (e.g. 64 MiB) for pathological exports.
- cleanSpectronautChunk()
  - Rewritten to use data.table for in-place, efficient chunk processing.
  - setDT(input) to convert input frames in-place.
  - Selects only present expected columns (intersect) and errors if none found.
  - Two-step renaming: first standardize column names with MSstatsConvert:::.standardizeColnames, then map standardized → MSstats names via setnames(..., skip_absent = TRUE).
  - Intensity coerced to numeric; Excluded and Identified converted from character "True"/"False" to logicals when necessary.
  - Applies filters:
    - filter_by_excluded: Excluded == TRUE → Intensity := NA_real_
    - filter_by_identified: Identified == FALSE → Intensity := NA_real_
    - filter_by_qvalue: preserves NA semantics (is.na(EGQvalue) | EGQvalue >= cutoff → Intensity := NA_real_; same for PGQvalue)
  - Drops rows with FFrgLossType != "noloss".
  - Derives IsotopeLabelType from LabeledSequence (L vs H) or defaults to "L".
  - Selects final output columns (and anomalyModelFeatures when requested) and writes chunk via .writeChunkToFile().
  - Reduced code size and removed dead no-op code; behavior and semantics preserved.
R/converters.R
- bigSpectronauttoMSstatsFormat()
  - New argument block_size = 16L * 1024L * 1024L added (after connection = NULL).
  - Forwards block_size to reduceBigSpectronaut(...).
  - Roxygen usage and argument docs updated accordingly.
man pages
- man/bigSpectronauttoMSstatsFormat.Rd updated:
  - Usage signature now shows block_size and documents default 16 MiB and advice to increase (e.g. 64 MiB) when encountering "Invalid: straddling object straddles two block boundaries".
- New man/dot-prefixedPath.Rd added for internal .prefixedPath(prefix, path).
Tests (tests/testthat/test-converters.R)
- Added helper make_spectronaut_input() to generate minimal Spectronaut chunk inputs.
- New/expanded tests covering:
  - cleanSpectronautChunk: verifies MSstats schema, filter_by_excluded behavior, filter_by_identified behavior, q-value NA-aware semantics matching dplyr::if_else, and dropping rows where F.FrgLossType != "noloss".
  - reduceBigSpectronaut: block_size argument validation rejects negative, zero, NA, vector length >1, and non-numeric string inputs.
  - bigSpectronauttoMSstatsFormat: verifies block_size is forwarded to reduceBigSpectronaut (default and explicit override).
- Existing converter tests retained; overall test run: 51 PASS, 0 FAIL, 0 WARN, 0 SKIP (per PR description).

Unit tests added or modified

tests/testthat/test-converters.R
- Added make_spectronaut_input() helper for constructing minimal Spectronaut chunks.
- Tests for cleanSpectronautChunk:
  - "cleanSpectronautChunk produces the expected MSstats schema"
  - "cleanSpectronautChunk filter_by_excluded sets Intensity to NA on excluded rows"
  - "cleanSpectronautChunk filter_by_identified sets Intensity to NA on unidentified rows"
  - "cleanSpectronautChunk filter_by_qvalue NA-aware semantics match dplyr::if_else"
  - "cleanSpectronautChunk drops rows where FFrgLossType != noloss"
- Tests for reduceBigSpectronaut and bigSpectronauttoMSstatsFormat:
  - "reduceBigSpectronaut rejects invalid block_size values" (negative, zero, NA, vector, non-numeric string)
  - "bigSpectronauttoMSstatsFormat plumbs block_size through to reduceBigSpectronaut" (captures forwarded value; checks default and explicit override)
- Tests use temporary files and clean up with on.exit/unlink.

Coding guidelines / violations

No coding guideline violations identified. The changes:

Add explicit Imports and NAMESPACE entries for data.table.
Use stopifnot() parameter validation for block_size.
Update roxygen and man pages to document new parameter and behavior.

R/clean_spectronaut.R:9-12: added block_size parameter (default 16L * 1024L * 1024L) with coerce + validation. R/clean_spectronaut.R:44: CsvReadOptions$create now uses the parameter. R/converters.R:120-125: new @param block_size roxygen with the straddling-object workaround note. R/converters.R:148-156: bigSpectronauttoMSstatsFormat gains block_size, plumbed to reduceBigSpectronaut. tests/testthat/test-converters.R:97-163: validation tests (rejects negative/zero/NA/vector/string) + plumbing tests (default forwards 16 MiB, override forwards user's value). man/bigSpectronauttoMSstatsFormat.Rd: regenerated from roxygen.

…setnames so the package is data.table-aware (cedta()). R/clean_spectronaut.R:103-187: rewrote cleanSpectronautChunk in data.table: setDT(input) at entry; subsequent operations modify in place via :=. Two-step rename (setnames for standardize, then setnames with skip_absent = TRUE to map standardized→MSstats) matches the MSstatsConvert family pattern. Conditional NA assignment uses mask form dt[cond, Intensity := NA_real_]. Q-value filters preserve dplyr::if_else NA semantics via explicit is.na(EGQvalue) | EGQvalue >= cutoff. Dropped the leftover dplyr::collect(head(dplyr::select(...))) pattern — was a no-op residue from a prior refactor. Function shrank from ~88 lines to ~64. DESCRIPTION:20: added data.table to Imports. NAMESPACE: regenerated, now imports :=, .SD, setDT, setnames from data.table. tests/testthat/test-converters.R:97-211: 5 new tests — schema smoke test, filter_by_excluded, filter_by_identified, filter_by_qvalue (covering the NA-q-value case), and FFrgLossType row drop.

coderabbitai · 2026-05-18T21:53:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 49662a26-bd47-48cd-ac0c-0369b825f8e4

📥 Commits

Reviewing files that changed from the base of the PR and between c8c835e and 8188f4e.

📒 Files selected for processing (2)

R/clean_spectronaut.R
tests/testthat/test-converters.R

📝 Walkthrough

Walkthrough

This PR refactors Spectronaut CSV processing to use Apache Arrow record-batch streaming instead of readr chunking, converts chunk processing from dplyr to data.table operations, and introduces a configurable block_size parameter exposed in the public API, documented, and validated with tests.

Changes

Spectronaut Arrow Streaming & Data.Table Refactor

Layer / File(s)	Summary
data.table dependency setup `DESCRIPTION`, `NAMESPACE`	Added `data.table` to package Imports and declared required data.table symbols (`:=`, `.SD`, `setDT`, `setnames`) for chunk processing.
Arrow-based CSV streaming in reduceBigSpectronaut `R/clean_spectronaut.R`	Replaced `readr::read_delim_chunked()` with Apache Arrow dataset scanning using record-batch iteration, added `block_size` to Arrow read options, converted batches to data frames, invoked `cleanSpectronautChunk()` per batch, and added progress/throughput logging.
Data.table refactor in cleanSpectronautChunk `R/clean_spectronaut.R`	Rewrote chunk transformation from dplyr to data.table: validate and subset present columns, standardized→final renaming including anomaly features, coerce types (`Intensity`, `Excluded`, `Identified`), apply exclusion/identified/qvalue and `FFrgLossType == "noloss"` filtering, derive `IsotopeLabelType`, and write finalized chunk.
Public API parameter threading for block_size `R/converters.R`	Extended `bigSpectronauttoMSstatsFormat()` signature with `block_size = 16L * 1024L * 1024L`, added roxygen `@param block_size`, and forwarded `block_size` into `reduceBigSpectronaut()`.
User and internal function documentation `man/bigSpectronauttoMSstatsFormat.Rd`, `man/dot-prefixedPath.Rd`	Updated usage signature and `\arguments{}` docs for `block_size` with guidance for wide exports; added new man page documenting internal `.prefixedPath()` helper.
block_size validation and forwarding tests `tests/testthat/test-converters.R`	Added helper to build minimal Spectronaut inputs; tests for cleanSpectronautChunk behaviors (column set, excluded/identified/qvalue semantics, loss-type dropping); tests validating `block_size` input rejection and correct forwarding through the API.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant bigSpectronauttoMSstatsFormat
  participant reduceBigSpectronaut
  participant ArrowDataset
  User->>bigSpectronauttoMSstatsFormat: call (optional block_size)
  bigSpectronauttoMSstatsFormat->>reduceBigSpectronaut: forward block_size
  reduceBigSpectronaut->>ArrowDataset: configure CSV reader with block_size
  ArrowDataset->>reduceBigSpectronaut: return record batches
  reduceBigSpectronaut->>cleanSpectronautChunk: process each batch
  cleanSpectronautChunk->>reduceBigSpectronaut: write processed chunk

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Vitek-Lab/MSstatsBig#10: Both PRs touch Spectronaut chunk-writing and the shared internal chunk-writing helper .writeChunkToFile().
Vitek-Lab/MSstatsBig#6: Both PRs modify R/clean_spectronaut.R to thread anomaly feature flags and adjust handled columns; this PR additionally adds Arrow streaming and data.table refactors.

Suggested reviewers

tonywu1999

Poem

🐰 Arrow threads the CSV stream so fine,
data.table hops in, columns align,
block_size set, batches flow with cheer,
Chunks get cleaned and written, far and near,
A rabbit nods — the pipeline's clear!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is vague and uses abbreviated/coded language ('M sstats big/work/20260513') that does not clearly convey the main change to someone scanning commit history.	Use a clearer title like 'Replace readr chunking with Arrow CSV reader and rewrite cleanSpectronautChunk in data.table' to better describe the primary changes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description fully adheres to the template with complete Motivation/Context, detailed Changes, comprehensive Testing coverage, and a fully checked Checklist.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch MSstatsBig/work/20260513_arrow_chunking_followups

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/clean_spectronaut.R`:
- Around line 155-167: The current filter block assumes columns Excluded,
Identified, EGQvalue, PGQvalue, and FFrgLossType always exist and will error if
any are missing; update the logic in clean_spectronaut.R to guard each filter by
checking column presence (e.g., use "if ('Excluded' %in% names(input))" before
the Excluded filter, similarly for Identified, EGQvalue and PGQvalue before
applying qvalue-based NA assignment, and check for FFrgLossType before
subsetting), so each conditional only runs when its target column exists and the
behavior remains unchanged otherwise.
- Around line 25-58: The computed needed_cols is never passed to Arrow, so the
CSV reader still parses all columns; update the CsvConvertOptions usage to
include the projection by calling
arrow::CsvConvertOptions$create(include_columns = needed_cols) (or set the
include_columns field on convert_opts after creation) so that convert_opts
includes needed_cols before calling arrow::open_dataset/Scanner$create;
reference the symbols needed_cols and convert_opts (and the call
arrow::CsvConvertOptions$create) and ensure this happens prior to creating
ds/reader.

In `@tests/testthat/test-converters.R`:
- Around line 106-113: The test's expect_error calls are too broad—change them
to assert the error message mentions "block_size" so they only pass when
block_size validation fails; update each expect_error(reduceBigSpectronaut(...),
...) in tests/testthat/test-converters.R to include a second argument (string or
regex) that matches "block_size" (e.g., "block_size" or "block_size.*invalid")
for the negative values and vector cases, and similarly for the
suppressed-warning call so all invalid-block_size cases are checked by message
content rather than any error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ebfd981-78cf-43a3-b93a-fd55ccfaeccc

📥 Commits

Reviewing files that changed from the base of the PR and between a43b90b and c8c835e.

📒 Files selected for processing (7)

DESCRIPTION
NAMESPACE
R/clean_spectronaut.R
R/converters.R
man/bigSpectronauttoMSstatsFormat.Rd
man/dot-prefixedPath.Rd
tests/testthat/test-converters.R

coderabbitai · 2026-05-18T21:57:25Z

+  # Columns cleanSpectronautChunk actually consumes; Arrow's
+  # convert_options$include_columns drops everything else at parse time so
+  # we never materialize the ~35 unused columns Spectronaut exports.
+  needed_cols <- c("R.FileName", "R.Condition", "R.Replicate",
+                   "PG.ProteinAccessions", "EG.ModifiedSequence",
+                   "FG.LabeledSequence", "FG.Charge",
+                   "F.FrgIon", "F.Charge",
+                   "EG.Identified", "F.ExcludedFromQuantification",
+                   "F.FrgLossType", "PG.Qvalue", "EG.Qvalue",
+                   intensity)
+  if (calculateAnomalyScores) {
+    needed_cols <- c(needed_cols, anomalyModelFeatures)
+  }
+
+  # Arrow's CSV reader replaces readr::read_delim_chunked.  Arrow releases
+  # per-batch state as soon as a batch is consumed, so peak memory is
+  # bounded by one record batch instead of growing with the dataset (readr
+  # keeps a string-interning pool that accumulates across chunks).  The
+  # `delim` switch above already covers comma / tab / semicolon variants;
+  # Arrow's CSV reader handles all three the same way through
+  # CsvParseOptions$delimiter.
+  parse_opts   <- arrow::CsvParseOptions$create(delimiter = delim)
+  convert_opts <- arrow::CsvConvertOptions$create()
+  read_opts    <- arrow::CsvReadOptions$create(block_size = block_size)
+
+  ds <- arrow::open_dataset(
+    input_file,
+    format          = "csv",
+    parse_options   = parse_opts,
+    convert_options = convert_opts,
+    read_options    = read_opts
+  )
+
+  reader <- arrow::Scanner$create(ds)$ToRecordBatchReader()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether column projection is actually wired in the Arrow path. rg -n -C2 'needed_cols|CsvConvertOptions\\$create|include_columns|Scanner\\$create|open_dataset\\(' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 289

🏁 Script executed:

#!/bin/bash # First, check for needed_cols usage in the file echo "=== Searching for needed_cols usage ===" rg -n 'needed_cols' R/clean_spectronaut.R echo "" echo "=== Searching for CsvConvertOptions and include_columns ===" rg -n 'CsvConvertOptions\|include_columns' R/clean_spectronaut.R echo "" echo "=== Checking context around CsvConvertOptions create call ===" rg -n -B3 -A3 'CsvConvertOptions\$create' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 688

🏁 Script executed:

#!/bin/bash # Check what happens after reader creation echo "=== Code from line 50 onwards ===" sed -n '50,80p' R/clean_spectronaut.R echo "" echo "=== Complete function to see full context ===" wc -l R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 1118

🏁 Script executed:

#!/bin/bash # Search for include_columns usage in the entire codebase echo "=== Searching for include_columns in codebase ===" rg -n 'include_columns' . echo "" echo "=== Searching for CsvConvertOptions usage patterns ===" rg -n 'CsvConvertOptions' . -A2 echo "" echo "=== Check the comment context more carefully ===" sed -n '25,50p' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 1892

Wire needed_cols into Arrow's column projection.

needed_cols is computed but never applied to CsvConvertOptions, so the reader still parses all CSV columns. This defeats the intended memory/performance reduction on wide Spectronaut exports.

💡 Proposed fix

- convert_opts <- arrow::CsvConvertOptions$create() + convert_opts <- arrow::CsvConvertOptions$create( + include_columns = needed_cols + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/clean_spectronaut.R` around lines 25 - 58, The computed needed_cols is never passed to Arrow, so the CSV reader still parses all columns; update the CsvConvertOptions usage to include the projection by calling arrow::CsvConvertOptions$create(include_columns = needed_cols) (or set the include_columns field on convert_opts after creation) so that convert_opts includes needed_cols before calling arrow::open_dataset/Scanner$create; reference the symbols needed_cols and convert_opts (and the call arrow::CsvConvertOptions$create) and ensure this happens prior to creating ds/reader.

coderabbitai · 2026-05-18T21:57:26Z

  if (filter_by_excluded) {
-    input <- dplyr::mutate(
-      input, Intensity = dplyr::if_else(Excluded, NA_real_, Intensity))
-
+    input[Excluded == TRUE, Intensity := NA_real_]
  }
-
  if (filter_by_identified) {
-    input <- dplyr::mutate(
-      input, Intensity = dplyr::if_else(Identified, Intensity, NA_real_))
+    input[Identified == FALSE, Intensity := NA_real_]
  }
-
  if (filter_by_qvalue) {
-    input <- dplyr::mutate(
-      input,
-      Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_))
-    input <- dplyr::mutate(
-      input, 
-      Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_))
+    # Preserve dplyr::if_else semantics: rows with NA q-values become NA.
+    input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
+    input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
  }
-
-  input <- dplyr::filter(input, FFrgLossType == "noloss")
-  if (is.element("LabeledSequence", colnames(input))) {
-    input <- dplyr::mutate(input, IsLabeled = grepl("Lys8", LabeledSequence) | grepl("Arg10", LabeledSequence))
-    input <- dplyr::mutate(input, IsotopeLabelType := dplyr::if_else(IsLabeled, "H", "L"))
+
+  input <- input[FFrgLossType == "noloss"]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard filter steps when selected columns are absent.

After present_orig subsetting, downstream filters still assume Excluded, Identified, EGQvalue, PGQvalue, and FFrgLossType exist. Missing columns here will raise runtime errors.

🛡️ Proposed fix

- if (filter_by_excluded) { + if (filter_by_excluded && "Excluded" %in% colnames(input)) { input[Excluded == TRUE, Intensity := NA_real_] } - if (filter_by_identified) { + if (filter_by_identified && "Identified" %in% colnames(input)) { input[Identified == FALSE, Intensity := NA_real_] } if (filter_by_qvalue) { - input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_] - input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_] + if ("EGQvalue" %in% colnames(input)) { + input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_] + } + if ("PGQvalue" %in% colnames(input)) { + input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_] + } } - - input <- input[FFrgLossType == "noloss"] + if ("FFrgLossType" %in% colnames(input)) { + input <- input[FFrgLossType == "noloss"] + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (filter_by_excluded) {

input <- dplyr::mutate(

input, Intensity = dplyr::if_else(Excluded, NA_real_, Intensity))

input[Excluded == TRUE, Intensity := NA_real_]

}

if (filter_by_identified) {

input <- dplyr::mutate(

input, Intensity = dplyr::if_else(Identified, Intensity, NA_real_))

input[Identified == FALSE, Intensity := NA_real_]

}

if (filter_by_qvalue) {

input <- dplyr::mutate(

input,

Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_))

input <- dplyr::mutate(

input,

Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_))

# Preserve dplyr::if_else semantics: rows with NA q-values become NA.

input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]

input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]

}

input <- dplyr::filter(input, FFrgLossType == "noloss")

if (is.element("LabeledSequence", colnames(input))) {

input <- dplyr::mutate(input, IsLabeled = grepl("Lys8", LabeledSequence) | grepl("Arg10", LabeledSequence))

input <- dplyr::mutate(input, IsotopeLabelType := dplyr::if_else(IsLabeled, "H", "L"))

input <- input[FFrgLossType == "noloss"]

if (filter_by_excluded && "Excluded" %in% colnames(input)) {

input[Excluded == TRUE, Intensity := NA_real_]

}

if (filter_by_identified && "Identified" %in% colnames(input)) {

input[Identified == FALSE, Intensity := NA_real_]

}

if (filter_by_qvalue) {

# Preserve dplyr::if_else semantics: rows with NA q-values become NA.

if ("EGQvalue" %in% colnames(input)) {

input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]

}

if ("PGQvalue" %in% colnames(input)) {

input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]

}

}

if ("FFrgLossType" %in% colnames(input)) {

input <- input[FFrgLossType == "noloss"]

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/clean_spectronaut.R` around lines 155 - 167, The current filter block assumes columns Excluded, Identified, EGQvalue, PGQvalue, and FFrgLossType always exist and will error if any are missing; update the logic in clean_spectronaut.R to guard each filter by checking column presence (e.g., use "if ('Excluded' %in% names(input))" before the Excluded filter, similarly for Identified, EGQvalue and PGQvalue before applying qvalue-based NA assignment, and check for FFrgLossType before subsetting), so each conditional only runs when its target column exists and the behavior remains unchanged otherwise.

coderabbitai · 2026-05-18T21:57:26Z

+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L))
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L))
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_))
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)))
+  expect_error(suppressWarnings(
+    reduceBigSpectronaut(input_file, output_file, block_size = "16MB")
+  ))
+})


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen invalid block_size expectations to avoid false positives.

These assertions currently pass on any error. If reducer internals fail for another reason, this test can still pass even when block_size validation regresses. Constrain the expected error to include block_size in the message.

Suggested tightening

- expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L)) - expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L)) - expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_)) - expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L))) + expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L), regexp = "block_size") + expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L), regexp = "block_size") + expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_), regexp = "block_size") + expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)), regexp = "block_size") expect_error(suppressWarnings( reduceBigSpectronaut(input_file, output_file, block_size = "16MB") - )) + ), regexp = "block_size")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/testthat/test-converters.R` around lines 106 - 113, The test's expect_error calls are too broad—change them to assert the error message mentions "block_size" so they only pass when block_size validation fails; update each expect_error(reduceBigSpectronaut(...), ...) in tests/testthat/test-converters.R to include a second argument (string or regex) that matches "block_size" (e.g., "block_size" or "block_size.*invalid") for the negative values and vector cases, and similarly for the suppressed-warning call so all invalid-block_size cases are checked by message content rather than any error.

tonywu1999 · 2026-05-19T23:08:56Z

-                                 anomalyModelFeatures=c()) {
+                                 calculateAnomalyScores=FALSE,
+                                 anomalyModelFeatures=c(),
+                                 block_size = 16L * 1024L * 1024L) {


For people who may want to increase their block size to increase the speed of processing, add a recommendation on how to estimate the adequate block size to maximize speed while reducing the risk of the system crashing

Make it more clear in the MSstatsBig / MSstatsConvert documentation on which columns we actually need from users for Spectronaut

tonywu1999 and others added 10 commits May 7, 2026 14:46

filter columns for readr initially

20dc07e

use col_names parameter

5129c77

fix col_names input

08b0db2

reduce chunk size

5a03986

try arrow csv reader delimted reader

f844758

fix column selection

53f7a78

add progress tracking

a08a65b

add more tests

fdd7476

Rudhik1904 requested a review from tonywu1999 May 18, 2026 21:53

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

tonywu1999 reviewed May 19, 2026

View reviewed changes

temp Commit so I can get the list of col

8188f4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M sstats big/work/20260513 arrow chunking followups#17

M sstats big/work/20260513 arrow chunking followups#17
Rudhik1904 wants to merge 11 commits into
develfrom
MSstatsBig/work/20260513_arrow_chunking_followups

Rudhik1904 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

tonywu1999 May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rudhik1904 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Changes

Testing

Checklist Before Requesting a Review

Motivation & Context

Detailed changes

Unit tests added or modified

Coding guidelines / violations

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

tonywu1999 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rudhik1904 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading