Skip to content

M sstats big/work/20260513 arrow chunking followups#17

Open
Rudhik1904 wants to merge 11 commits into
develfrom
MSstatsBig/work/20260513_arrow_chunking_followups
Open

M sstats big/work/20260513 arrow chunking followups#17
Rudhik1904 wants to merge 11 commits into
develfrom
MSstatsBig/work/20260513_arrow_chunking_followups

Conversation

@Rudhik1904

@Rudhik1904 Rudhik1904 commented May 18, 2026

Copy link
Copy Markdown
Contributor

Motivation and Context

reduceBigSpectronaut() previously streamed Spectronaut CSV exports through readr::read_delim_chunked, which holds a string-interning pool that grows across chunks and pushed peak memory well above one batch's working set. This branch ports the reader to Arrow (which releases per-batch state) and follows up with the three remaining items from TODO-arrow_chunking_followups.md:

  1. The Arrow block_size was Arrow's 256 KiB default — too small for wide Spectronaut rows, causing Invalid: straddling object straddles two block boundaries errors and excessive parser overhead.
  2. End users had no documented escape hatch when the straddling error did fire on extra-wide exports.
  3. cleanSpectronautChunk() ran every batch through ~13 sequential dplyr verbs on a data.frame, producing repeated transient column allocations and fragmenting R's allocator. The rest of the MSstatsConvert family is data.table-native.

Solution: replace readr with Arrow's CsvReadOptions / Scanner / ToRecordBatchReader, raise the default block_size to 16 MiB and expose it as a user parameter, document the straddling-object workaround at the parameter, and rewrite cleanSpectronautChunk() in pure data.table.

Changes

  • Arrow reader replaces readr::read_delim_chunked in reduceBigSpectronaut(). Uses arrow::open_dataset() + Scanner + ToRecordBatchReader to stream record batches; preserves the existing comma/tab/semicolon delimiter switch via CsvParseOptions$delimiter. Per-batch progress logging every 1,000 batches.
  • New block_size parameter on both reduceBigSpectronaut() and the user-facing bigSpectronauttoMSstatsFormat(). Default 16L * 1024L * 1024L (16 MiB) replaces Arrow's 256 KiB default. Coerced to integer and validated (length 1, non-NA, positive).
  • Roxygen @param block_size documents the exact error string (Invalid: straddling object straddles two block boundaries) and the recommended override (64L * 1024L * 1024L) so users hitting the straddling error on pathological rows have a self-service fix.
  • cleanSpectronautChunk() rewritten in data.table. setDT(input) at entry; column selection via dt[, cols, with = FALSE]; two-step rename via setnames(..., skip_absent = TRUE) matching the MSstatsConvert family convention; in-place column updates via :=; conditional NA assignment via mask form dt[cond, Intensity := NA_real_]. Function shrank from ~88 lines to ~64.
  • NA-q-value semantics preserved. Q-value filters use is.na(EGQvalue) | EGQvalue >= cutoff so rows with missing q-values still get Intensity = NA, matching the previous dplyr::if_else behavior (a naive data.table translation would silently change this).
  • Dead code removed. The dplyr::collect(head(dplyr::select(...))) pattern at the old lines 140/144 was a no-op residue from an earlier Arrow-Table refactor and is gone.
  • data.table added to Imports in DESCRIPTION and imported via @importFrom data.table := .SD setDT setnames so the package is cedta()-aware. Regenerated NAMESPACE and man/bigSpectronauttoMSstatsFormat.Rd via devtools::document().

Testing

All tests run via devtools::test(): 51 PASS, 0 FAIL, 0 WARN, 0 SKIP.

New tests added in tests/testthat/test-converters.R:

  • reduceBigSpectronaut validates block_size: rejects negative, zero, NA, length-2 vector, and unparseable string inputs.
  • bigSpectronauttoMSstatsFormat plumbs block_size through: spies on reduceBigSpectronaut via mockery::stub, asserts the default forwards 16L * 1024L * 1024L and an explicit override forwards the user's value.
  • cleanSpectronautChunk schema smoke test: synthetic minimal Spectronaut-shaped input, asserts output column set and basic values.
  • cleanSpectronautChunk filter_by_excluded: rows with Excluded == "True" get Intensity = NA.
  • cleanSpectronautChunk filter_by_identified: rows with Identified == "False" get Intensity = NA.
  • cleanSpectronautChunk filter_by_qvalue (incl. NA case): rows below cutoff are kept, above cutoff become NA, and NA q-values become NA — the explicit semantic guarantee from the rewrite.
  • cleanSpectronautChunk drops rows where F.FrgLossType != "noloss".

Before this branch cleanSpectronautChunk had no direct test coverage (existing tests stubbed reduceBigSpectronaut out entirely).

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

Motivation & Context

Spectronaut CSV imports previously used readr::read_delim_chunked() which accumulates string-interning state across chunks, causing unbounded memory growth on very large exports. Certain extra-wide Spectronaut exports also triggered Arrow-style “straddling object” errors when rows spanned block boundaries. Additionally, cleanSpectronautChunk() used dplyr pipelines which are less optimal for large, chunked in-place processing.

Solution: replace the readr chunked reader with an Arrow-based streaming CSV reader that processes single record-batches (bounding peak memory) and expose a tunable block_size for block-boundary issues. Rewrite cleanSpectronautChunk() in data.table for in-place, memory-efficient transformations. Add progress logging and parameter plumbing so callers can tune block_size.

Detailed changes

  • Package metadata and imports

    • DESCRIPTION: added data.table to Imports.
    • NAMESPACE: added importFrom(data.table, ":="), importFrom(data.table, ".SD"), importFrom(data.table, setDT), importFrom(data.table, setnames).
    • Roxygen importFrom directive added for data.table symbols.
  • R/clean_spectronaut.R

    • reduceBigSpectronaut()
      • Replaced readr chunked reader with Arrow streaming reader using arrow::open_dataset(), arrow::Scanner$create(...)->ToRecordBatchReader().
      • Detects CSV delimiter (comma/ tab/ semicolon) from filename and configures CsvParseOptions$create(delimiter = delim).
      • Uses CsvReadOptions$create(block_size = block_size) and CsvConvertOptions to avoid materializing unrelated columns.
      • Defines needed_cols and passes convert_options$include_columns (via open_dataset) so only required Spectronaut columns are parsed.
      • Iterates batch-by-batch: reader$read_next_batch() → as.data.frame(batch) → cleanSpectronautChunk(...).
      • Tracks position (pos), batch index, elapsed time, and logs progress every 1,000 batches; emits final summary when done.
      • Adds new argument block_size with default 16L * 1024L * 1024L, coerced to integer and validated (length 1, non-NA, > 0).
      • Roxygen/documentation updated to explain block_size and recommend larger values (e.g. 64 MiB) for pathological exports.
    • cleanSpectronautChunk()
      • Rewritten to use data.table for in-place, efficient chunk processing.
      • setDT(input) to convert input frames in-place.
      • Selects only present expected columns (intersect) and errors if none found.
      • Two-step renaming: first standardize column names with MSstatsConvert:::.standardizeColnames, then map standardized → MSstats names via setnames(..., skip_absent = TRUE).
      • Intensity coerced to numeric; Excluded and Identified converted from character "True"/"False" to logicals when necessary.
      • Applies filters:
        • filter_by_excluded: Excluded == TRUE → Intensity := NA_real_
        • filter_by_identified: Identified == FALSE → Intensity := NA_real_
        • filter_by_qvalue: preserves NA semantics (is.na(EGQvalue) | EGQvalue >= cutoff → Intensity := NA_real_; same for PGQvalue)
      • Drops rows with FFrgLossType != "noloss".
      • Derives IsotopeLabelType from LabeledSequence (L vs H) or defaults to "L".
      • Selects final output columns (and anomalyModelFeatures when requested) and writes chunk via .writeChunkToFile().
      • Reduced code size and removed dead no-op code; behavior and semantics preserved.
  • R/converters.R

    • bigSpectronauttoMSstatsFormat()
      • New argument block_size = 16L * 1024L * 1024L added (after connection = NULL).
      • Forwards block_size to reduceBigSpectronaut(...).
      • Roxygen usage and argument docs updated accordingly.
  • man pages

    • man/bigSpectronauttoMSstatsFormat.Rd updated:
      • Usage signature now shows block_size and documents default 16 MiB and advice to increase (e.g. 64 MiB) when encountering "Invalid: straddling object straddles two block boundaries".
    • New man/dot-prefixedPath.Rd added for internal .prefixedPath(prefix, path).
  • Tests (tests/testthat/test-converters.R)

    • Added helper make_spectronaut_input() to generate minimal Spectronaut chunk inputs.
    • New/expanded tests covering:
      • cleanSpectronautChunk: verifies MSstats schema, filter_by_excluded behavior, filter_by_identified behavior, q-value NA-aware semantics matching dplyr::if_else, and dropping rows where F.FrgLossType != "noloss".
      • reduceBigSpectronaut: block_size argument validation rejects negative, zero, NA, vector length >1, and non-numeric string inputs.
      • bigSpectronauttoMSstatsFormat: verifies block_size is forwarded to reduceBigSpectronaut (default and explicit override).
    • Existing converter tests retained; overall test run: 51 PASS, 0 FAIL, 0 WARN, 0 SKIP (per PR description).

Unit tests added or modified

  • tests/testthat/test-converters.R
    • Added make_spectronaut_input() helper for constructing minimal Spectronaut chunks.
    • Tests for cleanSpectronautChunk:
      • "cleanSpectronautChunk produces the expected MSstats schema"
      • "cleanSpectronautChunk filter_by_excluded sets Intensity to NA on excluded rows"
      • "cleanSpectronautChunk filter_by_identified sets Intensity to NA on unidentified rows"
      • "cleanSpectronautChunk filter_by_qvalue NA-aware semantics match dplyr::if_else"
      • "cleanSpectronautChunk drops rows where FFrgLossType != noloss"
    • Tests for reduceBigSpectronaut and bigSpectronauttoMSstatsFormat:
      • "reduceBigSpectronaut rejects invalid block_size values" (negative, zero, NA, vector, non-numeric string)
      • "bigSpectronauttoMSstatsFormat plumbs block_size through to reduceBigSpectronaut" (captures forwarded value; checks default and explicit override)
    • Tests use temporary files and clean up with on.exit/unlink.

Coding guidelines / violations

No coding guideline violations identified. The changes:

  • Add explicit Imports and NAMESPACE entries for data.table.
  • Use stopifnot() parameter validation for block_size.
  • Update roxygen and man pages to document new parameter and behavior.

Review Change Stack

tonywu1999 and others added 10 commits May 7, 2026 14:46
R/clean_spectronaut.R:9-12: added block_size parameter (default 16L * 1024L * 1024L) with coerce + validation.
R/clean_spectronaut.R:44: CsvReadOptions$create now uses the parameter.
R/converters.R:120-125: new @param block_size roxygen with the straddling-object workaround note.
R/converters.R:148-156: bigSpectronauttoMSstatsFormat gains block_size, plumbed to reduceBigSpectronaut.
tests/testthat/test-converters.R:97-163: validation tests (rejects negative/zero/NA/vector/string) + plumbing tests (default forwards 16 MiB, override forwards user's value).
man/bigSpectronauttoMSstatsFormat.Rd: regenerated from roxygen.
…setnames so the package is data.table-aware (cedta()).

R/clean_spectronaut.R:103-187: rewrote cleanSpectronautChunk in data.table:
setDT(input) at entry; subsequent operations modify in place via :=.
Two-step rename (setnames for standardize, then setnames with skip_absent = TRUE to map standardized→MSstats) matches the MSstatsConvert family pattern.
Conditional NA assignment uses mask form dt[cond, Intensity := NA_real_].
Q-value filters preserve dplyr::if_else NA semantics via explicit is.na(EGQvalue) | EGQvalue >= cutoff.
Dropped the leftover dplyr::collect(head(dplyr::select(...))) pattern — was a no-op residue from a prior refactor.
Function shrank from ~88 lines to ~64.
DESCRIPTION:20: added data.table to Imports.
NAMESPACE: regenerated, now imports :=, .SD, setDT, setnames from data.table.
tests/testthat/test-converters.R:97-211: 5 new tests — schema smoke test, filter_by_excluded, filter_by_identified, filter_by_qvalue (covering the NA-q-value case), and FFrgLossType row drop.
@coderabbitai

coderabbitai Bot commented May 18, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 49662a26-bd47-48cd-ac0c-0369b825f8e4

📥 Commits

Reviewing files that changed from the base of the PR and between c8c835e and 8188f4e.

📒 Files selected for processing (2)
  • R/clean_spectronaut.R
  • tests/testthat/test-converters.R

📝 Walkthrough

Walkthrough

This PR refactors Spectronaut CSV processing to use Apache Arrow record-batch streaming instead of readr chunking, converts chunk processing from dplyr to data.table operations, and introduces a configurable block_size parameter exposed in the public API, documented, and validated with tests.

Changes

Spectronaut Arrow Streaming & Data.Table Refactor

Layer / File(s) Summary
data.table dependency setup
DESCRIPTION, NAMESPACE
Added data.table to package Imports and declared required data.table symbols (:=, .SD, setDT, setnames) for chunk processing.
Arrow-based CSV streaming in reduceBigSpectronaut
R/clean_spectronaut.R
Replaced readr::read_delim_chunked() with Apache Arrow dataset scanning using record-batch iteration, added block_size to Arrow read options, converted batches to data frames, invoked cleanSpectronautChunk() per batch, and added progress/throughput logging.
Data.table refactor in cleanSpectronautChunk
R/clean_spectronaut.R
Rewrote chunk transformation from dplyr to data.table: validate and subset present columns, standardized→final renaming including anomaly features, coerce types (Intensity, Excluded, Identified), apply exclusion/identified/qvalue and FFrgLossType == "noloss" filtering, derive IsotopeLabelType, and write finalized chunk.
Public API parameter threading for block_size
R/converters.R
Extended bigSpectronauttoMSstatsFormat() signature with block_size = 16L * 1024L * 1024L, added roxygen @param block_size, and forwarded block_size into reduceBigSpectronaut().
User and internal function documentation
man/bigSpectronauttoMSstatsFormat.Rd, man/dot-prefixedPath.Rd
Updated usage signature and \arguments{} docs for block_size with guidance for wide exports; added new man page documenting internal .prefixedPath() helper.
block_size validation and forwarding tests
tests/testthat/test-converters.R
Added helper to build minimal Spectronaut inputs; tests for cleanSpectronautChunk behaviors (column set, excluded/identified/qvalue semantics, loss-type dropping); tests validating block_size input rejection and correct forwarding through the API.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant bigSpectronauttoMSstatsFormat
  participant reduceBigSpectronaut
  participant ArrowDataset
  User->>bigSpectronauttoMSstatsFormat: call (optional block_size)
  bigSpectronauttoMSstatsFormat->>reduceBigSpectronaut: forward block_size
  reduceBigSpectronaut->>ArrowDataset: configure CSV reader with block_size
  ArrowDataset->>reduceBigSpectronaut: return record batches
  reduceBigSpectronaut->>cleanSpectronautChunk: process each batch
  cleanSpectronautChunk->>reduceBigSpectronaut: write processed chunk
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Vitek-Lab/MSstatsBig#10: Both PRs touch Spectronaut chunk-writing and the shared internal chunk-writing helper .writeChunkToFile().
  • Vitek-Lab/MSstatsBig#6: Both PRs modify R/clean_spectronaut.R to thread anomaly feature flags and adjust handled columns; this PR additionally adds Arrow streaming and data.table refactors.

Suggested reviewers

  • tonywu1999

Poem

🐰 Arrow threads the CSV stream so fine,
data.table hops in, columns align,
block_size set, batches flow with cheer,
Chunks get cleaned and written, far and near,
A rabbit nods — the pipeline's clear!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and uses abbreviated/coded language ('M sstats big/work/20260513') that does not clearly convey the main change to someone scanning commit history. Use a clearer title like 'Replace readr chunking with Arrow CSV reader and rewrite cleanSpectronautChunk in data.table' to better describe the primary changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description fully adheres to the template with complete Motivation/Context, detailed Changes, comprehensive Testing coverage, and a fully checked Checklist.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MSstatsBig/work/20260513_arrow_chunking_followups

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Rudhik1904 Rudhik1904 requested a review from tonywu1999 May 18, 2026 21:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/clean_spectronaut.R`:
- Around line 155-167: The current filter block assumes columns Excluded,
Identified, EGQvalue, PGQvalue, and FFrgLossType always exist and will error if
any are missing; update the logic in clean_spectronaut.R to guard each filter by
checking column presence (e.g., use "if ('Excluded' %in% names(input))" before
the Excluded filter, similarly for Identified, EGQvalue and PGQvalue before
applying qvalue-based NA assignment, and check for FFrgLossType before
subsetting), so each conditional only runs when its target column exists and the
behavior remains unchanged otherwise.
- Around line 25-58: The computed needed_cols is never passed to Arrow, so the
CSV reader still parses all columns; update the CsvConvertOptions usage to
include the projection by calling
arrow::CsvConvertOptions$create(include_columns = needed_cols) (or set the
include_columns field on convert_opts after creation) so that convert_opts
includes needed_cols before calling arrow::open_dataset/Scanner$create;
reference the symbols needed_cols and convert_opts (and the call
arrow::CsvConvertOptions$create) and ensure this happens prior to creating
ds/reader.

In `@tests/testthat/test-converters.R`:
- Around line 106-113: The test's expect_error calls are too broad—change them
to assert the error message mentions "block_size" so they only pass when
block_size validation fails; update each expect_error(reduceBigSpectronaut(...),
...) in tests/testthat/test-converters.R to include a second argument (string or
regex) that matches "block_size" (e.g., "block_size" or "block_size.*invalid")
for the negative values and vector cases, and similarly for the
suppressed-warning call so all invalid-block_size cases are checked by message
content rather than any error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ebfd981-78cf-43a3-b93a-fd55ccfaeccc

📥 Commits

Reviewing files that changed from the base of the PR and between a43b90b and c8c835e.

📒 Files selected for processing (7)
  • DESCRIPTION
  • NAMESPACE
  • R/clean_spectronaut.R
  • R/converters.R
  • man/bigSpectronauttoMSstatsFormat.Rd
  • man/dot-prefixedPath.Rd
  • tests/testthat/test-converters.R

Comment thread R/clean_spectronaut.R
Comment on lines +25 to +58
# Columns cleanSpectronautChunk actually consumes; Arrow's
# convert_options$include_columns drops everything else at parse time so
# we never materialize the ~35 unused columns Spectronaut exports.
needed_cols <- c("R.FileName", "R.Condition", "R.Replicate",
"PG.ProteinAccessions", "EG.ModifiedSequence",
"FG.LabeledSequence", "FG.Charge",
"F.FrgIon", "F.Charge",
"EG.Identified", "F.ExcludedFromQuantification",
"F.FrgLossType", "PG.Qvalue", "EG.Qvalue",
intensity)
if (calculateAnomalyScores) {
needed_cols <- c(needed_cols, anomalyModelFeatures)
}

# Arrow's CSV reader replaces readr::read_delim_chunked. Arrow releases
# per-batch state as soon as a batch is consumed, so peak memory is
# bounded by one record batch instead of growing with the dataset (readr
# keeps a string-interning pool that accumulates across chunks). The
# `delim` switch above already covers comma / tab / semicolon variants;
# Arrow's CSV reader handles all three the same way through
# CsvParseOptions$delimiter.
parse_opts <- arrow::CsvParseOptions$create(delimiter = delim)
convert_opts <- arrow::CsvConvertOptions$create()
read_opts <- arrow::CsvReadOptions$create(block_size = block_size)

ds <- arrow::open_dataset(
input_file,
format = "csv",
parse_options = parse_opts,
convert_options = convert_opts,
read_options = read_opts
)

reader <- arrow::Scanner$create(ds)$ToRecordBatchReader()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether column projection is actually wired in the Arrow path.
rg -n -C2 'needed_cols|CsvConvertOptions\\$create|include_columns|Scanner\\$create|open_dataset\\(' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 289


🏁 Script executed:

#!/bin/bash
# First, check for needed_cols usage in the file
echo "=== Searching for needed_cols usage ==="
rg -n 'needed_cols' R/clean_spectronaut.R

echo ""
echo "=== Searching for CsvConvertOptions and include_columns ==="
rg -n 'CsvConvertOptions\|include_columns' R/clean_spectronaut.R

echo ""
echo "=== Checking context around CsvConvertOptions create call ==="
rg -n -B3 -A3 'CsvConvertOptions\$create' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 688


🏁 Script executed:

#!/bin/bash
# Check what happens after reader creation
echo "=== Code from line 50 onwards ==="
sed -n '50,80p' R/clean_spectronaut.R

echo ""
echo "=== Complete function to see full context ==="
wc -l R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 1118


🏁 Script executed:

#!/bin/bash
# Search for include_columns usage in the entire codebase
echo "=== Searching for include_columns in codebase ==="
rg -n 'include_columns' .

echo ""
echo "=== Searching for CsvConvertOptions usage patterns ==="
rg -n 'CsvConvertOptions' . -A2

echo ""
echo "=== Check the comment context more carefully ==="
sed -n '25,50p' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 1892


Wire needed_cols into Arrow's column projection.

needed_cols is computed but never applied to CsvConvertOptions, so the reader still parses all CSV columns. This defeats the intended memory/performance reduction on wide Spectronaut exports.

💡 Proposed fix
-  convert_opts <- arrow::CsvConvertOptions$create()
+  convert_opts <- arrow::CsvConvertOptions$create(
+    include_columns = needed_cols
+  )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/clean_spectronaut.R` around lines 25 - 58, The computed needed_cols is
never passed to Arrow, so the CSV reader still parses all columns; update the
CsvConvertOptions usage to include the projection by calling
arrow::CsvConvertOptions$create(include_columns = needed_cols) (or set the
include_columns field on convert_opts after creation) so that convert_opts
includes needed_cols before calling arrow::open_dataset/Scanner$create;
reference the symbols needed_cols and convert_opts (and the call
arrow::CsvConvertOptions$create) and ensure this happens prior to creating
ds/reader.

Comment thread R/clean_spectronaut.R
Comment on lines 155 to +167
if (filter_by_excluded) {
input <- dplyr::mutate(
input, Intensity = dplyr::if_else(Excluded, NA_real_, Intensity))

input[Excluded == TRUE, Intensity := NA_real_]
}

if (filter_by_identified) {
input <- dplyr::mutate(
input, Intensity = dplyr::if_else(Identified, Intensity, NA_real_))
input[Identified == FALSE, Intensity := NA_real_]
}

if (filter_by_qvalue) {
input <- dplyr::mutate(
input,
Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_))
input <- dplyr::mutate(
input,
Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_))
# Preserve dplyr::if_else semantics: rows with NA q-values become NA.
input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
}

input <- dplyr::filter(input, FFrgLossType == "noloss")
if (is.element("LabeledSequence", colnames(input))) {
input <- dplyr::mutate(input, IsLabeled = grepl("Lys8", LabeledSequence) | grepl("Arg10", LabeledSequence))
input <- dplyr::mutate(input, IsotopeLabelType := dplyr::if_else(IsLabeled, "H", "L"))

input <- input[FFrgLossType == "noloss"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard filter steps when selected columns are absent.

After present_orig subsetting, downstream filters still assume Excluded, Identified, EGQvalue, PGQvalue, and FFrgLossType exist. Missing columns here will raise runtime errors.

🛡️ Proposed fix
-  if (filter_by_excluded) {
+  if (filter_by_excluded && "Excluded" %in% colnames(input)) {
     input[Excluded == TRUE, Intensity := NA_real_]
   }
-  if (filter_by_identified) {
+  if (filter_by_identified && "Identified" %in% colnames(input)) {
     input[Identified == FALSE, Intensity := NA_real_]
   }
   if (filter_by_qvalue) {
-    input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
-    input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
+    if ("EGQvalue" %in% colnames(input)) {
+      input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
+    }
+    if ("PGQvalue" %in% colnames(input)) {
+      input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
+    }
   }
-
-  input <- input[FFrgLossType == "noloss"]
+  if ("FFrgLossType" %in% colnames(input)) {
+    input <- input[FFrgLossType == "noloss"]
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (filter_by_excluded) {
input <- dplyr::mutate(
input, Intensity = dplyr::if_else(Excluded, NA_real_, Intensity))
input[Excluded == TRUE, Intensity := NA_real_]
}
if (filter_by_identified) {
input <- dplyr::mutate(
input, Intensity = dplyr::if_else(Identified, Intensity, NA_real_))
input[Identified == FALSE, Intensity := NA_real_]
}
if (filter_by_qvalue) {
input <- dplyr::mutate(
input,
Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_))
input <- dplyr::mutate(
input,
Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_))
# Preserve dplyr::if_else semantics: rows with NA q-values become NA.
input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
}
input <- dplyr::filter(input, FFrgLossType == "noloss")
if (is.element("LabeledSequence", colnames(input))) {
input <- dplyr::mutate(input, IsLabeled = grepl("Lys8", LabeledSequence) | grepl("Arg10", LabeledSequence))
input <- dplyr::mutate(input, IsotopeLabelType := dplyr::if_else(IsLabeled, "H", "L"))
input <- input[FFrgLossType == "noloss"]
if (filter_by_excluded && "Excluded" %in% colnames(input)) {
input[Excluded == TRUE, Intensity := NA_real_]
}
if (filter_by_identified && "Identified" %in% colnames(input)) {
input[Identified == FALSE, Intensity := NA_real_]
}
if (filter_by_qvalue) {
# Preserve dplyr::if_else semantics: rows with NA q-values become NA.
if ("EGQvalue" %in% colnames(input)) {
input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
}
if ("PGQvalue" %in% colnames(input)) {
input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
}
}
if ("FFrgLossType" %in% colnames(input)) {
input <- input[FFrgLossType == "noloss"]
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/clean_spectronaut.R` around lines 155 - 167, The current filter block
assumes columns Excluded, Identified, EGQvalue, PGQvalue, and FFrgLossType
always exist and will error if any are missing; update the logic in
clean_spectronaut.R to guard each filter by checking column presence (e.g., use
"if ('Excluded' %in% names(input))" before the Excluded filter, similarly for
Identified, EGQvalue and PGQvalue before applying qvalue-based NA assignment,
and check for FFrgLossType before subsetting), so each conditional only runs
when its target column exists and the behavior remains unchanged otherwise.

Comment on lines +106 to +113
expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L))
expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L))
expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_))
expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)))
expect_error(suppressWarnings(
reduceBigSpectronaut(input_file, output_file, block_size = "16MB")
))
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen invalid block_size expectations to avoid false positives.

These assertions currently pass on any error. If reducer internals fail for another reason, this test can still pass even when block_size validation regresses. Constrain the expected error to include block_size in the message.

Suggested tightening
-  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L))
-  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L))
-  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_))
-  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)))
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L), regexp = "block_size")
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L), regexp = "block_size")
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_), regexp = "block_size")
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)), regexp = "block_size")
   expect_error(suppressWarnings(
     reduceBigSpectronaut(input_file, output_file, block_size = "16MB")
-  ))
+  ), regexp = "block_size")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/testthat/test-converters.R` around lines 106 - 113, The test's
expect_error calls are too broad—change them to assert the error message
mentions "block_size" so they only pass when block_size validation fails; update
each expect_error(reduceBigSpectronaut(...), ...) in
tests/testthat/test-converters.R to include a second argument (string or regex)
that matches "block_size" (e.g., "block_size" or "block_size.*invalid") for the
negative values and vector cases, and similarly for the suppressed-warning call
so all invalid-block_size cases are checked by message content rather than any
error.

Comment thread R/clean_spectronaut.R
anomalyModelFeatures=c()) {
calculateAnomalyScores=FALSE,
anomalyModelFeatures=c(),
block_size = 16L * 1024L * 1024L) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For people who may want to increase their block size to increase the speed of processing, add a recommendation on how to estimate the adequate block size to maximize speed while reducing the risk of the system crashing
  2. Make it more clear in the MSstatsBig / MSstatsConvert documentation on which columns we actually need from users for Spectronaut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants