Add SIRIUS annotations and mz_rt fallback to MZMine converter (three-tier ProteinName)#138
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (7)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughImplements three-tier ProteinName assignment: MZMine compound_name (tier 1), optional SIRIUS name via ChangesThree-tier ProteinName assignment with SIRIUS enrichment
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
R/clean_MZMine.R (1)
79-85: 💤 Low valueSIRIUS deduplication is order-dependent when duplicates exist.
Unlike tier 1 which sorts by
id, -scoreto pick the highest-scoring compound, the SIRIUS deduplication sorts only bymappingFeatureId. Ifsirius_annotationscontains multiple rows for the same feature (e.g., multiple structure candidates), the chosennamedepends on the input row order, which may vary between runs or SIRIUS versions.While the documentation notes that scores are "ignored in this release", consider adding a deterministic tiebreaker (e.g., alphabetical by
name) to ensure reproducible results:🔧 Suggested stabilization
sirius_dt[, mappingFeatureId := as.character(mappingFeatureId)] - data.table::setorder(sirius_dt, mappingFeatureId) + data.table::setorder(sirius_dt, mappingFeatureId, name) sirius_dt <- unique(sirius_dt, by = "mappingFeatureId")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/clean_MZMine.R` around lines 79 - 85, sirius deduplication is nondeterministic because sirius_dt is only ordered by mappingFeatureId before calling unique(), so when multiple rows share a mappingFeatureId the kept row depends on input order; to fix, make the ordering deterministic by additionally ordering by name (with NA pushed last) before deduplication: in the sirius_dt pipeline (working with sirius_annotations, sirius_dt, mappingFeatureId, name) create a stable sort key or replace NA names with a sentinel that sorts after real names, then call setorder(sirius_dt, mappingFeatureId, name_sort_key) and finally unique(sirius_dt, by = "mappingFeatureId") so the chosen name is reproducible (alphabetical tiebreaker).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@vignettes/msstats_data_format.Rmd`:
- Around line 381-385: Update the note about removeProtein_with1Feature and
tier-3 mz_rt fallback IDs to avoid asserting they are always singletons: clarify
that mz_rt fallback ProteinName values are derived from rounded m/z and RT so
collisions can occur and some fallback IDs may group multiple features, and
advise users that using removeProtein_with1Feature = TRUE can therefore remove
grouped features unexpectedly; reference the terms mz_rt,
removeProtein_with1Feature, ProteinName, and "tier-3" in the revised sentence.
---
Nitpick comments:
In `@R/clean_MZMine.R`:
- Around line 79-85: sirius deduplication is nondeterministic because sirius_dt
is only ordered by mappingFeatureId before calling unique(), so when multiple
rows share a mappingFeatureId the kept row depends on input order; to fix, make
the ordering deterministic by additionally ordering by name (with NA pushed
last) before deduplication: in the sirius_dt pipeline (working with
sirius_annotations, sirius_dt, mappingFeatureId, name) create a stable sort key
or replace NA names with a sentinel that sorts after real names, then call
setorder(sirius_dt, mappingFeatureId, name_sort_key) and finally
unique(sirius_dt, by = "mappingFeatureId") so the chosen name is reproducible
(alphabetical tiebreaker).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3e37b941-7631-498d-ad89-2f525f3cea15
⛔ Files ignored due to path filters (1)
inst/tinytest/raw_data/MZMine/structure_identifications.tsvis excluded by!**/*.tsv
📒 Files selected for processing (7)
R/clean_MZMine.RR/converters_MZMinetoMSstatsFormat.Rinst/tinytest/test_converters_MZMinetoMSstatsFormat.Rman/MSstatsClean.Rdman/MZMinetoMSstatsFormat.Rdman/dot-cleanRawMZMine.Rdvignettes/msstats_data_format.Rmd
- Retain features lacking MZMine compound names, assign ProteinName via MZMine/SIRIUS/mz_rt tiers, restore required mz/RT metadata, and add validation, tests, docs, and tier-level logging.
Inherit sirius_annotations docs via @inheritParams, replace tier terminology with MSI levels and plain-language source descriptions, and remove the removeProtein_with1Feature parameter (hard-coded FALSE internally). Switch the SIRIUS fill to in-place data.table updates with a deterministic dedup tiebreaker.
f40d381 to
66b378c
Compare
| #' @param mzmine_annotations `data.frame` of MZMine spectral-library | ||
| #' annotations with columns `id`, `compound_name`, `score`. Required: | ||
| #' the highest-scoring `compound_name` per feature is used as | ||
| #' `ProteinName`, and features in the quant table with no matching | ||
| #' annotation row are dropped from the output. | ||
| #' the highest-scoring `compound_name` per feature (MSI Level 2 | ||
| #' putative identification via MS/MS spectral matching) is used as | ||
| #' `ProteinName`. |
There was a problem hiding this comment.
I think you can remove mzmine_annotations in the docs here if you're using inheritParams now from .cleanRawMZMine
…from .cleanRawMZMine alongside sirius_annotations
Motivation and Context
Please include relevant motivation and context of the problem along with a short summary of the solution.
Changes
Please provide a detailed bullet point list of your changes.
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review
Motivation and Context
The PR adds support for SIRIUS structural annotations and implements a fallback mechanism for the MZMine converter in MSstatsConvert. Previously, the converter would drop features lacking MZMine spectral-library annotations (MSI Level 2). This enhancement introduces a three-tier
ProteinNameassignment strategy: (1) MZMine compound names (highest priority), (2) SIRIUS structure identifications when available (MSI Level 3, in-silico structure prediction), and (3) a synthesized m/z and retention time identifier as a final fallback. This approach retains all features while maintaining annotation precedence and improving metabolomics data coverage.Detailed Changes
Implementation Changes
R/clean_MZMine.R(.cleanRawMZMinefunction)sirius_annotationsparameter (defaultNULL)ProteinNameassignment from a single inner join (which discarded unmatched features) to a three-tier left-join strategy:compound_namebyrowID→id(features with MZMine annotations receive highest-scoring compound name)sirius_annotationsprovided, fill remainingNAvalues usingmappingFeatureId→rowIDjoin with SIRIUSname(non-empty names only)NAfeatures, synthesizeProteinNamefrom m/z and retention time (format:mz_rt, e.g.,489.334_7.89constructed asround(mz, 4)_round(rt, 2))rowID,rowmz,rowretentiontimeR/converters_MZMinetoMSstatsFormat.R(public wrapper function)sirius_annotationsparameter (defaultNULL) to function signaturesirius_annotations(when non-NULL) contains required columnsmappingFeatureIdandname; stops with error message if missing required columnsmzmine_annotationsis provided (not NULL or missing); raises error with instruction messagesirius_annotationsintoMSstatsConvert::MSstatsClean(was previously called without it)remove_single_feature_proteinsbehavior: now always set toFALSErather than driven byremoveProtein_with1Featureflagstructure_identifications.tsvand calling the converter withsirius_annotations\detailssection to describe the three-tier strategy, MSI level assignments, and explicit clarification that features are retained via m/z-RT fallback rather than filtered outDocumentation Updates
man/MSstatsClean.Rd: Updated S4 method signature to acceptsirius_annotations = NULL; clarified that SIRIUSnamepopulatesProteinNamefor features without MZMine compound names; notes schema validated against SIRIUS 6 output (lines: +13/-6)man/MZMinetoMSstatsFormat.Rd: Documented newsirius_annotationsparameter; rewrote\detailssection to describe three-tier assignment with MSI level context (Level 2 for MZMine, Level 3 for SIRIUS); updated examples with SIRIUS demonstration; clarified features are retained rather than filtered; explained m/z-RT fallback format (lines: +64/-15)man/dot-cleanRawMZMine.Rd: Documentedsirius_annotationsparameter; specified onlymappingFeatureIdandnamecolumns are read; noted score/confidence columns are ignored; clarified schema validation against SIRIUS 6 with guidance for other versions; removed prior statement that features without matching annotations are dropped (lines: +13/-6)vignettes/msstats_data_format.Rmd: Updated vignette section to explain three-tier strategy with precedence, SIRIUS schema expectations (mappingFeatureIdmatches to MZMinerowID), and replaced prior behavior (drop features; no fallback) with new behavior (retain via m/z-RT fallback); extended worked examples to show both baseline conversion and SIRIUS-enriched conversion followed by inspection of uniquePeptideSequence/ProteinNamepairs (lines: +52/-12)Unit Tests Added/Modified
inst/tinytest/test_converters_MZMinetoMSstatsFormat.R(lines: +50/-22)Baseline MZMine annotation test:
ProteinNamevalues (489.334_7.89,555.447_9.1) when absent frommzmine_annotationsSIRIUS annotation test:
compound_name(Caffeine) beats SIRIUS for feature 1 (confirmed both sources available but MZMine wins)"Caffeic acid""555.447_9.1"Error handling tests:
mzmine_annotations = NULLraises "mzmine_annotations is required" error with guidance messagemzmine_annotationsargument (missing) raises same errorsirius_annotations(missing requirednamecolumn) triggersstop()with "missing required column" error messageCoding Guidelines Compliance
No explicit violations of coding guidelines were identified in the provided changes. The implementation demonstrates:
setorder,unique,setfor column deletion) consistent with codebase patternsgetOption("MSstatsLog")andgetOption("MSstatsMsg")