diff --git a/README.md b/README.md index 79c60fe0..66dd0810 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,15 @@ Multiple results may be returned representing possible conceptual matches, but a Note that the results returned by this service have been conflated using both GeneProtein and DrugChemical conflation; you can read more about this at the [Conflation documentation](https://github.com/NCATSTranslator/Babel/blob/master/docs/Conflation.md). -* See this [Jupyter Notebook](documentation/NameResolution.ipynb) for examples of use. -* See the [API documentation](documentation/API.md) for information about the NameRes API. -* See [Scoring](documentation/Scoring.md) for information about the scoring algorithm used by NameRes. -* See [Deployment](documentation/Deployment.md) for instructions on deploying NameRes. +## Getting started + +The best place to start is the Jupyter Notebook, which walks through the most common use cases with live examples: + +* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NCATSTranslator/NameResolution/blob/master/documentation/NameResolution.ipynb) [Jupyter Notebook](documentation/NameResolution.ipynb) — interactive examples covering lookup, filtering, autocomplete, bulk lookup, and synonyms + +## Documentation + +* [Translator Guide](documentation/TranslatorGuide.md) — what to do when results are unexpected, when to use `/synonyms` vs. NodeNorm, and performance tips +* [API documentation](documentation/API.md) — full reference for all NameRes endpoints +* [Scoring](documentation/Scoring.md) — how NameRes scores and ranks results +* [Deployment](documentation/Deployment.md) — instructions for deploying NameRes diff --git a/api/resources/openapi.yml b/api/resources/openapi.yml index 07313170..6a9ac0dc 100644 --- a/api/resources/openapi.yml +++ b/api/resources/openapi.yml @@ -13,8 +13,10 @@ info: have been correctly normalized using the Node Normalization service.

You can read more about this API on the NameResolution GitHub repository.

- Note that the returned by this service have been conflated using both GeneProtein and DrugChemical conflation; - you can read more about this at the Conflation documentation.' +

Note that the results returned by this service have been conflated using both GeneProtein and DrugChemical + conflation; you can read more about this at the + Conflation documentation. + The active conflations for any deployment can be discovered via the /status endpoint.

' license: name: MIT url: https://opensource.org/licenses/MIT diff --git a/api/server.py b/api/server.py index 2a2277b6..e6e3cb5e 100755 --- a/api/server.py +++ b/api/server.py @@ -71,6 +71,11 @@ async def status() -> Dict: babel_version = os.environ.get("BABEL_VERSION", "unknown") babel_version_url = os.environ.get("BABEL_VERSION_URL", "") + # Which conflations are active in this deployment? Baked in at data-loading time. + conflations_raw = os.environ.get("CONFLATIONS", "GeneProtein,DrugChemical") + conflations = [c.strip() for c in conflations_raw.split(",") if c.strip()] + conflation_url = "https://github.com/NCATSTranslator/Babel/blob/main/docs/Conflation.md" + # Look up the BIOLINK_MODEL_TAG. # Note: this should be a tag from the Biolink Model repo, e.g. "master" or "v4.3.6". biolink_model_tag = os.environ.get("BIOLINK_MODEL_TAG", "master") @@ -101,6 +106,8 @@ async def status() -> Dict: 'url': biolink_model_url, 'download_url': biolink_model_download_url, }, + 'conflations': conflations, + 'conflation_url': conflation_url, 'nameres_version': nameres_version, 'startTime': core['startTime'], 'numDocs': index.get('numDocs', ''), @@ -122,6 +129,8 @@ async def status() -> Dict: 'url': biolink_model_url, 'download_url': biolink_model_download_url, }, + 'conflations': conflations, + 'conflation_url': conflation_url, 'nameres_version': nameres_version, } diff --git a/documentation/API.md b/documentation/API.md index 57bcbdea..8db89f68 100644 --- a/documentation/API.md +++ b/documentation/API.md @@ -91,13 +91,17 @@ The Name Resolver largely consists of two [search endpoints](#search-endpoints): ## Conflation Unlike the Node Normalizer, the Name Resolution Service does not currently support on-the-fly conflation. Instead, -all the [Babel conflations](https://github.com/NCATSTranslator/Babel/blob/master/docs/Conflation.md) are turned on when Solr database is built. At the moment, this includes: -* GeneProtein conflation: protein-encoding genes are conflated with the protein(s) they encode, and the gene identifier - is used to identify this concept. Therefore, if you search for "" -* DrugChemical conflation: drugs are conflated with their active ingredient, and the identifier for the active ingredient - is used to identify this concept. -This means that -- for example -- protein-encoding genes will include the synonyms found -for the protein they encode, and that no separate entry will be available for those proteins. +all the [Babel conflations](https://github.com/NCATSTranslator/Babel/blob/main/docs/Conflation.md) are baked in when the Solr database is built. At the moment, this includes: +* **GeneProtein conflation:** protein-encoding genes are conflated with the protein(s) they encode, and the gene identifier + is used to identify this concept. Therefore, if you search for a protein name, you will typically receive the gene + identifier (e.g., searching for "dystrophin" returns `NCBIGene:1756` rather than a UniProtKB identifier). +* **DrugChemical conflation:** drugs are conflated with their active ingredient, and the identifier for the active + ingredient is used to identify this concept. + +This means that protein-encoding genes include the synonyms found for the protein they encode, and no separate +entry is available for those proteins in NameRes. + +The active conflations for any NameRes deployment can be queried programmatically via the [`/status` endpoint](#status). Once you have an identifier from Name Resolver, you can use the [Node Normalizer](https://nodenormalization-sri.renci.org/) to look up the equivalent identifiers for that CURIE with and without conflation. Please use the Node Normalizer @@ -325,6 +329,8 @@ Solr database. "url": "https://github.com/biolink/biolink-model/tree/v4.2.6-rc5", "download_url": "https://raw.githubusercontent.com/biolink/biolink-model/v4.2.6-rc5/biolink-model.yaml" }, + "conflations": ["GeneProtein", "DrugChemical"], + "conflation_url": "https://github.com/NCATSTranslator/Babel/blob/main/docs/Conflation.md", "nameres_version": "v1.5.1", "startTime": "2025-12-19T11:53:09.638Z", "numDocs": 425583391, diff --git a/documentation/TranslatorGuide.md b/documentation/TranslatorGuide.md new file mode 100644 index 00000000..763412cf --- /dev/null +++ b/documentation/TranslatorGuide.md @@ -0,0 +1,185 @@ +# NameRes Translator Guide + +This guide is aimed at Translator developers and users who are integrating NameRes into their workflows. +It covers what to do when results are unexpected, how `/synonyms` (reverse-lookup) relates to NodeNorm, +and tips for improving performance. + +## What to do when a name lookup returns unexpected results + +NameRes ranks results by a [Solr TF*IDF score](./Scoring.md) — the top result is the best *textual* match, +not necessarily the biologically intended concept. If the results don't look right, try these steps. + +### 1. Use `highlighting` to understand what matched + +Set `highlighting=true` on a `/lookup` call to see which label or synonym drove the match: + +``` +GET /lookup?string=cold&highlighting=true&limit=5 +``` + +This tells you which synonym triggered the match, which helps diagnose why an unexpected concept ranked high. + +### 2. Filter by Biolink type + +Use `biolink_type` to restrict results to the category you expect. Multiple types are combined with OR logic: + +``` +GET /lookup?string=cold&biolink_type=Disease&biolink_type=PhenotypicFeature +``` + +Common types: `Disease`, `Gene`, `ChemicalEntity`, `PhenotypicFeature`, `BiologicalProcess`, `AnatomicalEntity`. +Types can be specified with or without the `biolink:` prefix. + +### 3. Restrict to trusted prefixes + +Use `only_prefixes` to limit results to a specific ontology, or `exclude_prefixes` to drop a noisy one. +Prefixes are pipe-separated and case-sensitive: + +``` +# Only MONDO disease identifiers +GET /lookup?string=diabetes&biolink_type=Disease&only_prefixes=MONDO + +# Exclude UMLS (often produces many ambiguous matches) +GET /lookup?string=NIH&exclude_prefixes=UMLS +``` + +Common trusted prefixes by category: + +| Category | Recommended prefixes | +|---|---| +| Disease | `MONDO`, `OMIM`, `ORPHANET` | +| Gene | `NCBIGene`, `HGNC` | +| Chemical/Drug | `CHEBI`, `DRUGBANK` | +| Phenotype | `HP`, `MP` | +| Anatomy | `UBERON`, `CL` | + +### 4. Filter by taxon for gene/protein queries + +When searching for a gene or protein, results may include entries from multiple species. Use `only_taxa` +to restrict to a specific organism. The value is a pipe-separated list of NCBI Taxon CURIEs: + +``` +# Human genes only +GET /lookup?string=APOE&biolink_type=Gene&only_taxa=NCBITaxon:9606 + +# Human and mouse +GET /lookup?string=APOE&only_taxa=NCBITaxon:9606|NCBITaxon:10090 +``` + +Common taxa: human `NCBITaxon:9606`, mouse `NCBITaxon:10090`, rat `NCBITaxon:10116`, zebrafish `NCBITaxon:7955`. + +### 5. Try autocomplete mode for partial strings + +If your search string is a fragment of a name (e.g., typed by a user mid-word), set `autocomplete=true`. +This expands the final word with a wildcard so that `"diab"` matches `"diabetes"`, `"diabetic"`, etc.: + +``` +GET /lookup?string=diab&autocomplete=true&limit=5 +``` + +Without `autocomplete`, `"diab"` will only match documents that literally contain the token `"diab"`. + +### 6. If the correct concept is consistently missing + +If your filtering is correct but the expected result never appears, the concept may be missing from the +Babel data that NameRes is built from. Consider filing an issue on: +- [NameRes GitHub](https://github.com/NCATSTranslator/NameResolution/issues) — for search/ranking problems +- [Babel GitHub](https://github.com/NCATSTranslator/Babel/issues) — for missing synonyms or identifiers + +--- + +## Using `/synonyms` (reverse-lookup) vs. NodeNorm + +These two services answer different questions. + +### Use `/synonyms` when you want to inspect synonyms for a known CURIE + +The `/synonyms` endpoint returns all names and synonyms that NameRes knows for a given concept, along with +its Biolink types, taxa, and clique identifier count. This is useful for verifying synonym coverage or +debugging why a particular name did or did not match. + +``` +GET /synonyms?preferred_curies=NCBIGene:1756 +``` + +**Important:** `/synonyms` requires the *preferred* (normalized) CURIE. If you pass a non-preferred +identifier (e.g. a UniProtKB accession for a gene), you will get an empty result. Before calling +`/synonyms`, normalize your CURIE with NodeNorm (see below). + +You can look up multiple CURIEs in one request: + +``` +GET /synonyms?preferred_curies=MONDO:0005148&preferred_curies=NCBIGene:1756 +``` + +### Use NodeNorm when you need identifier normalization or equivalent identifiers + +The [Node Normalization service](https://nodenormalization-sri.renci.org/) is the right tool when you need to: + +- Convert a non-preferred identifier to its preferred CURIE +- Find all equivalent identifiers for a concept across ontologies +- Check which Biolink types a CURIE maps to +- Determine whether two CURIEs refer to the same concept + +To normalize a CURIE before passing it to `/synonyms`, call NodeNorm with GeneProtein and DrugChemical +conflation enabled (to match the conflation used by NameRes): + +``` +GET https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=UniProtKB:A0A0S2Z3B5&conflate=true&drug_chemical=true +``` + +The `id.identifier` field in the response is the preferred CURIE you can then pass to `/synonyms`. + +### Quick decision guide + +| Question | Tool | +|---|---| +| What synonyms does NameRes know for this CURIE? | `/synonyms` | +| What is the preferred identifier for this concept? | NodeNorm | +| Are these two CURIEs equivalent? | NodeNorm | +| What Biolink types does this CURIE have? | NodeNorm | +| Why didn't a particular name match in `/lookup`? | `/synonyms` + `highlighting` | +| Which conflations are active in this NameRes deployment? | `/status` (`conflations` field) | + +--- + +## Performance tips + +### Batch multiple queries with `/bulk-lookup` + +Instead of making N separate `/lookup` calls, send them all in one POST request to `/bulk-lookup`. +It returns a dictionary keyed by input string: + +```json +POST /bulk-lookup +{ + "strings": ["diabetes", "hypertension", "asthma"], + "limit": 5, + "biolink_types": ["Disease"] +} +``` + +This is significantly more efficient than sequential individual requests. + +### Add filters before processing results + +Apply `biolink_type`, `only_prefixes`, and `only_taxa` at query time rather than filtering the response +yourself. Server-side filtering reduces the result set before it is serialized and transmitted. + +### Set `limit` to what you actually need + +The default `limit` is 10 and the maximum is 1000. If you only need the top result, set `limit=1`. +If you need to page through a large result set, use `offset` for server-side pagination rather than +requesting a large `limit` and slicing client-side. + +### Cache results between Babel data releases + +NameRes results are stable between Babel data releases (which happen a few times per year). If your +application calls NameRes repeatedly for the same input strings, cache the results locally. Check the +`/status` endpoint to detect when the Babel version changes and invalidate your cache accordingly: + +``` +GET /status +``` + +The `babel_version` field in the response changes with each data release.