Skip to content

Integrate reusable xCRG package into ARAX#2772

Open
venkataseshtej wants to merge 11 commits into
masterfrom
xcrg-package-integration
Open

Integrate reusable xCRG package into ARAX#2772
venkataseshtej wants to merge 11 commits into
masterfrom
xcrg-package-integration

Conversation

@venkataseshtej
Copy link
Copy Markdown
Collaborator

This PR integrates the reusable catrax-xcrg package into ARAX for MVP2/xCRG inferred ChemicalEntity-Gene activity/abundance queries.

Main changes

  • Adds catrax-xcrg as a pinned dependency.
  • Detects MVP2/xCRG query shape in ARAX_query_graph_interpreter.
  • Routes matching queries to connect(action=xcrg).
  • Calls the reusable catrax-xcrg package from ARAX_connect.
  • Uses config/env-driven Retriever URL, timeout, and TF batch size:
    • ARAX_XCRG_RETRIEVER_URL
    • ARAX_XCRG_TIMEOUT
    • ARAX_XCRG_TF_BATCH_SIZE
  • Uses ARAX NGD DB path via get_curie_ngd_path().
  • Makes ResultTransformer safely no-op for xCRG responses because xCRG already returns final TRAPI support graph output.
  • Fixes ARAX final result count for responses where connect() replaces the message.

Validation

I ran one MVP2/xCRG query through ARAX locally.

The query was recognized as xCRG MVP2 and routed to:

connect(action=xcrg)

It returned real results:

status: OK
message: Normal completion with 1378 results.
xcrg_connect_flag: True

TRAPI validator result on the ARAX-produced response:

CRITICAL: 0
ERRORS: 0
WARNINGS: 0

Extra checks also passed:

missing node binding attributes: 0
missing edge binding attributes: 0
KG edges missing attributes: 0
KG edges missing sources: 0
metatype:Datetime attributes: 0
schema_version: 1.6.0
biolink_version: 4.3.2

Notes

For local testing, I used:

ARAX_XCRG_RETRIEVER_URL=https://retriever.ci.transltr.io/query

because dev.retriever.biothings.io was returning 424 Failed Dependency for this query shape.

The xCRG package does not hardcode the Retriever endpoint; it is config/envdriven.

Local NGD testing used get_curie_ngd_path() and a local symlink to the Shepherd NGD SQLite DB. In CI/prod, ARAX DB manager should provide the NGD DB at that expected path.

@venkataseshtej venkataseshtej self-assigned this May 18, 2026
Copy link
Copy Markdown
Member

@dkoslicki dkoslicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, with just one comment about deployment level. @mohsenht can you review too since this is in your connect part of the code?

Also, likely best to, moving forward, pin the requirements to a versioned PyPi release rather than a commit SHA

Comment thread code/ARAX/ARAXQuery/ARAX_connect.py Outdated
XCRG_RETRIEVER_URL_ENV = "ARAX_XCRG_RETRIEVER_URL"
XCRG_TIMEOUT_ENV = "ARAX_XCRG_TIMEOUT"
XCRG_TF_BATCH_SIZE_ENV = "ARAX_XCRG_TF_BATCH_SIZE"
DEFAULT_XCRG_RETRIEVER_URL = "https://retriever.ci.transltr.io/query"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retriever URL should pull from whatever deployment level ARAX is currently at. For example, if the code is deployed at arax.ci.transltr.io, it should call retriever.ci.transltr.io. If deployed at arax.test.transltr.io, it should call retirever.test.transltr.io. And if at arax.transltr.io, it should call retriever.transltr.io. You should be able to pull this from configuration file (I forget exactly which).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. I updated xCRG Retriever URL selection to use RTXConfiguration().maturity so ARAX staging/testing/production map to the corresponding Retriever deployment. I kept ARAX_XCRG_RETRIEVER_URL as an explicit local/debug override.

I also agree that once catrax-xcrg has a versioned PyPI release, we should replace the Git commit pin with a pinned PyPI version.

@dkoslicki
Copy link
Copy Markdown
Member

Awesome! But I just noticed that all the xCRG tests are skipped due to being marked as slow in the test suite. Can you please:

  1. Run those tests locally (including the slow-marked xCRG tests)
  2. If those tests execute relatively quickly, remove the slow marks in the automated tests
  3. Include new, faster tests for xCRG so we can be testing that component every time the CICD runs

@saramsey
Copy link
Copy Markdown
Member

saramsey commented May 18, 2026

From this PR, I see that environment variables are used for xCRG configuration. From the submitter's comments, I gather that the environment variables are intended for debugging purposes, which I presume means on a dev machine or in a "test rig" type environment.

If there is a potential use-case intended for setting one or more of the new xCRG environment variables within the ARAX flask application, how would we be proposing to set them? This is just a reminder that ARAX is (at least, with the current code-base we have) started out of a System V init script, on ITRB systems and on arax.ncats.io. I suppose that init script could be hacked to override xCRG environment variables, if need be? Just wondering what the plan is here. Maybe the plan is that xCRG environment variable overriding will not ever be needed or used for the Flask server ARAX on arax.ncats.io or on ITRB systems?

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

Awesome! But I just noticed that all the xCRG tests are skipped due to being marked as slow in the test suite. Can you please:

  1. Run those tests locally (including the slow-marked xCRG tests)
  2. If those tests execute relatively quickly, remove the slow marks in the automated tests
  3. Include new, faster tests for xCRG so we can be testing that component every time the CICD runs

Thank you, I checked this and pushed an update to the PR.

I ran the existing slow-marked xCRG tests locally with --runslow. Those tests appear to be for the old legacy ARAX_infer.py / creativeCRG path, not the new package-backed connect(action=xcrg) path. They did execute, but 4/5 failed because the old legacy embedding file is missing:

Infer/data/xCRG_data/chemical_gene_embeddings_...npz

So I did not remove the slow marks from those tests, since they are not reliable always-on CI tests for the new xCRG path.

Instead, I added fast always-on tests for the new ARAX xCRG integration path in:

code/ARAX/test/test_ARAX_xcrg_connect.py

These tests cover:

  • deployment-aware Retriever URL mapping,
  • ARAX_XCRG_RETRIEVER_URL override behavior,
  • MVP2 query routing to connect(action=xcrg),
  • non-xCRG queries not routing to connect(action=xcrg),
  • ARAX_connect calling run_xcrg(...) with the expected config,
  • ResultTransformer no-op behavior for xCRG responses.

Local validation passed:

6 passed in 1.82s
git diff --check: passed
py_compile: passed

I pushed this update to the PR branch.

@saramsey
Copy link
Copy Markdown
Member

saramsey commented May 18, 2026

@venkataseshtej thank you for coding up some new faster xCRG unit tests. That was a great idea.

@saramsey
Copy link
Copy Markdown
Member

saramsey commented May 18, 2026

@venkataseshtej
In the new xcrg-package-integration branch, it doesn't look like the new xCRG database files are listed in RTX/code/config_dbs.json, are they? Or are there no new database files for xCRG?
https://github.com/RTXteam/RTX/blob/xcrg-package-integration/code/config_dbs.json
In any event, in that file, there seem to be "old" xCRG files that should be updated or removed.

Have you tested this PR by running the ARAX flask application server and the "Example 3" question in it?

I think it would be a good idea, before merging. See
https://github.com/RTXteam/RTX/blob/master/notes/arax-maintenance-sop.md
for some tips on how to run the flask server locally on a dev machine.

Another option is to commandeer a dev-area on arax.ncats.io and test there using the xcrg-package-integration branch code.

Personally, I prefer to test on a developer machine by just running the flask server locally. I find it is easier that doing:

ssh arax.ncats.io
sudo docker exec -it rtx2 bash
su - rt
cd /mnt/data/orangeboard/devarea/RTX
git fetch origin
git stash
git checkout xcrg-package-integration
python3.12 code/ARAX/ARAXQuery/ARAX_database_manager.py -c
exit
# service RTX_OpenAPI_devarea stop
# service RTX_OpenAPI_devarea start
# tail -f /tmp/RTX_OpenAPI_devarea.elog
<Ctrl-C>
su - rt
cd /mnt/data/orangeboard/devarea/RTX
git checkout master
git stash pop
exit
# service RTX_OpenAPI_devarea stop
# service RTX_OpenAPI_devarea start
# tail -f /tmp/RTX_OpenAPI_devarea.elog
<Ctrl-C>

and so forth.

If you would like some help getting the ARAX Flask server working on your developer machine, @hodgesf or @bazarkua can help show you how to do it.

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

From this PR, I see that environment variables are used for xCRG configuration. From the submitter's comments, I gather that the environment variables are intended for debugging purposes, which I presume means on a dev machine or in a "test rig" type environment.

If there is a potential use-case intended for setting one or more of the new xCRG environment variables within the ARAX flask application, how would we be proposing to set them? This is just a reminder that ARAX is (at least, with the current code-base we have) started out of a System V init script, on ITRB systems and on arax.ncats.io. I suppose that init script could be hacked to override xCRG environment variables, if need be? Just wondering what the plan is here. Maybe the plan is that xCRG environment variable overriding will not ever be needed or used for the Flask server ARAX on arax.ncats.io or on ITRB systems?

Thanks, that is a good point. The intention is that the ARAX_XCRG_* environment variables are optional local/debug overrides only, not required deployment configuration for Flask ARAX.

For deployed ARAX, no new environment variables need to be set. The Retriever URL is now selected from RTXConfiguration().maturity, so the default behavior is deployment-aware:

staging     -> https://retriever.ci.transltr.io/query
testing     -> https://retriever.test.transltr.io/query
production  -> https://retriever.transltr.io/query
development -> https://retriever.ci.transltr.io/query

The timeout and TF batch size also have code defaults, so the System V init script should not need to be modified for xCRG.

The env vars are mainly for local testing/debugging, for example if a developer wants to temporarily point xCRG to a specific Retriever deployment or adjust timeout/batch size without changing code. If we later decide these values need to be production-tunable, I agree the cleaner approach would be to add them to the ARAX/RTX configuration rather than relying on System V init-script environment overrides.

I can also add a short code comment to make this explicit.

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

@venkataseshtej have you tested this PR by running the ARAX flask application server? I think it might be a good idea, before merging. See https://github.com/RTXteam/RTX/blob/master/notes/arax-maintenance-sop.md

Thanks, good point. I have tested the PR through the local ARAXQuery().query(...) path and validated the resulting TRAPI response, but I have not yet tested it through the ARAX Flask application server.

I agree that this is worth doing before merge. I will follow the local Flask server portion of the ARAX maintenance SOP, start the Flask server from this PR branch, submit the same MVP2/xCRG query through the HTTP endpoint, and validate the returned response with the TRAPI validator. I will report the result back on the PR before merge.

@saramsey
Copy link
Copy Markdown
Member

saramsey commented May 18, 2026

@venkataseshtej here is the most relevant section of the ARAX Maintenance SOP for this situation, I think:
https://github.com/RTXteam/RTX/blob/master/notes/arax-maintenance-sop.md#setup-of-your-local-dev-system

Unfortunately, that section has gotten a little bit out-of-date since we deployed Tier0 ARAX. But it gives the gist of how to set things up, at least.

@dkoslicki
Copy link
Copy Markdown
Member

@venkataseshtej , there are no new database files for this new xCRG implementation, correct? @saramsey , do you have any guidance on removing the old xCRG database files? This new method is model-free in comparison to the old, model-based approach.

@saramsey
Copy link
Copy Markdown
Member

saramsey commented May 18, 2026

@saramsey , do you have any guidance on removing the old xCRG database files?

If there are database files that are no longer needed, I propose that in the xcrg-package-integration branch, those database file references should be removed from:
https://github.com/RTXteam/RTX/blob/xcrg-package-integration/code/config_dbs.json

and the code corresponding to those file(s) should be removed from the ARAX_database_manager.py script (for consistency, I think that code edit be done in the same branch as the change to the config_dbs.json file):
https://github.com/RTXteam/RTX/blob/xcrg-package-integration/code/ARAX/ARAXQuery/ARAX_database_manager.py

@bazarkua or @hodgesf can help with the edits to the ARAX_database_manager.py, if that would be useful to the PSU team.

There is also this shell script,
https://github.com/RTXteam/RTX/blob/xcrg-package-integration/code/generate-db-symlinks.sh

which isn't used in an automated way by any of our ARAX systems but it is a convenience script for managing ARAX on a dev machine. I'm happy to edit it in the xcrg-package-integration branch, with permission from @venkataseshtej . I just need to know which database files are being eliminated from config_dbs.json. @venkataseshtej maybe you can comment about which specific database files referenced in config_dbs.json are going away, from ARAX? Or, I guess, I can inspect the commit when someone removes them from config_dbs.json in the branch.

Are all three of these lines going away?

"xcrg_embeddings": "/translator/data/orangeboard/databases/KG2.10.2/chemical_gene_embeddings_v1.0.KG2.10.0_refreshedTo_KG2.10.2.npz",
"xcrg_increase_model": "/translator/data/orangeboard/databases/KG2.10.0/xcrg_increase_model_v1.0.KG2.10.0_new_version.pt",
"xcrg_decrease_model": "/translator/data/orangeboard/databases/KG2.10.0/xcrg_decrease_model_v1.0.KG2.10.0_new_version.pt"

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

@venkataseshtej here is the most relevant section of the ARAX Maintenance SOP for this situation, I think: https://github.com/RTXteam/RTX/blob/master/notes/arax-maintenance-sop.md#setup-of-your-local-dev-system

Unfortunately, that section has gotten a little bit out-of-date since we deployed Tier0 ARAX. But it gives the gist of how to set things up, at least.

Thanks, that makes sense.

For the new package-backed xCRG path, I do not believe there are new xCRG-specific database files that need to be added to code/config_dbs.json. The new path uses Retriever through a configured URL, uses the existing ARAX NGD DB path through get_curie_ngd_path(), and the TF list is bundled inside the catrax-xcrg package.

That said, I will audit code/config_dbs.json and grep the codebase for the old xCRG/creativeCRG references. If those old xCRG entries are only for the legacy ARAX_infer.py / creativeCRG path and are no longer needed for the new package-backed connect(action=xcrg) path, I can either remove/update them in this PR or leave that as a separate cleanup, depending on what you prefer.

I also agree on the Flask server test. So far I tested through the local ARAXQuery().query(...) path and validated the returned TRAPI response, but I have not yet tested through the ARAX Flask application server yet. I will follow the local Flask server portion of the maintenance SOP, run the server from this PR branch, submit the “Example 3” query through the HTTP endpoint, and validate the returned response. I will report the result back here before merge.

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

@venkataseshtej . I just need to know which database files are being eliminated from config_dbs.json. @venkataseshtej may

Thanks, this makes sense.

For the new package-backed connect(action=xcrg) implementation, there are no new xCRG-specific database/model files that need to be added to code/config_dbs.json.

The new path uses:

  • Retriever through the configured Retriever URL,
  • the existing ARAX NGD DB path via get_curie_ngd_path(),
  • the TF (transcription factors) list bundled inside the catrax-xcrg package.

So yes, these three old model-based xCRG entries are legacy only and are not used by the new package backed xCRG path:

xcrg_embeddings
xcrg_increase_model
xcrg_decrease_model

I removed those three entries from code/config_dbs.json, removed the corresponding handling from RTXConfiguration.py and ARAX_database_manager.py, and removed the matching dev symlink entries from generate-db-symlinks.sh.

I did not remove the existing NGD DB configuration, since the new xCRG path still uses NGD through the ARAX NGD helper.

Validation:

config_dbs.json JSON syntax: OK
py_compile: OK
git diff --check: OK
fast ARAX xCRG tests: 6 passed in 1.85s
ARAXDatabaseManager check:
  xcrg_embeddings: False
  xcrg_increase_model: False
  xcrg_decrease_model: False
  curie_ngd: True

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

venkataseshtej commented May 18, 2026

@venkataseshtej , there are no new database files for this new xCRG implementation, correct? @saramsey , do you have any guidance on removing the old xCRG database files? This new method is model-free in comparison to the old, model-based approach.

Yes, correct. The new package-backed connect(action=xcrg) implementation does not add any new xCRG-specific database/model files.

It uses:

  • Retriever through the configured Retriever URL,
  • the existing ARAX NGD DB via get_curie_ngd_path(),
  • and the TF list bundled inside the catrax-xcrg package.

So the old model-based xCRG files are no longer needed for this new path:

xcrg_embeddings
xcrg_increase_model
xcrg_decrease_model

I removed those legacy references from config_dbs.json, RTXConfiguration.py, ARAX_database_manager.py, and generate-db-symlinks.sh. I kept curie_ngd, since the new xCRG path still uses the existing ARAX NGD helper.

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

I found two separate issues and pushed follow up fixes.

First, the CI failure happened because the Docker image used in the Python analysis job clones RTX again inside the container, and that inner clone was using default branch code rather than the PR branch. That is why CI still saw the old xCRG DB entries and tried to rsync the old chemical_gene_embeddings file even after the PR branch removed those entries. I added a small CI/Docker change so the Docker build receives the PR branch and checks it out inside the container.

Second, I found that the legacy creativeCRG.py code still references RTXConfig.xcrg_embeddings_path, RTXConfig.xcrg_increase_model_path, and RTXConfig.xcrg_decrease_model_path. To avoid breaking legacy imports/tests, I kept those as legacy fallback attributes in RTXConfiguration.py, but did not add them back to config_dbs.json, ARAX_database_manager.py, or generate-db-symlinks.sh.

So the database manager should no longer download/manage the old xCRG model files, while the old creativeCRG.py code will not immediately fail with missing config attributes. The new package-backed connect(action=xcrg) path does not use those legacy model files.

Local validation passed:

  • py_compile: OK
  • git diff --check: OK
  • fast xCRG tests: 6 passed
  • DB manager old xCRG keys: False
  • DB manager curie_ngd: True
  • legacy config attrs: True

@dkoslicki @saramsey please let me know if you would prefer the CI/Docker branch checkout fix to be split into a separate PR. I included it here because otherwise this PR’s CI was not actually testing the PR branch inside the Docker container .

@saramsey
Copy link
Copy Markdown
Member

saramsey commented May 18, 2026

First, the CI failure happened because the Docker image used in the Python analysis job clones RTX again inside the container, and that inner clone was using default branch code rather than the PR branch.

Ah yes, this is a frustrating limitation of the CICD-Dockerfile. It has tripped me up multiple times before. Some day, we (the ARAX team) should fix it.

@saramsey
Copy link
Copy Markdown
Member

@dkoslicki @saramsey please let me know if you would prefer the CI/Docker branch checkout fix to be split into a separate PR. I included it here because otherwise this PR’s CI was not actually testing the PR branch inside the Docker container .

I'm OK with including these fixes in this PR. Thank you for implementing those fixes, @venkataseshtej.

@dkoslicki
Copy link
Copy Markdown
Member

@edeutsch to point arax.ncats.io/test to this branch for us to test

@edeutsch
Copy link
Copy Markdown
Collaborator

edeutsch commented May 20, 2026

okay @dkoslicki and @chunyuma I have now deployed branch xcrg-package-integration to /test.

A naive first test seems to show it is working well:
https://arax.ncats.io/test/?r=458983

But please test and confirm if we are ready to merge into master and deploy everywhere.
Others are welcome to test, too!

@hodgesf
Copy link
Copy Markdown
Collaborator

hodgesf commented May 20, 2026

All tests pass locally with the pytest suite and the flask server. I think this is good to go. Also, example 3 is now lightening fast, compared to before this update. Awesome job!!

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

@venkataseshtej
Have you tested this PR by running the ARAX flask application server and the "Example 3" question in it?

I think it would be a good idea, before merging. See https://github.com/RTXteam/RTX/blob/master/notes/arax-maintenance-sop.md for some tips on how to run the flask server locally on a dev machine.

Another option is to commandeer a dev-area on arax.ncats.io and test there using the xcrg-package-integration branch code.

Personally, I prefer to test on a developer machine by just running the flask server locally. I find it is easier that doing:

I completed the local ARAX Flask server testing from the xcrg-package-integration branch.

I tested through the HTTP endpoint:

POST http://localhost:5001/api/arax/v1.4/query

1. UI Example 3 / xCRG MVP2 increased query

status: Success
operations: connect(action=xcrg)
results: 1083
kg_nodes: 1571
kg_edges: 3107
auxiliary_graphs: 1523
schema_version: 1.6.0
biolink_version: 4.3.2

TRAPI validator:

CRITICAL: 0
ERRORS: 0
WARNINGS: 0

2. Additional xCRG decreased query

status: Success
operations: connect(action=xcrg)
results: 1378
kg_nodes: 1866
kg_edges: 5243
auxiliary_graphs: 2661
schema_version: 1.6.0
biolink_version: 4.3.2

TRAPI validator:

CRITICAL: 0
ERRORS: 0
WARNINGS: 0

Extra checks passed for both responses:

missing node binding attributes: 0
missing edge binding attributes: 0
KG edges missing attributes: 0
KG edges missing sources: 0
metatype:Datetime attributes: 0

So the local Flask HTTP server path is working for the new package-backed connect(action=xcrg) route. This now covers both the direct ARAXQuery().query(...) path and the Flask HTTP endpoint path.

@edeutsch
Copy link
Copy Markdown
Collaborator

edeutsch commented May 21, 2026

This seems great, but I'm afraid the TRAPI validator report:

CRITICAL: 0
ERRORS: 0
WARNINGS: 0

is a red flag. This is extremely difficult to achieve and thus seems unlikely. (the validator is very fussy, so there are always warnings)

The thing is that the validator does not run on initial queries (for various reasons including it would slow things down unacceptably). The validator only runs when a previous result is recalled.

I ran the Example 3 query on /test. No errors are visible. But then I recalled it:
https://arax.ncats.io/test/?r=458992

This reveals the errors:
image

image image

The biggest issue seems to be that the NCBIGene entries have all empty/null properties:

        "NCBIGene:9994": {
          "attributes": [],
          "categories": [],
          "is_set": false,
          "name": null
        },

I'm not entirely sure where these entries come from, but this is definitely invalid.

@dkoslicki
Copy link
Copy Markdown
Member

I'm seeing similar issues in the nodes in the support graphs that are missing all their properties/names/etc.:

image

@venkataseshtej , the nodes in the support graphs should be as they are returned from retriever (i.e. all of their properties and the like preserved)

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

venkataseshtej commented May 21, 2026

This seems great, but I'm afraid the TRAPI validator report:

CRITICAL: 0
ERRORS: 0
WARNINGS: 0

is a red flag. This is extremely difficult to achieve and thus seems unlikely. (the validator is very fussy, so there are always warnings)

The thing is that the validator does not run on initial queries (for various reasons including it would slow things down unacceptably). The validator only runs when a previous result is recalled.

I ran the Example 3 query on /test. No errors are visible. But then I recalled it: https://arax.ncats.io/test/?r=458992

This reveals the errors: image

image image
The biggest issue seems to be that the NCBIGene entries have all empty/null properties:

        "NCBIGene:9994": {
          "attributes": [],
          "categories": [],
          "is_set": false,
          "name": null
        },

I'm not entirely sure where these entries come from, but this is definitely invalid.

Thanks @edeutsch for pointing this out. I see the issue now.
It looks like some KG nodes, especially NCBIGene:* support/path nodes, are coming through with empty categories and null/empty node properties. This seems to be a final TRAPI cleanup issue in the 'catrax-xcrg' package rather than an ARAX routing issue.

I will update the package so that before returning the final response, every KG node has non-empty categories, using CURIE-prefix fallback categories where needed, e.g. NCBIGene:* -> biolink:Gene. I will also check for / prune dangling nodes during that cleanup.
I will include the 500 result limit in the same package update, then update the pinned catrax-xcrg commit in this ARAX PR and retest through the local Flask endpoint before asking for test redeployment.

@dkoslicki
Copy link
Copy Markdown
Member

dkoslicki commented May 21, 2026

@venkataseshtej Sounds good (our messages were posted simultaneously), but one thing to note: don't go about it by looking up the node categories, nor using any fallback rules like CURIE prefix to infer categories. Just use the nodes as returned by retriever. Since you're getting them from retriever, just preserve and pass through all of these properties. You definitely do not want to be figuring them out yourself.

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

I'm seeing similar issues in the nodes in the support graphs that are missing all their properties/names/etc.:

image @venkataseshtej , the nodes in the support graphs should be as they are returned from retriever (i.e. all of their properties and the like preserved)

Got it.For the support graph nodes, I should not just create fallback nodes from CURIE prefixes if Retriever already returned full node objects. The correct fix is to preserve the node records from Retriever’s knowledge graph when copying support/path edges into the final response, including their name, categories, attributes, and other properties.

I’ll update the catrax-xcrg final TRAPI builder so that support graph node IDs are hydrated from the original Retriever KG nodes first. CURIE-prefix category inference will only be used as a fallback when a referenced node is genuinely missing from the Retriever node map.

@venkataseshtej Sounds good (our messages were posted simultaneously), but one thing to note: don't go about it by looking up the node categories, nor using any fallback rules like CURIE prefix to infer categories. Just use the nodes as returned by retriever. Since you're getting them from retriever, just preserve and pass through all of these properties. You definitely do not want to be figuring them out yourself.

Thanks @dkoslicki, that clarification helps. I will avoid adding any CURIE-prefix category inference or other category lookup logic.

I will fix this by preserving the node objects exactly as returned by Retriever. So when xCRG copies support/path edges into the final KG/support graphs, it will also copy the corresponding subject/object node records from the Retriever KG, including their names, categories, attributes, and other properties.

If a support edge references a node that is not present in the Retriever KG node map, I will treat that as an incomplete support path and avoid fabricating node metadata. The final clean up goal will be pass through preservation from Retriever, not the category reconstruction in xCRG.

@dkoslicki
Copy link
Copy Markdown
Member

dkoslicki commented May 21, 2026

What you wrote later is correct: treat missing node properties as truly missing. Earlier in this message, you have:

CURIE-prefix category inference will only be used as a fallback when a referenced node is genuinely missing from the Retriever node map.

This is what we don't want. If retriever is missing stuff, it's missing and needs to be fixed by them. We don't want to silently be trying to fix their mistakes, but rather pass them through verbatim, that way if someone sees missing information, we can point to the source and say "that's their problem". No fall backs or trying to fix retriever issues.

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

venkataseshtej commented May 21, 2026

What you wrote later is correct: treat missing node properties as truly missing. Earlier in this message, you have:

CURIE-prefix category inference will only be used as a fallback when a referenced node is genuinely missing from the Retriever node map.

This is what we don't want. If retriever is missing stuff, it's missing and needs to be fixed by them. We don't want to silently be trying to fix their mistakes, but rather pass them through verbatim, that way if someone sees missing information, we can point to the source and say "that's their problem". No fall backs or trying to fix retriever issues.

Follow up update @dkoslicki : I tested the updated xCRG package locally through the ARAX Flask server using an Example 3-style xCRG query through:

POST http://127.0.0.1:5001/api/arax/v1.4/query

The fresh Flask response now looks clean:

HTTP: 200
operations: connect(action=xcrg)
results: 500
kg_nodes: 501
kg_edges: 1191
aux_graphs: 500
empty-category KG nodes: 0
empty/null placeholder KG nodes: 0
missing support_graph references: 0

I also validated the saved Flask response locally with reasoner-validator 6.0.1, matching the validator version that exposed the /test recall issue:

CRITICAL: 0
ERRORS: 0
WARNINGS: 0

The package now preserves Retriever-provided node metadata when available, avoids CURIE category fallback repair, avoids incomplete evidence nodes/support references and caps xCRG results at 500.

I have pushed the updated catrax-xcrg package and updated the ARAX pin in this PR. @edeutsch could you please redeploy xcrg-package-integration to /test so we can validate the /test response before merge?

@edeutsch
Copy link
Copy Markdown
Collaborator

Thank you @venkataseshtej this is now looking very good.
I have deployed to /test and performed the Example 3 query again
https://arax.ncats.io/test/?r=460178
Validation passed, although there are 3 warnings.
image

The first one is an objection to this:
image

I did a little investigating and I suspect these warnings come Retriever data, not xCRG data, but I'm not certain.

I think we're in good shape.

Is there anything else that needs to be done or evaluated before we merge into master ?

@bazarkua
Copy link
Copy Markdown
Collaborator

Just wanted to mention looks like that after commit 9e5f571
CI Test build fails even using the PR branch

====== 3 failed, 157 passed, 133 skipped, 1 warning in 1098.52s (0:18:18) ======

@saramsey
Copy link
Copy Markdown
Member

Just recapping here my thoughts about xCRG TRAPI logging, that I shared in Slack on Friday:

As of Friday, the new xCRG module was (seemingly) not giving any details in the TRAPI message log from which we are able to discern what is going wrong. For example, we see no information about the HTTP status code returned from Retriever, or whether Retreiver's response's TRAPI message itself contained an error in the TRAPI log, and finally, whether any results (and if there were results, how many) were returned from Retriever. It would also be useful if xCRG could emit to STDERR, at least in debugging mode, the TRAPI query graph that is is POSTing to Retriever. So a team member can manually curl the the TRAPI message to Retriever's API at the command-line to see how it responds.

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

Just recapping here my thoughts about xCRG TRAPI logging, that I shared in Slack on Friday:

As of Friday, the new xCRG module was (seemingly) not giving any details in the TRAPI message log from which we are able to discern what is going wrong. For example, we see no information about the HTTP status code returned from Retriever, or whether Retreiver's response's TRAPI message itself contained an error in the TRAPI log, and finally, whether any results (and if there were results, how many) were returned from Retriever. It would also be useful if xCRG could emit to STDERR, at least in debugging mode, the TRAPI query graph that is is POSTing to Retriever. So a team member can manually curl the the TRAPI message to Retriever's API at the command-line to see how it responds.

Thanks @saramsey , agreed. This is feasible and I think it will make the xCRG path much easier to debug.

I have already added the first part of this in the latest catrax-xcrg package update: each Retriever call now logs the Retriever HTTP status, Retriever TRAPI status/description, returned result/node/edge counts, and Retriever log messages when the call returns zero results or a non-complete status.

I will also make sure the xCRG call label is clear in the logs, e.g. direct lookup vs TF-mediated template/batch, and that failures/non 200 responses are surfaced as ARAX/TRAPI warnings or errors rather than silently resulting in zero results.

For the exact Retriever query graph, I agree that we should expose it in debugging mode. I’ll keep the normal TRAPI log compact, but add DEBUG-level logging and/or debug artifact output for the full TRAPI query being posted to Retriever so it can be copied and tested with curl.

I will verify this in /test after the latest branch is redeployed so the ARAX TRAPI message log shows enough information to diagnose Retriever behavior directly.

@venkataseshtej
Copy link
Copy Markdown
Collaborator Author

venkataseshtej commented May 26, 2026

Status update on the earlier xCRG changes: (05/22/26)

I pushed the updated package/ARAX fixes. The current xCRG config now calls Retriever with tiers=[0, 1] instead of only Tier 0 or only Tier 1. This keeps Tier 0 included as intended, while avoiding the previous zero-result behavior when Tier 0 alone returned no results.

I also added Retriever diagnostics in the catrax-xcrg package so the ARAX TRAPI log now reports:

- Retriever HTTP status
- Retriever TRAPI status/description
- returned result/node/edge counts
- Retriever log messages when a lookup returns zero results or a non-complete status

Local validation passed:

Example 3 live xCRG smoke against Retriever CI with tiers=[0,1]: 500 results
Second MVP2 query against Retriever CI with tiers=[0,1]: 41 results
TRAPI validator on the second response: 0 critical / 0 errors / 0 warnings
Fast ARAX xCRG tests: 6 passed
xCRG package tests: 9 passed

@saramsey
Copy link
Copy Markdown
Member

@venkata SESH TEJ MATTA I have a question about the new xCRG. I would have put this question in the relevant ARAX issue, but there doesn't seem to be an obvious one (or maybe it is the very old issue #2048?), for the new xCRG, only this PR. Anyhow, here's my question. My understanding is that the new xCRG does not put any information into the ARAX UI's "Expansion Progress" screen. I also vaguely recall hearing that it was claimed (I don't recall in what context, sorry) that this is because xCRG doesn't use ARAX-expand. But, on looking at the ARAX code base, the function that updates the information in the ARAX UI's "Expansion Progress" screen is not per se in ARAX-expand (yes, the name of the screen could be understandably misconstrued to mean that it only works with ARAX-expand), but in the ARAX_response.py module's ARAXResponse class, specifically, the update_query_plan method. And since xCRG is clearly streaming results back to the ARAX UI, and since xCRG's creativeCRG.py module's creativeCRG class has an ARAXResponse object, as shown in its initializer here:

def __init__(self, response: ARAXResponse, data_path: str):

I am wondering, why exactly can't xCRG update to the ARAX UI's Expansion Progress screen? Can't it just call response.update_query_plan, and just specify the qedge_key for the query graph's edge that connects between the "chemical" query node and the "gene" query node? Please forgive my ignorance. I am only asking because this has been (per my understanding) an interface contract that all ARAX modules abide by, for a long time, including xDTD.

@edeutsch
Copy link
Copy Markdown
Collaborator

I would encourage Steve's suggestion on the query_plan updating.
But yet, I'm thinking that it would be good to get the current functionality deployed first before working on this, rather than holding up deployment for the new functionality, since Sarah seems eager for the new functionality.

@saramsey
Copy link
Copy Markdown
Member

My understanding is that there is a blocking issue with the new xCRG code, in that it is not returning aux graphs. I may be mistaken, but that is the impression I am getting from the thread on Slack in #deployment.

@hodgesf
Copy link
Copy Markdown
Collaborator

hodgesf commented May 28, 2026

Dr. Ramsey's comment above has been verified. Example 3 is failing on /test because there are TRAPI validation errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants