Integrate reusable xCRG package into ARAX#2772
Conversation
dkoslicki
left a comment
There was a problem hiding this comment.
Looks good, with just one comment about deployment level. @mohsenht can you review too since this is in your connect part of the code?
Also, likely best to, moving forward, pin the requirements to a versioned PyPi release rather than a commit SHA
| XCRG_RETRIEVER_URL_ENV = "ARAX_XCRG_RETRIEVER_URL" | ||
| XCRG_TIMEOUT_ENV = "ARAX_XCRG_TIMEOUT" | ||
| XCRG_TF_BATCH_SIZE_ENV = "ARAX_XCRG_TF_BATCH_SIZE" | ||
| DEFAULT_XCRG_RETRIEVER_URL = "https://retriever.ci.transltr.io/query" |
There was a problem hiding this comment.
The retriever URL should pull from whatever deployment level ARAX is currently at. For example, if the code is deployed at arax.ci.transltr.io, it should call retriever.ci.transltr.io. If deployed at arax.test.transltr.io, it should call retirever.test.transltr.io. And if at arax.transltr.io, it should call retriever.transltr.io. You should be able to pull this from configuration file (I forget exactly which).
There was a problem hiding this comment.
Thanks, that makes sense. I updated xCRG Retriever URL selection to use RTXConfiguration().maturity so ARAX staging/testing/production map to the corresponding Retriever deployment. I kept ARAX_XCRG_RETRIEVER_URL as an explicit local/debug override.
I also agree that once catrax-xcrg has a versioned PyPI release, we should replace the Git commit pin with a pinned PyPI version.
|
Awesome! But I just noticed that all the xCRG tests are skipped due to being marked as slow in the test suite. Can you please:
|
|
From this PR, I see that environment variables are used for xCRG configuration. From the submitter's comments, I gather that the environment variables are intended for debugging purposes, which I presume means on a dev machine or in a "test rig" type environment. If there is a potential use-case intended for setting one or more of the new xCRG environment variables within the ARAX flask application, how would we be proposing to set them? This is just a reminder that ARAX is (at least, with the current code-base we have) started out of a System V init script, on ITRB systems and on |
Thank you, I checked this and pushed an update to the PR. I ran the existing slow-marked xCRG tests locally with So I did not remove the slow marks from those tests, since they are not reliable always-on CI tests for the new xCRG path. Instead, I added fast always-on tests for the new ARAX xCRG integration path in: These tests cover:
Local validation passed: I pushed this update to the PR branch. |
|
@venkataseshtej thank you for coding up some new faster xCRG unit tests. That was a great idea. |
|
@venkataseshtej Have you tested this PR by running the ARAX flask application server and the "Example 3" question in it? I think it would be a good idea, before merging. See Another option is to commandeer a dev-area on Personally, I prefer to test on a developer machine by just running the flask server locally. I find it is easier that doing: and so forth. If you would like some help getting the ARAX Flask server working on your developer machine, @hodgesf or @bazarkua can help show you how to do it. |
Thanks, that is a good point. The intention is that the For deployed ARAX, no new environment variables need to be set. The Retriever URL is now selected from The timeout and TF batch size also have code defaults, so the System V init script should not need to be modified for xCRG. The env vars are mainly for local testing/debugging, for example if a developer wants to temporarily point xCRG to a specific Retriever deployment or adjust timeout/batch size without changing code. If we later decide these values need to be production-tunable, I agree the cleaner approach would be to add them to the ARAX/RTX configuration rather than relying on System V init-script environment overrides. I can also add a short code comment to make this explicit. |
Thanks, good point. I have tested the PR through the local I agree that this is worth doing before merge. I will follow the local Flask server portion of the ARAX maintenance SOP, start the Flask server from this PR branch, submit the same MVP2/xCRG query through the HTTP endpoint, and validate the returned response with the TRAPI validator. I will report the result back on the PR before merge. |
|
@venkataseshtej here is the most relevant section of the ARAX Maintenance SOP for this situation, I think: Unfortunately, that section has gotten a little bit out-of-date since we deployed Tier0 ARAX. But it gives the gist of how to set things up, at least. |
|
@venkataseshtej , there are no new database files for this new xCRG implementation, correct? @saramsey , do you have any guidance on removing the old xCRG database files? This new method is model-free in comparison to the old, model-based approach. |
If there are database files that are no longer needed, I propose that in the and the code corresponding to those file(s) should be removed from the @bazarkua or @hodgesf can help with the edits to the There is also this shell script, which isn't used in an automated way by any of our ARAX systems but it is a convenience script for managing ARAX on a dev machine. I'm happy to edit it in the Are all three of these lines going away? |
Thanks, that makes sense. For the new package-backed xCRG path, I do not believe there are new xCRG-specific database files that need to be added to That said, I will audit I also agree on the Flask server test. So far I tested through the local |
Thanks, this makes sense. For the new package-backed The new path uses:
So yes, these three old model-based xCRG entries are legacy only and are not used by the new package backed xCRG path: I removed those three entries from I did not remove the existing NGD DB configuration, since the new xCRG path still uses NGD through the ARAX NGD helper. Validation: |
Yes, correct. The new package-backed It uses:
So the old model-based xCRG files are no longer needed for this new path: I removed those legacy references from |
|
I found two separate issues and pushed follow up fixes. First, the CI failure happened because the Docker image used in the Python analysis job clones RTX again inside the container, and that inner clone was using default branch code rather than the PR branch. That is why CI still saw the old xCRG DB entries and tried to rsync the old Second, I found that the legacy So the database manager should no longer download/manage the old xCRG model files, while the old Local validation passed:
@dkoslicki @saramsey please let me know if you would prefer the CI/Docker branch checkout fix to be split into a separate PR. I included it here because otherwise this PR’s CI was not actually testing the PR branch inside the Docker container . |
Ah yes, this is a frustrating limitation of the CICD-Dockerfile. It has tripped me up multiple times before. Some day, we (the ARAX team) should fix it. |
I'm OK with including these fixes in this PR. Thank you for implementing those fixes, @venkataseshtej. |
|
@edeutsch to point arax.ncats.io/test to this branch for us to test |
|
okay @dkoslicki and @chunyuma I have now deployed branch A naive first test seems to show it is working well: But please test and confirm if we are ready to merge into |
|
All tests pass locally with the pytest suite and the flask server. I think this is good to go. Also, example 3 is now lightening fast, compared to before this update. Awesome job!! |
I completed the local ARAX Flask server testing from the I tested through the HTTP endpoint: 1. UI Example 3 / xCRG MVP2 increased queryTRAPI validator: 2. Additional xCRG decreased queryTRAPI validator: Extra checks passed for both responses: So the local Flask HTTP server path is working for the new package-backed |
|
This seems great, but I'm afraid the TRAPI validator report: is a red flag. This is extremely difficult to achieve and thus seems unlikely. (the validator is very fussy, so there are always warnings) The thing is that the validator does not run on initial queries (for various reasons including it would slow things down unacceptably). The validator only runs when a previous result is recalled. I ran the Example 3 query on /test. No errors are visible. But then I recalled it:
The biggest issue seems to be that the NCBIGene entries have all empty/null properties: I'm not entirely sure where these entries come from, but this is definitely invalid. |
|
I'm seeing similar issues in the nodes in the support graphs that are missing all their properties/names/etc.:
@venkataseshtej , the nodes in the support graphs should be as they are returned from retriever (i.e. all of their properties and the like preserved) |
Thanks @edeutsch for pointing this out. I see the issue now. I will update the package so that before returning the final response, every KG node has non-empty categories, using CURIE-prefix fallback categories where needed, e.g. NCBIGene:* -> biolink:Gene. I will also check for / prune dangling nodes during that cleanup. |
|
@venkataseshtej Sounds good (our messages were posted simultaneously), but one thing to note: don't go about it by looking up the node categories, nor using any fallback rules like CURIE prefix to infer categories. Just use the nodes as returned by retriever. Since you're getting them from retriever, just preserve and pass through all of these properties. You definitely do not want to be figuring them out yourself. |
Got it.For the support graph nodes, I should not just create fallback nodes from CURIE prefixes if Retriever already returned full node objects. The correct fix is to preserve the node records from Retriever’s knowledge graph when copying support/path edges into the final response, including their name, categories, attributes, and other properties. I’ll update the catrax-xcrg final TRAPI builder so that support graph node IDs are hydrated from the original Retriever KG nodes first. CURIE-prefix category inference will only be used as a fallback when a referenced node is genuinely missing from the Retriever node map.
Thanks @dkoslicki, that clarification helps. I will avoid adding any CURIE-prefix category inference or other category lookup logic. I will fix this by preserving the node objects exactly as returned by Retriever. So when xCRG copies support/path edges into the final KG/support graphs, it will also copy the corresponding subject/object node records from the Retriever KG, including their names, categories, attributes, and other properties. If a support edge references a node that is not present in the Retriever KG node map, I will treat that as an incomplete support path and avoid fabricating node metadata. The final clean up goal will be pass through preservation from Retriever, not the category reconstruction in xCRG. |
|
What you wrote later is correct: treat missing node properties as truly missing. Earlier in this message, you have:
This is what we don't want. If retriever is missing stuff, it's missing and needs to be fixed by them. We don't want to silently be trying to fix their mistakes, but rather pass them through verbatim, that way if someone sees missing information, we can point to the source and say "that's their problem". No fall backs or trying to fix retriever issues. |
Follow up update @dkoslicki : I tested the updated xCRG package locally through the ARAX Flask server using an Example 3-style xCRG query through: The fresh Flask response now looks clean: I also validated the saved Flask response locally with The package now preserves Retriever-provided node metadata when available, avoids CURIE category fallback repair, avoids incomplete evidence nodes/support references and caps xCRG results at 500. I have pushed the updated |
|
Thank you @venkataseshtej this is now looking very good. The first one is an objection to this: I did a little investigating and I suspect these warnings come Retriever data, not xCRG data, but I'm not certain. I think we're in good shape. Is there anything else that needs to be done or evaluated before we merge into |
|
Just wanted to mention looks like that after commit 9e5f571 ====== 3 failed, 157 passed, 133 skipped, 1 warning in 1098.52s (0:18:18) ====== |
|
Just recapping here my thoughts about xCRG TRAPI logging, that I shared in Slack on Friday:
|
Thanks @saramsey , agreed. This is feasible and I think it will make the xCRG path much easier to debug. I have already added the first part of this in the latest I will also make sure the xCRG call label is clear in the logs, e.g. direct lookup vs TF-mediated template/batch, and that failures/non 200 responses are surfaced as ARAX/TRAPI warnings or errors rather than silently resulting in zero results. For the exact Retriever query graph, I agree that we should expose it in debugging mode. I’ll keep the normal TRAPI log compact, but add DEBUG-level logging and/or debug artifact output for the full TRAPI query being posted to Retriever so it can be copied and tested with I will verify this in |
|
Status update on the earlier xCRG changes: (05/22/26) I pushed the updated package/ARAX fixes. The current xCRG config now calls Retriever with I also added Retriever diagnostics in the Local validation passed: |
|
@venkata SESH TEJ MATTA I have a question about the new xCRG. I would have put this question in the relevant ARAX issue, but there doesn't seem to be an obvious one (or maybe it is the very old issue #2048?), for the new xCRG, only this PR. Anyhow, here's my question. My understanding is that the new xCRG does not put any information into the ARAX UI's "Expansion Progress" screen. I also vaguely recall hearing that it was claimed (I don't recall in what context, sorry) that this is because xCRG doesn't use ARAX-expand. But, on looking at the ARAX code base, the function that updates the information in the ARAX UI's "Expansion Progress" screen is not per se in ARAX-expand (yes, the name of the screen could be understandably misconstrued to mean that it only works with ARAX-expand), but in the I am wondering, why exactly can't xCRG update to the ARAX UI's Expansion Progress screen? Can't it just call |
|
I would encourage Steve's suggestion on the query_plan updating. |
|
My understanding is that there is a blocking issue with the new xCRG code, in that it is not returning aux graphs. I may be mistaken, but that is the impression I am getting from the thread on Slack in |
|
Dr. Ramsey's comment above has been verified. Example 3 is failing on /test because there are TRAPI validation errors. |










This PR integrates the reusable
catrax-xcrgpackage into ARAX for MVP2/xCRG inferred ChemicalEntity-Gene activity/abundance queries.Main changes
catrax-xcrgas a pinned dependency.ARAX_query_graph_interpreter.connect(action=xcrg).catrax-xcrgpackage fromARAX_connect.ARAX_XCRG_RETRIEVER_URLARAX_XCRG_TIMEOUTARAX_XCRG_TF_BATCH_SIZEget_curie_ngd_path().ResultTransformersafely no-op for xCRG responses because xCRG already returns final TRAPI support graph output.connect()replaces the message.Validation
I ran one MVP2/xCRG query through ARAX locally.
The query was recognized as xCRG MVP2 and routed to:
It returned real results:
TRAPI validator result on the ARAX-produced response:
Extra checks also passed:
Notes
For local testing, I used:
because
dev.retriever.biothings.iowas returning424 Failed Dependencyfor this query shape.The xCRG package does not hardcode the Retriever endpoint; it is config/envdriven.
Local NGD testing used
get_curie_ngd_path()and a local symlink to the Shepherd NGD SQLite DB. In CI/prod, ARAX DB manager should provide the NGD DB at that expected path.