A Python static-analysis toolkit — the CLDK backend that emits a canonical symbol table and call graph, as analysis.json or a Neo4j property graph.
canpy is a static analyzer for Python built on Jedi, with optional
CodeQL-resolved call edges and
Tree-sitter parsing. It produces the canonical CodeLLM-DevKit
(CLDK) analysis.json — a symbol table plus a call graph — and can project that same analysis into a
Neo4j property graph. It is the Python backend behind
CLDK, mirroring its
TypeScript (cants) and
Java siblings.
Every run produces a symbol table and a call graph. Edges come from Jedi's lexical resolution by
default; --codeql resolves additional edges (RPC / third-party / dynamically-dispatched targets)
and merges them with the Jedi-derived edges, also backfilling callees Jedi could not resolve.
- Symbol table — modules, classes, functions, methods, variables, decorators, imports, and docstrings, with precise source spans.
- Call graph — Jedi's lexical resolver by default, with optional CodeQL-resolved edges
(
--codeql) for RPC / third-party / dynamically-dispatched targets, merged with the Jedi edges; CodeQL also backfills callees Jedi could not resolve. - Neo4j output — project the analysis into a labeled property graph: a self-contained
graph.cyphersnapshot, or an incremental push to a live database over Bolt. - Versioned schema — a machine-readable, version-stamped Neo4j schema contract (
--emit schema), checked in asschema.neo4j.jsonand shipped with every release. - Incremental cache — per-file results are cached under
.codeanalyzer;--lazy(default) reuses them,--eagerforces a clean rebuild.--raydistributes the work across cores. - Compact output — canonical
analysis.json, or binaryanalysis.msgpackfor smaller artifacts.
-
Python 3.10 or newer.
-
A C toolchain and the
venv/ development headers — the analyzer builds an isolated virtual environment per project (via Python'svenv) so Jedi can resolve types and imports:# Ubuntu / Debian sudo apt install python3-venv python3-dev build-essential # Fedora / RHEL / CentOS sudo dnf group install "Development Tools" && sudo dnf install python3-venv python3-devel # macOS xcode-select --install
pip install codeanalyzer-python
canpy --helpFor the optional live Neo4j push (--emit neo4j --neo4j-uri …), install the neo4j extra:
pip install 'codeanalyzer-python[neo4j]'Install the CLI as an isolated tool with the one-line installer (provisions via uv / pipx / pip):
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/codellm-devkit/codeanalyzer-python/releases/latest/download/canpy-installer.sh | shbrew install codellm-devkit/tap/codeanalyzer-pythonThe formula depends on uv and installs canpy as an isolated,
version-pinned uv tool (the package and its dependencies are resolved and cached on first run).
This project uses uv for dependency management.
git clone https://github.com/codellm-devkit/codeanalyzer-python
cd codeanalyzer-python
uv sync --all-groups
uv run canpy --helpcanpy --input /path/to/python/projectWith no --output, the analysis is printed to stdout as compact JSON; with --output <dir> it is
written to analysis.json (or graph.cypher for --emit neo4j, or analysis.msgpack with
--format msgpack) in that directory.
$ canpy --help
Usage: canpy [OPTIONS] COMMAND [ARGS]...
Static Analysis on Python source code using Jedi, PyCG and Tree sitter.
╭─ Options ────────────────────────────────────────────────────────────────────╮
│ --input -i PATH Path to the │
│ project root │
│ directory (not │
│ required for │
│ --emit schema). │
│ --output -o PATH Output directory │
│ for artifacts. │
│ --format -f [json|msgpack] Output format │
│ for --emit json: │
│ json or msgpack. │
│ [default: json] │
│ --emit [json|neo4j|sche Output target: │
│ ma] json │
│ (analysis.json, │
│ default) | neo4j │
│ (graph.cypher or │
│ live Bolt push) │
│ | schema (the │
│ Neo4j │
│ schema.json │
│ contract). │
│ [default: json] │
│ --app-name TEXT Logical │
│ application name │
│ for the graph │
│ :PyApplication │
│ anchor (default: │
│ input dir name). │
│ --neo4j-uri TEXT Push the graph │
│ to a live Neo4j │
│ over Bolt │
│ (incremental); │
│ omit to write │
│ graph.cypher. │
│ [env var: │
│ NEO4J_URI] │
│ --neo4j-user TEXT Neo4j username. │
│ [env var: │
│ NEO4J_USERNAME] │
│ [default: neo4j] │
│ --neo4j-password TEXT Neo4j password. │
│ Prefer the env │
│ var over the │
│ flag (the flag │
│ is visible in │
│ shell history / │
│ process list). │
│ [env var: │
│ NEO4J_PASSWORD] │
│ [default: neo4j] │
│ --neo4j-database TEXT Neo4j database │
│ name (default: │
│ server default). │
│ [env var: │
│ NEO4J_DATABASE] │
│ --analysis-level -a INTEGER RANGE Analysis depth: │
│ [1<=x<=2] 1=symbol │
│ table+Jedi call │
│ graph, 2=+PyCG │
│ call graph. │
│ [default: 1] │
│ --ray --no-ray Enable Ray for │
│ distributed │
│ analysis. │
│ [default: │
│ no-ray] │
│ --eager --lazy Enable eager or │
│ lazy analysis. │
│ Defaults to │
│ lazy. │
│ [default: lazy] │
│ --skip-tests --include-tests Skip test files │
│ in analysis. │
│ [default: │
│ skip-tests] │
│ --no-venv --venv Skip virtualenv │
│ creation and │
│ dependency │
│ installation; │
│ resolve imports │
│ against the │
│ ambient Python │
│ environment │
│ instead. │
│ [default: venv] │
│ --file-name PATH Analyze only the │
│ specified file │
│ (relative to │
│ input │
│ directory). │
│ --cache-dir -c PATH Directory to │
│ store analysis │
│ cache. Defaults │
│ to │
│ '.codeanalyzer' │
│ in the input │
│ directory. │
│ --clear-cache --keep-cache Clear cache │
│ after analysis. │
│ By default, │
│ cache is │
│ retained. │
│ [default: │
│ keep-cache] │
│ -v INTEGER Increase │
│ verbosity: -v, │
│ -vv, -vvv │
│ [default: 0] │
│ --pycg-shard --no-pycg-shard Shard PyCG │
│ call-graph │
│ analysis by │
│ Python package │
│ (level 2 only). │
│ When the project │
│ exceeds the │
│ 500-file │
│ ceiling, PyCG is │
│ run │
│ independently │
│ per top-level │
│ package with │
│ cross-package │
│ imports treated │
│ as ghost nodes. │
│ Without this │
│ flag, projects │
│ over the ceiling │
│ fall back to │
│ Jedi-only edges. │
│ [default: │
│ no-pycg-shard] │
│ --pycg-shard-cei… INTEGER RANGE Maximum files │
│ [x>=1] per shard when │
│ --pycg-shard is │
│ active (default │
│ 100). Shards │
│ exceeding this │
│ limit are │
│ skipped; their │
│ call edges are │
│ omitted from the │
│ call graph (Jedi │
│ edges for those │
│ packages are │
│ still included). │
│ Lower values are │
│ safer for │
│ packages with │
│ deep class │
│ hierarchies or │
│ heavy import │
│ graphs. │
│ [default: 100] │
│ --pycg-shard-tim… INTEGER RANGE Per-shard │
│ [x>=0] wall-clock │
│ timeout in │
│ seconds when │
│ --pycg-shard is │
│ active (default │
│ 120). A shard │
│ that exceeds │
│ this limit is │
│ skipped │
│ gracefully. │
│ PyCG's fixpoint │
│ is bimodal: it │
│ either converges │
│ quickly or │
│ diverges │
│ indefinitely, so │
│ the timeout acts │
│ as a final │
│ safety net after │
│ the file-count │
│ ceiling. Set to │
│ 0 to disable. │
│ POSIX only │
│ (macOS / Linux); │
│ ignored on │
│ Windows. │
│ [default: 120] │
│ --pycg-shard-str… [jedi|package] How --pycg-shard │
│ groups files │
│ (level 2 only). │
│ 'jedi' (default) │
│ partitions the │
│ Jedi │
│ module-dependen… │
│ graph (SCC + │
│ Louvain) so │
│ tightly-coupled │
│ modules │
│ co-compute and │
│ few call edges │
│ are severed │
│ between shards; │
│ import cycles │
│ are never split. │
│ 'package' uses │
│ the legacy │
│ one-shard-per-p… │
│ grouping. │
│ [default: jedi] │
│ --pycg-max-iter INTEGER RANGE Cap on PyCG's │
│ [x>=-1] fixpoint passes │
│ per │
│ shard/project │
│ (level 2; │
│ default 50). │
│ PyCG iterates │
│ until its │
│ points-to state │
│ stops changing, │
│ but its │
│ access-path │
│ domain has no │
│ convergence │
│ bound, so heavy │
│ metaclass/mixin │
│ code (e.g. an │
│ ORM) can loop │
│ with each pass │
│ costing seconds. │
│ The cap returns │
│ a │
│ sound-but-incom… │
│ call graph │
│ instead of │
│ looping until │
│ the timeout │
│ kills it. Set to │
│ -1 for PyCG's │
│ unbounded │
│ run-to-converge… │
│ behaviour. │
│ [default: 50] │
│ --help Show this │
│ message and │
│ exit. │
╰──────────────────────────────────────────────────────────────────────────────╯
-
Basic analysis to stdout, or to a file:
canpy --input ./my-python-project # compact JSON on stdout canpy --input ./my-python-project --output ./out # → ./out/analysis.json
-
Binary output (msgpack):
canpy --input ./my-python-project --output ./out --format msgpack # → ./out/analysis.msgpack -
Resolve extra call edges with CodeQL:
canpy --input ./my-python-project --codeql
By default, edges come from Jedi's lexical analysis. Adding
--codeqlresolves additional edges (including RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges; CodeQL also backfills resolved callees Jedi could not resolve. CodeQL integration is experimental; the CLI is downloaded into<cache_dir>/codeql/on first use. -
Emit a Neo4j snapshot, or push to a live database:
canpy --input ./my-python-project --emit neo4j --output ./out # → ./out/graph.cypher canpy --input ./my-python-project --emit neo4j \ --neo4j-uri bolt://localhost:7687 --neo4j-user neo4j --neo4j-password secret -
Emit the Neo4j schema contract:
canpy --emit schema # print schema.json to stdout (no project needed) canpy --emit schema --output ./out # → ./out/schema.json
-
Force a clean rebuild with a custom cache directory:
canpy --input ./my-python-project --eager --cache-dir /path/to/custom-cache
canpy builds one analysis in memory and can emit it three ways (--emit):
A PyApplication document — the canonical CLDK contract:
By default this is printed to stdout in JSON; with --output it is written to analysis.json (or
analysis.msgpack with --format msgpack, a more compact binary format).
--emit neo4j projects the same analysis into a labeled property graph. Every node label is
Py-prefixed and every relationship type is PY_-prefixed (e.g. :PyClass, PY_CALLS) so multiple
language analyzers can share one database without label or relationship-type collisions. Declarations
are keyed by their signature under a shared :PySymbol label; calls, imports, inheritance,
decorators, and call sites are relationships:
- Without
--neo4j-uri— writes a self-containedgraph.cypher(constraints + indexes, a scoped wipe, then batchedMERGEs). Load it withcypher-shell < graph.cypher. Needs no extra dependencies. - With
--neo4j-uri— pushes to a live Neo4j over Bolt incrementally: only modules whose content hash changed are rewritten, and on a full run modules whose source file vanished are pruned. Requires theneo4jextra. Every graph carries aschema_versionon its:PyApplicationnode.
Call-graph endpoints that aren't present in the symbol table (third-party / framework / RPC targets)
are materialized as :PyExternal ghost nodes, mirroring the analyzer's own ghost-node behaviour.
The connection options also read from the standard Neo4j environment variables — NEO4J_URI,
NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE — when the corresponding flag is omitted (an
explicit flag wins). Prefer the env var for the password so it doesn't land in shell history or the
process list:
export NEO4J_URI=bolt://localhost:7687
export NEO4J_PASSWORD=secret
canpy -i ./my-project --emit neo4j # credentials picked up from the environment--emit schema writes the machine-readable, version-stamped Neo4j schema (schema.json: node labels,
relationships, properties, constraints, and indexes). It needs no project and is checked into the repo
as schema.neo4j.json and bundled in every release as a GitHub Release asset, so a consumer can
validate producer/consumer compatibility without invoking the tool. The shape of the contract matches
the codeanalyzer-typescript backend.
A UML of the analysis.json schema (the PyApplication containment tree) is checked in as
schema-uml.drawio, and the property-graph schema as
neo4j-schema.drawio.
This project uses uv.
uv sync --all-groups
uv run canpy --input /path/to/project # run from source
uv run canpy --emit schema > schema.neo4j.json # regenerate the checked-in schema contract
uv run python scripts/update_readme.py # regenerate the canpy --help block above
uv run pytest # run the test suiteThe Neo4j schema-conformance test always runs. The Neo4j bolt integration test spins up a real Neo4j via Testcontainers and is opt-in — it needs a container runtime (Docker or Podman) and is enabled with an environment variable:
RUN_CONTAINER_TESTS=1 uv run pytest test/test_neo4j_bolt.py -sApache 2.0 — see LICENSE.
{ "symbol_table": { /* file path → module (classes, functions, variables, imports, …) */ }, "call_graph": [ /* CALL_DEP edges: { source, target, weight, provenance } keyed by callable signature */ ] }