Skip to content

codellm-devkit/codeanalyzer-python

Repository files navigation

CodeLLM-DevKit

codeanalyzer-python (canpy)

A Python static-analysis toolkit — the CLDK backend that emits a canonical symbol table and call graph, as analysis.json or a Neo4j property graph.

PyPI GitHub release Release License


canpy is a static analyzer for Python built on Jedi, with optional CodeQL-resolved call edges and Tree-sitter parsing. It produces the canonical CodeLLM-DevKit (CLDK) analysis.json — a symbol table plus a call graph — and can project that same analysis into a Neo4j property graph. It is the Python backend behind CLDK, mirroring its TypeScript (cants) and Java siblings.

Every run produces a symbol table and a call graph. Edges come from Jedi's lexical resolution by default; --codeql resolves additional edges (RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges, also backfilling callees Jedi could not resolve.

Table of Contents

Features

  • Symbol table — modules, classes, functions, methods, variables, decorators, imports, and docstrings, with precise source spans.
  • Call graph — Jedi's lexical resolver by default, with optional CodeQL-resolved edges (--codeql) for RPC / third-party / dynamically-dispatched targets, merged with the Jedi edges; CodeQL also backfills callees Jedi could not resolve.
  • Neo4j output — project the analysis into a labeled property graph: a self-contained graph.cypher snapshot, or an incremental push to a live database over Bolt.
  • Versioned schema — a machine-readable, version-stamped Neo4j schema contract (--emit schema), checked in as schema.neo4j.json and shipped with every release.
  • Incremental cache — per-file results are cached under .codeanalyzer; --lazy (default) reuses them, --eager forces a clean rebuild. --ray distributes the work across cores.
  • Compact output — canonical analysis.json, or binary analysis.msgpack for smaller artifacts.

Installation

Prerequisites

  • Python 3.10 or newer.

  • A C toolchain and the venv / development headers — the analyzer builds an isolated virtual environment per project (via Python's venv) so Jedi can resolve types and imports:

    # Ubuntu / Debian
    sudo apt install python3-venv python3-dev build-essential
    
    # Fedora / RHEL / CentOS
    sudo dnf group install "Development Tools" && sudo dnf install python3-venv python3-devel
    
    # macOS
    xcode-select --install

Install via pip (PyPI)

pip install codeanalyzer-python
canpy --help

For the optional live Neo4j push (--emit neo4j --neo4j-uri …), install the neo4j extra:

pip install 'codeanalyzer-python[neo4j]'

Install via shell script

Install the CLI as an isolated tool with the one-line installer (provisions via uv / pipx / pip):

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/codellm-devkit/codeanalyzer-python/releases/latest/download/canpy-installer.sh | sh

Install via Homebrew

brew install codellm-devkit/tap/codeanalyzer-python

The formula depends on uv and installs canpy as an isolated, version-pinned uv tool (the package and its dependencies are resolved and cached on first run).

Build from source

This project uses uv for dependency management.

git clone https://github.com/codellm-devkit/codeanalyzer-python
cd codeanalyzer-python
uv sync --all-groups
uv run canpy --help

Usage

canpy --input /path/to/python/project

With no --output, the analysis is printed to stdout as compact JSON; with --output <dir> it is written to analysis.json (or graph.cypher for --emit neo4j, or analysis.msgpack with --format msgpack) in that directory.

Options

$ canpy --help

 Usage: canpy [OPTIONS] COMMAND [ARGS]...

 Static Analysis on Python source code using Jedi, PyCG and Tree sitter.

╭─ Options ────────────────────────────────────────────────────────────────────╮
│ --input            -i                     PATH              Path to the      │
│                                                             project root     │
│                                                             directory (not   │
│                                                             required for     │
│                                                             --emit schema).  │
│ --output           -o                     PATH              Output directory │
│                                                             for artifacts.   │
│ --format           -f                     [json|msgpack]    Output format    │
│                                                             for --emit json: │
│                                                             json or msgpack. │
│                                                             [default: json]  │
│ --emit                                    [json|neo4j|sche  Output target:   │
│                                           ma]               json             │
│                                                             (analysis.json,  │
│                                                             default) | neo4j │
│                                                             (graph.cypher or │
│                                                             live Bolt push)  │
│                                                             | schema (the    │
│                                                             Neo4j            │
│                                                             schema.json      │
│                                                             contract).       │
│                                                             [default: json]  │
│ --app-name                                TEXT              Logical          │
│                                                             application name │
│                                                             for the graph    │
│                                                             :PyApplication   │
│                                                             anchor (default: │
│                                                             input dir name). │
│ --neo4j-uri                               TEXT              Push the graph   │
│                                                             to a live Neo4j  │
│                                                             over Bolt        │
│                                                             (incremental);   │
│                                                             omit to write    │
│                                                             graph.cypher.    │
│                                                             [env var:        │
│                                                             NEO4J_URI]       │
│ --neo4j-user                              TEXT              Neo4j username.  │
│                                                             [env var:        │
│                                                             NEO4J_USERNAME]  │
│                                                             [default: neo4j] │
│ --neo4j-password                          TEXT              Neo4j password.  │
│                                                             Prefer the env   │
│                                                             var over the     │
│                                                             flag (the flag   │
│                                                             is visible in    │
│                                                             shell history /  │
│                                                             process list).   │
│                                                             [env var:        │
│                                                             NEO4J_PASSWORD]  │
│                                                             [default: neo4j] │
│ --neo4j-database                          TEXT              Neo4j database   │
│                                                             name (default:   │
│                                                             server default). │
│                                                             [env var:        │
│                                                             NEO4J_DATABASE]  │
│ --analysis-level   -a                     INTEGER RANGE     Analysis depth:  │
│                                           [1<=x<=2]         1=symbol         │
│                                                             table+Jedi call  │
│                                                             graph, 2=+PyCG   │
│                                                             call graph.      │
│                                                             [default: 1]     │
│ --ray                  --no-ray                             Enable Ray for   │
│                                                             distributed      │
│                                                             analysis.        │
│                                                             [default:        │
│                                                             no-ray]          │
│ --eager                --lazy                               Enable eager or  │
│                                                             lazy analysis.   │
│                                                             Defaults to      │
│                                                             lazy.            │
│                                                             [default: lazy]  │
│ --skip-tests           --include-tests                      Skip test files  │
│                                                             in analysis.     │
│                                                             [default:        │
│                                                             skip-tests]      │
│ --no-venv              --venv                               Skip virtualenv  │
│                                                             creation and     │
│                                                             dependency       │
│                                                             installation;    │
│                                                             resolve imports  │
│                                                             against the      │
│                                                             ambient Python   │
│                                                             environment      │
│                                                             instead.         │
│                                                             [default: venv]  │
│ --file-name                               PATH              Analyze only the │
│                                                             specified file   │
│                                                             (relative to     │
│                                                             input            │
│                                                             directory).      │
│ --cache-dir        -c                     PATH              Directory to     │
│                                                             store analysis   │
│                                                             cache. Defaults  │
│                                                             to               │
│                                                             '.codeanalyzer'  │
│                                                             in the input     │
│                                                             directory.       │
│ --clear-cache          --keep-cache                         Clear cache      │
│                                                             after analysis.  │
│                                                             By default,      │
│                                                             cache is         │
│                                                             retained.        │
│                                                             [default:        │
│                                                             keep-cache]      │
│                    -v                     INTEGER           Increase         │
│                                                             verbosity: -v,   │
│                                                             -vv, -vvv        │
│                                                             [default: 0]     │
│ --pycg-shard           --no-pycg-shard                      Shard PyCG       │
│                                                             call-graph       │
│                                                             analysis by      │
│                                                             Python package   │
│                                                             (level 2 only).  │
│                                                             When the project │
│                                                             exceeds the      │
│                                                             500-file         │
│                                                             ceiling, PyCG is │
│                                                             run              │
│                                                             independently    │
│                                                             per top-level    │
│                                                             package with     │
│                                                             cross-package    │
│                                                             imports treated  │
│                                                             as ghost nodes.  │
│                                                             Without this     │
│                                                             flag, projects   │
│                                                             over the ceiling │
│                                                             fall back to     │
│                                                             Jedi-only edges. │
│                                                             [default:        │
│                                                             no-pycg-shard]   │
│ --pycg-shard-cei…                         INTEGER RANGE     Maximum files    │
│                                           [x>=1]            per shard when   │
│                                                             --pycg-shard is  │
│                                                             active (default  │
│                                                             100). Shards     │
│                                                             exceeding this   │
│                                                             limit are        │
│                                                             skipped; their   │
│                                                             call edges are   │
│                                                             omitted from the │
│                                                             call graph (Jedi │
│                                                             edges for those  │
│                                                             packages are     │
│                                                             still included). │
│                                                             Lower values are │
│                                                             safer for        │
│                                                             packages with    │
│                                                             deep class       │
│                                                             hierarchies or   │
│                                                             heavy import     │
│                                                             graphs.          │
│                                                             [default: 100]   │
│ --pycg-shard-tim…                         INTEGER RANGE     Per-shard        │
│                                           [x>=0]            wall-clock       │
│                                                             timeout in       │
│                                                             seconds when     │
│                                                             --pycg-shard is  │
│                                                             active (default  │
│                                                             120). A shard    │
│                                                             that exceeds     │
│                                                             this limit is    │
│                                                             skipped          │
│                                                             gracefully.      │
│                                                             PyCG's fixpoint  │
│                                                             is bimodal: it   │
│                                                             either converges │
│                                                             quickly or       │
│                                                             diverges         │
│                                                             indefinitely, so │
│                                                             the timeout acts │
│                                                             as a final       │
│                                                             safety net after │
│                                                             the file-count   │
│                                                             ceiling. Set to  │
│                                                             0 to disable.    │
│                                                             POSIX only       │
│                                                             (macOS / Linux); │
│                                                             ignored on       │
│                                                             Windows.         │
│                                                             [default: 120]   │
│ --pycg-shard-str…                         [jedi|package]    How --pycg-shard │
│                                                             groups files     │
│                                                             (level 2 only).  │
│                                                             'jedi' (default) │
│                                                             partitions the   │
│                                                             Jedi             │
│                                                             module-dependen… │
│                                                             graph (SCC +     │
│                                                             Louvain) so      │
│                                                             tightly-coupled  │
│                                                             modules          │
│                                                             co-compute and   │
│                                                             few call edges   │
│                                                             are severed      │
│                                                             between shards;  │
│                                                             import cycles    │
│                                                             are never split. │
│                                                             'package' uses   │
│                                                             the legacy       │
│                                                             one-shard-per-p… │
│                                                             grouping.        │
│                                                             [default: jedi]  │
│ --pycg-max-iter                           INTEGER RANGE     Cap on PyCG's    │
│                                           [x>=-1]           fixpoint passes  │
│                                                             per              │
│                                                             shard/project    │
│                                                             (level 2;        │
│                                                             default 50).     │
│                                                             PyCG iterates    │
│                                                             until its        │
│                                                             points-to state  │
│                                                             stops changing,  │
│                                                             but its          │
│                                                             access-path      │
│                                                             domain has no    │
│                                                             convergence      │
│                                                             bound, so heavy  │
│                                                             metaclass/mixin  │
│                                                             code (e.g. an    │
│                                                             ORM) can loop    │
│                                                             with each pass   │
│                                                             costing seconds. │
│                                                             The cap returns  │
│                                                             a                │
│                                                             sound-but-incom… │
│                                                             call graph       │
│                                                             instead of       │
│                                                             looping until    │
│                                                             the timeout      │
│                                                             kills it. Set to │
│                                                             -1 for PyCG's    │
│                                                             unbounded        │
│                                                             run-to-converge… │
│                                                             behaviour.       │
│                                                             [default: 50]    │
│ --help                                                      Show this        │
│                                                             message and      │
│                                                             exit.            │
╰──────────────────────────────────────────────────────────────────────────────╯

Examples

  1. Basic analysis to stdout, or to a file:

    canpy --input ./my-python-project                        # compact JSON on stdout
    canpy --input ./my-python-project --output ./out         # → ./out/analysis.json
  2. Binary output (msgpack):

    canpy --input ./my-python-project --output ./out --format msgpack   # → ./out/analysis.msgpack
  3. Resolve extra call edges with CodeQL:

    canpy --input ./my-python-project --codeql

    By default, edges come from Jedi's lexical analysis. Adding --codeql resolves additional edges (including RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges; CodeQL also backfills resolved callees Jedi could not resolve. CodeQL integration is experimental; the CLI is downloaded into <cache_dir>/codeql/ on first use.

  4. Emit a Neo4j snapshot, or push to a live database:

    canpy --input ./my-python-project --emit neo4j --output ./out   # → ./out/graph.cypher
    canpy --input ./my-python-project --emit neo4j \
      --neo4j-uri bolt://localhost:7687 --neo4j-user neo4j --neo4j-password secret
  5. Emit the Neo4j schema contract:

    canpy --emit schema                   # print schema.json to stdout (no project needed)
    canpy --emit schema --output ./out    # → ./out/schema.json
  6. Force a clean rebuild with a custom cache directory:

    canpy --input ./my-python-project --eager --cache-dir /path/to/custom-cache

Output targets

canpy builds one analysis in memory and can emit it three ways (--emit):

analysis.json (default)

A PyApplication document — the canonical CLDK contract:

{
  "symbol_table": { /* file path → module (classes, functions, variables, imports, …) */ },
  "call_graph":   [ /* CALL_DEP edges: { source, target, weight, provenance } keyed by callable signature */ ]
}

By default this is printed to stdout in JSON; with --output it is written to analysis.json (or analysis.msgpack with --format msgpack, a more compact binary format).

Neo4j graph

--emit neo4j projects the same analysis into a labeled property graph. Every node label is Py-prefixed and every relationship type is PY_-prefixed (e.g. :PyClass, PY_CALLS) so multiple language analyzers can share one database without label or relationship-type collisions. Declarations are keyed by their signature under a shared :PySymbol label; calls, imports, inheritance, decorators, and call sites are relationships:

  • Without --neo4j-uri — writes a self-contained graph.cypher (constraints + indexes, a scoped wipe, then batched MERGEs). Load it with cypher-shell < graph.cypher. Needs no extra dependencies.
  • With --neo4j-uri — pushes to a live Neo4j over Bolt incrementally: only modules whose content hash changed are rewritten, and on a full run modules whose source file vanished are pruned. Requires the neo4j extra. Every graph carries a schema_version on its :PyApplication node.

Call-graph endpoints that aren't present in the symbol table (third-party / framework / RPC targets) are materialized as :PyExternal ghost nodes, mirroring the analyzer's own ghost-node behaviour.

The connection options also read from the standard Neo4j environment variables — NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE — when the corresponding flag is omitted (an explicit flag wins). Prefer the env var for the password so it doesn't land in shell history or the process list:

export NEO4J_URI=bolt://localhost:7687
export NEO4J_PASSWORD=secret
canpy -i ./my-project --emit neo4j     # credentials picked up from the environment

Schema contract

--emit schema writes the machine-readable, version-stamped Neo4j schema (schema.json: node labels, relationships, properties, constraints, and indexes). It needs no project and is checked into the repo as schema.neo4j.json and bundled in every release as a GitHub Release asset, so a consumer can validate producer/consumer compatibility without invoking the tool. The shape of the contract matches the codeanalyzer-typescript backend.

A UML of the analysis.json schema (the PyApplication containment tree) is checked in as schema-uml.drawio, and the property-graph schema as neo4j-schema.drawio.

Development

This project uses uv.

uv sync --all-groups
uv run canpy --input /path/to/project           # run from source
uv run canpy --emit schema > schema.neo4j.json  # regenerate the checked-in schema contract
uv run python scripts/update_readme.py          # regenerate the canpy --help block above
uv run pytest                                   # run the test suite

The Neo4j schema-conformance test always runs. The Neo4j bolt integration test spins up a real Neo4j via Testcontainers and is opt-in — it needs a container runtime (Docker or Podman) and is enabled with an environment variable:

RUN_CONTAINER_TESTS=1 uv run pytest test/test_neo4j_bolt.py -s

License

Apache 2.0 — see LICENSE.

About

Python Static Analysis Backend for CLDK

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors