Skip to content

Commit e971175

Browse files
authored
Merge pull request #165 from codellm-devkit/minor/fix-160-161-162-163-164
Read-only Python Neo4j backend, per-language factory facade & typed backend configs (#160#164)
2 parents f477ab7 + 21b3e3b commit e971175

49 files changed

Lines changed: 6053 additions & 446 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,67 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [Unreleased]
9+
10+
### Added
11+
- **Per-language factory methods on `CLDK`**`CLDK.java()`, `CLDK.python()`, `CLDK.typescript()`,
12+
and `CLDK.c()` — each with an honest signature exposing only the options that apply to that
13+
language. These are the preferred entry points, replacing the stringly-typed
14+
`CLDK(language).analysis(...)`.
15+
- **Typed backend-configuration objects** in `cldk.analysis.commons.backend_config`. The backend is
16+
now selected by the *type* of the `backend=` config passed to a factory: `CodeAnalyzerConfig`
17+
(default; in-process analyzer) / `PyCodeAnalyzerConfig` (adds `use_codeql`, `use_ray`), or
18+
`Neo4jConnectionConfig` (read-only Neo4j). `Neo4jConnectionConfig` is hoisted here and re-exported
19+
from `cldk.analysis.{python,typescript}.neo4j` for backward compatibility.
20+
- **Unified, language-keyed cache directory.** All backends now share a single `cache_dir`
21+
(default `<project>/.codeanalyzer`) and write their artifacts under a per-language subdirectory
22+
(`<cache_dir>/java`, `<cache_dir>/python`, `<cache_dir>/typescript`), so a polyglot project
23+
analyzed under more than one language no longer overwrites a shared `analysis.json`.
24+
25+
### Changed
26+
- **Caching is on by default for Java/TypeScript.** The in-process backend now caches `analysis.json`
27+
to disk (under the language-keyed `cache_dir`) instead of streaming over a stdout pipe.
28+
- `CLDK(language).analysis(...)` is **deprecated** and retained as a thin compatibility shim that
29+
forwards to the new factory methods (emits a `DeprecationWarning`).
30+
31+
### Deprecated
32+
- Java `source_code` (single-file) input — pass `project_path` instead.
33+
34+
### Removed
35+
- `analysis_backend_path` from the public interface. The backend binary ships with the packaged
36+
`codeanalyzer-*` dependency; for TypeScript, `$CODEANALYZER_TS_BIN` remains as the only
37+
out-of-band override.
38+
- `analysis_json_path` from the public interface — folded into the unified `cache_dir`.
39+
40+
### Migration
41+
- The language-keyed cache relocates `analysis.json` from `<cache_dir>/analysis.json` to
42+
`<cache_dir>/<language>/analysis.json`; existing caches are not found at the new path, so the
43+
first run after upgrading recomputes the analysis.
44+
45+
### Added (Neo4j)
46+
- Read-only Neo4j-backed TypeScript analysis backend (`cldk.analysis.typescript.neo4j.TSNeo4jBackend`).
47+
It is a drop-in alternative to the in-memory `TSCodeanalyzer`: it answers the **same** `get_*`
48+
query surface (call graph, callers/callees, class hierarchy, call sites, decorators, symbol
49+
lookups, ...) by running **Cypher over a live Neo4j graph** instead of walking the pydantic /
50+
NetworkX structures. The graph is the one `codeanalyzer-typescript` emits with `--emit neo4j`
51+
(schema `schema.neo4j.json`); it is always populated out of band, and the SDK only polls it
52+
(read-only — never writes, needs no binary or project sources).
53+
- `TypeScriptAnalysis` / `CLDK.analysis(language="typescript")` now accept an optional
54+
`neo4j_config` (`Neo4jConnectionConfig`) to select the Neo4j backend; without it the in-memory
55+
backend is used, unchanged.
56+
- Read-only Neo4j-backed **Python** analysis backend (`cldk.analysis.python.neo4j.PyNeo4jBackend`),
57+
the analog of the TypeScript one. It answers all 21 `PythonAnalysisBackend` queries via Cypher
58+
over the graph `codeanalyzer-python` (>= 0.2.0) emits with `--emit neo4j`. Verified against a real
59+
57-module project: every node/edge **present in the graph** reconstructs identically to the
60+
in-memory `PyCodeanalyzer` (3169/3200 checks; zero weight/provenance mismatches on shared call
61+
edges). Known gaps are not in the query layer: projection-lossy fields (comments → docstring,
62+
`PyVariableDeclaration.value`/columns, per-binding import detail), and an **upstream emitter bug**
63+
where calls to a bare module name that is also imported (e.g. `os`/`re`/`json`) are dropped from
64+
the emitted call graph. `PythonAnalysis` / `CLDK.analysis(language="python")` accept the same
65+
optional `neo4j_config`.
66+
- Bumped `codeanalyzer-python` to `0.2.0` (adds the Neo4j graph emitter).
67+
- Optional `neo4j` extra (`pip install cldk[neo4j]`) for the Neo4j Python driver.
68+
869
## [v1.0.7] - 2026-02-14
970

1071
### Added
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
################################################################################
2+
# Copyright IBM Corporation 2026
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
################################################################################
16+
17+
"""Backend configuration objects for the CLDK analysis facades.
18+
19+
The CLDK front end selects analysis behavior along two orthogonal axes: the *language* (chosen by
20+
which :class:`~cldk.core.CLDK` factory method is called) and the *backend* (chosen by the **type**
21+
of the configuration object passed as ``backend=``). The dataclasses here are those configuration
22+
objects -- Parameter Objects that the facades ingest and dispatch on.
23+
24+
Two backend families exist:
25+
26+
* :class:`CodeAnalyzerConfig` (and its language-specific subclasses) selects the in-process
27+
codeanalyzer backend, which runs the packaged ``codeanalyzer-*`` binary and caches its
28+
``analysis.json`` under a language-keyed cache directory.
29+
* :class:`Neo4jConnectionConfig` selects the read-only Neo4j/Cypher backend, which answers the
30+
same queries over a graph populated out of band.
31+
32+
The per-language ``*Backend`` unions below are the discriminated unions the facades match on.
33+
"""
34+
35+
from __future__ import annotations
36+
37+
from dataclasses import dataclass
38+
from pathlib import Path
39+
from typing import Union
40+
41+
# The canonical sub-directory name each language's artifacts live under inside the shared cache
42+
# root. Keyed so that a polyglot repository analyzed under more than one language does not have its
43+
# backends overwrite a single shared ``analysis.json``.
44+
_CACHE_KEYS = {"java": "java", "python": "python", "typescript": "typescript", "c": "c"}
45+
46+
47+
@dataclass
48+
class CodeAnalyzerConfig:
49+
"""Select the in-process codeanalyzer backend.
50+
51+
The backend binary is sourced from the packaged ``codeanalyzer-*`` dependency, so the only
52+
knob is where analysis artifacts are cached.
53+
54+
Attributes:
55+
cache_dir: Root directory for analysis artifacts. When ``None`` the facade defaults it to
56+
``<project>/.codeanalyzer``. Each backend writes under a language-keyed subdirectory of
57+
this root (see :func:`cache_subdir`), so the same root can be shared across languages.
58+
"""
59+
60+
cache_dir: Union[str, Path, None] = None
61+
62+
63+
@dataclass
64+
class PyCodeAnalyzerConfig(CodeAnalyzerConfig):
65+
"""Select the in-process codeanalyzer backend for Python.
66+
67+
Adds the Python-only call-graph knobs on top of :class:`CodeAnalyzerConfig`.
68+
69+
Attributes:
70+
use_codeql: If ``True`` (default), augment Jedi-based call-graph resolution with CodeQL.
71+
use_ray: If ``True``, enable Ray-based parallel processing for large projects.
72+
"""
73+
74+
use_codeql: bool = True
75+
use_ray: bool = False
76+
77+
78+
@dataclass
79+
class Neo4jConnectionConfig:
80+
"""Select the read-only Neo4j-backed analysis backend.
81+
82+
The graph is always populated out of band (e.g. a job that runs ``codeanalyzer-* --emit
83+
neo4j``); the SDK only polls it. This config carries the connection details and which
84+
application to scope queries to.
85+
86+
Attributes:
87+
uri: Bolt URI of the Neo4j server (e.g. ``bolt://localhost:7687``).
88+
username: Neo4j username (read-only credentials are sufficient).
89+
password: Neo4j password.
90+
database: Database name (``None`` => server default).
91+
application_name: The application anchor name to scope queries to. Matches the
92+
``--app-name`` the graph was loaded with (defaults to the project directory name).
93+
"""
94+
95+
uri: str
96+
username: str = "neo4j"
97+
password: str = "neo4j"
98+
database: str | None = None
99+
application_name: str | None = None
100+
101+
102+
# Per-language discriminated unions the facades match on. Java has no Neo4j backend yet, so its
103+
# only admissible config is the codeanalyzer one.
104+
JavaBackend = CodeAnalyzerConfig
105+
PyBackend = Union[PyCodeAnalyzerConfig, Neo4jConnectionConfig]
106+
TSBackend = Union[CodeAnalyzerConfig, Neo4jConnectionConfig]
107+
108+
109+
def cache_subdir(cache_dir: Union[str, Path, None], project_dir: Union[str, Path, None], language: str) -> Path | None:
110+
"""Resolve the language-keyed cache directory for a backend.
111+
112+
Args:
113+
cache_dir: The cache root from the backend config. When ``None``, defaults to
114+
``<project_dir>/.codeanalyzer``.
115+
project_dir: The project directory, used to derive the default root.
116+
language: The canonical language key (``"java"``, ``"python"``, ``"typescript"``, ``"c"``).
117+
118+
Returns:
119+
``<root>/<language>`` as an absolute path, or ``None`` if no root can be determined
120+
(no ``cache_dir`` and no ``project_dir``).
121+
"""
122+
key = _CACHE_KEYS.get(language, language)
123+
if cache_dir is not None:
124+
root = Path(cache_dir).expanduser().resolve()
125+
elif project_dir is not None:
126+
root = Path(project_dir).expanduser().resolve() / ".codeanalyzer"
127+
else:
128+
return None
129+
return root / key

0 commit comments

Comments
 (0)