docs: add llms.txt, agent skills, and AI assistants guide by timsaucer · Pull Request #1505 · apache/datafusion-python

timsaucer · 2026-04-24T16:30:32Z

Which issue does this PR close?

Part of #1394 (implements PR 4 of the plan).

Rationale for this change

#1394 tracks making datafusion-python legible to AI coding assistants without breaking the experience for humans browsing the docs. Earlier PRs shipped the repo-root SKILL.md (#1497), enriched module docstrings and doctests (#1498), added a README section pointing agents at the skill (#1503), and rewrote the TPC-H examples in idiomatic DataFrame form (#1504). This PR fills in the docs-site layer: a machine-readable entry point for LLM tooling, a short human-written page explaining how to wire up an AI assistant, and two contributor-facing skills that agents working on this repo can pick up.

It also relocates the pattern demos that #1504 removed from the TPC-H queries (CASE filtering, array-based membership, UDF-vs-expression predicates, array_agg with filter) into the common-operations docs, so those teaching examples still live somewhere concrete.

What changes are included in this PR?

docs/source/llms.txt — an llmstxt.org entry point, copied verbatim to the site root via html_extra_path. Categorized links to the skill, user guide, DataFrame API reference, and TPC-H examples.
docs/source/ai-coding-assistants.rst — a short human-written page mirroring the README section added in docs: add README section for AI coding assistants #1503. Explains what the skill is, how to install it (npx skills add apache/datafusion-python or a manual AGENTS.md / CLAUDE.md pointer), and what it covers. Wired into the User Guide toctree.
.ai/skills/write-dataframe-code/SKILL.md — a contributor skill layered on top of the repo-root SKILL.md. Adds a TPC-H pattern index (which query demonstrates which API), the plan-comparison diagnostic workflow for translating SQL to DataFrame form, and the project-specific docstring conventions.
.ai/skills/audit-skill-md/SKILL.md — a contributor skill that cross-references SKILL.md against the current public Python surface (functions module, DataFrame, Expr, SessionContext, package-root re-exports) and reports new APIs needing coverage and stale mentions. Diff-only; does not auto-edit.
AGENTS.md (symlinked as CLAUDE.md) — lists the three contributor skills and documents the plan-comparison diagnostic workflow.
docs/source/user-guide/common-operations/expressions.rst — adds a "Testing membership in a list" section comparing |-compound filters, in_list, and array_position / make_array, plus a "Conditional expressions" section contrasting switched and searched case.
docs/source/user-guide/common-operations/udf-and-udfa.rst — adds a "When not to use a UDF" subsection showing the compound-OR predicate that replaces a Python-side UDF for disjunctive bucket filters (the Q19 case).
docs/source/user-guide/common-operations/aggregations.rst — adds a "Building per-group arrays" subsection covering array_agg(filter=..., distinct=True) with array_length and array_element for the single-value-per-group pattern (the Q21 case).
examples/array-operations.py — a runnable end-to-end walkthrough of the membership and array_agg patterns. Linked from examples/README.md.

Verified with pre-commit run --all-files and sphinx-build -W --keep-going against the full docs tree.

Are there any user-facing changes?

Yes, docs-only:

New docs-site page: ai-coding-assistants.html, reachable from the User Guide sidebar.
New docs-site asset: llms.txt served at the site root (datafusion.apache.org/python/llms.txt).
New common-operations content (membership tests, conditional expressions, UDF guidance, array_agg patterns).
New example file examples/array-operations.py.

No public Python API is added, changed, or removed.

🤖 Generated with Claude Code

Adds a new `skill` page that embeds the repo-root `SKILL.md` through the myst `{include}` directive, so the agent-facing guide lives on the published docs site without duplication. The page is wired into the User Guide toctree. Implements PR 4a of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `docs/source/llms.txt` in llmstxt.org schema: a short description plus categorized links to the agent skill, user guide pages, DataFrame API reference, and example queries. `html_extra_path` in `conf.py` copies it verbatim to the published site root so it resolves at `https://datafusion.apache.org/python/llms.txt`. Implements PR 4b of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `.ai/skills/write-dataframe-code/SKILL.md`, a contributor-facing skill for agents working on this repo. It layers on top of the user-facing repo-root SKILL.md with: - a TPC-H pattern index mapping idiomatic API usages to the query file that demonstrates them, - an ad-hoc plan-comparison workflow for checking DataFrame translations against a reference SQL query via `optimized_logical_plan()`, and - the project-specific docstring and aggregate/window documentation conventions that CLAUDE.md already enforces for contributors. Implements PR 4c of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `.ai/skills/audit-skill-md/SKILL.md`, a contributor skill that cross-references the repo-root `SKILL.md` against the current public Python API (functions module, DataFrame, Expr, SessionContext, and package-root re-exports). Reports two classes of drift: - new APIs exposed by the Python surface that are not yet covered in the user-facing guide, and - stale mentions in the guide that no longer exist in the public API. The skill is diff-only — it produces a report the user reviews before any edit to `SKILL.md`. Complements `check-upstream/`, which audits in the opposite direction (upstream Rust features not yet exposed). Implements PR 4d of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Moves the illustrative patterns that apache#1504 removed from the TPC-H examples into the common-operations docs, where they serve as pattern-focused teaching material without cluttering the TPC-H translations: - expressions.rst gains a "Testing membership in a list" section comparing `|`-compound filters, `in_list`, and `array_position` + `make_array`, plus a "Conditional expressions" section contrasting switched and searched `case`. - udf-and-udfa.rst gains a "When not to use a UDF" subsection showing the compound-OR predicate that replaces a Python-side UDF for disjunctive bucket filters (the Q19 case). - aggregations.rst gains a "Building per-group arrays" subsection covering `array_agg(filter=..., distinct=True)` with `array_length`/`array_element` for the single-value-per-group pattern (the Q21 case). - Adds `examples/array-operations.py`, a runnable end-to-end walkthrough of the membership and array_agg patterns. Implements PR 4e of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… AGENTS.md - List the three contributor skills (`check-upstream`, `write-dataframe-code`, `audit-skill-md`) under the Skills section so agents know what tools they have before starting work. - Document the plan-comparison diagnostic workflow (comparing `ctx.sql(...).optimized_logical_plan()` against a DataFrame's `optimized_logical_plan()` via `LogicalPlan.__eq__`) for translating SQL queries to DataFrame form. Points at the full write-up in the `write-dataframe-code` skill rather than duplicating it. `CLAUDE.md` is a symlink to `AGENTS.md`, so the change lands in both. Implements PR 4f of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g state The "Building per-group arrays" block added in the previous commit reassigned `df` and `ctx` mid-page, which then broke the Grouping Sets examples further down that share the Pokemon `df` binding (`col_type_1` etc. no longer resolved). Rename the demo DataFrame to `orders_df` and drop the redundant `ctx = SessionContext()` so the shared state from the top of the page stays intact. Verified with `sphinx-build -W --keep-going` against the full docs tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… page The previous approach embedded the repo-root `SKILL.md` on the docs site via a myst `{include}`. That file is written for agents -- dense, skill-formatted, and not suited to a human browsing the User Guide. It also relied on a fragile `:start-line:` offset to strip YAML frontmatter. Replace it with `docs/source/ai-coding-assistants.md`, a short human-readable page that mirrors the README section added in apache#1503: what the skill is, how to install it via `npx skills` or a manual pointer, and what kinds of things it covers. `SKILL.md` stays at the repo root as the single source of truth; agents fetch the raw GitHub URL directly. `llms.txt` is updated to point its Agent Guide entry at `raw.githubusercontent.com/.../SKILL.md` and to include the new human-readable page as a secondary link. The User Guide toctree now references `ai-coding-assistants` in place of the removed `skill` stub. Verified with `sphinx-build -W --keep-going` against the full docs tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The introduction and the "Installing the skill" section both enumerated the same set of supported assistants. Drop the intro copy; the list that matters is next to `npx skills add`, where it answers "what does this command actually configure?" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… title Every other page in `docs/source/user-guide` and the top-level `docs/source` is written in reStructuredText; the lone `.md` page was an inconsistency. Rewrite in rst so the ASF header matches the rest of the tree, cross-references can use `:py:func:` roles if we ever add any, and myst is no longer required just to render this one page. Also shorten the page title from "Using DataFusion with AI Coding Assistants" to "Using AI Coding Assistants" -- it already sits under the DataFusion user guide so the product name is redundant. Verified with `sphinx-build -W --keep-going`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The skill as written pushed for every public method to be mentioned in `SKILL.md`, which is the wrong goal. `SKILL.md` is a distilled agent guide of idiomatic patterns and pitfalls, not an API reference -- autoapi-generated docs and module docstrings already provide full per-method coverage. An audit pressing for 100% method coverage would bloat the skill file into a stale copy of that reference. The two checks with actual value (stale mentions in `SKILL.md`, and drift between `functions.__all__` and the categorized function list) are small enough to be ad-hoc greps at release time and do not warrant a dedicated skill. Also remove references to the skill from `AGENTS.md` and the `write-dataframe-code` skill's "Related" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A separate PR covers the same contributor-facing material (TPC-H pattern index, plan-comparison workflow, docstring conventions), so this skill is redundant. Remove the skill directory and the corresponding references in `AGENTS.md`, including the plan-comparison section that pointed at it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous version of the section asserted that a UDF predicate blocks optimizer rewrites but did not show evidence. Replace the two `code-block` examples with an executable walkthrough that writes a small Parquet file, runs the same filter two ways, and prints the physical plan for each. The native-expression plan renders with three annotations on the `DataSourceExec` node that the UDF plan does not have: - `predicate=brand@1 = A AND qty@2 >= 150` pushed into the scan - `pruning_predicate=... brand_min@0 <= A AND ... qty_max@4 >= 150` for row-group pruning via Parquet footer min/max stats - `required_guarantees=[brand in (A)]` for bloom-filter / dictionary skipping The UDF form keeps only `predicate=brand_qty_filter(...)`: the scan has to materialize every row group and call the Python callback. The disjunctive-OR rewrite (previously the main example) stays at the end as the idiomatic alternative for multi-bucket filters. Verified with `sphinx-build -W --keep-going`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

timsaucer and others added 10 commits April 24, 2026 11:53

timsaucer marked this pull request as draft April 24, 2026 16:33

timsaucer and others added 3 commits April 24, 2026 12:39

timsaucer marked this pull request as ready for review April 24, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add llms.txt, agent skills, and AI assistants guide#1505

docs: add llms.txt, agent skills, and AI assistants guide#1505
timsaucer wants to merge 13 commits intoapache:mainfrom
timsaucer:feat/docsite-agent-improvements

timsaucer commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

timsaucer commented Apr 24, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant