docs: add llms.txt, agent skills, and AI assistants guide#1505
Open
timsaucer wants to merge 13 commits intoapache:mainfrom
Open
docs: add llms.txt, agent skills, and AI assistants guide#1505timsaucer wants to merge 13 commits intoapache:mainfrom
timsaucer wants to merge 13 commits intoapache:mainfrom
Conversation
Adds a new `skill` page that embeds the repo-root `SKILL.md` through the
myst `{include}` directive, so the agent-facing guide lives on the
published docs site without duplication. The page is wired into the
User Guide toctree. Implements PR 4a of the plan in apache#1394.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `docs/source/llms.txt` in llmstxt.org schema: a short description plus categorized links to the agent skill, user guide pages, DataFrame API reference, and example queries. `html_extra_path` in `conf.py` copies it verbatim to the published site root so it resolves at `https://datafusion.apache.org/python/llms.txt`. Implements PR 4b of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `.ai/skills/write-dataframe-code/SKILL.md`, a contributor-facing skill for agents working on this repo. It layers on top of the user-facing repo-root SKILL.md with: - a TPC-H pattern index mapping idiomatic API usages to the query file that demonstrates them, - an ad-hoc plan-comparison workflow for checking DataFrame translations against a reference SQL query via `optimized_logical_plan()`, and - the project-specific docstring and aggregate/window documentation conventions that CLAUDE.md already enforces for contributors. Implements PR 4c of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `.ai/skills/audit-skill-md/SKILL.md`, a contributor skill that cross-references the repo-root `SKILL.md` against the current public Python API (functions module, DataFrame, Expr, SessionContext, and package-root re-exports). Reports two classes of drift: - new APIs exposed by the Python surface that are not yet covered in the user-facing guide, and - stale mentions in the guide that no longer exist in the public API. The skill is diff-only — it produces a report the user reviews before any edit to `SKILL.md`. Complements `check-upstream/`, which audits in the opposite direction (upstream Rust features not yet exposed). Implements PR 4d of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the illustrative patterns that apache#1504 removed from the TPC-H examples into the common-operations docs, where they serve as pattern-focused teaching material without cluttering the TPC-H translations: - expressions.rst gains a "Testing membership in a list" section comparing `|`-compound filters, `in_list`, and `array_position` + `make_array`, plus a "Conditional expressions" section contrasting switched and searched `case`. - udf-and-udfa.rst gains a "When not to use a UDF" subsection showing the compound-OR predicate that replaces a Python-side UDF for disjunctive bucket filters (the Q19 case). - aggregations.rst gains a "Building per-group arrays" subsection covering `array_agg(filter=..., distinct=True)` with `array_length`/`array_element` for the single-value-per-group pattern (the Q21 case). - Adds `examples/array-operations.py`, a runnable end-to-end walkthrough of the membership and array_agg patterns. Implements PR 4e of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… AGENTS.md - List the three contributor skills (`check-upstream`, `write-dataframe-code`, `audit-skill-md`) under the Skills section so agents know what tools they have before starting work. - Document the plan-comparison diagnostic workflow (comparing `ctx.sql(...).optimized_logical_plan()` against a DataFrame's `optimized_logical_plan()` via `LogicalPlan.__eq__`) for translating SQL queries to DataFrame form. Points at the full write-up in the `write-dataframe-code` skill rather than duplicating it. `CLAUDE.md` is a symlink to `AGENTS.md`, so the change lands in both. Implements PR 4f of the plan in apache#1394. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g state The "Building per-group arrays" block added in the previous commit reassigned `df` and `ctx` mid-page, which then broke the Grouping Sets examples further down that share the Pokemon `df` binding (`col_type_1` etc. no longer resolved). Rename the demo DataFrame to `orders_df` and drop the redundant `ctx = SessionContext()` so the shared state from the top of the page stays intact. Verified with `sphinx-build -W --keep-going` against the full docs tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… page
The previous approach embedded the repo-root `SKILL.md` on the docs
site via a myst `{include}`. That file is written for agents -- dense,
skill-formatted, and not suited to a human browsing the User Guide. It
also relied on a fragile `:start-line:` offset to strip YAML
frontmatter.
Replace it with `docs/source/ai-coding-assistants.md`, a short
human-readable page that mirrors the README section added in apache#1503:
what the skill is, how to install it via `npx skills` or a manual
pointer, and what kinds of things it covers. `SKILL.md` stays at the
repo root as the single source of truth; agents fetch the raw GitHub
URL directly.
`llms.txt` is updated to point its Agent Guide entry at
`raw.githubusercontent.com/.../SKILL.md` and to include the new
human-readable page as a secondary link. The User Guide toctree now
references `ai-coding-assistants` in place of the removed `skill`
stub.
Verified with `sphinx-build -W --keep-going` against the full docs
tree.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The introduction and the "Installing the skill" section both enumerated the same set of supported assistants. Drop the intro copy; the list that matters is next to `npx skills add`, where it answers "what does this command actually configure?" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… title Every other page in `docs/source/user-guide` and the top-level `docs/source` is written in reStructuredText; the lone `.md` page was an inconsistency. Rewrite in rst so the ASF header matches the rest of the tree, cross-references can use `:py:func:` roles if we ever add any, and myst is no longer required just to render this one page. Also shorten the page title from "Using DataFusion with AI Coding Assistants" to "Using AI Coding Assistants" -- it already sits under the DataFusion user guide so the product name is redundant. Verified with `sphinx-build -W --keep-going`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The skill as written pushed for every public method to be mentioned in `SKILL.md`, which is the wrong goal. `SKILL.md` is a distilled agent guide of idiomatic patterns and pitfalls, not an API reference -- autoapi-generated docs and module docstrings already provide full per-method coverage. An audit pressing for 100% method coverage would bloat the skill file into a stale copy of that reference. The two checks with actual value (stale mentions in `SKILL.md`, and drift between `functions.__all__` and the categorized function list) are small enough to be ad-hoc greps at release time and do not warrant a dedicated skill. Also remove references to the skill from `AGENTS.md` and the `write-dataframe-code` skill's "Related" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A separate PR covers the same contributor-facing material (TPC-H pattern index, plan-comparison workflow, docstring conventions), so this skill is redundant. Remove the skill directory and the corresponding references in `AGENTS.md`, including the plan-comparison section that pointed at it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version of the section asserted that a UDF predicate blocks optimizer rewrites but did not show evidence. Replace the two `code-block` examples with an executable walkthrough that writes a small Parquet file, runs the same filter two ways, and prints the physical plan for each. The native-expression plan renders with three annotations on the `DataSourceExec` node that the UDF plan does not have: - `predicate=brand@1 = A AND qty@2 >= 150` pushed into the scan - `pruning_predicate=... brand_min@0 <= A AND ... qty_max@4 >= 150` for row-group pruning via Parquet footer min/max stats - `required_guarantees=[brand in (A)]` for bloom-filter / dictionary skipping The UDF form keeps only `predicate=brand_qty_filter(...)`: the scan has to materialize every row group and call the Python callback. The disjunctive-OR rewrite (previously the main example) stays at the end as the idiomatic alternative for multi-bucket filters. Verified with `sphinx-build -W --keep-going`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #1394 (implements PR 4 of the plan).
Rationale for this change
#1394 tracks making datafusion-python legible to AI coding assistants without breaking the experience for humans browsing the docs. Earlier PRs shipped the repo-root
SKILL.md(#1497), enriched module docstrings and doctests (#1498), added a README section pointing agents at the skill (#1503), and rewrote the TPC-H examples in idiomatic DataFrame form (#1504). This PR fills in the docs-site layer: a machine-readable entry point for LLM tooling, a short human-written page explaining how to wire up an AI assistant, and two contributor-facing skills that agents working on this repo can pick up.It also relocates the pattern demos that #1504 removed from the TPC-H queries (CASE filtering, array-based membership, UDF-vs-expression predicates,
array_aggwith filter) into the common-operations docs, so those teaching examples still live somewhere concrete.What changes are included in this PR?
docs/source/llms.txt— an llmstxt.org entry point, copied verbatim to the site root viahtml_extra_path. Categorized links to the skill, user guide, DataFrame API reference, and TPC-H examples.docs/source/ai-coding-assistants.rst— a short human-written page mirroring the README section added in docs: add README section for AI coding assistants #1503. Explains what the skill is, how to install it (npx skills add apache/datafusion-pythonor a manualAGENTS.md/CLAUDE.mdpointer), and what it covers. Wired into the User Guide toctree..ai/skills/write-dataframe-code/SKILL.md— a contributor skill layered on top of the repo-rootSKILL.md. Adds a TPC-H pattern index (which query demonstrates which API), the plan-comparison diagnostic workflow for translating SQL to DataFrame form, and the project-specific docstring conventions..ai/skills/audit-skill-md/SKILL.md— a contributor skill that cross-referencesSKILL.mdagainst the current public Python surface (functions module,DataFrame,Expr,SessionContext, package-root re-exports) and reports new APIs needing coverage and stale mentions. Diff-only; does not auto-edit.AGENTS.md(symlinked asCLAUDE.md) — lists the three contributor skills and documents the plan-comparison diagnostic workflow.docs/source/user-guide/common-operations/expressions.rst— adds a "Testing membership in a list" section comparing|-compound filters,in_list, andarray_position/make_array, plus a "Conditional expressions" section contrasting switched and searchedcase.docs/source/user-guide/common-operations/udf-and-udfa.rst— adds a "When not to use a UDF" subsection showing the compound-OR predicate that replaces a Python-side UDF for disjunctive bucket filters (the Q19 case).docs/source/user-guide/common-operations/aggregations.rst— adds a "Building per-group arrays" subsection coveringarray_agg(filter=..., distinct=True)witharray_lengthandarray_elementfor the single-value-per-group pattern (the Q21 case).examples/array-operations.py— a runnable end-to-end walkthrough of the membership andarray_aggpatterns. Linked fromexamples/README.md.Verified with
pre-commit run --all-filesandsphinx-build -W --keep-goingagainst the full docs tree.Are there any user-facing changes?
Yes, docs-only:
ai-coding-assistants.html, reachable from the User Guide sidebar.llms.txtserved at the site root (datafusion.apache.org/python/llms.txt).array_aggpatterns).examples/array-operations.py.No public Python API is added, changed, or removed.
🤖 Generated with Claude Code