Skip to content

Make it easier for agents to generate datafusion-python code #1394

@timsaucer

Description

@timsaucer

Problem

More and more users reach for LLMs to generate DataFusion Python code. Today, agents are excellent at writing SQL but struggle to produce idiomatic DataFrame API code — they either transliterate SQL literally or invent patterns that don't match the library's grain. Nothing the project currently ships reliably surfaces to the agent at the moment it's writing code.

Goals

  1. Establish a single, authoritative guide for writing idiomatic DataFusion Python code.
  2. Make that guide discoverable through every channel agents actually use — not just the channels we wish they used.
  3. Validate the guide against a reference corpus (TPC-H) so it stays honest as the API evolves.
  4. Extend the same pattern across the wider DataFusion family (Ballista, Comet, Ray, etc.) via an upstream llms.txt hub.

Where idiomatic code is defined

Single source of truth: SKILL.md at the repo root.

This one file — kept at the repo root with YAML frontmatter for skill-ecosystem discovery, and included verbatim on the docs site — is the canonical guide for agents. It contains:

  • Core abstractions (SessionContext / DataFrame / Expr / functions) and import conventions.
  • A quick-start example that works end-to-end.
  • SQL-to-DataFrame reference table (for users who think in SQL first).
  • Migration sections for users coming from Spark, Pandas, and Polars — same shape as the SQL table, column-mapping each API's idioms to DataFusion's.
  • Common pitfalls caught in real agent sessions: &/|/~ vs Python and/or/not, lit() wrapping, decimal/float literal interactions, F.substring vs F.substr arity, join-key disambiguation, date-vs-timestamp arithmetic rules, etc.
  • Idiomatic patterns: fluent chaining, window functions in place of correlated subqueries, semi/anti joins in place of EXISTS/NOT EXISTS, aggregate().filter() for HAVING, variable assignment for CTEs.

The TPC-H example suite (examples/tpch/) is the reference corpus: every query is written as idiomatic DataFrame code, validated by answer-file comparison, and where the optimized logical plan differs from the SQL version, the difference is documented in a comment. This gives the SKILL.md guidance a continuously verified ground truth.

For humans, the primary reference is the online user guide at https://datafusion.apache.org/python. SKILL.md is written in a dense, skill-oriented format for agent consumption.

How agents discover it

Discovery is layered. Each layer catches agents the prior ones missed, so no single channel is load-bearing.

Layer Mechanism Target audience
1 npx skills add apache/datafusion-python — reads SKILL.md at repo root via the skill-ecosystem convention Agents with skill-registry support (Claude Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, Aider, opencode, +18 others)
2 Community aggregators auto-scrape repos with a SKILL.md (skillsmp, awesome-claude-skills, claudemarketplaces) Users browsing skill indexes
3 https://datafusion.apache.org/python/llms.txt published on the docs site (llmstxt.org convention) Agents that auto-fetch /llms.txt from documentation sites
4 Docs site page that {include}s SKILL.md Humans and WebSearch-capable agents browsing the docs
5 Enriched datafusion.__doc__ with a pointer to the online SKILL.md URL Agents that introspect the installed package (help(datafusion), IDE hovers, PyPI rendering)
6 README section explaining the install paths (npx skills add preferred; manual pointer fallback) Users arriving from PyPI/README before any agent is wired up
7 https://datafusion.apache.org/llms.txt upstream hub (separate PR to apache/datafusion) pointing at each subproject's llms.txt Agents that land anywhere in the DataFusion ecosystem

Previously the plan relied on shipping the guide inside the wheel so agents that introspect the installed package would find it. In practice no shipping agent walks site-packages/*/AGENTS.md, so layer 5 is narrowed to what the module docstring can carry (via help() / introspection / IDE tooling), and the file itself is distributed at the repo level via layers 1–2.

Task list

  • PR 1a — SKILL.md at repo root + package docstring entry point (Add SKILL.md and enrich package docstring #1497)
  • PR 1b — Module docstrings + doctest examples
  • PR 1c — README "Using DataFusion with AI coding assistants" section
  • PR 2 — TPC-H reference SQL + plan comparison diagnostic
  • PR 3 — Rewrite TPC-H non-idiomatic queries
  • PR 4 — Docs site ({include} + llms.txt) + AI skills + CLAUDE.md
  • PR 5 — Upstream sync process documentation
  • PR 6 — apache/datafusion llms.txt hub (separate repo)
  • PR 7 — .claude-plugin/plugin.json for Claude Code plugin marketplace (optional)

Detailed plan to follow as a comment.

Changes from the original plan

  • PR 1a landed as SKILL.md at the repo root (not python/datafusion/AGENTS.md shipped inside the wheel). Empirical testing showed no mainstream agent walks site-packages for AGENTS.md, so the in-wheel distribution channel was aspirational.
  • PR 1c changed from a datafusion-init console script to a README section. With the skill ecosystem handling project-root pointer writing automatically, a console script's remaining audience (Python-only, no-Node, agent-agnostic users) is narrow enough that a README edit covers it with less surface area.
  • PR 7 added — optional Claude Code plugin marketplace entry for /plugin install datafusion-python UX.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions