Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,12 @@ Skills follow the [Agent Skills](https://agentskills.io) open standard. Each ski
- `SKILL.md` — The skill definition with YAML frontmatter (name, description, argument-hint) and detailed instructions.
- Additional supporting files as needed.

Currently available skills:

- [`check-upstream`](.ai/skills/check-upstream/SKILL.md) — audit upstream
Apache DataFusion features (functions, DataFrame ops, SessionContext
methods, FFI types) not yet exposed in the Python bindings.

## Pull Requests

Every pull request must follow the template in
Expand Down
3 changes: 2 additions & 1 deletion dev/release/rat_exclude_files.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,5 @@ benchmarks/tpch/create_tables.sql
**/.cargo/config.toml
uv.lock
examples/tpch/answers_sf1/*.tbl
SKILL.md
SKILL.md
docs/source/llms.txt
82 changes: 82 additions & 0 deletions docs/source/ai-coding-assistants.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.

Using AI Coding Assistants
==========================

If you write DataFusion Python code with an AI coding assistant, this
project ships machine-readable guidance so the assistant produces
idiomatic code rather than guessing from its training data.

What is published
-----------------

- `SKILL.md <https://github.com/apache/datafusion-python/blob/main/SKILL.md>`_ —
a dense, skill-oriented reference covering imports, data loading,
DataFrame operations, expression building, SQL-to-DataFrame mappings,
idiomatic patterns, and common pitfalls. Follows the
`Agent Skills <https://agentskills.io>`_ open standard.
- `llms.txt <llms.txt>`_ — an entry point for LLM-based tools following the
`llmstxt.org <https://llmstxt.org>`_ convention. Categorized links to the
skill, user guide, API reference, and examples.

Both files live at stable URLs so an agent can discover them without a
checkout of the repo.

Installing the skill
--------------------

**Preferred:** run

.. code-block:: shell

npx skills add apache/datafusion-python

This installs the skill in any supported agent on your machine (Claude
Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, and others).
The command writes the pointer into the agent's configuration so that any
project you open that uses DataFusion Python picks up the skill
automatically.

**Manual:** if you are not using the ``skills`` registry, paste this
single line into your project's ``AGENTS.md`` or ``CLAUDE.md``::

For DataFusion Python code, see https://github.com/apache/datafusion-python/blob/main/SKILL.md

Most assistants resolve that pointer the first time they see a
DataFusion-related prompt in the project.

What the skill covers
---------------------

Writing DataFusion Python code has a handful of conventions that are easy
for a model to miss — bitwise ``&`` / ``|`` / ``~`` instead of Python
``and`` / ``or`` / ``not``, the lazy-DataFrame immutability model, how
window functions replace SQL correlated subqueries, the ``case`` /
``when`` builder syntax, and the ``in_list`` / ``array_position`` options
for membership tests. The skill enumerates each of these with short,
copyable examples.

It is *not* a replacement for this user guide. Think of it as a distilled
reference the assistant keeps open while it writes code for you.

If you are an agent author
--------------------------

The skill file and ``llms.txt`` are the two supported integration
points. Both are versioned along with the release and follow open
standards — no project-specific handshake is required.
4 changes: 4 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,10 @@ def setup(sphinx) -> None:
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]

# Copy agent-facing files (llms.txt) verbatim to the site root so they
# resolve at conventional URLs like `https://.../python/llms.txt`.
html_extra_path = ["llms.txt"]

html_logo = "_static/images/2x_bgwhite_original.png"

html_css_files = ["theme_overrides.css"]
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ Example
user-guide/configuration
user-guide/sql
user-guide/upgrade-guides
ai-coding-assistants


.. _toc.contributor_guide:
Expand Down
36 changes: 36 additions & 0 deletions docs/source/llms.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# DataFusion in Python

> Apache DataFusion Python is a Python binding for Apache DataFusion, an in-process, Arrow-native query engine. It exposes a SQL interface and a lazy DataFrame API over PyArrow and any Arrow C Data Interface source. This file points agents and LLM-based tools at the most useful entry points for writing DataFusion Python code.

## Agent Guide

- [SKILL.md (agent skill, raw)](https://raw.githubusercontent.com/apache/datafusion-python/main/SKILL.md): idiomatic DataFrame API patterns, SQL-to-DataFrame mappings, common pitfalls, and the full `functions` catalog. Primary source of truth for writing datafusion-python code.
- [Using DataFusion with AI coding assistants](https://datafusion.apache.org/python/ai-coding-assistants.html): human-readable guide for installing the skill and manual setup pointers.

## User Guide

- [Introduction](https://datafusion.apache.org/python/user-guide/introduction.html): install, the Pokemon quick start, Jupyter tips.
- [Basics](https://datafusion.apache.org/python/user-guide/basics.html): `SessionContext`, `DataFrame`, and `Expr` at a glance.
- [Data sources](https://datafusion.apache.org/python/user-guide/data-sources.html): Parquet, CSV, JSON, Arrow, Pandas, Polars, and Python objects.
- [DataFrame operations](https://datafusion.apache.org/python/user-guide/dataframe/index.html): the lazy query-building interface.
- [Common operations](https://datafusion.apache.org/python/user-guide/common-operations/index.html): select, filter, join, aggregate, window, expressions, and functions.
- [SQL](https://datafusion.apache.org/python/user-guide/sql.html): running SQL against registered tables.
- [Configuration](https://datafusion.apache.org/python/user-guide/configuration.html): session and runtime options.

## DataFrame API reference

- [`datafusion.dataframe.DataFrame`](https://datafusion.apache.org/python/autoapi/datafusion/dataframe/index.html): the lazy DataFrame builder (`select`, `filter`, `aggregate`, `join`, `sort`, `limit`, set operations).
- [`datafusion.expr`](https://datafusion.apache.org/python/autoapi/datafusion/expr/index.html): expression tree nodes (`Expr`, `Window`, `WindowFrame`, `GroupingSet`).
- [`datafusion.functions`](https://datafusion.apache.org/python/autoapi/datafusion/functions/index.html): 290+ scalar, aggregate, and window functions.
- [`datafusion.context.SessionContext`](https://datafusion.apache.org/python/autoapi/datafusion/context/index.html): session entry point, data loading, SQL execution.

## Examples

- [TPC-H queries (GitHub)](https://github.com/apache/datafusion-python/tree/main/examples/tpch): canonical translations of TPC-H Q01–Q22 to idiomatic DataFrame code, each with reference SQL embedded in the module docstring.
- [Other examples (GitHub)](https://github.com/apache/datafusion-python/tree/main/examples): UDF/UDAF/UDWF, Substrait, Pandas/Polars interop, S3 reads.

## Optional

- [Contributor guide](https://datafusion.apache.org/python/contributor-guide/introduction.html): building from source, extending the Python bindings.
- [Upgrade guides](https://datafusion.apache.org/python/user-guide/upgrade-guides.html): migration notes between releases.
- [Upstream Rust `DataFusion`](https://datafusion.apache.org/): the underlying query engine.
53 changes: 53 additions & 0 deletions docs/source/user-guide/common-operations/aggregations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,59 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v
f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])


Building per-group arrays
^^^^^^^^^^^^^^^^^^^^^^^^^

:py:func:`~datafusion.functions.array_agg` collects the values within each
group into a list. Combined with ``distinct=True`` and the ``filter``
argument, it lets you ask two questions of the same group in one pass —
"what are all the values?" and "what are the values that satisfy some
condition?".

Suppose each row records a line item with the supplier that fulfilled it and
a flag for whether that supplier met the commit date. We want to identify
orders where exactly one supplier failed, among two or more suppliers in
total:

.. ipython:: python

orders_df = ctx.from_pydict(
{
"order_id": [1, 1, 1, 2, 2, 3],
"supplier_id": [100, 101, 102, 200, 201, 300],
"failed": [False, True, False, False, False, True],
},
)

grouped = orders_df.aggregate(
[col("order_id")],
[
f.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"),
f.array_agg(
col("supplier_id"),
filter=col("failed"),
distinct=True,
).alias("failed_suppliers"),
],
)

grouped.filter(
(f.array_length(col("failed_suppliers")) == lit(1))
& (f.array_length(col("all_suppliers")) > lit(1))
).select(
col("order_id"),
f.array_element(col("failed_suppliers"), lit(1)).alias("the_one_bad_supplier"),
)

Two aspects of the pattern are worth calling out:

- ``filter=`` on an aggregate narrows the rows contributing to *that*
aggregate only. Filtering the DataFrame before the aggregate would have
dropped whole groups that no longer had any rows.
- :py:func:`~datafusion.functions.array_length` tests group size without
another aggregate pass, and :py:func:`~datafusion.functions.array_element`
extracts a single value when you have proven the array has length one.

Grouping Sets
-------------

Expand Down
92 changes: 92 additions & 0 deletions docs/source/user-guide/common-operations/expressions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,98 @@ This function returns a new array with the elements repeated.
In this example, the `repeated_array` column will contain `[[1, 2, 3], [1, 2, 3]]`.


Testing membership in a list
----------------------------

A common need is filtering rows where a column equals *any* of a small set of
values. DataFusion offers three forms; they differ in readability and in how
they scale:

1. A compound boolean using ``|`` across explicit equalities.
2. :py:func:`~datafusion.functions.in_list`, which accepts a list of
expressions and tests equality against all of them in one call.
3. A trick with :py:func:`~datafusion.functions.array_position` and
:py:func:`~datafusion.functions.make_array`, which returns the 1-based
index of the value in a constructed array, or null if it is not present.

.. ipython:: python

from datafusion import SessionContext, col, lit
from datafusion import functions as f

ctx = SessionContext()
df = ctx.from_pydict({"shipmode": ["MAIL", "SHIP", "AIR", "TRUCK", "RAIL"]})

# Option 1: compound boolean. Fine for two values; awkward past three.
df.filter((col("shipmode") == lit("MAIL")) | (col("shipmode") == lit("SHIP")))

# Option 2: in_list. Preferred for readability as the set grows.
df.filter(f.in_list(col("shipmode"), [lit("MAIL"), lit("SHIP")]))

# Option 3: array_position / make_array. Useful when you already have the
# set as an array column and want "is in that array" semantics.
df.filter(
~f.array_position(
f.make_array(lit("MAIL"), lit("SHIP")), col("shipmode")
).is_null()
)

Use ``in_list`` as the default. It is explicit, readable, and matches the
semantics users expect from SQL's ``IN (...)``. Reach for the
``array_position`` form only when the membership set is itself an array
column rather than a literal list.

Conditional expressions
-----------------------

DataFusion provides :py:func:`~datafusion.functions.case` for the SQL
``CASE`` expression in both its switched and searched forms, along with
:py:func:`~datafusion.functions.when` as a standalone builder for the
searched form.

**Switched CASE** (one expression compared against several literal values):

.. ipython:: python

df = ctx.from_pydict(
{"priority": ["1-URGENT", "2-HIGH", "3-MEDIUM", "5-LOW"]},
)

df.select(
col("priority"),
f.case(col("priority"))
.when(lit("1-URGENT"), lit(1))
.when(lit("2-HIGH"), lit(1))
.otherwise(lit(0))
.alias("is_high_priority"),
)

**Searched CASE** (an independent boolean predicate per branch). Use this
form whenever a branch tests more than simple equality — for example,
checking whether a joined column is ``NULL`` to gate a computed value:

.. ipython:: python

df = ctx.from_pydict(
{"volume": [10.0, 20.0, 30.0], "supplier_id": [1, None, 2]},
)

df.select(
col("volume"),
col("supplier_id"),
f.when(col("supplier_id").is_not_null(), col("volume"))
.otherwise(lit(0.0))
.alias("attributed_volume"),
)

This searched-CASE pattern is idiomatic for "attribute the measure to the
matching side of a left join, otherwise contribute zero" — a shape that
appears in TPC-H Q08 and similar market-share calculations.

If a switched CASE has only two or three branches that test equality, an
``in_list`` filter combined with :py:meth:`~datafusion.expr.Expr.otherwise`
is often simpler than the full ``case`` builder.

Structs
-------

Expand Down
Loading