A Query.Farm VGI worker for DuckDB.
vgi-explain · a Query.Farm VGI worker · powered by SHAP
Explain machine-learning predictions in pure SQL. vgi-explain exposes
SHAP to DuckDB as ordinary SQL functions. Point
it at a model that vgi-sklearn
or vgi-xgboost trained, and get
per-row, per-feature contributions — why each prediction came out the way it did —
without leaving your query.
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'sklearn' (TYPE vgi, LOCATION 'vgi-sklearn');
ATTACH 'explain' (TYPE vgi, LOCATION 'vgi-explain'); -- 'uv run explain_worker.py' from a checkout
-- train a model with vgi-sklearn (or vgi-xgboost), keep the model BLOB it returns
CREATE TABLE model AS
SELECT model FROM sklearn.fit((SELECT * FROM sklearn.iris()),
estimator := 'random_forest_classifier', target := 'target', id := 'sample_id');
-- a table function takes only one subquery (the feature matrix), so the model
-- travels as a scalar — read it from a session VARIABLE
SET VARIABLE m = (SELECT model FROM model);
-- per-row, per-feature SHAP contributions (long format)
SELECT * FROM explain.shap_values(
(SELECT sample_id AS id, sepal_length_cm, sepal_width_cm, petal_length_cm, petal_width_cm
FROM sklearn.iris()),
model := getvariable('m'), id := 'id')
ORDER BY id, feature
LIMIT 10;vgi-explain does not train models. It consumes the self-contained model
BLOB that vgi-sklearn / vgi-xgboost produce — a single value holding the
fitted estimator plus its metadata (estimator type, ordered feature names,
classes). The BLOB is unpacked with skops safe
loading, restricted to the scikit-learn / numpy / scipy / xgboost namespaces, so
a malformed or hostile value yields a clear error rather than executing code.
Every function follows the same SQL-friendly contract as vgi-sklearn:
- Your input table is the feature matrix, passed as the single subquery
argument. Features are matched to the model by name, so column order does
not matter (a shuffled
SELECTstill works) and extra columns are ignored. - The model is a scalar
modelargument —model := getvariable('m')— not a subquery (DuckDB allows a table function only one subquery argument). idnames a column to exclude from the features and carry through to the output, so you canJOINresults back to the source.
The explainer is chosen automatically by model type:
| Model family | Explainer |
|---|---|
| trees: RandomForest, GradientBoosting, DecisionTree, XGB, LGBM | TreeExplainer |
| linear: LinearRegression, Ridge, Lasso, LogisticRegression | LinearExplainer |
| anything else | KernelExplainer |
Long format: one row per (input row × feature [× class]).
| column | type | meaning |
|---|---|---|
| id | (passthrough) | your id column (or a 0-based row index if omitted) |
feature |
VARCHAR | feature name |
class |
VARCHAR | class label, or NULL for regression / binary |
shap_value |
DOUBLE | signed contribution toward the model output |
Additivity: for any row, sum(shap_value) + base_value ≈ model output (the
raw margin / score the model produces for that row).
The explainer's expected value — the anchor SHAP values add onto. One row, or
one row per class for a multi-class classifier (class, base_value).
Global importance ranked descending. With a non-empty relation it returns
mean(|SHAP|) across rows (method = 'mean_abs_shap'); with an empty
relation it falls back to the model's native importances
(feature_importances_ or |coef_|, method = 'native'). Columns: feature,
importance, rank, method.
For a classifier with K > 2 classes, SHAP produces a contribution per class.
shap_values emits a class column carrying the class label, so every
(row, feature, class) contribution is addressable, and shap_base_value
returns one row per class. Regression and binary classification collapse to a
single output: class is NULL, one contribution per (row, feature).
uv sync --extra dev
uv run --no-sync pytest -q # unit + in-process + real-client tests
make test-sql # DuckDB sqllogictest E2E (haybarn-unittest)
uv run --no-sync ruff check . && uv run --no-sync mypy vgi_explain/
make fixture # regenerate the committed model-BLOB fixturevgi-explain: MIT.- SHAP: MIT.
- scikit-learn: BSD-3-Clause.
- XGBoost: Apache-2.0.
- skops: MIT.
- numpy: BSD-3-Clause.
Written by Query.Farm.
Copyright 2026 Query Farm LLC - https://query.farm
