Skip to content

feat(gooddata-pipelines): support composite key references on parent datasets#1608

Merged
benkeanna merged 1 commit into
masterfrom
aben/composite-refs
May 13, 2026
Merged

feat(gooddata-pipelines): support composite key references on parent datasets#1608
benkeanna merged 1 commit into
masterfrom
aben/composite-refs

Conversation

@benkeanna
Copy link
Copy Markdown
Contributor

@benkeanna benkeanna commented May 12, 2026

Summary

https://gooddata.atlassian.net/browse/MCMIC-2430

Adds a new ParentDatasetReference type and a new optional parent_dataset_references: list[ParentDatasetReference] field on CustomDatasetDefinition, allowing callers to express composite-key joins to the parent dataset (e.g. 2-column foreign key) — a shape the underlying gooddata_sdk already supports via list[CatalogDeclarativeReferenceSource] but the wrapper currently restricts to a single column.

Driven by a real need on the MIC BCA tooling: BCA datasets reference parent dim datasets that can have composite primary keys, and the existing wrapper API couldn't express that.

Backward compatibility

  • Legacy fields parent_dataset_reference_attribute_id, dataset_reference_source_column, dataset_reference_source_column_data_type remain accepted but are now optional and marked deprecated=True (Pydantic emits a DeprecationWarning on access).
  • A check_reference_form_exclusive validator rejects mixing the legacy fields with the new parent_dataset_references — chosen over silent precedence to avoid surprising callers later when one path is ignored.
  • The processor uses parent_dataset_references when set, otherwise falls back to the legacy form. Existing single-column callers see no behavior change.

Test plan

  • pytest tests/ — 189 passing (composite-only branch).
  • ruff check src/ tests/ — clean.
  • Validate behavior end-to-end against a real workspace with a composite-key parent dim.

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

❌ Patch coverage is 96.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 78.99%. Comparing base (efc6b2a) to head (ff2c33d).
⚠️ Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
...ooddata_pipelines/ldm_extension/input_processor.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1608      +/-   ##
==========================================
+ Coverage   78.83%   78.99%   +0.15%     
==========================================
  Files         230      231       +1     
  Lines       15486    15603     +117     
==========================================
+ Hits        12208    12325     +117     
  Misses       3278     3278              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

parent_dataset_reference_attribute_id: str | None = Field(
default=None,
deprecated=(
"Use `parent_dataset_references` for richer (composite-key) joins. "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave out the "for richer ... joins" part in the deprecation messages.

Comment on lines +128 to +129
"Composite-key reference to the parent dataset. When provided and "
"non-empty, supersedes the legacy single-column reference fields."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with something like "List of references to parent datasets." or something This phrasing assumes that join on multiple keys is the default use case (llm is going off the conversation context) but for the most users, joining on one field will be more than enough.

Comment on lines +160 to +162
Forcing callers to pick one form prevents silent precedence surprises:
without this check, setting both would quietly use the new list and
ignore the legacy values, which is easy to miss when debugging.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd delete this - it sounds more like AI reasoning than a useful docstring (It tries to justify why the code is present, rather than explain what the code is doing)

Comment on lines +166 to +167
one-element list. Missing legacy fields yield an empty list, which
will be rejected downstream by the GoodData API.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, would it not be better to fail fast instead of waiting for a an API call to fail?

Comment on lines +300 to +301
# `parent_dataset_references` list takes precedence when set and
# non-empty; otherwise fall back to the legacy single-column fields.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just repeating the docstring of the function that is called. This only needs to exist in one place.

@benkeanna benkeanna force-pushed the aben/composite-refs branch 3 times, most recently from 9017191 to a1f565a Compare May 13, 2026 06:54
@benkeanna benkeanna marked this pull request as ready for review May 13, 2026 11:42
Comment on lines +43 to +46
| parent_dataset_reference_attribute_id | string \| None | **Deprecated** — single-column reference to the parent attribute. Use `parent_dataset_references` instead. |
| dataset_reference_source_column | string \| None | **Deprecated** — single-column name on the custom dataset. Use `parent_dataset_references` instead. |
| dataset_reference_source_column_data_type | [ColumnDataType](#columndatatype) \| None | **Deprecated** — column data type for the single-column reference. Use `parent_dataset_references` instead. |
| parent_dataset_references | [ParentDatasetReference](#parentdatasetreference)[] \| None | Composite-key reference to the parent dataset (one entry per join column). When set, supersedes the three legacy single-column fields above. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider using just

**Deprecated** — use `parent_dataset_references` instead.


Either `dataset_source_table` or `dataset_source_sql` must be specified with a truthy value, but not both. An exception will be raised if both parameters are falsy or if both have truthy values.

The parent-dataset reference can be expressed via either the three legacy fields (`parent_dataset_reference_attribute_id`, `dataset_reference_source_column`, `dataset_reference_source_column_data_type`) or the new `parent_dataset_references` list — never both. Mixing the two forms raises a `ValidationError`. New code should prefer `parent_dataset_references`; the legacy fields will be removed in a future release.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just lead the user straight to the new field. No need to mention the legacy fields here

| parent_dataset_reference_attribute_id | string \| None | **Deprecated** — single-column reference to the parent attribute. Use `parent_dataset_references` instead. |
| dataset_reference_source_column | string \| None | **Deprecated** — single-column name on the custom dataset. Use `parent_dataset_references` instead. |
| dataset_reference_source_column_data_type | [ColumnDataType](#columndatatype) \| None | **Deprecated** — column data type for the single-column reference. Use `parent_dataset_references` instead. |
| parent_dataset_references | [ParentDatasetReference](#parentdatasetreference)[] \| None | Composite-key reference to the parent dataset (one entry per join column). When set, supersedes the three legacy single-column fields above. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reference to the composite key is "too much information" here.

Comment on lines +177 to +179
assert definition.parent_dataset_reference_attribute_id is not None
assert definition.dataset_reference_source_column is not None
assert definition.dataset_reference_source_column_data_type is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to explicitly throw an exception when any of these is None? I know it should be unreachable because of the validators, but still.

@benkeanna benkeanna force-pushed the aben/composite-refs branch from a1f565a to 65f1088 Compare May 13, 2026 12:34
janmatzek
janmatzek previously approved these changes May 13, 2026
@janmatzek janmatzek dismissed their stale review May 13, 2026 12:52

didn't notice the comments on docs are still unresolved, sorry

…datasets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@benkeanna benkeanna force-pushed the aben/composite-refs branch from 65f1088 to ff2c33d Compare May 13, 2026 12:58
@benkeanna benkeanna merged commit 0c299a0 into master May 13, 2026
14 checks passed
@benkeanna benkeanna deleted the aben/composite-refs branch May 13, 2026 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants