[CALCITE-4460] Support custom delimiter when parsing CSV tables#4893
[CALCITE-4460] Support custom delimiter when parsing CSV tables#4893Diveyam-Mishra wants to merge 2 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for configuring a custom field separator when parsing CSV tables in Calcite’s file adapter, enabling non-comma delimited files (e.g., |, \t) via model JSON while preserving comma as the default.
Changes:
- Add
"separator"operand parsing and validation inCsvTableFactory, defaulting toCSVParser.DEFAULT_SEPARATORwhen omitted. - Thread the separator through
CsvTable/CsvTranslatableTableintoCsvEnumerator, including row-type deduction and CSV reader creation. - Add tests and fixtures covering pipe delimiter support, invalid multi-character separators, and default behavior compatibility.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| file/src/main/java/org/apache/calcite/adapter/file/CsvTableFactory.java | Parses/validates optional separator operand and passes it into table construction. |
| file/src/main/java/org/apache/calcite/adapter/file/CsvTable.java | Stores separator and uses it for row-type deduction and field type inference. |
| file/src/main/java/org/apache/calcite/adapter/file/CsvTranslatableTable.java | Passes separator into CsvEnumerator during projection execution. |
| file/src/main/java/org/apache/calcite/adapter/file/CsvEnumerator.java | Adds separator-aware constructors and uses it in openCsv and deduceRowType. |
| file/src/main/java/org/apache/calcite/adapter/file/CsvStreamReader.java | Adds a separator-accepting constructor for streaming reads. |
| file/src/test/java/org/apache/calcite/adapter/file/FileAdapterTest.java | Adds coverage for custom separator, invalid separator, and default behavior. |
| file/src/test/resources/custom-separator.json | New model demonstrating `"separator": " |
| file/src/test/resources/sales-csv/PIPE_DELIMITED.csv | New pipe-delimited CSV fixture for tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
18bf557 to
9165ba4
Compare
|
It would be worth documenting this new feature. |
| DEPTNO:int|NAME:string | ||
| 10|"Sales" | ||
| 20|"Marketing" | ||
| 30|"Accounts" |
There was a problem hiding this comment.
Could you add a test record containing an escape character? I'd like to verify whether the custom delimiter and escape character function correctly in combination. Additionally, I had another thought: regarding support for custom escape characters—I'm not sure if Calcite supports this, though naturally, this doesn't necessarily need to be implemented in the current PR. Overall, this PR looks good.
|
I notice that you added duplicate constructors for You should probably change user documentation. Make sure you test escaping/quoting with the custom separator. +1 to merge when the above issues are fixed. |
9165ba4 to
06d9ccd
Compare
|
I have approved the CI, you can take a look at the detail. |
2435a40 to
14096ae
Compare
|
I think this is done from my end now U can check @xiedeyantu |
| @Override public Enumerator<@Nullable Object[]> enumerator() { | ||
| return new CsvEnumerator<>(source, cancelFlag, false, null, | ||
| CsvEnumerator.arrayConverter(fieldTypes, fields, false)); | ||
| CsvEnumerator.arrayConverter(fieldTypes, fields, false), ','); |
There was a problem hiding this comment.
Can we add a default parameter of ',' without modifying so many constructors, and have the constructors without the "char separator" also call the constructors with the "char separator"?
| * [Elasticsearch adapter](elasticsearch_adapter.html) | ||
| (<a href="{{ site.apiRoot }}/org/apache/calcite/adapter/elasticsearch/package-summary.html">calcite-elasticsearch</a>) | ||
| * [File adapter](file_adapter.html) (<a href="{{ site.apiRoot }}/org/apache/calcite/adapter/file/package-summary.html">calcite-file</a>) | ||
| * [File adapter](file_adapter.html) (<a href="{{ site.apiRoot }}/org/apache/calcite/adapter/file/package-summary.html">calcite-file</a>), including support for custom CSV delimiters |
There was a problem hiding this comment.
I think we can remove ", including support for custom CSV delimiters".
14096ae to
331defd
Compare
|



Jira Link
CALCITE-4460
Changes Proposed
This PR adds support for custom field delimiters (separators) in the Calcite file adapter. Previously, the adapter hardcoded comma (
,) as the only separator, preventing use of other common delimiters like pipes (|) or tabs.What changed
CsvTableFactory: Parses an optional"separator"operand from the model JSON. Validates it is exactly one character; rejects multi-character values withIllegalArgumentException.CsvTable: Addedseparatorfield and a new constructor that accepts it. Passes separator toCsvEnumerator.deduceRowType().CsvTranslatableTable: Added separator constructor; passes separator toCsvEnumeratorinproject().CsvEnumerator: Added separator parameter to constructors,openCsv(), anddeduceRowType(). UsesCSVReader(reader, separator).CsvStreamReader: Added package-private constructorCsvStreamReader(Source, char).All existing constructors default separator to
CSVParser.DEFAULT_SEPARATOR(comma), preserving full backward compatibility.Scope
Per guidance from @julianhyde, changes are limited to the file adapter only. The CSV adapter (example code) is kept simple and unchanged.
Example configuration
{ "version": "1.0", "defaultSchema": "TEST", "schemas": [ { "name": "TEST", "tables": [ { "name": "MY_TABLE", "type": "custom", "factory": "org.apache.calcite.adapter.file.CsvTableFactory", "operand": { "file": "data.psv", "separator": "|" } } ] } ] }Tests added
testCsvCustomSeparatorPipe — reads pipe-delimited data, verifies correct parsing
testCsvCustomSeparatorInvalidMultiChar — verifies multi-character separator is rejected
testCsvDefaultSeparatorBackwardCompat — verifies omitting separator defaults to comma