Skip to content

interregna/JArrow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

110 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

J language addon for Apache Arrow

Read (and eventually write) Apache Arrow and Parquet files to and from J. Uses C API.

Installation and Loading

  1. Ensure that you have installed the Arrow GLib (C) Packages for your OS. Instructions can be found at: arrow.apache.org/install.

  2. From your J session:

   install 'github:interregna/JArrow@main'
   load 'data/arrow'

Usage

   install 'github:interregna/JArrow@main'

   load 'data/arrow'
   readParquetTable '~addons/data/arrow/test/test1.parquet'
┌─┬───────────────┐
│a│0 1 2 3 4 5 6 7│
├─┼───────────────┤
│b│8 7 6 5 4 3 2 1│
└─┴───────────────┘
   readsParquetTable '~addons/data/arrow/test/test2.parquet'
┌────────┬──────────┬────────┬─────────┬───────┬────────┬───────┬───────┬────────┬────────┬────────┬──────────┬──────────┬───────────┬────────────┬─────────┬─────────┬───────┬───────────────┐
│Column 1│Column Two│shortCol│ushortCol│intcCol│uintcCol│int_Col│uintCol│int16Col│int32Col│int64Col│float32Col│float64Col│longlongCol│ulonglongCol│DoubleCol│StringCol│boolCol│datetime64Col  │
├────────┼──────────┼────────┼─────────┼───────┼────────┼───────┼───────┼────────┼────────┼────────┼──────────┼──────────┼───────────┼────────────┼─────────┼─────────┼───────┼───────────────┤
│0100000100100100300500100600700100100100    │This     │1946684800000000│
│188.7511188908826344388531.25613.75888888.75    │ is      │0946771200000000│
│277.522277807722738777462.5527.5777777.5    │all      │0946857600000000│
│366.2533366706619133166393.75441.25666666.25    │ valid   │0946944000000000│
│45544455605515527555325355555555    │text     │1947030400000000│
│543.7555543504311821843256.25268.75434343.75    │         │0947116800000000│
│632.56663240328216232187.5182.5323232.5    │data.    │0947203200000000│
│721.257772130214610621118.7596.25212121.25    │         │0947289600000000│
└────────┴──────────┴────────┴─────────┴───────┴────────┴───────┴───────┴────────┴────────┴────────┴──────────┴──────────┴───────────┴────────────┴─────────┴─────────┴───────┴───────────────┘
   readCSVTable '~addons/data/arrow/test/test1.csv'
┌──┬───────────────────────────...
│ID│1 2 3 4 5 8 10 11 12 14 15 ...
├──┼───────────────────────────...
│y │100.669 100.669 100.669 100...
└──┴───────────────────────────...
  NB. Note this is json-line format, not json-format. See: https://jsonlines.org
  readsJsonTable'~Jaddons/data/arrow/test/test1.json'
┌───────┬──────────┐
│name   │date      │
├───────┼──────────┤
│Gilbert│12-13-2014│
│Alexa  │09-04-1983│
│May    │01-01-1924│
│Deloise│04-25-1894│
└───────┴──────────┘
   readsFeatherTable '~addons/data/arrow/test/test1.feather'
┌────┬───┬──────┐
│team│pos│points│
├────┼───┼──────┤
│A   │G  │17    │
│A   │F  │17    │
│B   │G  │15    │
│B   │F  │ 5    │
│C   │G  │11    │
│C   │F  │10    │
│D   │G  │ 5    │
│D   │F  │14    │
└────┴───┴──────┘

(6!:16) and (6!:17) can be used to convert Arrow datetime64 types to and from ISO 8601 format (e.g. 2000-01-11T22:58:04). fromdate32 can be used to convert Arrow date32 types to YYYY M D tuples.

Notes

readsTable minimizes display time in the UI but uses more space

readTable minimizes space but can take more time to display

Development

  1. In Jqt, identify your path for ~Projects jpath '~Projects'

  2. Git clone the JArrow repo within ~Projects

  3. Restart Jqt and open the Arrow project Project > Open > Projects > jarrow

  4. Re-build the addon. Ctrl + F9

  5. Run the addon. F9 (Re-build addon scripts, reload and run tests)

Examples: see test/test1.ijs

TODO
  • Error catching for empty pointers, missing files, and general errors.
  • Dereference / cleanup gobjects and allocated memory
  • Additional data types
    • Dictionaries (need to store lookup tables)
    • Lists
    • Maps
  • Tensors
  • Documentation (see: ~/addons/gui/cobrowser/scriptdoc.ijs)
  • CSV reader
  • JSONL reader
  • Arrow Feather (IPC v1) reader
  • IPC files (".arrow" files) — readArrowTable, writeArrowTable
  • IPC streams (".arrows" files) — readArrowsTable, writeArrowsTable
  • Feather v2 writer — writeFeatherTable (alias of writeArrowTable)
  • Parquet writer — writeParquet
  • Flight client
  • Flight server
  • Non-local filesystems (S3)
  • IPC streaming with event-driven calls
Writers (available with Arrow GLib 24.0.0)
   writeArrowTable    tablePtr;'~out.arrow'      NB. IPC file format
   writeArrowsTable   tablePtr;'~out.arrows'     NB. IPC streaming format
   writeFeatherTable  tablePtr;'~out.feather'    NB. Feather v2 (= IPC file)
   writeParquet       tablePtr;'~out.parquet'    NB. Parquet
Unified surface (updated for Arrow GLib 24.0.0)

A small terse vocabulary giving JArrow a consistent handle-based columnar interface for J users.

NB. handle open / close / schema / read / project (gaps 1, 2, 7, 10)
   h=:    ho '~data.feather'        NB. auto-detects format from extension
   hs h                              NB. Arrow schema
   tbl=: hr h                        NB. read full table
   tbl=: 1000 hr h                   NB. read first 1000 rows
   tbl=: ('price';'qty') hp h        NB. project columns
   hc h                              NB. close

NB. stream table fallback over .arrows files (gap 5)
   sh=: sbo '~stream.arrows'
   b=:  sbn sh
   sbc sh

NB. Arrow C ABI bridge — zero-copy hand-off (gap 3)
   'sa aa'=: ax tbl                  NB. pending per-type J→Arrow builders
   tbl=:    ai sa;aa                 NB. import the same pair

NB. compute kernels via Arrow function registry (gap 6)
   afsum   col                       NB. sum kernel
   afmean  col / afmin col / afmax col / afcount col
   'percentile_99' afk col           NB. any registered kernel by name

NB. CSV / JSONL writers (gap 8)
   tbl wcsv   '~out.csv'
   tbl wjsonl '~out.jsonl'

NB. Custom extension types (gap 9)
   extreg ''                          NB. register quant4 / quant8 / embed_f32
   extq4   codes; scale; zero          NB. decode quant4 to floats

The verb prefixes are the vocabulary:

prefix category gap
h handle: open / read / close 1, 2, 7, 10
sb stream batches 5
a Arrow C ABI bridge 3
md schema metadata 4
af Arrow function (compute) 6
wcsv CSV / JSONL write 8
ext Custom extension types 9

Some verbs are skeletons that depend on extra GLib bindings. JArrow now attempts those optional binds at load time and leaves unsupported symbols inactive on older Arrow GLib installations.

sbo/sbn currently use a safe single-table fallback over readArrowsTable. The lower-level GLib read_next iterator crashes this J console build, so the public verb avoids that native path.

ax is intentionally guarded with a controlled assertion until the per-type J table to GArrowArray builders are implemented. This avoids the prior crash path through an unfinished read_next batch-reader bridge.

The IPC file (.arrow) and Feather v2 (.feather) formats are the random-access-preserving choice. .arrows (streaming, no footer) is the right shape for pipes and sockets. Parquet is the cross-ecosystem format.

Bug fixes in the Arrow GLib 24.0.0 refresh
  • The IPC reader verbs readArrowTable, readArrowsTable, and readFileBufferTable were defined with =. (local) instead of =: (global) — the public-interface transfers block at the bottom of arrow.ijs silently failed to expose them. Now fixed; these verbs are reachable as documented.
  • Added IPC writer wrappers; writer close is used as the flush point before deallocation.

About

J add-on for Apache Arrow, Parquet, CSV, & JSON

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages