Background
PR #1432 introduces a StorageAdapter ABC + entry-point plugin system for third-party storage protocols. The four built-in protocols (file, s3, gcs, azure) remain hardcoded in storage.py, producing a deliberate two-tier pattern: extension authors go through the ABC; built-ins keep their if self.protocol == "..." branches.
This issue tracks the follow-up work to collapse that two-tier pattern by routing all protocols through StorageAdapter, so built-ins and third-party adapters become first-class citizens of the same code path.
The user-facing API (dj.config.stores) does not change.
Scope
storage.py on master has ~22 protocol-conditional branches: 12 in setup methods (already addressable by the ABC drafted in #1432), and 10 in the I/O methods that are not yet addressable through the current four-method interface. This issue is about the I/O 10.
| Method |
Line |
Branch nature |
Design question |
put_file |
470 |
safe_copy (file) vs fsspec.put_file (clouds) |
Atomicity contract |
put_buffer |
518 |
safe_write (file) vs fsspec.pipe_file (clouds) |
Atomicity contract |
get_file |
497 |
pathlib (file) vs fsspec.get_file (clouds) |
— |
get_buffer |
549 |
open() (file) vs fsspec.cat_file (clouds) |
— |
remove |
604 |
Path.unlink (file) vs fsspec.rm (clouds) |
— |
open |
651 |
pre-write mkdir (file only) |
— |
put_folder |
697 |
shutil.copytree (file) vs fsspec.put recursive (clouds) |
Recursive-op semantics |
remove_folder |
737 |
shutil.rmtree (file) vs fsspec.rm recursive (clouds) |
Recursive-op semantics |
copy_from_url |
796 |
mkdir + copy (file) vs fsspec.pipe (clouds) |
— |
copy_folder_from_url |
847 |
mkdir (file) vs no-op (clouds) |
— |
Gating design questions
These must be resolved before any I/O branch is moved into an adapter. They are not blockers for #1432 itself.
-
Atomic-write semantics. safe_write and safe_copy (utils.py:131,153) write to a .saving/.copying temp file and os.rename to the final path — a documented atomicity guarantee. fsspec.LocalFileSystem.put_file/pipe_file write directly. A LocalFileAdapter must preserve this contract; either by reimplementing the temp-file-then-rename pattern in the adapter, or by deciding (with explicit user-facing notice) that DataJoint no longer guarantees atomicity for the local backend.
-
Recursive-op semantics. shutil.copytree and shutil.rmtree have specific behaviors around symlinks, partial failures, and metadata that fsspec.put recursive / fsspec.rm do not match 1:1. Document the divergences and decide whether to preserve shutil semantics for LocalFileAdapter or accept the fsspec semantics.
-
Interface shape. Open question whether each I/O operation gets its own adapter method, or whether the adapter exposes a smaller surface (e.g. a local_atomic_writer hook that only file-protocol adapters override, with everything else delegating to the fsspec defaults). The 10 branches above are not all equally weighty — open (pre-write mkdir) and copy_folder_from_url (mkdir vs no-op) are nearly trivial; put_file/put_buffer/put_folder are the load-bearing ones.
Proposed phasing
Each phase is a separate, reviewable PR.
- Phase 0 — test scaffolding. Add per-protocol unit tests for the four built-ins covering
_create_filesystem, _full_path, get_url, and each I/O method. This is the prerequisite that makes the rest safe — currently only test_get_store_spec_file_protocol exercises StorageBackend per-protocol logic; everything else relies on integration tests, which don't typically stress atomicity or recursive-op edge cases. Block all subsequent phases on Phase 0.
- Phase 1 — setup-layer migration. Route all four built-ins through
StorageAdapter.create_filesystem, validate_spec, full_path, get_url. Eliminates ~12 setup branches. No I/O changes; design questions above are not yet engaged.
- Phase 2 — design decision on atomicity + recursive ops. Write up the design proposal (a short doc or a discussion in this issue) addressing the gating questions. Get sign-off before coding.
- Phase 3 — I/O migration. One protocol at a time (
file → s3 → gcs → azure), each in its own PR with the Phase-0 test suite as the regression backstop.
Non-goals
- Changing
dj.config.stores user-facing API.
- Deprecating or removing fsspec; the adapters wrap it.
- Adding new built-in protocols beyond the existing four.
Acceptance criteria
storage.py contains zero if self.protocol == "..." branches.
- All four built-in adapters (
FileAdapter, S3Adapter, GCSAdapter, AzureAdapter) are registered at import time and discoverable via get_storage_adapter(protocol).
- Existing integration test suite passes unchanged.
- New per-protocol unit tests exist (delivered in Phase 0) and cover atomicity + recursive-op semantics for the local backend.
- Atomicity contract on the local backend is either preserved or, if changed, documented in release notes.
Tracking
This is the architectural follow-up to #1432. No target release is committed until the Phase 2 design discussion concludes — sequencing is against other work, not a fixed date.
Background
PR #1432 introduces a
StorageAdapterABC + entry-point plugin system for third-party storage protocols. The four built-in protocols (file,s3,gcs,azure) remain hardcoded instorage.py, producing a deliberate two-tier pattern: extension authors go through the ABC; built-ins keep theirif self.protocol == "..."branches.This issue tracks the follow-up work to collapse that two-tier pattern by routing all protocols through
StorageAdapter, so built-ins and third-party adapters become first-class citizens of the same code path.The user-facing API (
dj.config.stores) does not change.Scope
storage.pyon master has ~22 protocol-conditional branches: 12 in setup methods (already addressable by the ABC drafted in #1432), and 10 in the I/O methods that are not yet addressable through the current four-method interface. This issue is about the I/O 10.put_filesafe_copy(file) vsfsspec.put_file(clouds)put_buffersafe_write(file) vsfsspec.pipe_file(clouds)get_filefsspec.get_file(clouds)get_bufferopen()(file) vsfsspec.cat_file(clouds)removePath.unlink(file) vsfsspec.rm(clouds)openmkdir(file only)put_foldershutil.copytree(file) vsfsspec.putrecursive (clouds)remove_foldershutil.rmtree(file) vsfsspec.rmrecursive (clouds)copy_from_urlmkdir+ copy (file) vsfsspec.pipe(clouds)copy_folder_from_urlmkdir(file) vs no-op (clouds)Gating design questions
These must be resolved before any I/O branch is moved into an adapter. They are not blockers for #1432 itself.
Atomic-write semantics.
safe_writeandsafe_copy(utils.py:131,153) write to a.saving/.copyingtemp file andos.renameto the final path — a documented atomicity guarantee.fsspec.LocalFileSystem.put_file/pipe_filewrite directly. ALocalFileAdaptermust preserve this contract; either by reimplementing the temp-file-then-rename pattern in the adapter, or by deciding (with explicit user-facing notice) that DataJoint no longer guarantees atomicity for the local backend.Recursive-op semantics.
shutil.copytreeandshutil.rmtreehave specific behaviors around symlinks, partial failures, and metadata thatfsspec.putrecursive /fsspec.rmdo not match 1:1. Document the divergences and decide whether to preserveshutilsemantics forLocalFileAdapteror accept the fsspec semantics.Interface shape. Open question whether each I/O operation gets its own adapter method, or whether the adapter exposes a smaller surface (e.g. a
local_atomic_writerhook that only file-protocol adapters override, with everything else delegating to the fsspec defaults). The 10 branches above are not all equally weighty —open(pre-write mkdir) andcopy_folder_from_url(mkdir vs no-op) are nearly trivial;put_file/put_buffer/put_folderare the load-bearing ones.Proposed phasing
Each phase is a separate, reviewable PR.
_create_filesystem,_full_path,get_url, and each I/O method. This is the prerequisite that makes the rest safe — currently onlytest_get_store_spec_file_protocolexercises StorageBackend per-protocol logic; everything else relies on integration tests, which don't typically stress atomicity or recursive-op edge cases. Block all subsequent phases on Phase 0.StorageAdapter.create_filesystem,validate_spec,full_path,get_url. Eliminates ~12 setup branches. No I/O changes; design questions above are not yet engaged.file→s3→gcs→azure), each in its own PR with the Phase-0 test suite as the regression backstop.Non-goals
dj.config.storesuser-facing API.Acceptance criteria
storage.pycontains zeroif self.protocol == "..."branches.FileAdapter,S3Adapter,GCSAdapter,AzureAdapter) are registered at import time and discoverable viaget_storage_adapter(protocol).Tracking
This is the architectural follow-up to #1432. No target release is committed until the Phase 2 design discussion concludes — sequencing is against other work, not a fixed date.