Summary
In src/storage/object_storage.rs, the stale-entry cleanup in collect_upload_results currently uses extract_datetime_from_parquet_path_regex to compare the parquet file's data timestamp (extracted from the filename) against a 5-minute threshold, rather than the time the path was added to ACTIVE_OBJECT_STORE_SYNC_FILES.
Problem
- Historical ingestion (with time partition): entries are immediately eligible for cleanup because the data timestamp (from the filename) is old, potentially causing races where a currently-uploading file gets evicted from the tracking set prematurely.
- Near-real-time ingestion: works correctly since data timestamps closely match wall-clock time.
Proposed Improvement
Change the tracking structure from HashSet<PathBuf> to a HashMap<PathBuf, DateTime<Utc>> (or equivalent), recording Utc::now() when each path is inserted. Update the retain logic to compare now - tracked_at >= Duration::minutes(5) instead of parsing the data timestamp from the filename.
This ensures accurate duration-based eviction regardless of whether data is near-real-time or historical.
Context
Summary
In
src/storage/object_storage.rs, the stale-entry cleanup incollect_upload_resultscurrently usesextract_datetime_from_parquet_path_regexto compare the parquet file's data timestamp (extracted from the filename) against a 5-minute threshold, rather than the time the path was added toACTIVE_OBJECT_STORE_SYNC_FILES.Problem
Proposed Improvement
Change the tracking structure from
HashSet<PathBuf>to aHashMap<PathBuf, DateTime<Utc>>(or equivalent), recordingUtc::now()when each path is inserted. Update theretainlogic to comparenow - tracked_at >= Duration::minutes(5)instead of parsing the data timestamp from the filename.This ensures accurate duration-based eviction regardless of whether data is near-real-time or historical.
Context
src/storage/object_storage.rs, around lines 1160–1165