feat: segmented fetch with backoff retry in HttpRangeReader#59
Conversation
Split 256 MB chunks into 64 MB segments. Each segment fetched independently with exponential backoff (1s/2s/4s, 4 attempts). Mid-chunk failure refetches one 64 MB segment, not 256 MB. Segment cache (LRU, 3 entries = 192 MB) makes backward seeks within the cache window free. curl gets --connect-timeout, --max-time, and --retry flags for connection-level resilience. No API change — HttpRangeReader::new() and ::with_chunk_size() work exactly as before. Internal segmentation is transparent.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 360c5c0c38
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /// Create with custom chunk size. | ||
| pub fn with_chunk_size(url: String, total_size: u64, chunk_size: usize) -> Self { | ||
| // Segment size = chunk_size / 4, minimum 16 MB | ||
| let seg = (chunk_size / 4).max(16 * 1024 * 1024); |
There was a problem hiding this comment.
Preserve caller chunk size in custom reader constructor
with_chunk_size no longer honors the caller-provided chunk size as the max fetch unit: it derives segment_size as chunk_size / 4 with a hard minimum of 16 MB, and read() always fetches by segment_size. For inputs that intentionally pass small chunk sizes (e.g. 1–8 MB to cap memory/request size), this now performs much larger network reads than requested and changes runtime memory/network behavior compared to the prior API contract.
Useful? React with 👍 / 👎.
Problem
curl retry restarts the entire 256 MB chunk from scratch. On flaky connections or HuggingFace rate limits during 801 GB Maverick streaming, a single failure at byte 200M of a 256 MB chunk wastes 200 MB of already-downloaded data.
Solution
Split chunks into 64 MB segments (4 per chunk). Each segment fetched independently with:
--connect-timeout 30+--max-time 300per segmentSegment cache (LRU, 3 entries = 192 MB):
API
No change.
HttpRangeReader::new()and::with_chunk_size()work exactly as before. Internal segmentation is transparent to callers.Impact on Maverick run