Skip to content

perf(engine): content-addressed extraction cache for video frames#446

Open
jrusso1020 wants to merge 3 commits intoperf/v2/hdr-segment-scopefrom
perf/v2/extraction-cache
Open

perf(engine): content-addressed extraction cache for video frames#446
jrusso1020 wants to merge 3 commits intoperf/v2/hdr-segment-scopefrom
perf/v2/extraction-cache

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented Apr 23, 2026

What

Adds a content-addressed cache for extracted video frames, keyed on the tuple (path, mtime, size, mediaStart, duration, fps, format). Repeat renders of the same composition (studio edit → re-render, preview → final) skip the ffmpeg extraction entirely.

Why

Video frame extraction is the dominant non-capture phase for video-heavy compositions. Studio iteration workflows extract the same frames over and over — each render burns ffmpeg time that adds no value.

Validated on /tmp/hf-fixtures/cfr-sdr-cache:

Cold (miss): extractMs=69,  videoExtractMs=70,  totalElapsedMs=2052
Warm (hit):  extractMs=1,   videoExtractMs=2,   totalElapsedMs=1964
cacheHits: 0→1, cacheMisses: 1→0

The fixture is tiny (3s CFR SDR @ 30fps), so the wall-clock delta is small; the extraction-time delta (69→1ms, 98%) scales linearly with source length. For heavy-iteration workflows (a user rendering the same composition while tuning encoding params), extraction time goes to zero on every repeat render.

Depends on #444 (instrumentation surface) and #445 (segment-scope HDR preflight — otherwise cache keys would be unstable across renders on mixed-HDR compositions).

How

  • New packages/engine/src/services/extractionCache.ts:
    • SHA-256 key over a stable JSON encoding of (path, mtime_ms, size, mediaStart, duration, fps, format). Infinity duration is normalized to -1 so unresolved natural-duration sources still produce stable keys.
    • Truncates to 16 hex chars in the entry directory name — 64 bits of entropy is plenty at cache scale and keeps ls output short.
    • hfcache-v2- schema prefix — bumping it invalidates old entries (callers own gc policy; the cache owns keys).
    • .hf-complete dotfile sentinel. An entry dir without the sentinel is treated as a miss (covers crash-mid-extract and abandoned writes); the next render re-extracts over the partial frames with -y.
    • FRAME_FILENAME_PREFIX = "frame_" shared with the extractor — future refactors only need to touch one place to rename frames.
  • EngineConfig.extractCacheDir (env: HYPERFRAMES_EXTRACT_CACHE_DIR) gates the feature. Undefined disables caching — extraction runs into the render's workDir and cleanup removes it on render end, preserving the prior behaviour exactly. No default root is chosen by the engine; the caller (CLI, app, studio) owns the location policy.
  • ExtractedFrames.ownedByLookup flag prevents FrameLookupTable.cleanup from rm'ing a shared cache dir at render end. Set to true on both hits and misses (misses own the directory they wrote into, but hand it over to the cache rather than deleting it).
  • Phase 3 extractor flow:
    1. Snapshot (videoPath, mediaStart, start, end) per resolved video BEFORE Phase 2a/2b preflight mutates them — so cache keys are stable across renders that use workDir-local normalized files (those files have fresh mtimes every render).
    2. Compute key, lookupCacheEntry.
    3. On hit: rebuild ExtractedFrames from the cache dir plus the Phase 2-probed VideoMetadata — no re-ffprobe.
    4. On miss: ensureCacheEntryDir, extract with extractVideoFramesRange(..., outputDirOverride), then markCacheEntryComplete (the sentinel write is the last step so a crash leaves the dir un-sentineled).
  • extractVideoFramesRange gains an outputDirOverride parameter so cache-miss writes land directly in the keyed dir (no join(outputDir, videoId) wrapping).

Test plan

  • 19 unit tests in extractionCache.test.ts covering key determinism, mtime/size invalidation, format/fps/mediaStart/duration invalidation, Infinity normalization, sentinel semantics, missing-file tolerance
  • 2 integration tests in videoFrameExtractor.test.ts:
    • "reuses extracted frames on a warm cache hit" — asserts cacheHits=1, extractMs<50ms on second call against a CFR SDR fixture
    • "invalidates the cache when fps changes" — different fps on second call forces a new miss
  • End-to-end validation with HYPERFRAMES_EXTRACT_CACHE_DIR set, two runs of the same fixture
  • Lint + format (oxlint + oxfmt)
  • Typecheck (engine + producer)

Copy link
Copy Markdown
Collaborator Author

jrusso1020 commented Apr 23, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

jrusso1020 added a commit that referenced this pull request Apr 23, 2026
Review feedback on PR #446:

- Move the readdir/filter/sort rebuild of ExtractedFrames on cache hit
  into a new `rehydrateCacheEntry` helper inside the cache module — the
  extractor no longer imports FRAME_FILENAME_PREFIX or assembles frame
  paths by hand, so cache layout stays owned by the cache.
- Extract `resolveSegmentDuration` helper to replace two copies of the
  "requested duration → fallback to source natural" logic in Phase 3.
- Collapse the nested Phase 3 cache logic into a `tryCachedExtract`
  helper that returns `ExtractedFrames | null`. Reads top-down now.
- Move source-file `statSync` out of `canonicalKeyBlob` into a `readKeyStat`
  helper the extractor calls once per video in Phase 1. Eliminates the
  redundant statSync on every `computeCacheKey` / `lookupCacheEntry`.
  `CacheKeyInput` now requires `mtimeMs` / `size` — cleaner contract.
- Drop the redundant `existsSync` in `ensureCacheEntryDir`: `mkdirSync`
  with `recursive:true` is already idempotent.
- Tighten `CacheKeyInput.format` from `string` to `CacheFrameFormat`
  (`"jpg" | "png"`). Catches typos at compile time.
- Add an explicit concurrency caveat on `markCacheEntryComplete` — the
  lookup→populate→mark sequence is non-atomic; document the tradeoff
  in code rather than only in the commit history.
- Stop re-exporting cache internals (SCHEMA_PREFIX, COMPLETE_SENTINEL,
  FRAME_FILENAME_PREFIX, cacheEntryDirName, computeCacheKey, ensureCacheEntryDir,
  lookupCacheEntry, markCacheEntryComplete, CacheEntry, CacheKeyInput,
  CacheLookup) from the engine's public index.ts — they're only used by
  the colocated extractor and its tests, which import relatively.

Behavior unchanged — same cold/warm validation on /tmp/hf-fixtures/cfr-sdr-cache:
  cold: extractMs=71, cacheMisses=1
  warm: extractMs=0,  cacheHits=1   (was 1 → 0 post-refactor)

All 32 cache+extractor tests pass.
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: perf(engine): content-addressed extraction cache

Solid implementation. The opt-in gating (extractCacheDir / HYPERFRAMES_EXTRACT_CACHE_DIR, disabled by default) ensures zero behavioral change when unset. Pre-preflight key snapshot, sentinel-based atomicity, ownedByLookup cleanup separation, schema versioning (hfcache-v2-) — all well-designed. The simplify pass cleaned things up meaningfully (rehydrateCacheEntry, resolveSegmentDuration, flattened Phase 3).

Test quality is good — 19 unit tests cover key determinism, invalidation on each tuple component, Infinity normalization, sentinel semantics, and missing-file tolerance. 2 integration tests validate cold/warm paths with real FFmpeg.

P1 — Important (non-blocking, track as follow-up)

Hash truncation to 64 bits — silent corruption on collision. The 16 hex-char truncation gives ~2^32 birthday bound. Safe at local scale, but a collision silently serves wrong frames with no error. Consider writing the full 64-char hash into the .hf-complete sentinel and verifying on lookup — turns silent corruption into a cache miss. One-line improvement.

No eviction mechanism. Cache grows unbounded. Each segment stores potentially hundreds of extracted frames. Should be tracked as follow-up — a hyperframes cache clean --max-age CLI command or purgeStaleEntries(rootDir, maxAgeMs) utility. The schema prefix makes version-based cleanup straightforward.

P2 — Should fix

Concurrent renders can race on partial frame reads. If render A is writing frame_00050.jpg while render B reads the directory in rehydrateCacheEntry, B could see a truncated file. The sentinel prevents serving an incomplete set of frames, but not a partially-written individual frame. Unlikely in single-user usage, but worth documenting that concurrent use of the same cache dir is unsupported, or using atomic rename(tmp, final) per frame.

readKeyStat returns {mtimeMs: 0, size: 0} for missing files. Two missing files would share the same (0, 0) stat tuple, and the downstream extractor would fail with "file not found" anyway — but the zero-stat key pollutes the cache with dead entries. Returning null and skipping the cache path would prevent orphaned entries.

P3 — Suggestions

  • Integration test cleanup uses inline rmSync rather than afterEach — leaked on assertion failure. The unit tests in extractionCache.test.ts correctly use afterEach.
  • canonicalKeyBlob uses abbreviated property names (p, m, s, ms, d, f, fmt). Since the blob is only hashed (never stored), there's no benefit to compression. Full names would aid debugging. Only change if a v3 bump is planned.
  • rehydrateCacheEntry doesn't validate the %05d pattern in filenames. A stray file matching prefix/suffix would shift indices. Minor since the directory is cache-controlled.
  • mtime reliability on network filesystems — worth a note in the extractCacheDir doc that NFS/SMB mounts may have coarser granularity.

Verdict: Approve with suggestions

The implementation is well-designed and the dependency chain (#444#445#446) is clean. Main follow-ups: hash collision mitigation (one-line sentinel check) and eviction strategy (CLI command or utility). None of the issues are blocking for merge.

Keys extracted frame bundles on (path, mtime, size, mediaStart, duration,
fps, format) so iteration workflows (studio edit → re-render) skip the
ffmpeg call when none of those inputs have changed.

- New extractionCache service: SHA-256 key, `.hf-complete` sentinel,
  `hfcache-v2-` schema prefix, FRAME_FILENAME_PREFIX shared with the
  extractor. Schema bump invalidates older entries (callers gc them).
- `EngineConfig.extractCacheDir` (env: HYPERFRAMES_EXTRACT_CACHE_DIR)
  gates the feature — undefined disables caching, preserving the prior
  per-workDir behaviour exactly.
- Phase 3 per-video flow: compute key on the pre-preflight video
  (so keys are stable across renders that use workDir-local normalized
  files), look up; on hit rebuild ExtractedFrames from the cache dir
  plus Phase 2 metadata — no re-ffprobe. On miss, extract into the
  keyed dir then mark complete.
- `ExtractedFrames.ownedByLookup` prevents `FrameLookupTable.cleanup`
  from deleting shared cache entries when a render finishes.
- Extractor output-dir override lets cache-miss writes land directly in
  the keyed dir instead of the transient workDir.

Validation: /tmp/hf-fixtures/cfr-sdr-cache
  Cold  run: extractMs=69,  videoExtractMs=70,  totalElapsedMs=2052
  Warm  run: extractMs=1,   videoExtractMs=2,   totalElapsedMs=1964
  cacheHits 0→1, cacheMisses 1→0.

Adds 19 unit tests for key/sentinel/format semantics and 2 integration
tests (cache hit, fps change invalidates key) in videoFrameExtractor.test.ts.
Review feedback on PR #446:

- Move the readdir/filter/sort rebuild of ExtractedFrames on cache hit
  into a new `rehydrateCacheEntry` helper inside the cache module — the
  extractor no longer imports FRAME_FILENAME_PREFIX or assembles frame
  paths by hand, so cache layout stays owned by the cache.
- Extract `resolveSegmentDuration` helper to replace two copies of the
  "requested duration → fallback to source natural" logic in Phase 3.
- Collapse the nested Phase 3 cache logic into a `tryCachedExtract`
  helper that returns `ExtractedFrames | null`. Reads top-down now.
- Move source-file `statSync` out of `canonicalKeyBlob` into a `readKeyStat`
  helper the extractor calls once per video in Phase 1. Eliminates the
  redundant statSync on every `computeCacheKey` / `lookupCacheEntry`.
  `CacheKeyInput` now requires `mtimeMs` / `size` — cleaner contract.
- Drop the redundant `existsSync` in `ensureCacheEntryDir`: `mkdirSync`
  with `recursive:true` is already idempotent.
- Tighten `CacheKeyInput.format` from `string` to `CacheFrameFormat`
  (`"jpg" | "png"`). Catches typos at compile time.
- Add an explicit concurrency caveat on `markCacheEntryComplete` — the
  lookup→populate→mark sequence is non-atomic; document the tradeoff
  in code rather than only in the commit history.
- Stop re-exporting cache internals (SCHEMA_PREFIX, COMPLETE_SENTINEL,
  FRAME_FILENAME_PREFIX, cacheEntryDirName, computeCacheKey, ensureCacheEntryDir,
  lookupCacheEntry, markCacheEntryComplete, CacheEntry, CacheKeyInput,
  CacheLookup) from the engine's public index.ts — they're only used by
  the colocated extractor and its tests, which import relatively.

Behavior unchanged — same cold/warm validation on /tmp/hf-fixtures/cfr-sdr-cache:
  cold: extractMs=71, cacheMisses=1
  warm: extractMs=0,  cacheHits=1   (was 1 → 0 post-refactor)

All 32 cache+extractor tests pass.
…che caveats

Address Miguel's review on #446:

- P2: readKeyStat now returns null when the source file is missing instead
  of the {mtimeMs: 0, size: 0} zero-tuple sentinel. Two unrelated missing
  paths would otherwise share the same cache key and pollute the cache
  with orphaned entries. The cacheKeyInputs builder propagates the null
  forward; tryCachedExtract already skips the cache path for null entries
  so the extractor surfaces the real file-not-found error downstream.
- P2: document the single-writer caveat on EngineConfig.extractCacheDir.
  Concurrent renders targeting the same cache dir can race on partial
  frame reads — the .hf-complete sentinel prevents serving an incomplete
  entry but individual frames are written non-atomically. Also flagged
  the coarser-mtime-on-network-filesystems hazard for future ops.
- Updated the existing "tolerates a missing source" test to assert the
  new contract (readKeyStat returns null) rather than the old "zero-stat
  sentinel" behaviour.

All 480 engine tests pass.
@jrusso1020 jrusso1020 force-pushed the perf/v2/extraction-cache branch from 8a31b2c to 3c36324 Compare April 23, 2026 22:10
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

Rebased onto updated #445 and pushed 3c36324. Addressed both P2 items from <https://github.com/heygen-com/hyperframes/pull/446#pullrequestreview-4164956477|your review>:

  1. readKeyStat zero-tuple pollution. Return type is now {mtimeMs: number; size: number} | null. The cacheKeyInputs builder in videoFrameExtractor.ts propagates the null (entries with missing files become null in the array), and tryCachedExtract already had the if (!keyInput) return null; guard — so missing files now cleanly skip the cache path and let the extractor surface the real file-not-found error downstream. No shared (0, 0) tuple, no orphaned entries. The existing "tolerates a missing source" test now asserts the new contract (readKeyStat returns null) rather than exercising the old zero-stat sentinel behaviour.
  2. Concurrent-renders caveat. Documented on EngineConfig.extractCacheDir — single-writer is the intended model, the .hf-complete sentinel prevents serving an incomplete entry but individual frames are written non-atomically so a concurrent reader could observe a truncated frame. Also flagged the coarser-mtime-on-NFS/SMB hazard for ops. Didn't go the per-frame rename(tmp, final) route since the fix surface was bigger than the doc surface and single-writer is already the expected model for the CLI / app / studio callers you enumerated.

On the P1 follow-ups (non-blocking per your review):

  • Hash-collision mitigation via writing the full 64-char hash into .hf-complete and verifying on lookup — totally fair, queued as a follow-up. The current 16-hex-char truncation has the birthday bound at local scale so I'm not losing sleep, but "silent corruption on collision → cache miss" is the right shape.
  • Eviction mechanism — also queued. Thinking hyperframes cache clean --max-age <duration> or a purgeStaleEntries(rootDir, maxAgeMs) utility that walks the schema-prefixed dirs. The hfcache-v2- prefix makes version-based cleanup a bonus trivial case.

On the P3s: test cleanup style, canonicalKeyBlob property-name expansion, frame-pattern validation, all noted. Fold them into the follow-up PR alongside the hash-collision fix so the core semantic work here isn't further churned.

480/480 engine tests green after the refactor.

jrusso1020 added a commit that referenced this pull request Apr 23, 2026
The CLI already ships `render_complete` events to PostHog but only carries
a subset of `RenderPerfSummary` (duration_ms, capture_avg_ms, etc.). With
#444 adding per-phase breakdown and #446 adding cache counters, the data
is sitting in `job.perfSummary.videoExtractBreakdown` and
`job.perfSummary.tmpPeakBytes` but never reaching PostHog.

This forwards the new fields as flat properties (flat is easier to query
with PostHog insights than nested objects):

- tmp_peak_bytes
- stage_compile_ms, stage_video_extract_ms, stage_audio_process_ms,
  stage_capture_ms, stage_encode_ms, stage_assemble_ms
- extract_resolve_ms, extract_hdr_probe_ms, extract_hdr_preflight_ms,
  extract_hdr_preflight_count, extract_vfr_probe_ms, extract_vfr_preflight_ms,
  extract_vfr_preflight_count, extract_phase3_ms, extract_cache_hits,
  extract_cache_misses

`extract_phase3_ms` disambiguates from `stage_video_extract_ms`: the former
is just the parallel ffmpeg extract inside Phase 3, the latter is the full
stage (resolve + probe + preflight + extract).

All fields optional — the Docker subprocess call site (no perfSummary
available) still compiles and ships events without them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants