Undo by gburd · Pull Request #21 · gburd/postgres

gburd · 2026-03-26T19:14:14Z

No description provided.

- Hourly upstream sync from postgres/postgres (24x daily) - AI-powered PR reviews using AWS Bedrock Claude Sonnet 4.5 - Multi-platform CI via existing Cirrus CI configuration - Cost tracking and comprehensive documentation Features: - Automatic issue creation on sync conflicts - PostgreSQL-specific code review prompts (C, SQL, docs, build) - Cost limits: $15/PR, $200/month - Inline PR comments with security/performance labels - Skip draft PRs to save costs Documentation: - .github/SETUP_SUMMARY.md - Quick setup overview - .github/QUICKSTART.md - 15-minute setup guide - .github/PRE_COMMIT_CHECKLIST.md - Verification checklist - .github/docs/ - Detailed guides for sync, AI review, Bedrock See .github/README.md for complete overview Complete Phase 3: Windows builds + fix sync for CI/CD commits Phase 3: Windows Dependency Build System - Implement full build workflow (OpenSSL, zlib, libxml2) - Smart caching by version hash (80% cost reduction) - Dependency bundling with manifest generation - Weekly auto-refresh + manual triggers - PowerShell download helper script - Comprehensive usage documentation Sync Workflow Fix: - Allow .github/ commits (CI/CD config) on master - Detect and reject code commits outside .github/ - Merge upstream while preserving .github/ changes - Create issues only for actual pristine violations Documentation: - Complete Windows build usage guide - Update all status docs to 100% complete - Phase 3 completion summary All three CI/CD phases complete (100%): ✅ Hourly upstream sync with .github/ preservation ✅ AI-powered PR reviews via Bedrock Claude 4.5 ✅ Windows dependency builds with smart caching Cost: $40-60/month total See .github/PHASE3_COMPLETE.md for details Fix sync to allow 'dev setup' commits on master The sync workflow was failing because the 'dev setup v19' commit modifies files outside .github/. Updated workflows to recognize commits with messages starting with 'dev setup' as allowed on master. Changes: - Detect 'dev setup' commits by message pattern (case-insensitive) - Allow merge if commits are .github/ OR dev setup OR both - Update merge messages to reflect preserved changes - Document pristine master policy with examples This allows personal development environment commits (IDE configs, debugging tools, shell aliases, Nix configs, etc.) on master without violating the pristine mirror policy. Future dev environment updates should start with 'dev setup' in the commit message to be automatically recognized and preserved. See .github/docs/pristine-master-policy.md for complete policy See .github/DEV_SETUP_FIX.md for fix summary Optimize CI/CD costs by skipping builds for pristine commits Add cost optimization to Windows dependency builds to avoid expensive builds when only pristine commits are pushed (dev setup commits or .github/ configuration changes). Changes: - Add check-changes job to detect pristine-only pushes - Skip Windows builds when all commits are dev setup or .github/ only - Add comprehensive cost optimization documentation - Update README with cost savings (~40% reduction) Expected savings: ~$3-5/month on Windows builds, ~$40-47/month total through combined optimizations. Manual dispatch and scheduled builds always run regardless.

Nix-based development environment for PostgreSQL hacking. Not for merge; staged here so per-commit build/test runs can share a single toolchain. Changes from v34: - Drop the .idea/ and .vscode/ editor directories. Editor state is personal and should not live in the repository. - Re-introduce glibc-no-fortify-warning.patch and the patchedGlibc overlay in shell.nix. Without this, meson's libcurl thread-safety probe fails under our default -D_FORTIFY_SOURCE=1 -O0 -Werror combination because features.h emits a -Wcpp warning that becomes a fatal error. The patch is scoped to the dev shell only and does not leak into system glibc or release builds.

Sparsemap is a memory-efficient data structure for maintaining sparse sets of integers using hierarchical bitmaps. It supports O(1) set/get operations and efficient iteration over set bits while using far less memory than a dense bitmap for sparse populations. The implementation provides: - sparsemap_set/get/is_set for individual bit manipulation - sparsemap_scan for efficient forward iteration - sparsemap_select for rank-based selection - Configurable initial capacity with automatic growth Used by the UNDO subsystem for tracking allocated pages and by RECNO for free-space management within relation forks. Includes a TAP regression test module (test_sparsemap) exercising all public API operations.

Header-only implementation of a probabilistic skip list providing O(log n) insert, delete, and lookup operations with O(n) space. Compared to rbtree, skip lists offer simpler implementation, better cache locality for sequential scans, and lock-free read potential. The implementation provides: - Type-safe macros for defining typed skip lists (DEFINE_SKIPLIST) - Configurable maximum height (up to 32 levels) - Forward iteration via SKIPLIST_FOREACH - Range queries and nearest-neighbor lookup - Memory allocation via palloc (TopMemoryContext by default) Used by the UNDO subsystem for maintaining ordered transaction metadata and by RECNO for HLC-ordered page directories. Includes a TAP regression test module (test_skiplist) exercising insertion, deletion, iteration, and edge cases.

LRLock is a reader-writer lock that admits wait-free readers and a single serialized writer. Readers swap a 64-bit version counter through pg_atomic_read_acquire_u32 / pg_atomic_fetch_add_acqrel_u32 without touching any shared cache line that the writer also touches during steady state. The writer holds a short LWLock to serialize against other writers and bumps the version counter twice per write (odd while writing, even when done) so concurrent readers can detect in-flight updates and retry. This primitive is added now, ahead of any consumer, because the sLog flat hash partitions in the UNDO subsystem (added in a later commit) depend on it for the read path of MVCC visibility checks. LRLock is also useful in its own right and is exercised by the test_lrlock module added here. Three new atomic primitives are added to support LRLock: - pg_atomic_fetch_add_acqrel_u32: acquire-release atomic add, lighter than the existing seq-cst variant on aarch64. - pg_atomic_seq_cst_fence: standalone seq-cst fence separate from an RMW operation. - pg_atomic_read_acquire_u32: read with acquire semantics, between pg_atomic_read_u32 (no barrier) and pg_atomic_read_membarrier_u32 (full barrier) in cost. The test_lrlock module covers single-writer correctness, multi-reader consistency under contention, and writer serialization. Author: Greg Burd <greg@burd.me> Reviewed-by: TBD Discussion: https://postgr.es/m/TBD

…pport Motivation ---------- PostgreSQL's MVCC model leaves dead tuples behind on rollback and relies on VACUUM to reclaim space. Workloads that mix bulk DML with frequent aborts (ETL pipelines that hit a unique-key violation late, application-driven retry loops, partition swap operations that fail on a constraint check) accumulate bloat that VACUUM cannot keep up with, and ROLLBACK latency for large transactions remains acceptable only because no physical work is done. When the user later runs VACUUM the cost shows up there, plus the index entries pointing at dead tuples must be visited. Operators want a path where ROLLBACK returns immediately and leaves no dead-tuple debt behind. This commit adds the cluster-wide UNDO infrastructure that makes the above possible. It does not change heap behavior by default; heap participation is opt-in per relation, and the new RECNO AM (separate commit) is the first AM that always uses it. Design summary -------------- UNDO records are written into the regular WAL stream as XLOG_UNDO_BATCH records owned by a new resource manager, RM_UNDO_ID. Each modifying operation on an UNDO-enabled relation appends an inverse record to a per-transaction chain linked through urec_prev. The header is AM-agnostic; payload interpretation is delegated to a per-RM dispatch table registered via RegisterUndoRmgr(). Heap and nbtree register their own callbacks. Rollback is physical: ApplyUndoChain() walks the chain newest-first and restores the prior page bytes via memcpy under a critical section. Each application emits a Compensation Log Record (XLOG_UNDO_APPLY_RECORD) with REGBUF_FORCE_IMAGE. The CLR's LSN is written back into urec_clr_ptr, so a crash mid-rollback resumes idempotently: records with a valid urec_clr_ptr are skipped on the next pass. For large transactions (UNDO footprint above undo_instant_abort_threshold, default 64 KB) the backend records the XID in a shared Aborted Transaction Map (ATM) and returns from ROLLBACK in O(1). The logical_revert_worker drains the ATM in the background, applying chains and emitting CLRs without holding the client. This is the Constant Time Recovery design from Antonopoulos et al. (VLDB 2019), with the persistent version store realized as in-WAL UNDO rather than a separate tablespace. A secondary log (sLog) captures before-images for readers that need old tuple versions; the sLog is partitioned (slog_num_partitions GUC) for write concurrency. UNDO blocks live in shared_buffers via a virtual RelFileLocator (spcOid=1663, dbOid=9, relNumber=logno), so no separate cache exists to size or warm. Recovery follows ARIES: redo replays all WAL forward including ALLOCATE/DISCARD/EXTEND and any CLRs already produced before the crash; analysis is implicit (XIDs that wrote UNDO but have no commit/abort record remain in the in-memory table); undo applies the chains for those XIDs and emits new CLRs. Temp and unlogged records are skipped during crash recovery. Relationship to zheap and MariaDB-style undo -------------------------------------------- zheap stored UNDO in dedicated segment files under $PGDATA/base/undo and used UNDO both for rollback and for reader visibility against in-place updates. MariaDB and InnoDB use a rollback segment in a system tablespace. Both designs require a second durable storage tier with its own retention, archival, and replication story. The earlier iteration of this work followed the same path and ran into the same operational issues: a separate WAL- like stream that had to be checkpointed, archived, and shipped to standbys. UNDO-in-WAL is complementary, with a different trade-off. Records ride the existing WAL stream, so durability, replication, archival, and standby replay come for free. The cost is that WAL retention is now also gated by undo_discard_horizon: WAL containing UNDO for an in-flight or unresolved transaction cannot be recycled. In exchange there is no second log to administer, no separate checkpoint, and CLRs are ordinary WAL records that hot standbys already replay. The zheap thread's review feedback (in particular on segment lifecycle, discard correctness, and recovery sequencing) shaped this design directly: the rules in section 9 of src/backend/access/undo/README on inter-transaction UNDO ordering, CLR idempotency, and the temp/unlogged skip during crash recovery were derived from that discussion. WAL and recovery ---------------- RM_UNDO_ID is assigned in src/include/access/rmgrlist.h. Replay is deterministic: ALLOCATE updates log control structures, DISCARD advances discard_ptr and oldest_xid, EXTEND grows a log, and APPLY_RECORD restores a page from the embedded full page image without re-reading any UNDO data. Hot standbys see CLRs as normal WAL and apply them with the same FPI machinery used elsewhere. Archive impact is proportional to abort volume: each applied UNDO record produces one ~8 KB CLR. twophase.c, xact.c, and xlog.c gain hooks to attach a transaction's UNDO chain head to commit/abort/prepared state and to drive PerformUndoRecovery() at the end of redo. Performance ----------- OLTP overhead with UNDO active is in the low single digits at realistic concurrency; abort latency is O(1) for transactions above the instant-abort threshold. TPROC-C numbers: TBD: see cover letter section 5. Caveats and operational hazards ------------------------------- - WAL retention is pinned by the oldest in-flight transaction that may still abort and by the logical_revert_worker's pending ATM entries. A long-running transaction or a stalled revert worker will hold WAL on disk; max_wal_size must be set with this in mind. - sLog cleanup runs in the background; before-image pressure under DSA can grow if readers hold old snapshots while writers produce many before-images. Monitor pg_stat_get_undo_logs() and the sLog partitions. - ATM-full is a hard backpressure point: when the map cannot accept another XID, the backend falls back to synchronous rollback. This is correct but visible as a latency spike; size the ATM for peak concurrent abort volume. - Logical decoding filters XLOG_UNDO_BATCH; logical replication of UNDO-enabled tables works because the underlying heap or RECNO WAL records still flow. - wal_level = minimal still produces correct local rollback but standbys will not see CLRs. Reviewed-by: TBD Co-authored-by: Robert Haas <robertmhaas@gmail.com> Co-authored-by: Amit Kapila <amit.kapila16@gmail.com> Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/TBD

Motivation ---------- PostgreSQL's existing handling of filesystem side effects from DDL relies on pendingDeletes and the smgr unlink machinery driven from xact_redo. This covers the common cases (CREATE TABLE, DROP TABLE, TRUNCATE) for relation files but does not extend to other filesystem operations that extensions and new subsystems need: renaming a file as part of a logical-replication slot move, creating an auxiliary directory under $PGDATA, writing a configuration sidecar, setting an extended attribute that records provenance, or atomically replacing a file at commit time. When extension code performs these operations directly it has no way to participate in WAL-logged crash recovery; a crash between the catalog change and the filesystem syscall leaves the cluster in an inconsistent state that operators must clean up by hand. The UNDO and RECNO work in this series both need transactional file operations -- RECNO for its overflow segment files, UNDO for the (now removed) flat-file backing store and for log-segment rotation artifacts that survived the move to UNDO-in-WAL. Rather than open-coding pendingDeletes-style lists in each subsystem, this commit factors the pattern into a single facility and exposes it. Design summary -------------- FILEOPS adds a new resource manager, RM_FILEOPS_ID, with one WAL record type per filesystem primitive: CREATE, DELETE, RENAME, WRITE, TRUNCATE, CHMOD, CHOWN, MKDIR, RMDIR, SYMLINK, LINK, SETXATTR, REMOVEXATTR. The record-type encoding follows PostgreSQL convention (high 4 bits of the info byte for type, low bits for flags). The set of primitives mirrors the Berkeley DB fileops.src model so that each operation has an independent redo handler and descriptor. Operations enqueue a PendingFileOp on a per-transaction list, tagged at_commit or at_abort and tracked by subtransaction nesting level. At end-of-transaction the list is drained in registration order. CREATE registers a delete-on-abort entry so an aborted CREATE TABLE leaves no orphan file; DELETE is deferred to commit so a rolled-back DROP leaves the data intact. Subtransaction abort discards the child's pending entries; subtransaction commit re-parents them to the outer level. WAL is written before the syscall. Replay during crash recovery re-executes each operation idempotently: CREATE creates the file if missing, DELETE unlinks if present, RENAME moves if the source exists, TRUNCATE shortens to the recorded length, and so on. The parent directory is fsync'd after CREATE and DELETE to ensure the directory entry is durable. Extended attributes are handled through a thin portability shim, src/port/pg_xattr.c, that abstracts Linux setxattr/getxattr, FreeBSD extattr_*, and macOS getxattr/setxattr. Platforms without xattr support compile the SETXATTR/REMOVEXATTR records into no-ops; on those platforms a transaction that attempts an xattr operation receives a clear error rather than silently succeeding. The UNDO integration lives in fileops_undo.c: each FILEOPS WAL record carries enough information to also serve as an UNDO record for the FILEOPS RM, so a transaction that creates a file inside a larger UNDO-tracked operation rolls the file create back through the same chain that rolls back the data changes. Relationship to xact_redo and pendingDeletes -------------------------------------------- The existing pendingDeletes path in catalog/storage.c covers relation-level file removal for the smgr layer: it queues unlinks during CREATE/DROP and processes them at xact end. FILEOPS generalizes that pattern to arbitrary filesystem primitives and replaces ad-hoc unlinking by extension code with a single WAL-logged path. smgr operations on relation files are unchanged; they continue to flow through smgrcreate/smgrdounlink and the existing XLOG_SMGR_* records. FILEOPS sits alongside, not on top of, smgr. Code that today open-codes a pendingDeletes-style list (some contrib modules and the prior iteration of UNDO segment management) can switch to FILEOPS without rewriting the recovery hooks. WAL and recovery ---------------- RM_FILEOPS_ID is assigned in rmgrlist.h. Replay is deterministic because each primitive is idempotent on the recorded inputs; the redo path treats "operation already applied" as success. Hot standbys replay FILEOPS records identically to the primary, so a DDL that creates a file on the primary creates it on the standby. Archive impact is proportional to DDL volume; data-path operations are not affected. Performance ----------- FILEOPS operations are bounded by syscall latency; in normal OLTP workloads they are not on the hot path. TPROC-C numbers: TBD: see cover letter section 5. Coverage of pre-existing crash-asymmetric paths ----------------------------------------------- Beyond exposing a primitive for new code, this commit closes several long-standing crash-asymmetric paths in core that performed filesystem mutations outside any redo or pendingDeletes coverage: - copydir.c gains a register_for_abort_cleanup parameter. When set, the destination tree is registered with FILEOPS for recursive removal on abort. CREATE DATABASE STRATEGY=FILE_COPY (and the clone path) and ALTER DATABASE SET TABLESPACE both pass true; the WAL replay caller passes false because the redo handler is itself the recovery action. This eliminates the long-standing window where a crash mid-copy left orphan files that had to be cleaned up by hand. - DROP TABLESPACE in commands/tablespace.c now defers per-database subdirectory removals, the version-directory removal, and the symlink unlink to commit time via FILEOPS. Previously these were issued synchronously inside destroy_tablespace_directories, which made DROP TABLESPACE asymmetric with CREATE TABLESPACE (which already used FileOpsMkdir/FileOpsSymlink). A transaction abort now leaves the tablespace's filesystem state intact even after the catalog row has been processed. - CREATE TABLESPACE chmod on the user-supplied location is routed through FileOpsChmod outside recovery, so an abort restores the prior mode via the FILEOPS UNDO record. In recovery the chmod is issued directly because the redo handler has no surrounding transaction. Relationship to pendingDeletes (narrow integration) --------------------------------------------------- smgrDoPendingDeletes(isCommit=true) now consults FileOpsRemoveXattrsForRelation(rlocator) to clean up FILEOPS-tracked extended attributes attached to a relation's files. Today this is a no-op because no caller stores xattrs on relation files; the hook is wired in now so future commits introducing relation-keyed xattrs (compression markers, encryption key references) do not need to touch catalog/storage.c again. This is the *narrow* coupling between FILEOPS and pendingDeletes; FILEOPS does NOT replace pendingDeletes outright in this release, see Future Work below. Future Work ----------- FILEOPS could eventually subsume parts of the catalog/storage.c pendingDeletes machinery, providing a single uniform commit/abort hook for filesystem mutations. This is deferred for the following reasons: 1. xl_xact_commit carries XACT_XINFO_HAS_RELFILELOCATORS plus an embedded RelFileLocator array as the per-commit unlink list. Replacing that with separate XLOG_FILEOPS_DELETE records would change the wire format consumed by logical decoding, pg_basebackup, pg_receivewal, and downstream replication tools that parse commit records. Such a change requires a deprecation cycle and pg_upgrade coordination and is out of scope for the initial submission. 2. smgrunlink is segment-aware: a single relation drop iterates forks (main, fsm, vm, init) and segment files (1 GB segments) and coordinates with the buffer manager via DropRelationsAllBuffers. FILEOPS is path-string oriented and would either need to reimplement segment walking and buffer-pool teardown, or call back into smgr -- at which point the abstraction is no clearer than the current pendingDeletes list. 3. The wal_level=minimal optimization in smgrDoPendingSyncs cross-references pendingDeletes to skip syncing relations about to be dropped. Replacing pendingDeletes requires either preserving that cross-reference or accepting the cost of redundant fsyncs in the minimal case. A future patch can introduce FileOpsRelationDrop(rlocator) as an opt-in alternative to RelationDropStorage that emits XLOG_FILEOPS_DELETE records per fork and segment, with a corresponding xl_xact_commit format flag indicating "this xact has FILEOPS delete records, do not consult the embedded relfilelocator array." See doc/src/sgml/fileops.sgml for the full discussion. Caveats and operational hazards ------------------------------- - xattr portability: the user-visible behavior of SETXATTR and REMOVEXATTR depends on the underlying filesystem. tmpfs on some Linux distributions does not support user xattrs by default; ZFS and APFS handle xattrs through different interfaces. Tests in src/test/recovery/t/054 and 063 skip xattr cases when the kernel reports ENOTSUP. - A crash between the WAL flush and the syscall is the standard WAL-then-do ordering; recovery re-executes the syscall. A crash between the syscall and the next WAL flush is also safe because the operation is recorded and idempotent on replay. - DELETE deferral means files belonging to a dropped relation remain on disk until commit; long-running transactions that drop large objects will hold disk space until they end. - WAL volume scales with DDL frequency. Workloads with high DDL churn (test harnesses, multi-tenant provisioning) should expect a measurable increase in WAL produced per operation. - The pg_xattr.c shim is best-effort across platforms; review src/test/recovery/t/063_fileops_undo.pl for the matrix that is actually exercised in CI. Reviewed-by: TBD Discussion: https://postgr.es/m/TBD

Motivation ---------- The heap AM allocates a fresh tuple version on every UPDATE and relies on VACUUM to reclaim the old version after no snapshot can see it. For workloads dominated by narrow updates to hot rows -- counters, queue heads, materialized aggregates, time-series tail writes -- this produces churn that the planner, the buffer manager, and the autovacuum machinery all pay for: index bloat, HOT chain walks, write amplification on full_page_writes, and a steady background of VACUUM I/O that competes with the foreground. The operational pattern is well known and the standard advice (more fillfactor, more autovacuum workers, smaller scale factors) buys headroom at the cost of space. RECNO is a table access method that updates tuples in place and uses UNDO for rollback and (when fully wired) for reader visibility. An UPDATE writes the new value over the old one in the same slot; the prior version goes to the secondary log (sLog) so concurrent readers under an older snapshot can still see it. Aborted UPDATEs are reversed by physical UNDO application: the slot is restored from the before-image with no dead tuple left behind. No HOT chains, no index bloat from version proliferation, no VACUUM debt from rolled-back work. Design summary -------------- RECNO registers RM_RECNO_ID and a tableam handler. The on-disk page layout retains the standard PageHeader and ItemId array so existing buffer-manager, FSM, and visibility-map code paths apply unchanged. Tuples are stored as RecnoTuple, a header laid out for in-place rewrite: t_xmin, t_xmax, t_cid, t_infomask, t_infomask2, and an attribute payload sized to the original allocation. An update that fits in the existing slot is performed in place under the buffer's exclusive content lock; an update that does not fit spills the new tuple to an overflow segment and rewrites the slot with a forwarding pointer. Every DML emits an UNDO record before mutating the page (INSERT records have no payload; DELETE/UPDATE/INPLACE carry the full before-image). Rollback walks the chain through the standard UNDO machinery (separate commit) and memcpy's the prior bytes back. Because rollback is physical and idempotent via CLRs, the RECNO recovery story is the heap recovery story plus CLR replay. Visibility uses the same snapshot model as heap. Readers reconstruct the version they need by consulting the sLog when the on-page tuple's xmin is newer than the snapshot. The sLog is partitioned (slog_num_partitions GUC) so writer threads do not contend on a single hash; readers walk a partition's chain of before-images keyed by (relfilenode, blkno, offset). A clock-sweep style maintenance pass (recno_clock.c) ages out before-images that no live snapshot can need. Overflow segment files are managed through FILEOPS so that allocating an overflow segment, extending it, and reclaiming it all participate in WAL and crash recovery. Compression (recno_compress.c) is per-page, applied at write-out time, with the ItemId array kept uncompressed so random reads stay O(1). The visibility-map analogue (recno_vm.c) tracks all-visible and all-frozen pages in the same shape as the heap VM, so index-only scans plug in unchanged. The dirty map (recno_dirtymap.c) is a shared bitmap that the revert worker uses to find pages with pending UNDO work without scanning the entire relation. Relationship to heap; why a separate AM --------------------------------------- The obvious question is why this is a new AM rather than a set of flags on the existing heap AM. The honest answer: every change RECNO makes is invasive. In-place updates change the locking discipline (writers take buffer exclusive earlier and hold it longer), the visibility rules (a tuple's xmin alone does not tell you what the reader should see; you have to consult the sLog), the rollback path (physical restore from UNDO instead of dead tuple plus CLOG), and the VACUUM contract (there are no dead tuples to reclaim, but there is sLog and overflow space to maintain). Bolting these onto heapam.c would either introduce runtime branches on every hot path or fork the file in practice while pretending not to. Anticipating the Geoghegan critique: a new AM means duplicated machinery (a parallel VM, a parallel pruning story, a parallel bulk-insert path) and a parallel set of bugs to find and fix. That cost is real. The judgment here is that the alternative -- making heapam.c conditionally do everything heap does today and also do everything RECNO needs -- is worse for long-term maintainability and forecloses experimentation that an isolated AM allows. RECNO is intentionally additive: heap is unchanged, heap regression tests pass unchanged, and operators who do not opt in see no behavior change. Where RECNO and heap can share, they do. The buffer manager, WAL infrastructure, FSM, VM shape, btree integration, executor slot machinery, and statistics framework are all reused. The tableam handler exposes RECNO through the standard interface. WAL and recovery ---------------- RM_RECNO_ID is assigned in rmgrlist.h. Record types cover INSERT, DELETE, UPDATE in-place, UPDATE with overflow, PRUNE-equivalent (sLog trim), VM set/clear, and overflow segment extend/truncate. Replay is deterministic: each record carries either a full page image or enough state to reconstruct the page from the prior LSN. Hot standbys replay RECNO WAL identically; the sLog is rebuilt from XLOG_UNDO_BATCH replay so reader visibility on the standby is consistent with the primary. Archive impact tracks UPDATE-heavy workloads: an in-place update of an existing slot produces less WAL than the heap equivalent because there is no new tuple version, but each update also emits the before-image to UNDO, so total WAL is workload dependent. Performance ----------- Update-heavy workloads see lower index bloat and reduced VACUUM overhead at the cost of in-place update WAL plus UNDO before- image WAL. TPROC-C numbers: TBD: see cover letter section 5. Caveats and operational hazards ------------------------------- - In-place MVCC reader walk: when a snapshot is older than the on-page xmin the reader must walk the sLog chain for that slot. Long-running snapshots with concurrent UPDATE-heavy writers turn this into a measurable per-row cost; monitor pg_stat_get_undo_logs() and snapshot age. - sLog before-image pressure: the sLog is sized through shared_buffers and DSA; sustained UPDATE bursts can push before-image residency out of memory and onto disk. Tune slog_num_partitions for write concurrency and watch DSA usage. - Overflow churn: a workload that repeatedly grows a tuple past its slot allocation and back again will allocate, free, and reallocate overflow space. The compressor mitigates this for compressible payloads but not for genuinely growing rows. - Rollback bandwidth: an aborted bulk UPDATE generates a CLR per applied record (~8 KB FPI each). Aborts that exceed undo_instant_abort_threshold are deferred to the revert worker, but the worker still produces that WAL volume. - Autovacuum interaction: rolled-back rows leave no dead tuples, so n_dead_tup stays near zero on busy RECNO tables. The default autovacuum trigger will not fire; set autovacuum_vacuum_threshold = 0 at the table level to ensure hint-bit maintenance and index bloat cleanup still run. - This AM is new code. contrib/pageinspect grows recnofuncs.c for inspection but tooling parity with heap (pg_stat_user_* granularity, repack-style utilities, third-party monitors) will lag heap until the AM has been in tree for a release. Reviewed-by: TBD Discussion: https://postgr.es/m/TBD

github-actions Bot force-pushed the master branch 30 times, most recently from 9355586 to 9cbf7e6 Compare March 30, 2026 18:18

github-actions Bot force-pushed the master branch 21 times, most recently from c08b44f to 7c5c2d3 Compare April 2, 2026 22:10

gburd and others added 9 commits May 28, 2026 15:07

[DO NOT MERGE] Benchmarks: RECNO and UNDO performance test suite

5099675

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undo#21

Undo#21
gburd wants to merge 9 commits into
masterfrom
undo

gburd commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gburd commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant