Skip to content

Undo#21

Draft
gburd wants to merge 9 commits into
masterfrom
undo
Draft

Undo#21
gburd wants to merge 9 commits into
masterfrom
undo

Conversation

@gburd
Copy link
Copy Markdown
Owner

@gburd gburd commented Mar 26, 2026

No description provided.

@github-actions github-actions Bot force-pushed the master branch 30 times, most recently from 9355586 to 9cbf7e6 Compare March 30, 2026 18:18
@github-actions github-actions Bot force-pushed the master branch 21 times, most recently from c08b44f to 7c5c2d3 Compare April 2, 2026 22:10
gburd and others added 9 commits May 28, 2026 15:07
  - Hourly upstream sync from postgres/postgres (24x daily)
  - AI-powered PR reviews using AWS Bedrock Claude Sonnet 4.5
  - Multi-platform CI via existing Cirrus CI configuration
  - Cost tracking and comprehensive documentation

  Features:
  - Automatic issue creation on sync conflicts
  - PostgreSQL-specific code review prompts (C, SQL, docs, build)
  - Cost limits: $15/PR, $200/month
  - Inline PR comments with security/performance labels
  - Skip draft PRs to save costs

  Documentation:
  - .github/SETUP_SUMMARY.md - Quick setup overview
  - .github/QUICKSTART.md - 15-minute setup guide
  - .github/PRE_COMMIT_CHECKLIST.md - Verification checklist
  - .github/docs/ - Detailed guides for sync, AI review, Bedrock

  See .github/README.md for complete overview

Complete Phase 3: Windows builds + fix sync for CI/CD commits

Phase 3: Windows Dependency Build System
- Implement full build workflow (OpenSSL, zlib, libxml2)
- Smart caching by version hash (80% cost reduction)
- Dependency bundling with manifest generation
- Weekly auto-refresh + manual triggers
- PowerShell download helper script
- Comprehensive usage documentation

Sync Workflow Fix:
- Allow .github/ commits (CI/CD config) on master
- Detect and reject code commits outside .github/
- Merge upstream while preserving .github/ changes
- Create issues only for actual pristine violations

Documentation:
- Complete Windows build usage guide
- Update all status docs to 100% complete
- Phase 3 completion summary

All three CI/CD phases complete (100%):
✅ Hourly upstream sync with .github/ preservation
✅ AI-powered PR reviews via Bedrock Claude 4.5
✅ Windows dependency builds with smart caching

Cost: $40-60/month total
See .github/PHASE3_COMPLETE.md for details

Fix sync to allow 'dev setup' commits on master

The sync workflow was failing because the 'dev setup v19' commit
modifies files outside .github/. Updated workflows to recognize
commits with messages starting with 'dev setup' as allowed on master.

Changes:
- Detect 'dev setup' commits by message pattern (case-insensitive)
- Allow merge if commits are .github/ OR dev setup OR both
- Update merge messages to reflect preserved changes
- Document pristine master policy with examples

This allows personal development environment commits (IDE configs,
debugging tools, shell aliases, Nix configs, etc.) on master without
violating the pristine mirror policy.

Future dev environment updates should start with 'dev setup' in the
commit message to be automatically recognized and preserved.

See .github/docs/pristine-master-policy.md for complete policy
See .github/DEV_SETUP_FIX.md for fix summary

Optimize CI/CD costs by skipping builds for pristine commits

Add cost optimization to Windows dependency builds to avoid expensive
builds when only pristine commits are pushed (dev setup commits or
.github/ configuration changes).

Changes:
- Add check-changes job to detect pristine-only pushes
- Skip Windows builds when all commits are dev setup or .github/ only
- Add comprehensive cost optimization documentation
- Update README with cost savings (~40% reduction)

Expected savings: ~$3-5/month on Windows builds, ~$40-47/month total
through combined optimizations.

Manual dispatch and scheduled builds always run regardless.
Nix-based development environment for PostgreSQL hacking.  Not for
merge; staged here so per-commit build/test runs can share a single
toolchain.

Changes from v34:

- Drop the .idea/ and .vscode/ editor directories.  Editor state is
  personal and should not live in the repository.

- Re-introduce glibc-no-fortify-warning.patch and the patchedGlibc
  overlay in shell.nix.  Without this, meson's libcurl thread-safety
  probe fails under our default -D_FORTIFY_SOURCE=1 -O0 -Werror
  combination because features.h emits a -Wcpp warning that becomes a
  fatal error.  The patch is scoped to the dev shell only and does not
  leak into system glibc or release builds.
Sparsemap is a memory-efficient data structure for maintaining sparse
sets of integers using hierarchical bitmaps.  It supports O(1) set/get
operations and efficient iteration over set bits while using far less
memory than a dense bitmap for sparse populations.

The implementation provides:
  - sparsemap_set/get/is_set for individual bit manipulation
  - sparsemap_scan for efficient forward iteration
  - sparsemap_select for rank-based selection
  - Configurable initial capacity with automatic growth

Used by the UNDO subsystem for tracking allocated pages and by RECNO
for free-space management within relation forks.

Includes a TAP regression test module (test_sparsemap) exercising all
public API operations.
Header-only implementation of a probabilistic skip list providing
O(log n) insert, delete, and lookup operations with O(n) space.
Compared to rbtree, skip lists offer simpler implementation, better
cache locality for sequential scans, and lock-free read potential.

The implementation provides:
  - Type-safe macros for defining typed skip lists (DEFINE_SKIPLIST)
  - Configurable maximum height (up to 32 levels)
  - Forward iteration via SKIPLIST_FOREACH
  - Range queries and nearest-neighbor lookup
  - Memory allocation via palloc (TopMemoryContext by default)

Used by the UNDO subsystem for maintaining ordered transaction
metadata and by RECNO for HLC-ordered page directories.

Includes a TAP regression test module (test_skiplist) exercising
insertion, deletion, iteration, and edge cases.
LRLock is a reader-writer lock that admits wait-free readers and a
single serialized writer.  Readers swap a 64-bit version counter
through pg_atomic_read_acquire_u32 / pg_atomic_fetch_add_acqrel_u32
without touching any shared cache line that the writer also touches
during steady state.  The writer holds a short LWLock to serialize
against other writers and bumps the version counter twice per write
(odd while writing, even when done) so concurrent readers can detect
in-flight updates and retry.

This primitive is added now, ahead of any consumer, because the
sLog flat hash partitions in the UNDO subsystem (added in a later
commit) depend on it for the read path of MVCC visibility checks.
LRLock is also useful in its own right and is exercised by the
test_lrlock module added here.

Three new atomic primitives are added to support LRLock:
- pg_atomic_fetch_add_acqrel_u32: acquire-release atomic add, lighter
  than the existing seq-cst variant on aarch64.
- pg_atomic_seq_cst_fence: standalone seq-cst fence separate from an
  RMW operation.
- pg_atomic_read_acquire_u32: read with acquire semantics, between
  pg_atomic_read_u32 (no barrier) and pg_atomic_read_membarrier_u32
  (full barrier) in cost.

The test_lrlock module covers single-writer correctness, multi-reader
consistency under contention, and writer serialization.

Author: Greg Burd <greg@burd.me>
Reviewed-by: TBD
Discussion: https://postgr.es/m/TBD
…pport

Motivation
----------

PostgreSQL's MVCC model leaves dead tuples behind on rollback and
relies on VACUUM to reclaim space.  Workloads that mix bulk DML with
frequent aborts (ETL pipelines that hit a unique-key violation late,
application-driven retry loops, partition swap operations that fail
on a constraint check) accumulate bloat that VACUUM cannot keep up
with, and ROLLBACK latency for large transactions remains acceptable
only because no physical work is done.  When the user later runs
VACUUM the cost shows up there, plus the index entries pointing at
dead tuples must be visited.  Operators want a path where ROLLBACK
returns immediately and leaves no dead-tuple debt behind.

This commit adds the cluster-wide UNDO infrastructure that makes the
above possible.  It does not change heap behavior by default; heap
participation is opt-in per relation, and the new RECNO AM (separate
commit) is the first AM that always uses it.

Design summary
--------------

UNDO records are written into the regular WAL stream as
XLOG_UNDO_BATCH records owned by a new resource manager,
RM_UNDO_ID.  Each modifying operation on an UNDO-enabled relation
appends an inverse record to a per-transaction chain linked through
urec_prev.  The header is AM-agnostic; payload interpretation is
delegated to a per-RM dispatch table registered via
RegisterUndoRmgr().  Heap and nbtree register their own callbacks.

Rollback is physical: ApplyUndoChain() walks the chain newest-first
and restores the prior page bytes via memcpy under a critical
section.  Each application emits a Compensation Log Record
(XLOG_UNDO_APPLY_RECORD) with REGBUF_FORCE_IMAGE.  The CLR's LSN is
written back into urec_clr_ptr, so a crash mid-rollback resumes
idempotently: records with a valid urec_clr_ptr are skipped on the
next pass.

For large transactions (UNDO footprint above
undo_instant_abort_threshold, default 64 KB) the backend records the
XID in a shared Aborted Transaction Map (ATM) and returns from
ROLLBACK in O(1).  The logical_revert_worker drains the ATM in the
background, applying chains and emitting CLRs without holding the
client.  This is the Constant Time Recovery design from Antonopoulos
et al. (VLDB 2019), with the persistent version store realized as
in-WAL UNDO rather than a separate tablespace.

A secondary log (sLog) captures before-images for readers that need
old tuple versions; the sLog is partitioned (slog_num_partitions
GUC) for write concurrency.  UNDO blocks live in shared_buffers via
a virtual RelFileLocator (spcOid=1663, dbOid=9, relNumber=logno),
so no separate cache exists to size or warm.

Recovery follows ARIES: redo replays all WAL forward including
ALLOCATE/DISCARD/EXTEND and any CLRs already produced before the
crash; analysis is implicit (XIDs that wrote UNDO but have no
commit/abort record remain in the in-memory table); undo applies
the chains for those XIDs and emits new CLRs.  Temp and unlogged
records are skipped during crash recovery.

Relationship to zheap and MariaDB-style undo
--------------------------------------------

zheap stored UNDO in dedicated segment files under
$PGDATA/base/undo and used UNDO both for rollback and for reader
visibility against in-place updates.  MariaDB and InnoDB use a
rollback segment in a system tablespace.  Both designs require a
second durable storage tier with its own retention, archival, and
replication story.  The earlier iteration of this work followed the
same path and ran into the same operational issues: a separate WAL-
like stream that had to be checkpointed, archived, and shipped to
standbys.

UNDO-in-WAL is complementary, with a different trade-off.  Records
ride the existing WAL stream, so durability, replication, archival,
and standby replay come for free.  The cost is that WAL retention
is now also gated by undo_discard_horizon: WAL containing UNDO for
an in-flight or unresolved transaction cannot be recycled.  In
exchange there is no second log to administer, no separate
checkpoint, and CLRs are ordinary WAL records that hot standbys
already replay.  The zheap thread's review feedback (in particular
on segment lifecycle, discard correctness, and recovery
sequencing) shaped this design directly: the rules in section 9 of
src/backend/access/undo/README on inter-transaction UNDO ordering,
CLR idempotency, and the temp/unlogged skip during crash recovery
were derived from that discussion.

WAL and recovery
----------------

RM_UNDO_ID is assigned in src/include/access/rmgrlist.h.  Replay is
deterministic: ALLOCATE updates log control structures, DISCARD
advances discard_ptr and oldest_xid, EXTEND grows a log, and
APPLY_RECORD restores a page from the embedded full page image
without re-reading any UNDO data.  Hot standbys see CLRs as normal
WAL and apply them with the same FPI machinery used elsewhere.
Archive impact is proportional to abort volume: each applied UNDO
record produces one ~8 KB CLR.

twophase.c, xact.c, and xlog.c gain hooks to attach a transaction's
UNDO chain head to commit/abort/prepared state and to drive
PerformUndoRecovery() at the end of redo.

Performance
-----------

OLTP overhead with UNDO active is in the low single digits at
realistic concurrency; abort latency is O(1) for transactions above
the instant-abort threshold.  TPROC-C numbers: TBD: see cover
letter section 5.

Caveats and operational hazards
-------------------------------

- WAL retention is pinned by the oldest in-flight transaction that
  may still abort and by the logical_revert_worker's pending ATM
  entries.  A long-running transaction or a stalled revert worker
  will hold WAL on disk; max_wal_size must be set with this in
  mind.
- sLog cleanup runs in the background; before-image pressure under
  DSA can grow if readers hold old snapshots while writers produce
  many before-images.  Monitor pg_stat_get_undo_logs() and the
  sLog partitions.
- ATM-full is a hard backpressure point: when the map cannot
  accept another XID, the backend falls back to synchronous
  rollback.  This is correct but visible as a latency spike;
  size the ATM for peak concurrent abort volume.
- Logical decoding filters XLOG_UNDO_BATCH; logical replication of
  UNDO-enabled tables works because the underlying heap or RECNO
  WAL records still flow.
- wal_level = minimal still produces correct local rollback but
  standbys will not see CLRs.

Reviewed-by: TBD
Co-authored-by: Robert Haas <robertmhaas@gmail.com>
Co-authored-by: Amit Kapila <amit.kapila16@gmail.com>
Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com>
Co-authored-by: Andres Freund <andres@anarazel.de>
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/TBD
Motivation
----------

PostgreSQL's existing handling of filesystem side effects from DDL
relies on pendingDeletes and the smgr unlink machinery driven from
xact_redo.  This covers the common cases (CREATE TABLE, DROP
TABLE, TRUNCATE) for relation files but does not extend to other
filesystem operations that extensions and new subsystems need:
renaming a file as part of a logical-replication slot move,
creating an auxiliary directory under $PGDATA, writing a
configuration sidecar, setting an extended attribute that records
provenance, or atomically replacing a file at commit time.  When
extension code performs these operations directly it has no way to
participate in WAL-logged crash recovery; a crash between the
catalog change and the filesystem syscall leaves the cluster in an
inconsistent state that operators must clean up by hand.

The UNDO and RECNO work in this series both need transactional
file operations -- RECNO for its overflow segment files, UNDO for
the (now removed) flat-file backing store and for log-segment
rotation artifacts that survived the move to UNDO-in-WAL.  Rather
than open-coding pendingDeletes-style lists in each subsystem,
this commit factors the pattern into a single facility and exposes
it.

Design summary
--------------

FILEOPS adds a new resource manager, RM_FILEOPS_ID, with one WAL
record type per filesystem primitive: CREATE, DELETE, RENAME,
WRITE, TRUNCATE, CHMOD, CHOWN, MKDIR, RMDIR, SYMLINK, LINK,
SETXATTR, REMOVEXATTR.  The record-type encoding follows
PostgreSQL convention (high 4 bits of the info byte for type, low
bits for flags).  The set of primitives mirrors the Berkeley DB
fileops.src model so that each operation has an independent redo
handler and descriptor.

Operations enqueue a PendingFileOp on a per-transaction list,
tagged at_commit or at_abort and tracked by subtransaction nesting
level.  At end-of-transaction the list is drained in registration
order.  CREATE registers a delete-on-abort entry so an aborted
CREATE TABLE leaves no orphan file; DELETE is deferred to commit
so a rolled-back DROP leaves the data intact.  Subtransaction
abort discards the child's pending entries; subtransaction commit
re-parents them to the outer level.

WAL is written before the syscall.  Replay during crash recovery
re-executes each operation idempotently: CREATE creates the file
if missing, DELETE unlinks if present, RENAME moves if the source
exists, TRUNCATE shortens to the recorded length, and so on.  The
parent directory is fsync'd after CREATE and DELETE to ensure the
directory entry is durable.

Extended attributes are handled through a thin portability shim,
src/port/pg_xattr.c, that abstracts Linux setxattr/getxattr,
FreeBSD extattr_*, and macOS getxattr/setxattr.  Platforms without
xattr support compile the SETXATTR/REMOVEXATTR records into
no-ops; on those platforms a transaction that attempts an xattr
operation receives a clear error rather than silently succeeding.

The UNDO integration lives in fileops_undo.c: each FILEOPS WAL
record carries enough information to also serve as an UNDO record
for the FILEOPS RM, so a transaction that creates a file inside a
larger UNDO-tracked operation rolls the file create back through
the same chain that rolls back the data changes.

Relationship to xact_redo and pendingDeletes
--------------------------------------------

The existing pendingDeletes path in catalog/storage.c covers
relation-level file removal for the smgr layer: it queues unlinks
during CREATE/DROP and processes them at xact end.  FILEOPS
generalizes that pattern to arbitrary filesystem primitives and
replaces ad-hoc unlinking by extension code with a single
WAL-logged path.  smgr operations on relation files are unchanged;
they continue to flow through smgrcreate/smgrdounlink and the
existing XLOG_SMGR_* records.  FILEOPS sits alongside, not on top
of, smgr.  Code that today open-codes a pendingDeletes-style list
(some contrib modules and the prior iteration of UNDO segment
management) can switch to FILEOPS without rewriting the recovery
hooks.

WAL and recovery
----------------

RM_FILEOPS_ID is assigned in rmgrlist.h.  Replay is deterministic
because each primitive is idempotent on the recorded inputs; the
redo path treats "operation already applied" as success.  Hot
standbys replay FILEOPS records identically to the primary, so a
DDL that creates a file on the primary creates it on the standby.
Archive impact is proportional to DDL volume; data-path
operations are not affected.

Performance
-----------

FILEOPS operations are bounded by syscall latency; in normal OLTP
workloads they are not on the hot path.  TPROC-C numbers: TBD:
see cover letter section 5.

Coverage of pre-existing crash-asymmetric paths
-----------------------------------------------

Beyond exposing a primitive for new code, this commit closes
several long-standing crash-asymmetric paths in core that
performed filesystem mutations outside any redo or pendingDeletes
coverage:

- copydir.c gains a register_for_abort_cleanup parameter.  When
  set, the destination tree is registered with FILEOPS for
  recursive removal on abort.  CREATE DATABASE STRATEGY=FILE_COPY
  (and the clone path) and ALTER DATABASE SET TABLESPACE both pass
  true; the WAL replay caller passes false because the redo
  handler is itself the recovery action.  This eliminates the
  long-standing window where a crash mid-copy left orphan files
  that had to be cleaned up by hand.
- DROP TABLESPACE in commands/tablespace.c now defers per-database
  subdirectory removals, the version-directory removal, and the
  symlink unlink to commit time via FILEOPS.  Previously these
  were issued synchronously inside destroy_tablespace_directories,
  which made DROP TABLESPACE asymmetric with CREATE TABLESPACE
  (which already used FileOpsMkdir/FileOpsSymlink).  A transaction
  abort now leaves the tablespace's filesystem state intact even
  after the catalog row has been processed.
- CREATE TABLESPACE chmod on the user-supplied location is routed
  through FileOpsChmod outside recovery, so an abort restores the
  prior mode via the FILEOPS UNDO record.  In recovery the chmod
  is issued directly because the redo handler has no surrounding
  transaction.

Relationship to pendingDeletes (narrow integration)
---------------------------------------------------

smgrDoPendingDeletes(isCommit=true) now consults
FileOpsRemoveXattrsForRelation(rlocator) to clean up FILEOPS-tracked
extended attributes attached to a relation's files.  Today this is
a no-op because no caller stores xattrs on relation files; the
hook is wired in now so future commits introducing relation-keyed
xattrs (compression markers, encryption key references) do not
need to touch catalog/storage.c again.  This is the *narrow*
coupling between FILEOPS and pendingDeletes; FILEOPS does NOT
replace pendingDeletes outright in this release, see Future Work
below.

Future Work
-----------

FILEOPS could eventually subsume parts of the catalog/storage.c
pendingDeletes machinery, providing a single uniform commit/abort
hook for filesystem mutations.  This is deferred for the
following reasons:

  1. xl_xact_commit carries XACT_XINFO_HAS_RELFILELOCATORS plus an
     embedded RelFileLocator array as the per-commit unlink list.
     Replacing that with separate XLOG_FILEOPS_DELETE records would
     change the wire format consumed by logical decoding,
     pg_basebackup, pg_receivewal, and downstream replication
     tools that parse commit records.  Such a change requires a
     deprecation cycle and pg_upgrade coordination and is out of
     scope for the initial submission.
  2. smgrunlink is segment-aware: a single relation drop iterates
     forks (main, fsm, vm, init) and segment files (1 GB
     segments) and coordinates with the buffer manager via
     DropRelationsAllBuffers.  FILEOPS is path-string oriented and
     would either need to reimplement segment walking and
     buffer-pool teardown, or call back into smgr -- at which
     point the abstraction is no clearer than the current
     pendingDeletes list.
  3. The wal_level=minimal optimization in smgrDoPendingSyncs
     cross-references pendingDeletes to skip syncing relations
     about to be dropped.  Replacing pendingDeletes requires
     either preserving that cross-reference or accepting the cost
     of redundant fsyncs in the minimal case.

A future patch can introduce FileOpsRelationDrop(rlocator) as an
opt-in alternative to RelationDropStorage that emits
XLOG_FILEOPS_DELETE records per fork and segment, with a
corresponding xl_xact_commit format flag indicating "this xact
has FILEOPS delete records, do not consult the embedded
relfilelocator array."  See doc/src/sgml/fileops.sgml for the
full discussion.

Caveats and operational hazards
-------------------------------

- xattr portability: the user-visible behavior of SETXATTR and
  REMOVEXATTR depends on the underlying filesystem.  tmpfs on
  some Linux distributions does not support user xattrs by
  default; ZFS and APFS handle xattrs through different
  interfaces.  Tests in src/test/recovery/t/054 and 063 skip
  xattr cases when the kernel reports ENOTSUP.
- A crash between the WAL flush and the syscall is the standard
  WAL-then-do ordering; recovery re-executes the syscall.  A
  crash between the syscall and the next WAL flush is also safe
  because the operation is recorded and idempotent on replay.
- DELETE deferral means files belonging to a dropped relation
  remain on disk until commit; long-running transactions that
  drop large objects will hold disk space until they end.
- WAL volume scales with DDL frequency.  Workloads with high DDL
  churn (test harnesses, multi-tenant provisioning) should
  expect a measurable increase in WAL produced per operation.
- The pg_xattr.c shim is best-effort across platforms; review
  src/test/recovery/t/063_fileops_undo.pl for the matrix that is
  actually exercised in CI.

Reviewed-by: TBD
Discussion: https://postgr.es/m/TBD
Motivation
----------

The heap AM allocates a fresh tuple version on every UPDATE and
relies on VACUUM to reclaim the old version after no snapshot can
see it.  For workloads dominated by narrow updates to hot rows --
counters, queue heads, materialized aggregates, time-series tail
writes -- this produces churn that the planner, the buffer manager,
and the autovacuum machinery all pay for: index bloat, HOT chain
walks, write amplification on full_page_writes, and a steady
background of VACUUM I/O that competes with the foreground.  The
operational pattern is well known and the standard advice (more
fillfactor, more autovacuum workers, smaller scale factors) buys
headroom at the cost of space.

RECNO is a table access method that updates tuples in place and
uses UNDO for rollback and (when fully wired) for reader
visibility.  An UPDATE writes the new value over the old one in
the same slot; the prior version goes to the secondary log (sLog)
so concurrent readers under an older snapshot can still see it.
Aborted UPDATEs are reversed by physical UNDO application: the
slot is restored from the before-image with no dead tuple left
behind.  No HOT chains, no index bloat from version proliferation,
no VACUUM debt from rolled-back work.

Design summary
--------------

RECNO registers RM_RECNO_ID and a tableam handler.  The on-disk
page layout retains the standard PageHeader and ItemId array so
existing buffer-manager, FSM, and visibility-map code paths apply
unchanged.  Tuples are stored as RecnoTuple, a header laid out for
in-place rewrite: t_xmin, t_xmax, t_cid, t_infomask, t_infomask2,
and an attribute payload sized to the original allocation.  An
update that fits in the existing slot is performed in place under
the buffer's exclusive content lock; an update that does not fit
spills the new tuple to an overflow segment and rewrites the slot
with a forwarding pointer.

Every DML emits an UNDO record before mutating the page (INSERT
records have no payload; DELETE/UPDATE/INPLACE carry the full
before-image).  Rollback walks the chain through the standard
UNDO machinery (separate commit) and memcpy's the prior bytes
back.  Because rollback is physical and idempotent via CLRs, the
RECNO recovery story is the heap recovery story plus CLR replay.

Visibility uses the same snapshot model as heap.  Readers
reconstruct the version they need by consulting the sLog when the
on-page tuple's xmin is newer than the snapshot.  The sLog is
partitioned (slog_num_partitions GUC) so writer threads do not
contend on a single hash; readers walk a partition's chain of
before-images keyed by (relfilenode, blkno, offset).  A
clock-sweep style maintenance pass (recno_clock.c) ages out
before-images that no live snapshot can need.

Overflow segment files are managed through FILEOPS so that
allocating an overflow segment, extending it, and reclaiming it
all participate in WAL and crash recovery.  Compression
(recno_compress.c) is per-page, applied at write-out time, with
the ItemId array kept uncompressed so random reads stay O(1).

The visibility-map analogue (recno_vm.c) tracks all-visible and
all-frozen pages in the same shape as the heap VM, so index-only
scans plug in unchanged.  The dirty map (recno_dirtymap.c) is a
shared bitmap that the revert worker uses to find pages with
pending UNDO work without scanning the entire relation.

Relationship to heap; why a separate AM
---------------------------------------

The obvious question is why this is a new AM rather than a set of
flags on the existing heap AM.  The honest answer: every change
RECNO makes is invasive.  In-place updates change the locking
discipline (writers take buffer exclusive earlier and hold it
longer), the visibility rules (a tuple's xmin alone does not tell
you what the reader should see; you have to consult the sLog),
the rollback path (physical restore from UNDO instead of dead
tuple plus CLOG), and the VACUUM contract (there are no dead
tuples to reclaim, but there is sLog and overflow space to
maintain).  Bolting these onto heapam.c would either introduce
runtime branches on every hot path or fork the file in practice
while pretending not to.

Anticipating the Geoghegan critique: a new AM means duplicated
machinery (a parallel VM, a parallel pruning story, a parallel
bulk-insert path) and a parallel set of bugs to find and fix.
That cost is real.  The judgment here is that the alternative --
making heapam.c conditionally do everything heap does today and
also do everything RECNO needs -- is worse for long-term
maintainability and forecloses experimentation that an isolated
AM allows.  RECNO is intentionally additive: heap is unchanged,
heap regression tests pass unchanged, and operators who do not
opt in see no behavior change.

Where RECNO and heap can share, they do.  The buffer manager,
WAL infrastructure, FSM, VM shape, btree integration, executor
slot machinery, and statistics framework are all reused.  The
tableam handler exposes RECNO through the standard interface.

WAL and recovery
----------------

RM_RECNO_ID is assigned in rmgrlist.h.  Record types cover
INSERT, DELETE, UPDATE in-place, UPDATE with overflow,
PRUNE-equivalent (sLog trim), VM set/clear, and overflow segment
extend/truncate.  Replay is deterministic: each record carries
either a full page image or enough state to reconstruct the page
from the prior LSN.  Hot standbys replay RECNO WAL identically;
the sLog is rebuilt from XLOG_UNDO_BATCH replay so reader
visibility on the standby is consistent with the primary.
Archive impact tracks UPDATE-heavy workloads: an in-place update
of an existing slot produces less WAL than the heap equivalent
because there is no new tuple version, but each update also
emits the before-image to UNDO, so total WAL is workload
dependent.

Performance
-----------

Update-heavy workloads see lower index bloat and reduced VACUUM
overhead at the cost of in-place update WAL plus UNDO before-
image WAL.  TPROC-C numbers: TBD: see cover letter section 5.

Caveats and operational hazards
-------------------------------

- In-place MVCC reader walk: when a snapshot is older than the
  on-page xmin the reader must walk the sLog chain for that
  slot.  Long-running snapshots with concurrent UPDATE-heavy
  writers turn this into a measurable per-row cost; monitor
  pg_stat_get_undo_logs() and snapshot age.
- sLog before-image pressure: the sLog is sized through
  shared_buffers and DSA; sustained UPDATE bursts can push
  before-image residency out of memory and onto disk.  Tune
  slog_num_partitions for write concurrency and watch DSA usage.
- Overflow churn: a workload that repeatedly grows a tuple past
  its slot allocation and back again will allocate, free, and
  reallocate overflow space.  The compressor mitigates this for
  compressible payloads but not for genuinely growing rows.
- Rollback bandwidth: an aborted bulk UPDATE generates a CLR per
  applied record (~8 KB FPI each).  Aborts that exceed
  undo_instant_abort_threshold are deferred to the revert
  worker, but the worker still produces that WAL volume.
- Autovacuum interaction: rolled-back rows leave no dead tuples,
  so n_dead_tup stays near zero on busy RECNO tables.  The
  default autovacuum trigger will not fire; set
  autovacuum_vacuum_threshold = 0 at the table level to ensure
  hint-bit maintenance and index bloat cleanup still run.
- This AM is new code.  contrib/pageinspect grows recnofuncs.c
  for inspection but tooling parity with heap (pg_stat_user_*
  granularity, repack-style utilities, third-party monitors)
  will lag heap until the AM has been in tree for a release.

Reviewed-by: TBD
Discussion: https://postgr.es/m/TBD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant