Skip to content

Idle auto-cancel: cancel Pro subs after 2 idle periods, with reversible keep flow#178

Open
btc wants to merge 37 commits into
mainfrom
feat/idle-auto-cancel
Open

Idle auto-cancel: cancel Pro subs after 2 idle periods, with reversible keep flow#178
btc wants to merge 37 commits into
mainfrom
feat/idle-auto-cancel

Conversation

@btc
Copy link
Copy Markdown
Owner

@btc btc commented May 9, 2026

Summary

Auto-cancel a Pro subscription via Stripe cancel_at_period_end=true when the user has been idle for two consecutive billing periods. No settings toggle — this is the default Sabermatic behavior. Reversible via email-link click or any authenticated activity during the cancel window.

Architecture: Event-driven via Stripe invoice.upcoming webhook (~T-7 days before each renewal). Cancel decision happens inside a per-user-locked DB transaction (SELECT ... FOR UPDATE) with a generic stripe_webhook_dedup table for retry-storm protection and period-keyed user_events rows for multi-firing dedup. Reversal paths verify Stripe state before acting (handles partial-failure cache drift). Activity signal = MAX(last_active) from soft-deleted auth_sessions.

Spec: docs/superpowers/specs/2026-05-07-idle-auto-cancel-design.md
Plan: docs/superpowers/plans/2026-05-07-idle-auto-cancel.md

What's in the box

  • 4 migrations (015–018): auth_sessions soft-delete, users sub-state cache columns, stripe_webhook_dedup table, keep_link_token_uses table.
  • internal/feat/idleunsub/ package: HMAC-SHA256 keep-token sign/verify, Service.HandleInvoiceUpcoming (cancel decision in a per-user-locked tx), Service.KeepSubscription (link-click reversal), Service.AutoReverse (auth-middleware reversal with cache-drift correction), email composition + 4 templates.
  • Stripe webhook integration: routes invoice.upcoming and customer.subscription.created through the new service; customer.subscription.updated/deleted sync the cache.
  • /sub/keep public endpoint: token verify → single-use claim → reversal. Token IS the auth.
  • AutoReverse hook in Backend.AuthenticateSession: fire-and-forget goroutine on authed requests with gates set.
  • RPC + frontend: pending_kept_banner field on User proto, AckKeptBanner RPC, <KeptBanner> React component with optimistic dismiss + getMe cache invalidation.
  • Email via River: cancel + kept emails enqueued through SendEmailJob (not synchronous Mailgun on the webhook hot path).
  • Periodic banner cleanup: 24h River job on QueueMaintenance clears banners older than 14 days (gated by MAX(created_at) per user).
  • 7 OTEL counters wired per spec §9.

Test coverage

  • 30+ unit/integration tests for idleunsub (handler, service, token, email, dedup, race).
  • 5 keep-link handler tests (happy, tampered, expired, replay, refuses-manual-cancel, no-signer-degraded-mode).
  • 5 webhook integration tests (routing, sync invariants, deletion clearing).
  • 3 AuthenticateSession tests (projection, fires-when-gates-set, refuses-manual-cancel).
  • E2E integration tests:
    • TestE2E_CancelAndKeepViaLink: webhook → cancel → email → click → reverse → replay-idempotent.
    • TestE2E_AutoReverseOnLogin: signup → seed → login → auth → goroutine completes.
    • TestE2E_ConcurrentInvoiceUpcomingSerializes: two concurrent webhook deliveries, asserts SELECT FOR UPDATE serializes them.
  • Frontend: KeptBanner render test (4 cases) added; vitest 65/65 pass.
  • Three-round Opus + Sonnet branch review ran in parallel; converged clean by round 3.

Test plan

  • make test green locally (last run: backend -race -count=1 all 39 packages OK; frontend tsc/lint/vitest clean; buf lint clean).
  • Manual browser smoke test of <KeptBanner> per CLAUDE.md UI rule:
    1. Sign in as a test user; set users.pending_kept_banner = TRUE via psql.
    2. Reload — banner appears.
    3. Click ×; banner disappears immediately (optimistic).
    4. Reload — banner stays gone.
  • Set KEEP_TOKEN_HMAC_KEY (32-byte secret) in production env before deploy.

Deployment notes

  • New env var: KEEP_TOKEN_HMAC_KEY. Empty key = degraded mode (cancel decisions still fire; keep-link emails skipped; /sub/keep returns invalid for any token — fail closed).
  • Migrations 015–018 are additive; safe on a populated users table (DEFAULT NOW() on idle_eligible_after triggers a one-time table rewrite — fine at current scale).
  • Existing users start immediately eligible (idle_eligible_after = NOW() at migration time); the trigger fires on their next invoice.upcoming if idle.

Out of scope (per spec §10)

  • Multiple subscriptions per user
  • Annual subscriptions (rule generalizes; no special handling shipped)
  • Settings toggle to opt out
  • Cancel-window UI surface beyond the banner
  • Auto-prune of auth_sessions and keep_link_token_uses (boy-scout cleanup, not blocking)
  • Generic webhook dedup adoption by other handlers (invoice.paid, etc.)

btc added 30 commits May 7, 2026 21:52
A small Go library plus a sibling code generator that turn AIP-134 PATCH
RPCs into safe, dynamic Postgres UPDATE statements. The proto message and
FieldMask carry intent on the wire; an aippatch.yaml plus codegen produce
a typed Mapping[T] per resource; a runtime Apply[T] call validates the
mask, builds the SQL, executes it, and returns the post-update proto.

Built first at drill/thirdparty/aippatch/ with UpdateProfile as the v0
target; designed for clean extraction into a standalone module reusable
across Spanda LLC products. Codegen consumes proto FileDescriptorSet plus
parsed sql/migrations via pg_query_go plus aippatch.yaml policy; output
is committed *.gen.go files reviewable in PRs and guarded by a --check
mode in CI.

v0 codec set covers scalars, timestamps, and enums — sufficient to retire
drill's hand-rolled UpdateProfile handler. JSONB, declarative validators,
declarative authz, AIP-193 error mapping, and AIP-154 ETag are roadmap
items; each is backwards-compatible with v0 call sites.
Auto-cancel a Pro subscription via cancel_at_period_end=true when the
user has been idle for two consecutive billing periods. No settings
toggle — this is the default Sabermatic behavior. Establishes a Spanda
tenet: earn money for ongoing value delivered.

Trigger fires at Stripe invoice.upcoming (~T-7) when MAX(last_active)
across auth_sessions falls before the previous period's start, with
launch-day grandfathering via users.idle_eligible_after. Reversal via
email-link click (signed token, single-use) or auto-reverse on any
authenticated activity during the cancel window. sub_cancel_is_auto
distinguishes our auto-cancels from user-initiated portal cancels.

Migrations 015–018: auth_sessions soft-delete, users sub-state cache,
stripe_webhook_dedup, keep_link_token_uses. Feature package home at
internal/feat/idleunsub/.
Both opus and sonnet review agents returned ~30 findings. Major changes:

HIGH
- Acknowledge AIP-134 empty-mask deviation explicitly (Wire conformance
  note); ErrorOnEmpty stays as v0 default.
- Fix s.b.Pool() — it is a method, not a field.
- Drop "matches sqlc's DBTX" claim; document it is a strict subset
  satisfied by *pgxpool.Pool, *pgx.Conn, and pgx.Tx.
- Remove mustValidate panic helper from generated file; show clean
  Mapping literal plus generated InitPatches() error per drill's
  no-panic-at-init rule.
- Replace ub.SQL("RETURNING *") with ub.Returning(boundCols...) —
  first-class API, sends only mapped columns, excludes unmapped sensitive
  columns from the wire.
- Add AutoSet mechanism (yaml auto_set: { updated_at: NOW() }) so PATCH
  bumps updated_at like the existing sqlc UPDATEs do; without this every
  PATCH would leave updated_at stale.
- Use proto.CloneOf instead of proto.Clone(...).(T); add nil op.Message
  guard.
- Add buf build -o buf.binpb to make codegen — current buf generate does
  not emit a descriptor set.

MEDIUM
- Add Transactions subsection: Apply takes any DBTX including pgx.Tx.
- Add audit-logging hooks (returned diff) to v1 roadmap.
- Document CGO requirement for aippatchgen explicitly.
- Add NOT NULL extraction to migration replay; nullable columns deferred
  to v1.
- Define module-extraction trigger as "second project consumes aippatch
  in production" (v0.5).
- Audit existing test wording in rollout plan.
- Split smallint into separate type-table row with int16-to-int32
  widening on decode.
- Add validated bool to Mapping; Apply rejects unvalidated mappings with
  Internal.

LOW
- Document v0 nested-mask-path constraint and emit codegen diagnostic.
- List github.com/google/uuid in runtime imports.
- Add bytes/float/double to deferred-types table.
- State codecs are global and reusable across resources.
- Make algorithm step explicit that unmatched proto fields without
  skip:true are diagnostic.
- Add go get step to rollout plan.
- Note RETURNING uses bound-columns list (privacy benefit).
- Correct codec key description (yaml name, not protoreflect name).
- Add grep-for-callers step before deleting UpdateDisplayName.

NIT
- Use op.Mask.GetPaths() throughout.
- Clarify two-phase ordering (field-number processing, alphabetical emit).
- Spell out "foreign-key constraints".
- Document yaml string-to-Go-enum mapping for empty_mask.
- Diagram label uses thirdparty/aippatch/cmd/aippatchgen.
- Spanda replication wording: runtime/codegen are drill-free, not the yaml.

Spec grew from 690 to ~770 lines.
Round 2 surfaced one structural bug (m.codecs referenced but missing from
Mapping struct), pg_query_go v5/v6 drift, several real correctness issues,
and several smaller items. Fixes:

HIGH
- Add Codecs map[string]EnumCodec plumbing: shared Codecs registry in
  generated init.gen.go, passed into Mapping.Validate(codecs) which copies
  the relevant subset into unexported m.codecs. Apply now references
  m.codecs that actually exists.
- Add EnumCodec type to public API (ProtoEnum, ToText, FromText).
- Scrub pg_query_go to v6 throughout (latest published; v6.2.2).
- Tone down Returning(boundCols) "privacy" claim — bound non-writable
  columns ARE transmitted (e.g. email); only unmapped columns are
  excluded.

MEDIUM
- Fix typed-nil trap: any(op.Message) == nil is false for a nil *T;
  switch to op.Message.ProtoReflect().IsValid().
- Sort op.Where keys before iterating to keep SQL deterministic
  (otherwise pgx prepared-statement cache misses on every call).
- Validate op.Where keys exist in m.bindingsByColumn — prevents
  attacker-controlled identifier injection via the map-key surface.
- Tighten AutoSet conflict check: reject if column is *any* binding
  (writable or not), not just writable.
- Add pg_query_go-based validation for AutoSet SQL literal: parse and
  reject multi-statement / non-expression input.
- Remove faulty test-wording audit step — verified existing tests assert
  via connect.CodeOf(err), not on message text.
- Document soft-deleted user wire change: today's UpdateProfile returns
  Internal; aippatch returns NotFound. Add to Wire conformance note,
  rollout plan, and Risks.
- Fix cmd/server/main.go → cmd/drill/main.go (drill's actual binary
  path).

LOW
- UpdateAllWritable empty-mask path: explicit Internal in v0 (was
  silently emitting AutoSet-only UPDATE).
- Fix uuid.UUID(v).String() → uuid.UUID(v.([16]byte)).String() in type
  table.
- Switch validated bool to atomic.Bool for race safety under -race.
- Replace fictional maskHasPath with slices.Contains in handler example.
- Add field-number-vs-alphabetical ordering explanation to algorithm.
- Add GetPaths nil-safety note.
- Note PK appears in Bindings (no dedup with PK column).
- AutoSet test pattern: SELECT before / Apply / SELECT after — within a
  tx, NOW() is constant, so the framework's separate query advances it.
- Remove dead yaml override comments (free_full_educators_used skip
  entry was never visited).

NIT
- AIP-203 → AIP-134 §Update_Mask citation.
- "stale" → "superseded" sqlc deletion wording.
- Architecture diagram pg_query_go shows version.

Spec grew from ~770 to ~830 lines.
Both opus and sonnet round-3 reviews returned "Approve with minor revisions"
— no HIGH severity findings. Sonnet's "H1" was a redundant nil check that
opus correctly classified NIT (the IsValid check actually works); fixed as
cosmetic.

MEDIUM (both reviewers)
- Validate(codecs) now explicitly fails when a Binding references an
  enum codec name not in the supplied codecs map. Closes the silent
  failure mode where dangling refs would only surface at first Apply.
- Validate doc spells out FromText construction, dangling-codec check,
  and validated-stays-false-on-error.
- Pin huandu/go-sqlbuilder to @v1.36.0 in `go get` step (UpdateBuilder.
  Returning was added in that release).

LOW
- UpdateAllWritable empty-mask path now returns CodeUnimplemented (was
  Internal); AIP-aligned for "feature not implemented".
- SET-clause iteration now walks Mapping.Bindings order (alphabetical)
  filtered by mask membership, instead of client-supplied path order.
  Makes SQL deterministic regardless of how clients order the mask
  paths — pgx prepared-statement cache hits, and golden tests are
  stable.
- Add codegen step rejecting EnumCodec maps with duplicate ToText
  values (would produce a lossy FromText).
- Note `sqlbuilder.PostgreSQL.NewUpdateBuilder()` is required (default
  flavor would emit MySQL-style `?` placeholders).
- Fix AutoSet test pattern — NOW() is constant within a transaction so
  SELECT-before/Apply/SELECT-after inside BeginFunc would see equal
  values. Use clock_timestamp() OR top-level statements OR capture
  pre-Apply NOW() and assert post >= captured.

NIT
- Drop dead `op.Message.ProtoReflect() == nil` clause; ProtoReflect
  never returns nil for protoc-gen-go types.
- Hoist ProtoReflect() to a `src` local instead of calling repeatedly.
- Use `ub.IsNull(m.SoftDelete)` instead of string concatenation.
- Add bullet for ALTER TABLE ADD/DROP CONSTRAINT and other unlisted
  AlterTableCmd kinds being ignored in v0.
- Sort AutoSet alphabetically by Column for stable diff (in addition
  to Bindings).
- Move nested-mask-path rejection out of proto-field-iteration loop
  (proto field names never have dots; the check belongs at yaml
  validation time on writable/overrides keys).
- Document CGO first-build cost (~3 min) and CI cache requirement.
- Align spec on Makefile's existing `generate` target (was claiming a
  nonexistent `codegen` target).

Final spec is implementation-ready. Both reviewers gave "Approve with
minor revisions"; ending the review loop after round 3 per CLAUDE.md
("up to 3 rounds, end early if clean").
15 tasks across 6 phases covering migrations 015–018, the
internal/feat/idleunsub/ feature package (token sign/verify, trigger
evaluation, KeepSubscription, AutoReverse, email composition), Stripe
webhook integration, the public /sub/keep endpoint, the auth-middleware
AutoReverse hook, the user-info RPC + AckKeptBanner, and the React
KeptBanner component. Acceptance criteria + manual smoke test + spec
out-of-scope list at the end.

Spec: docs/superpowers/specs/2026-05-07-idle-auto-cancel-design.md
Round 4: opus + sonnet both rated "Approve with minor revisions". Both
flagged the same M1 (UpdateAllWritable error-code inconsistency across
four sections). Opus added M2 (codegen column-uniqueness) and several
LOWs/NITs; sonnet added decode fd-nil-guard.

MEDIUM
- Reconcile UpdateAllWritable across all four sections that mentioned it:
  Errors section now lists CodeUnimplemented; Roadmap aligns with Apply
  walkthrough's defense-in-depth check; yaml-mapping prose says codegen
  is the primary rejection point.
- Codegen step 10 added: verify column uniqueness across emitted
  bindings to catch overlapping override.column entries that would
  produce duplicate RETURNING and SET clauses Postgres rejects.
- Codegen step 9 added explicitly: yaml empty_mask: update_writable is
  rejected at codegen, not just at runtime.

LOW
- decode() in Apply step 7 now guards fd == nil before passing to decode,
  symmetric with encode() in step 2.
- AutoSet NOW() gotcha explained inline in the canonical yaml example
  (before, only in Testing strategy — readers who scrolled past would
  write a flaky test).
- Op.Where doc strings now state values must be scalar / pgx-bindable;
  list comparators (IN, !=, <) deferred to v1+.
- Generalize "call patches.InitPatches()" error message to "your
  generated InitPatches()" — the runtime is consumer-agnostic.
- PK-without-proto-field case explained: column simply doesn't appear
  in Bindings; WHERE uses m.PK for identity regardless.

NIT
- Decisions row 8 now matches the wire-conformance "permanent per
  resource" claim (was contradictorily saying "relax later").
- "drop" in algorithm step 3.iii reworded to "emit no binding; field
  is registered as intentionally skipped" so the relationship to step 4
  is unambiguous.
- Risks #1 first-build cost wording softened from "~3 minutes" to
  "several minutes (varies by runner)".
- Validate concurrency note added: callers serialize startup; sync.Once
  is left to the caller if needed.
Add revoked_at column to auth_sessions and index on (user_id, last_active DESC).
Replace hard-delete (DeleteAuthSession, DeleteUserAuthSessions) with soft-revoke
(RevokeAuthSession, RevokeUserAuthSessions) so session rows are preserved for the
new GetUserLastActive activity query. Account deletion (DeleteAccount) retains
hard-delete for GDPR data minimization.
Service skeleton plus the cancel-decision flow per spec §4.1/4.2: fetch
sub from Stripe, dedup by webhook event_id, period-key dedup against
prior subscription_kept/auto_canceled rows, evaluate the idle predicate
(MAX(last_active) < periodStart - one interval) under a per-user row
lock, write the audit row + cache flags inside a tx, then call
UpdateSubscriptionCancel(idempotency_key=event.ID) outside the tx.

Tests cover the happy path plus eight early-return shapes (trialing,
already-canceled, grandfathered, retry-storm dedup, post-keep dedup,
no-activity-history, still-active, first-period grace). They use a
fake StripeClient and a per-test fresh Postgres via the package's
testutil.SharedPostgres TestMain.

Adds a GetUserByStripeCustomer query to map Stripe customers to users
on the webhook path (nullable text param matches the column).
- Add per-symbol nolint:unused directives (with Task 8 rationale) for 7
  symbols in metrics.go and metadata.go that are not yet wired up
- Honor Price.Recurring.IntervalCount in subtractInterval (was always
  subtracting exactly 1 interval regardless of billing cycle multiplier)
- Add TODO comment on enqueueCancelEmail stub documenting Task 9's
  required test injection
- Strengthen DedupesRetryStorm test: three-step scenario isolates the
  webhook-dedup gate from the period-dedup gate; documents that evt_2's
  TryClaimWebhookEvent row rolls back with the TX on period-dedup hit
- Happy-path test asserts stripe_subscription_id remains NULL (documents
  spec §4.1 single-population-path invariant)
- Hoist metadata.go's free-floating doc comment to attach to cancelMetadata
- Document empty idempotency key intent on both UpdateSubscriptionCancel
  reversal call sites (KeepSubscription and AutoReverse)
- Wrap KeepSubscription and AutoReverse real-reversal DB writes in a
  single transaction so ClearUserAutoCancelState + InsertSubscriptionKeptEvent
  commit-or-fail atomically
- Add stuck-state comment on StripeSubscriptionID.Valid early-return in AutoReverse
- Add Task 9 email TODO comment in TestKeepSubscription_Idempotent
- Add TestAutoReverse_StripeUpdateFails: Stripe error leaves gates ON for self-heal
- Fix unlambda in NewTokenSigner (gocritic: replace lambda with time.Now directly)
- Wire real TokenSigner (32-byte random key) in setupHappyPath; signer was
  nil, causing enqueueCancelEmail to silently exercise the error branch
- Add recordingMailer to capture sent messages; assert cancel/kept email
  subject, To, and body content in happy-path tests
- Assert require.Empty(fx.Mailer.msgs) in skip/no-op tests
- Rename enqueueKeptEmail's unused subID param to _ (removes _ = subID smell)
- Add sub_id to kept-email error log lines in KeepSubscription and AutoReverse
- Add TestComposeCancelEmail_HTMLEscapes: verifies DisplayName is HTML-escaped
- Add <a href=...> assertion to TestComposeCancelEmail_Renders
- Use .UTC() before .Format() in composeCancelEmail and composeKeptEmail
  (defensive: robust to callers that pass non-UTC time)
Extract the duplicated makeIdleunsubSvc/makeIdleunsubOverride helpers from
auth_test.go and billing_test.go into a single exported NewServiceWithFake
constructor in idleunsubtest. Add concurrency safety docs to FakeStripe.
btc added 7 commits May 9, 2026 17:25
Replaces the synchronous mailer.Send call in HandleInvoiceUpcoming,
KeepSubscription, and AutoReverse with a River-backed enqueue. Mailgun
delivery latency must not block the Stripe webhook (30s timeout) or the
auth-middleware AutoReverse path (5s context). Spec §4.1 step 6, §4.3
step 6, and §9 metric idleunsub.email.enqueue all describe this as an
async enqueue, not a synchronous send.

- Add idleunsub.EmailEnqueuer interface; replace the mailer field on
  Service with an enqueuer. composeCancelEmail/composeKeptEmail still
  produce email.Message, then idleunsub copies fields into a
  jobs.SendEmailArgs and calls EnqueueSendEmail.
- internal/backend/idleunsub_enqueuer.go: production adapter that wraps
  the Backend's River client (via the Jobs interface) and forwards to
  river.Insert with the standard SendEmailInsertOpts.
- backend.go: defer idleunsub.NewService construction until after
  riverClient exists so the enqueuer can be wired with a live client.
- idleunsubtest: add RecordingEnqueuer (mutex-protected, captures every
  jobs.SendEmailArgs). NewServiceWithFake now wires this enqueuer instead
  of NullMailer so the helper continues to compile.
- All idleunsub unit + E2E tests updated to assert on
  enqueuer.Snapshot() rather than mailer.Msgs.
- Drop unused subID arg from enqueueKeptEmail.
Round-1 multi-reviewer fixes (Opus + Sonnet) on the idle-auto-cancel
branch. Eight changes; none alter the externally observable contract.

- ClearStaleKeptBanners worker + daily periodic job. Spec §4.5 mandates
  14-day banner auto-clear; the SQL existed (sql/queries/users.sql:138)
  but had no caller. Wired as a 24h River periodic job alongside the
  existing maintenance jobs in backend.New.
- AutoReverse comment fix: the empty-idempotency-key justification was
  copy-pasted from KeepSubscription and incorrectly referenced
  keep_link_token_uses single-use enforcement. AutoReverse converges via
  the gate re-read at the top of the function instead.
- KeepSubscription: document the deliberate Stripe-before-DB ordering.
  If the post-Stripe DB write fails, AutoReverse on the user's next
  authed request silently corrects via the cache-drift path (spec §4.3)
  — the user keeps their sub; only the banner/email are lost.
- Integration test for spec §8.2 concurrent invoice.upcoming. Two
  goroutines fire HandleStripeWebhook with distinct event IDs; the
  per-user FOR UPDATE lock + period-keyed dedup must serialize them
  to exactly one Stripe Update, one audit row, and one webhook dedup
  row (the loser's row is rolled back inside its TX).
- Boundary unit test: last_active EXACTLY equal to threshold must NOT
  trigger a cancel (spec §8.1 calls out '<' as strict).
- TouchAuthSession error: log the warning instead of dropping. Per
  CLAUDE.md "Error handling over nolint" — failure is non-fatal but
  shouldn't be invisible.
- Cache email templates at package init via template.Must. Embedded
  templates are validated by package tests, so the panic-on-parse
  contract is safe at runtime; per-Send cost drops to two Execute
  calls instead of two ParseFS round-trips.
- subtractInterval default-case slog.Warn so a future Stripe schema
  change surfaces in logs instead of silently falling through to
  monthly arithmetic.
Fix ClearStaleKeptBanners SQL to use MAX(created_at) GROUP BY so a user's
banner is only cleared when their MOST RECENT subscription_kept event is
older than 14 days; the prior subquery matched any stale event, wrongly
clearing a fresh banner when an older event also existed.

Route the ClearStaleKeptBanners periodic job to QueueMaintenance instead of
QueueDefault so it does not compete with user-facing work.

Remove RecordingMailer from idleunsubtest (no callers; production uses
RecordingEnqueuer for email assertions).

Add TestClearStaleKeptBanners_MostRecentEventGates covering the multi-event
edge: stale-only cleared, recent-plus-stale retained, recent-only retained.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant