Skip to content

chore: promote multicloud web frontend to main (1380 commits)#1131

Merged
cristim merged 1380 commits into
mainfrom
feat/multicloud-web-frontend
Jun 9, 2026
Merged

chore: promote multicloud web frontend to main (1380 commits)#1131
cristim merged 1380 commits into
mainfrom
feat/multicloud-web-frontend

Conversation

@cristim

@cristim cristim commented Jun 9, 2026

Copy link
Copy Markdown
Member

Promotes the long-lived integration branch feat/multicloud-web-frontend to main. main is a clean ancestor (0 commits ahead, ~1380 behind), so this is a conflict-free fast-forward preserving full history.

After merge:

  • the 57 open PRs will be retargeted from feat/multicloud-web-frontend to main;
  • docs / CI deploy triggers / contributor guidance update to main as the integration base;
  • feat/multicloud-web-frontend is retired.

Deploy workflows already trigger on main, so deployment continuity is preserved. The migration-failure alarm is gated off by default (#1130) so the deploy does not 403.

Summary by CodeRabbit

Release Notes

  • New Features

    • Multi-cloud deployment workflows for AWS Lambda, AWS Fargate, GCP Cloud Run, and Azure Container Apps
    • Database migration workflow for PostgreSQL schema updates
    • Web dashboard interface (experimental) with browser-based access
    • CSV import/export for recommendations across cloud providers
    • Target coverage sizing mode for recommendations
    • Rollback workflow for quick deployment recovery
  • Enhancements

    • CLI configuration support for Azure and GCP credentials
    • Audit logging for purchase operations and dry-run mode
    • RDS extended support filtering for recommendations
    • Improved CI/CD security scanning and pinned action versions
  • Documentation

    • Contributing guidelines and project architecture documentation
    • Security incident response runbooks
    • Deployment and troubleshooting guides

cristim and others added 30 commits May 22, 2026 12:37
… test does not panic (#663)

When TransitionExecutionStatus returns an error, RecoverStrandedApprovals
calls GetExecutionByID to distinguish a race (the row already left
"approved") from a real store failure. The test
TestManager_RecoverStrandedApprovals_LateCompletionNotClobbered had no
expectation for GetExecutionByID, so testify panicked on the unstubbed
call (mocks_test.go:277).

Add a stub returning Status="completed" so the manager correctly skips
the execution and the assertion (recovered==0) still genuinely proves
that a row which completed between the stale SELECT and the recovery
UPDATE is not clobbered by the sweep.

Closes #657
… rows (#664)

The production queryExecutions scan reads 22 columns; the mock row
definitions in TestPGXMock_GetExecutionByID_Success and
TestPGXMock_GetExecutionByID_WithTimestamps only declared 21, causing
"incorrect argument number 22 for columns 21" at scan time.

Add "approval_token_expires_at" as the 22nd column (after
retry_attempt_n, matching the SELECT and rows.Scan order in
store_postgres.go) and supply sql.NullTime{} as its value in both rows.

Closes #627, Closes #624
…closes #136) (#580)

* feat(settings): per-product Coverage and Enabled controls on SP cards (closes #136)

Add Coverage % and Enabled toggle to each of the four Savings Plans cards
(Compute, EC2 Instance, SageMaker, Database), matching the per-product
settings already available on RI cards.

- SERVICE_FIELDS SP entries extended with coverageId/enabledId
- TRACKED_FIELDS includes the new IDs for dirty-state detection
- loadGlobalSettings populates per-card coverage/enabled from service
  config, falling back to global default_coverage when absent
- saveGlobalSettings reads per-card coverage/enabled from DOM when the
  card exposes the controls; non-SP cards continue to inherit global settings
- index.html: Coverage % number input and Enabled checkbox added to each
  SP card in the settings form
- Tests: DOM fixture updated; 3 new tests cover save-coverage, save-enabled,
  and load-from-service-config paths; HTML structural assertions extended

* fix(settings): apply SP coverage/enabled fallback when service row absent

loadGlobalSettings only populated per-product Savings Plans coverage and
enabled controls for services present in the config response. A SP card
whose service row was missing kept the HTML default (80) instead of
falling back to global.default_coverage, which then persisted an
incorrect value on the next save.

Iterate SERVICE_FIELDS (the source of truth for SP card IDs) and overlay
any matching service-config values, so every SP card gets the global
default when its row is absent. Harden the regression test with a
non-default seeded DOM state and a non-80 global default so the fallback
assignment is actually exercised.

Addresses CodeRabbit review on PR #580.
…story (closes #642, #646) (#650)

* fix(purchases): record partial success and stamp target account on history

Partial-failure runs (some recs commit, others fail) were marked the whole
execution "failed" and skipped the success notification despite real
commitments written to purchase_history with Purchased=true, inviting a
re-approve that double-buys the already-purchased recs (#642). The
single-account path also stamped the ambient AWS host account on history
regardless of the resolved target account or provider (#646).

- #642: introduce a "partially_completed" outcome. When at least one rec
  committed, never mark the row "failed"; record partially_completed, send
  the confirmation for the recs that purchased, and surface the row in
  History (flagged IsAuditGap so its execution-level dollars are excluded -
  the committed dollars come from the per-rec purchase_history rows). Applies
  to both the single-account path (via a partialPurchaseError sentinel that
  finalizeExecution maps) and the multi-account fan-out path.
- #642 frontend: on a partial fan-out failure, name the buckets that created
  pending executions so the still-actionable requests are not silent orphans;
  add a distinct "Partial" history badge.
- #646: resolve the target account's ExternalID from the resolved account and
  stamp that on purchase_history, falling back to the ambient AWS STS identity
  only when no target account can be identified. Fixes assume-role AWS and
  direct-execute Azure/GCP stamping.

Regression tests: single-account + multi-account partial success
(partially_completed status + notification sent), correct account stamping
for AWS and Azure, History row synthesis with dollar-exclusion, and the
fan-out partial toast naming submitted buckets.

Closes #642
Closes #646

* fix(purchases): error when target account is unresolved, never fall back to ambient

When a single-account purchase resolves a concrete cloudAccountID but
GetCloudAccount returns a nil account (the account does not exist),
resolveSingleAccountProvider returned (nil, "", nil). The caller then
treated the empty target ID as "truly ambient" and fell back to the host
AWS STS identity, purchasing and stamping purchase_history against the
wrong account (#646).

Surface a descriptive error instead so the caller's no-ambient-fallback
contract holds once a target account is known. Add a regression test
covering the not-found path.
…n-flight rows (closes #645) (#648)

* fix(purchases): align token cancel path with session path and guard in-flight rows

The token/email cancel path (loadCancelableExecution) only rejected
completed/cancelled, so an email-link holder could cancel an
approved/running/paused/failed/expired execution that the dashboard
session path refuses. Cancelling an approved/running row is unsafe: the
AWS commitment is being or has been created, so the cancel would leave
the DB and the cloud out of sync.

Restrict the token path to pending/notified only, matching
cancelPurchaseViaSession. This restriction is itself the in-flight
guard. Add table-driven tests on both the token (Manager.CancelExecution)
and session (cancelPurchaseViaSession) paths covering every status:
each non-cancelable status is rejected with no store write, and
pending/notified still cancel successfully.

Closes #645

* refactor(purchase): centralize cancelable-status predicate

Extract the pending/notified cancelability rule into a single
PurchaseExecution.IsCancelable method and call it from both cancel
paths (purchase.Manager.loadCancelableExecution and the session-authed
cancelPurchaseViaSession) so the policy can never drift between the
token and session flows. Behavior is unchanged: only pending/notified
rows stay cancelable; approved/running and other in-flight states
remain non-cancelable.

Addresses CodeRabbit nitpick on PR #648 (issue #645).
…isory-only (#649)

These three GCP PurchaseCommitment implementations did not buy a
committed-use discount: they provisioned a brand-new billable resource
(a Cloud SQL instance, an empty GCS bucket, a Redis instance), so a
"purchase" silently spun up infrastructure that kept billing and the
commitment the user believed they bought never existed.

None of these services exposes a safe programmatic commitment-purchase
API: Cloud SQL and Memorystore CUDs are spend-based (bought via the
Cloud Billing console / Cloud Commerce Consumer Procurement API) and GCS
has no CUD purchase API at all. Their recommendations are therefore
advisory only.

Each PurchaseCommitment now returns the new sentinel
common.ErrCommitmentPurchaseNotSupported and never calls any
resource-creation API. The purchase executor already returns that error
to the caller (no success path, no infrastructure created). Regression
tests assert PurchaseCommitment returns the sentinel and never invokes
InsertInstance / bucket Create / CreateInstance.

Closes #640
…) (#653)

* feat(purchases): make Azure reservation purchases idempotent

Extends the #636 IdempotencyToken mechanism to all five Azure reservation
executors (cache, compute, cosmosdb, database, search) so a re-drive (and

Each executor previously minted a fresh reservationOrderID per call
(uuid.New(), or a time.Now() timestamp in search), so a retry created a
second reservation. The Azure Reservations API PUTs to
reservationOrders/{id} and is idempotent on a stable order ID, so the
order ID is now derived deterministically from the IdempotencyToken via
common.IdempotencyGUID: the same token re-PUTs the same order and returns
the existing reservation rather than creating a second. An empty/short
token falls back to the prior non-idempotent ID, preserving behaviour for
non-execution callers.

Adds common.IdempotencyGUID (canonical 8-4-4-4-12 GUID from the first 128
bits of the SHA-256 token) and common.ReservationOrderID, a single helper
that returns the derived GUID or the caller's fallback so each executor
stays a one-statement assignment (keeps gocyclo under threshold and avoids
repeating the empty-token guard across services). Per-provider regression
tests on compute and search assert that a same-token re-drive PUTs to the
identical reservationOrders/{id} URL (no second reservation) while a
distinct token targets a distinct order.

Refs #641.

* fix(azure/idem): reject non-hex IdempotencyGUID input; use uuid.New() fallback

IdempotencyGUID now validates the first 32 characters are valid lowercase
hex before formatting them as a GUID, returning "" for any non-hex input
rather than silently producing a malformed GUID.

Azure search PurchaseCommitment replaces the Unix-second timestamp fallback
with uuid.New().String() so the fallback is collision-resistant at
sub-second granularity rather than guaranteed to collide on burst retries.

Co-authored-by: CodeRabbit CR #653
…647) (#655)

* fix(api/purchases): guard web purchase execution against malformed recs and double-submit

Two related guards on the web executePurchase path:

- #643: validate each client-supplied recommendation's Provider/Service/
  Term/Payment/Count at the API boundary (validatePurchaseRecommendation in
  validation.go), wired into validateExecutePurchaseRequest after the account
  scope check. Per-provider Term (1/3) and Payment whitelists reject malformed
  values (Term:7, Payment:"foo", non-positive Count, empty/all provider) before
  they reach the cloud SDK with a silent default substituted. Scoped to the web
  execute path only; the retry path replays already-validated recs and is not
  re-gated.

- #644: submit-time idempotency. purchaseIdempotencyKey hashes actor + sorted
  rec tuples + capacity; findDuplicatePendingExecution recomputes the key for
  recent web-sourced pending executions (no schema change) and collapses a
  double-click/retry within a 2-minute window onto the existing execution
  instead of minting a second approvable row (double-spend). The frontend
  execute button is now disabled before the confirm dialog and network call in
  both the single and fan-out paths, and re-enabled on cancel.

Updated existing executePurchase / per-account-perms tests to carry valid recs
and the new GetPendingExecutions lookup; added table-driven validation tests,
idempotency-key and duplicate-lookup tests, and frontend button-disable tests.

#647 (server-side capacity_percent consistency) requires plumbing the
pre-scaling count through the frontend and rec type; split into a follow-up to
keep this change focused.

* fix(api/purchases): validate capacity_percent against scaled rec counts (#647)

capacity_percent was a decorative audit-only field: the frontend scaled
rec counts client-side, sent the percent alongside, and the backend stored
it without ever cross-checking that the scaled counts agreed with the
recorded percent. The audit trail for a financial action could claim e.g.
"50% capacity" while the recs summed to 100%.

Fix: stamp the pre-scaling recommended count onto each rec at submit time
(RecommendedCount in config.RecommendationRecord, recommended_count on the
TS Recommendation/LocalRecommendation wire types) and add
validateCapacityConsistency, wired into validateExecutePurchaseRequest after
the per-rec content validation. It rejects any rec where
floor(RecommendedCount*pct/100) != Count with a 400. The check is opt-in
per rec: recs that carry no recommended_count (legacy callers, single-rec
full-capacity purchases, retry replays) are skipped, so the change is
backward-compatible and never produces a false rejection for callers that
predate the field. JSONB persistence means no schema change.

The frontend stamps recommended_count = r.count on the scaled copy in
handleBulkPurchaseClick before the counts are floored, so the value the
backend verifies always describes the recs it accompanies.

Tests: table-driven validateCapacityConsistency cases (full/partial/floor/
mismatch/absent/mixed); a frontend round-trip test that recommended_count
survives into the executePurchase POST body on the single-rec path.

* fix(api/purchases): atomic duplicate-detection via SELECT FOR UPDATE in tx (#643)

Move the pending-execution scan inside the WithTx block so the duplicate
check and the INSERT are a single atomic operation, closing the TOCTOU
race that the pre-tx duplicatePurchaseResponse call could not prevent.

- Add GetPendingExecutionsTx to StoreInterface (SELECT ... FOR UPDATE)
- Extract scanExecutionRows helper to share row-scan logic between the
  pool path and the new tx path
- Extract matchDuplicateInList to keep persistExecutionAndSuppressions
  and executePurchase under the gocyclo threshold
- Add derefStringOrEmpty to eliminate the nil-check branch in executePurchase
- Change validatePurchaseRecommendation to *RecommendationRecord receiver
  so provider/service/payment normalisation is written back to the caller
- Add GetPendingExecutionsTx fallback to all six test mock types

Closes #643
…client (closes #556) (#620)

* feat(providers/azure/managed-redis): add Azure Managed Redis service client (closes #556)

Add providers/azure/services/managedredis as Azure's counterpart to AWS
MemoryDB. Azure Cache for Redis is the closest Azure equivalent: a fully-
managed in-memory caching service with reservation-based pricing via the
Azure Consumption and Retail Prices APIs.

The new client registers under ServiceMemoryDB so it is routed and reported
symmetrically with the AWS MemoryDB client at the provider-dispatch level.
It reuses the armredis/v3 and armconsumption SDK packages already present
in the azure provider module - no new dependencies.

Changes:
- providers/azure/services/managedredis/client.go: new ManagedRedisClient
  implementing the full provider.ServiceClient interface (GetRecommendations,
  GetExistingCommitments, PurchaseCommitment, ValidateOffering,
  GetOfferingDetails, GetValidResourceTypes)
- providers/azure/services/managedredis/client_test.go: 34 tests covering
  all methods, pager injection, HTTP mock, token error, SKU fallback
- providers/azure/services.go: NewManagedRedisClient factory
- providers/azure/provider.go: ServiceMemoryDB in GetSupportedServices and
  newServiceClientForSubscription switch
- providers/azure/services_test.go: TestNewManagedRedisClient
- providers/azure/provider_test.go: ServiceMemoryDB assertions in
  GetSupportedServices and AllServiceTypes tests

* fix(managedredis): address CR findings - error propagation, real order discovery, UUID, recommendation mapping

- Propagate pager-creation and NextPage errors in GetExistingCommitments instead
  of swallowing them (return nil, err rather than empty slice or silent break)
- Replace hardcoded zero-UUID order ID in reservationDetailsPager with the
  subscription-scope NewListPager so all real reservation orders are enumerated;
  update ReservationsDetailsPager interface to ListResponse (not ByReservationOrder)
- Use uuid.New().String() for PurchaseCommitment reservation order ID to avoid
  second-level collisions under concurrent purchases
- Replace static stub in convertRecommendation with recommendations.Extract so
  all Azure recommendation fields (SKU, count, cost, region, term) are mapped
- Update tests: pager mock type, pager-error case now asserts error, new
  convertRecommendation tests cover nil and legacy recommendation shapes

* fix(managedredis): address second CR pass - pagination + nil context

- Follow NextPageLink in getRedisPricing so all pricing pages are
  consumed rather than only the first response
- Replace nil context with context.Background() in
  TestGetValidResourceTypes_Fallback
- Add TestGetOfferingDetails_Paginated to exercise the multi-page
  fetch path end-to-end

* fix(providers/azure/managedredis): address CR third-pass production findings

- Guard NewClientWithHTTP against a nil HTTPClient by substituting a
  default *http.Client with a 30s timeout, avoiding nil-deref panics on
  the first Do call.
- Change collectSKUsFromPager to return (map, error) and have
  GetValidResourceTypes discard partial results and fall back to the
  curated commonSKUs() list on pager error, preventing false validation
  failures for valid SKUs.
- Pass the converter's *float64 RecurringMonthlyCost through directly so
  nil (the documented "provider did not return a monthly breakdown"
  sentinel in pkg/common/types.go) stays distinct from an explicit 0.

* test(providers/azure/managedredis): tighten mock + fixture coverage per CR

- Register h.AssertExpectations(t) via t.Cleanup on every test using
  mockHTTPClient (TestGetOfferingDetails_{1yr,3yr,NoUpfront,APIError,
  NoPricing} and TestPurchaseCommitment_*), so the mocked Do call is
  required to fire instead of being silently optional.
- Add a "3 Years" reservation entry to samplePricingJSON and assert the
  exact recurring/total cost in TestGetOfferingDetails_3yr so the test
  actually exercises 3-year term selection rather than just asserting
  RecurringCost > 0 against the 1-year fixture.
- Replace nil context with context.Background() in
  TestValidateOffering_InvalidSKU for consistency with other tests.

* fix(azure/managed-redis): fail closed on unknown terms and missing reservation price

- Add parseTermYears helper (mirrors synapse) to reject any term string
  outside the explicit "1yr"/"3yr" allowlist instead of silently defaulting
  to 1 year in PurchaseCommitment and GetOfferingDetails
- Remove hardcoded 45% reservation-price fallback in getRedisPricing; return
  an error when the API returns no reservation price for the requested SKU
- Add TestPurchaseCommitment_InvalidTerm and TestGetOfferingDetails_InvalidTerm
  to exercise the new error paths
…578)

The Purchases-tab Savings History chart (modules/savings-history.ts) only
wired its local period dropdown and refresh button, so changing the global
Provider or Account topbar chip did not refresh it (the tab had to be left
and re-entered), and it ignored the account filter entirely.

Mirror the PR #500 / #488 pattern:
- isPurchasesTabActive() guards the reload so a chip change while the user
  is on another tab defers (switchTab reloads on entry).
- queueMicrotask coalesces the account-then-provider subscriber pair fired
  from one user action (the #185 ordering rule), collapsing two reloads
  into one and avoiding stale-overwrite races.
- loadSavingsHistory() now forwards the selected account_id (single-select,
  matching the backend) and the provider chip to getSavingsAnalytics. The
  provider param is honoured by the backend once #502 lands; until then
  account_ids does the filtering.

Adds regression tests covering subscriber wiring, the active-tab guard,
coalescing, account_id passthrough, and two consecutive account changes
producing distinct fetches. Updates app.test.ts's state mock to expose the
new subscribe getters since it exercises the real savings-history module.

Closes #503
GCP Compute CUD creation named commitments cud-<unix-second> and ignored
opts.IdempotencyToken, so a re-drive of the same execution (issue #639's
recovery sweep) would create a SECOND committed-use discount: a financial
double-purchase.

Thread opts.IdempotencyToken into RegionCommitments.Insert via two
deterministic mechanisms so the same token can never buy twice:

- RequestId: GCP's native server-side idempotency key on Insert, which the
  API documents as preventing duplicate commitments. It must be a valid
  non-zero UUID, so we format the SHA-256 token into a canonical UUID with
  the existing common.IdempotencyGUID helper (the same derivation PR #653
  used for the Azure reservationOrderID). The same token always yields the
  same RequestId, so the second Insert is a server-side no-op.
- Name: also derived from the token (cud-<first-32-hex>, RFC1035-valid) as
  defense in depth. Commitment names are unique per project+region, so a
  re-drive that somehow reached Insert collides on the name and GCP rejects
  it with ALREADY_EXISTS rather than creating a duplicate.

An empty token preserves the prior non-idempotent timestamp-based name (the
CLI path, which has no owning execution). The token is masked in logs via
common.MaskToken (never logged verbatim), matching PR #652.

Adds a re-drive regression test mirroring the AWS/Azure ones: the same token
on a second PurchaseCommitment yields an identical RequestId and commitment
name, plus an empty-token test confirming the CLI path stays non-idempotent.

Scope is computeengine only; CloudSQL/Storage/Memorystore were made
advisory-only in #649 and are untouched.

Closes #654
refs #641
… (refs #667) (#668)

The purchase-execution flow had zero Info-level observability between the
approve handler and the final aggregated error toast. When a sync approve
hit a context-deadline timeout on the cloud SDK call (issue #667), the
only signal was the wrapped error returned to the HTTP response. CloudWatch
showed nothing about which rec ran, what details shape was used, how long
the AWS API call took, or whether the SDK reached the purchase call at
all.

This change adds Info/Error log lines at every critical step inside
executeSinglePurchase + processPurchaseRecommendations, every line tagged
with the owning execution UUID so a CloudWatch filter by exec ID surfaces
the full per-rec trace:

  purchase[<exec-id>]: dispatching N recommendation(s) for account=X plan=Y
  purchase[<exec-id>]: starting rec <provider>/<svc>/<region>/<sku> ...
  purchase[<exec-id>]: provider <provider> constructed in <ms>
  purchase[<exec-id>]: service client <provider>/<svc>/<region> ready in <ms>
  purchase[<exec-id>]: details ready for <provider>/<svc> (detailsAbsent=<bool> engine=<engine> idempotencyToken=<token>)
  purchase[<exec-id>]: <rec-tuple> PurchaseCommitment succeeded in <ms> (commitmentID=<id>, cost=<n>)

Failure paths log at Error level with the same exec-ID tag + elapsed time
so a slow SDK call is distinguishable from a fast IAM denial.

To carry the execution ID into executeSinglePurchase without a new
parameter (which would ripple through every caller including the CLI
purchase path that has no owning execution), a new ExecutionID field is
added to common.PurchaseOptions. The web/dashboard path now populates it
from exec.ExecutionID at the processPurchaseRecommendations entry; the
CLI path leaves it empty as documented in the field comment.

The `detailsAbsent=true` signal in the "details ready" line is the key
diagnostic — when set, the rec has no persisted ServiceDetails JSON and
the cloud client falls back to defaults (Platform=Linux/UNIX,
Tenancy=default, Scope=Region), which is the legacy-row code path from
issue #453. Combined with the timing on the PurchaseCommitment line,
operators can now distinguish "AWS SDK retry storm against a fully-
specified rec" from "AWS returned zero offerings because a legacy rec
fell back to defaults".

Updated two existing tests that asserted exact PurchaseOptions equality
to include the new ExecutionID field.

Test results: 161/161 purchase tests pass, full project build clean.

Refs #667
…541, #543) (#573)

* sec(frontend/docs): add SRI + version pin to Go-served swagger-ui docs

The Go /docs handler still loaded swagger-ui-dist@5.32.5 from unpkg with no
integrity/crossorigin, leaving the public docs page open to CDN poisoning
(issue #447). PR #521 only fixed the static frontend/src/docs.html, not the
Go handler (issue #541).

Bring handler_docs.go to parity with docs.html: pin @5.32.6, add the same
sha384 SRI hashes + crossorigin="anonymous" to the css/bundle/standalone-
preset tags, and load the standalone-preset script the bootstrap references
(previously missing). The unpkg host stays in script-src because element-level
SRI does not grant CSP load permission; the godoc explains why. Add a Go
regression test asserting the served HTML keeps the version pin and SRI attrs
and uses the shared hash constants.

Closes #447, #541
Refs #521

* test(frontend/docs): guard docs.html swagger-ui version pin + SRI attrs

PR #521 pinned the swagger-ui-dist version and added SRI integrity/crossorigin
to docs.html but added no test, so a future edit could silently drop the
integrity attribute or revert the @5.32.6 pin to a floating @5 tag (issue #543).

Add a regression test that, for each unpkg/swagger-ui-dist link/script tag in
docs.html, asserts an exact x.y.z version (never a floating major tag), an
integrity="sha384-..." attribute, and crossorigin="anonymous".

Closes #543
Refs #447, #521
…673)

TestGetOfferingDetails_BadPaymentOption asserts that an unknown payment
option ("Quarterly") returns an error containing "unsupported payment
option" and does not silently fall back to no-upfront. Addresses the
CodeRabbit finding from PR #592.
…utines (closes #669) (#670)

Defence-in-depth for the per-account / per-rec fan-out helper. Before this
change, any panic inside fn (nil deref, type assertion failure, slice OOB
in a future refactor) propagated up the unrecovered goroutine and crashed
the entire Lambda process. Consequences:

  - Purchase execution row stays at 'approved' with no transition to
    'failed' (state machine never runs the aggregator that would record
    the failure).
  - Lambda invocation terminates abnormally; CloudWatch may not flush
    the final log lines.
  - User sees a generic Lambda invocation error rather than the actual
    panic detail.

The fan-out helper now installs a deferred recover() in every goroutine
that converts a panic into a structured Err on the per-item Result slot,
logs the goroutine stack at Error level for post-mortem, and lets the
parent aggregator process the failure exactly like a normal fn-returned
error.

Current call sites in internal/purchase/execution.go (per-account
executeForAccount + per-rec processPurchaseRecommendations) have no
obvious panic source today; the recover() is cheap insurance for the
high-impact failure mode and covers future refactors that might
introduce one.

Regression test (TestFanOut_PanicInFn) panics inside fn for one of
three items, asserts the panicking item carries a non-nil Err containing
the panic value AND that the other two items still succeed (i.e. one
goroutine's panic doesn't cascade across the fan-out).

Closes #669
…rove (closes #671) (#674)

* fix(purchases): atomic CAS cancel to close TOCTOU with concurrent approve (closes #671)

Both cancel paths (session-authed dashboard cancel and email-token
CancelExecution) used an out-of-transaction IsCancelable check followed
by an unconditional SavePurchaseExecutionTx upsert. A concurrent approve
that transitioned the row to 'approved' between the check and the upsert
would be silently overwritten, leaving the DB at 'cancelled' while the
AWS/Azure/GCP commitment is real and billable.

Add CancelExecutionAtomic(ctx, tx, executionID, cancelledBy) to the store
layer (interface + PostgresStore + all three mocks). The method issues a
conditional UPDATE WHERE status IN ('pending','notified') inside the
caller-provided tx, returning (false, currentStatus, nil) on zero rows
affected so callers can 409 with the racing status visible. On success it
returns (true, "cancelled", nil).

Update both cancel paths to call CancelExecutionAtomic inside WithTx
(paired with DeleteSuppressionsByExecutionTx on the success branch) and
return 409 / a descriptive error when the CAS loses the race.

Fix the misleading "optimistic-locking guard inside the tx" comment on
handler_purchases.go that claimed a guard existed when it did not.

Regression tests added for both paths:
- TestManager_CancelExecution_RaceWithApprove (token path, purchase pkg)
- TestHandler_cancelPurchase_Session_RaceWithApprove (session path, api pkg)

Update all existing cancel tests to assert CancelExecutionAtomic rather
than SavePurchaseExecution to reflect the new call site.

* fix(cancel): surface tx errors as 5xx, tighten suppression-cleanup assertions

- handler_purchases.go: return fmt.Errorf (5xx) for WithTx failures instead
  of NewClientError(409); only the !cancelled path is a true conflict
- handler_purchases_test.go: add DeleteSuppressionsByExecutionTx expectation
  + AssertCalled in runSessionCancelAllowed to cover the transactional contract
- purchase/mocks_test.go: make DeleteSuppressionsByExecutionTx use m.Called
  so testify tracks calls for AssertNotCalled/AssertExpectations
- approvals_test.go, coverage_extra_test.go: register the expectation on all
  success-path cancel tests; add AssertNotCalled on race test to guard against
  accidental cleanup writes when CAS misses

Addresses CodeRabbit review on PR #674.
 #555) (#622)

* feat(providers/azure/synapse): add data-warehouse service client (closes #555)

Add Azure Synapse Analytics Reserved Capacity service client as the
Azure counterpart to AWS Redshift (issue #473 parity gap).

- New package providers/azure/services/synapse implements the full
  ServiceClient interface: GetRecommendations, GetExistingCommitments,
  PurchaseCommitment, ValidateOffering, GetOfferingDetails,
  GetValidResourceTypes.
- Recommendations sourced from Azure Consumption API
  (armconsumption.ReservationRecommendationsClient) using the
  "SQLDatabaseDTU" resource type filter; supports both Legacy (EA) and
  Modern (MCA) billing account shapes via the shared
  internal/recommendations.Extract converter.
- Existing commitments retrieved via armconsumption.ReservationsDetails,
  filtered to DW*/SCU SKUs that identify Synapse SQL Pool reservations.
- Purchase issues a PUT to Microsoft.Capacity/reservationOrders using
  reservedResourceType "SqlDW"; purchase-automation tag applied when
  PurchaseOptions.Source is non-empty.
- Pricing via Azure Retail Prices API (azure-synapse service name);
  handles "1 Year" / "3 Years" reservation term strings as returned
  by the live API.
- GetValidResourceTypes returns static DW100c-DW30000c list (same
  approach as AWS Redshift client); no armsynapse SDK dependency needed.
- Wire into provider.go (GetSupportedServices + newServiceClientForSubscription)
  and services.go factory; update provider_test.go to assert ServiceDataWarehouse.
- 31 unit tests covering all methods, pager mocks, HTTP mocks, credential
  errors, SKU filtering, pricing extraction, and tag application.

* fix(synapse): address CR findings - error propagation, term validation, input guards, pricing safety

- Propagate pager-creation error in GetExistingCommitments instead of returning empty
- Extract parseReservationTermYears helper that fails closed on unknown terms; replace
  both silent 1yr-coercion blocks in PurchaseCommitment and GetOfferingDetails
- Guard PurchaseCommitment against empty ResourceType and non-positive Count
- Remove synthetic 40% reservation-cost fabrication in getSynapsePricing; return a
  clear "pricing data unavailable" error when the API returns no reservation price
- Add test coverage for all four new code paths

* test(synapse): address remaining CodeRabbit review findings

- Propagate io.ReadAll error in doPurchaseRequest; when the HTTP response
  body read fails, surface a clear error message including the read failure
  rather than silently using an empty/partial body string in the status-error
  message.

- Add common.ServiceDataWarehouse to TestAzureProvider_GetServiceClient_AllServiceTypes
  so both the account and non-account client paths are regression-protected
  for the DataWarehouse service type.

* fix(synapse): filter GetRecommendations to client region

The pager returns recommendations for all regions; downstream
offering/purchase logic is region-bound. Gate each converted
recommendation on strings.EqualFold(converted.Region, c.region)
so callers receive only the region they constructed the client for.

Adds TestGetRecommendations_regionFilter (verifies cross-region rec
is dropped) and TestGetRecommendations_modernShape (Modern/MCA payload
shape coverage requested by CR).

Addresses CR finding on PR #622 (2026-05-22T11:20:18Z).

* test(azure/provider): add ServiceSavingsPlans and ServiceSearch to test matrices

Add the two newly wired services to all three test functions that guard
provider service wiring: TestAzureProvider_GetSupportedServices,
TestAzureProvider_GetServiceClient_AllServiceTypes, and
TestAzureProvider_GetServiceClientForAccount. Also add ServiceMemoryDB
to the GetServiceClientForAccount matrix which was missing.

Guard NewClientWithHTTP against a nil httpClient by substituting
http.DefaultClient, preventing a nil-dereference panic on purchase
and pricing paths that call c.httpClient.Do.

Closes CR findings: test matrix gap (ServiceSavingsPlans/ServiceSearch)
and nil HTTP client guard.

* fix(providers/azure/synapse): tighten SCU SKU filter + case-insensitive offering validation

Addresses CodeRabbit round-5 findings on PR #622:

- L210: Replace `strings.Contains(skuLower, "scu")` with
  `strings.HasPrefix(skuLower, "scu")` so unrelated SKUs that merely
  contain "scu" as a substring (e.g. "rescue_*") are no longer
  misclassified as Synapse reservations.
- L351: Use `strings.EqualFold` with `strings.TrimSpace` in
  `ValidateOffering` so a valid SKU with different casing (e.g.
  "dw1000c") is accepted, matching the rest of the Azure provider's
  case-insensitive comparison pattern.

Adds tests:
- TestValidateOffering_caseInsensitive
- TestValidateOffering_trimsWhitespace
- TestConvertSynapseReservation_scuPrefix
- TestConvertSynapseReservation_scuSubstringNotMatched (regression guard)
* sec(ci): pin all GitHub Actions to commit SHAs (#415)

Replace all `uses: org/action@tag` refs across every workflow file
with `uses: org/action@<sha> # tag` pins. Several workflows referenced
non-existent future versions (v5/v6/v7 for actions still on v4); these
are corrected to the actual latest released tag while being SHA-pinned.

Closes #415

* fix(ci): disable credential persistence on all checkout steps

All 11 actions/checkout uses defaulted to persist-credentials: true,
leaving the GitHub token in git config after checkout. None of these jobs
need persisted credentials for authenticated git operations. Set
persist-credentials: false on every step to reduce the token exposure
window.

Fixes CodeRabbit finding (artipacked) on PR #536.

* fix(ci): disable credential persistence in rollback and database-migration

All 5 actions/checkout steps in rollback.yml and all 4 in
database-migration.yml defaulted to persist-credentials: true, leaving
the GitHub token in git config for the remainder of each job. Neither
workflow has any step that performs authenticated git operations after
checkout (cloud auth is done via dedicated aws-actions/configure-aws-credentials,
google-github-actions/auth, and azure/login actions). Setting
persist-credentials: false on every checkout step closes the token
exposure window.

Fixes remaining CodeRabbit artipacked findings on PR #536 (nitpick
comments on rollback.yml and database-migration.yml).

* fix(ci): pin actions to latest SHA of the in-use major, not a downgraded major

The previous SHA-pin commit resolved each action to a commit in an older
major version than the workflows were using (e.g. checkout pinned to v4.3.1
while the branch used v5, configure-aws-credentials pinned to v4.3.1 while
the branch used v6, etc.). This corrects every pin to the latest release
within the same major that the base branch (feat/multicloud-web-frontend)
was already on. No logic changes; SHA + version comment only.

Actions corrected (old major -> correct major):
  actions/checkout              v4.3.1 -> v5.0.1    (34 occurrences)
  actions/download-artifact     v4.3.0 -> v7.0.0     (4 occurrences)
  actions/upload-artifact       v4.6.2 -> v6.0.0     (9 occurrences)
  actions/setup-go              v5.6.0 -> v6.4.0     (8 occurrences)
  actions/setup-node            v4.4.0 -> v6.4.0     (3 occurrences)
  actions/setup-python          v5.6.0 -> v6.2.0     (1 occurrence)
  aws-actions/configure-aws-credentials  v4.3.1 -> v6.1.1  (10 occurrences; standardises ALL to v6.1.1)
  azure/login                   v2.3.0 -> v3.0.0     (7 occurrences)
  docker/build-push-action      v6.9.0 -> v7.2.0     (1 occurrence)
  docker/setup-buildx-action    v3.9.0 -> v4.0.0     (2 occurrences)
  google-github-actions/auth    v2.1.9 -> v3.0.0     (5 occurrences)
  google-github-actions/setup-gcloud  v2.2.1 -> v3.0.1  (2 occurrences)
  hashicorp/setup-terraform     v3.1.2 -> v4.0.1     (15 occurrences)
  codecov/codecov-action        v5.5.4 -> v6.0.1     (1 occurrence)
  golangci/golangci-lint-action v6.5.2 -> v9.2.1     (1 occurrence)
  snyk/actions/golang           v1.0.0 -> 0.4.0      (1 occurrence; base was on 0.x)
  securego/gosec                v2.22.9 -> v2.26.1   (1 occurrence; updated to latest v2)
  github/codeql-action          v4.35.5 -> v4.36.0   (2 occurrences; updated to latest v4)

All 23 SHAs independently verified against their tags via GitHub API.
No float tags remain. YAML validates on all 15 files.
…) (#540)

* fix(db): add regression test for migration 000027 idempotency

Migration 000027 (savings_snapshots_pk) already contains the
DROP CONSTRAINT IF EXISTS guard that makes it safe on fresh databases
where 000018 has already created savings_snapshots_pkey. This commit
adds the integration test that would have caught the original failure:

- TestMigration_SavingsSnapshotsPK/full_migration_from_scratch: runs
  all migrations from scratch, verifying 000027 does not fail with
  "multiple primary keys for table" even though 000018 already added
  the constraint.
- TestMigration_SavingsSnapshotsPK/rollback_to_000026_then_re-apply:
  rolls back past 000027 (without rolling back 000018) and re-applies
  all migrations, verifying idempotency in the replay scenario.

Closes #246

* fix(db): fix 000040 down SQL column scope and test rollback loop

The ORDER BY CASE k in 000040_split_savingsplans.down.sql referenced
column k from a UNION branch, which is out of scope at the outer
ORDER BY level in PostgreSQL. Restructure the SP-collapse branch to
use a subquery that exposes a numeric priority column, then ORDER BY
that column name inside the subquery.

The savings_snapshots_pk_test.go rollback sub-test computed
stepsToRollback = currentVersion - 26 (= 21 on a 47-migration DB)
and passed it in a single RollbackMigrations call, which hit the
10-step safety cap. Call in a loop of at-most-10-step batches.

* refactor(migrations): extract getMigrationsPath to shared test helper

Move getMigrationsPath() from split_savingsplans_test.go into a new
helpers_test.go shared by all integration tests in the package. This
removes the implicit cross-file dependency that CodeRabbit flagged in
savings_snapshots_pk_test.go: the function was defined in a sibling file
and invisible to a reader of the calling file alone.

Both files are in the same package and build tag (integration), so they
always compiled together. The refactor makes each test file self-evidently
complete without relying on symbol spillover from a sibling.

* fix(db): derive rollback step count from schema_migrations row count

Replace `currentVersion - 26` arithmetic with a direct COUNT query on
schema_migrations for rows with version > 26. This is correct even when
migration numbering has gaps, since Steps(-n) steps back through n
applied migrations (not n version-number units).

Also clarify the comment about migration 000018 being "still in effect"
to make clear that its schema_migrations entry persists but its original
constraint was superseded by 000027.

Addresses CodeRabbit finding on PR #540 (nitpick, lines 68-73 and
90-93 of savings_snapshots_pk_test.go).
closes #678) (#681)

* feat(config): add ListStuckExecutions store method

Selects purchase_executions rows whose status is in a caller-supplied
set AND whose updated_at is older than a caller-supplied interval.
Newest-stuck-first, capped at MaxListLimit per call. Mirrors the
existing GetStaleProcessingExchanges pattern for RI exchanges.

Used by the reaper sweep (issue #678) to find executions stuck in
approved/running because the synchronous executor failed mid-flight
(Lambda timeout, OOM, network hang) without flipping the row to a
terminal state. Local SELECT only — no provider-side mutation.

Adds the method to StoreInterface and to every test-side mock store
that satisfies it (purchase, api, analytics, scheduler, server
health). Adds pgxmock coverage for the happy path, the
empty-statuses short-circuit, and a query-error path.

* feat(purchase): add ReapStuckExecutions sweep + per-row CAS to failed

New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds
executions stuck in approved/running longer than reapAfter and
atomically transitions them to "failed" via the existing
TransitionExecutionStatus CAS. Each successful transition also writes
a canonical, human-readable error string so the History UI (#621)
shows operators why the row was reaped and confirms it's safe to retry.

Safety properties:
- Local-status-only: never touches provider commitments. If the real
  executor did manage to create a commitment before dying, the
  retry hits the idempotency path (#636/#638/#652) which surfaces a
  duplicate-reservation error cleanly.
- CAS-protected: if the real executor wakes up and finishes between
  the SELECT and the transition, the CAS rejects; logged at INFO,
  not surfaced as an error. The real executor wins the race.
- Per-row error-isolation: a failure on row N never blocks N+1..K.

Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and
fall back to a 10m default on missing-or-malformed env (with a WARN
log so ops can spot a typo). Never panics — a misconfigured env var
must not crash the other scheduled tasks sharing the Lambda.

Tests cover:
- stale approved row → flipped to failed with canonical message
- stale running row → same
- younger-than-threshold rows are filtered out by the SELECT (no
  TransitionExecutionStatus / SavePurchaseExecution calls happen)
- terminal-status rows are never passed to the SELECT (regression
  guard against accidentally widening stuckStatuses)
- CAS race lost → logged + counted, not surfaced as sweep error
- CAS (nil, nil) defensive path treated as race-lost
- 3-stuck-row integration with per-row CAS expectations
- SELECT error → propagated as sweep error
- per-row save error after successful CAS → Reaped++ AND Errored++
- env-var: unset, valid, invalid, non-default unit (2h30m)

* feat(purchase): wire periodic reaper invocation (closes #678)

Wires the ReapStuckExecutions sweep (added in the previous commit) into
the existing scheduled-task pipeline:

- New ScheduledTaskType "reap_stuck_purchases" registered alongside the
  other periodic tasks (cleanup, analytics_refresh, ri_exchange_reshape).
  Adds a dedicated handleReapStuckPurchases dispatcher that reads the
  threshold via purchase.ParseReapAfterFromEnv on every invocation so an
  ops-side env-var tune via PURCHASE_APPROVED_REAP_AFTER takes effect on
  the next sweep without a redeploy.
- Adds ReapStuckExecutions to PurchaseManagerInterface so the handler
  contract is symmetric with the Manager + the testutil mock can satisfy
  it. MockPurchaseManager gains ReapStuckExecutionsFunc following the
  existing per-method-Func mock convention.
- New EventBridge rule "${stack_name}-reap-stuck-purchases" on
  rate(5 minutes), targeting the main Lambda with
  {"action":"reap_stuck_purchases"}. Cadence is intentionally more
  frequent than the 10m default threshold so a stuck row is reaped
  within ~1 threshold-window. The reaper itself is CAS-protected
  (TransitionExecutionStatus from approved/running → failed) so an
  over-run is safe — the real executor wins the race.
- New terraform vars: enable_reap_stuck_purchases_schedule (bool,
  default true), reap_stuck_purchases_schedule (string,
  default "rate(5 minutes)"), purchase_approved_reap_after (string,
  default "" → use in-code DefaultReapAfter). Empty-string default for
  the threshold avoids pinning a Lambda env value when ops hasn't taken
  a position.
- ParseScheduledEvent learns the "reap_stuck_purchases" action so the
  EventBridge → Lambda payload round-trips through the dispatcher.

Tests cover the dispatcher success path (asserts the default 10m
threshold is passed when the env var is unset), the store-error
propagation path, the lock-ID uniqueness guard, and the
ParseScheduledEvent action mapping.

* fix(purchase/reaper): distinguish CAS race-loss from real DB errors

The reaper's per-row error handler bucketed every failure from
TransitionExecutionStatus into RaceLost, including real DB outages
(connection refused, query timeout, etc.). That masked persistent
store failures: a downed DB would surface only as a quiet count
inflation on RaceLost, with no errored signal for ops to alert on.

Wrap the two legitimate race-loss outcomes from
TransitionExecutionStatus in sentinel errors so callers can use
errors.Is rather than brittle string matching:

  - ErrExecutionNotInExpectedStatus: row exists but its status moved
    out of the allowed set before the CAS (the real executor finished
    between SELECT and UPDATE — race lost).
  - ErrNotFound (reused for the "row vanished" case): row was deleted
    between SELECT and UPDATE — also a race outcome, nothing for the
    reaper to do.

In the reaper, classify errors.Is(err, ErrExecutionNotInExpectedStatus)
or errors.Is(err, ErrNotFound) as RaceLost (INFO log); everything else
bumps Errored with an ERROR log so the ops signal is visible.

Addresses A1 from CodeRabbit round-1 review on #681.

Regression tests:
  - TestReapStuckExecutions_CASRaceLostNoError now wraps the sentinel
    (was raw errors.New); guards the "race-loss is INFO-only" path.
  - TestReapStuckExecutions_RowVanishedTreatedAsRaceLost covers the
    ErrNotFound branch (manual DELETE between SELECT and CAS).
  - TestReapStuckExecutions_HardDBErrorClassifiedAsErrored asserts a
    non-sentinel error bumps Errored, not RaceLost — the regression
    guard for the original A1 finding.

The wrapped errors keep the original message substrings ("not found",
"cannot transition from ...") so existing call sites that match on
substrings (e.g. internal/purchase/manager.go's
RecoverStrandedApprovals, internal/purchase/approvals_test.go) are
unaffected.

* fix(purchase/reaper): reject non-positive reap durations + stabilize env tests

ParseReapAfterFromEnv accepted "0s" and negative durations such as
"-5m" because time.ParseDuration treats them as syntactically valid.
Feeding either into ListStuckExecutions would be catastrophic:

  - 0s: the cutoff "updated_at < NOW() - 0s" matches every
    approved/running row regardless of age — the reaper would flip
    fresh, in-flight executions to failed.
  - -5m: the cutoff "updated_at < NOW() - (-5m)" == "updated_at <
    NOW() + 5m" matches rows from 5 minutes in the future, i.e.
    effectively every row. Same outcome.

The store-side guard added in the original PR (ListStuckExecutions
rejects olderThan <= 0) prevents the broken SELECT from executing,
but a misconfigured env value would still cause every sweep to fail
silently with a confusing store error rather than a WARN at the
config-parse boundary. Reject non-positive durations at the env-parse
layer with a WARN log + fallback to DefaultReapAfter so the misconfig
is visible in the Lambda's startup logs.

Addresses A2 + A3 (defense-in-depth) + A4 + A5 from CodeRabbit
round-1 review on #681.

Regression tests (A4):
  - TestParseReapAfterFromEnv_ZeroFallsBackToDefault — "0s"
  - TestParseReapAfterFromEnv_NegativeFallsBackToDefault — "-5m"
  - TestParseReapAfterFromEnv_GarbageFallsBackToDefault — explicit
    "garbage" case complementing the existing _Invalid* test, named
    per the CR finding so the regression intent is searchable.

handler_test.go (A5):
  - TestHandleScheduledTask's table-driven loop now calls
    t.Setenv("PURCHASE_APPROVED_REAP_AFTER", "") inside the per-case
    sub-Run. The two reap_stuck_purchases cases assert reapAfter ==
    10*time.Minute (the default); without an explicit env clear, an
    ambient PURCHASE_APPROVED_REAP_AFTER in CI/dev would silently
    make them flaky. t.Setenv auto-restores the prior value at
    cleanup.

The store-side olderThan <= 0 guard (A3) was already folded into the
earlier rebased commit "feat(config): add ListStuckExecutions store
method" so no additional store-side change is needed here — A3 is
covered by the existing defense-in-depth, and this commit completes
the env-side validation A2 calls for.
…purchase flow (closes #677) (#680)

* feat(azure/reservations): add shared two-step calculatePrice->purchase helper

Introduces providers/azure/services/internal/reservations with
DoPurchaseTwoStep, IsSessionTimeout, CalculatePriceURL, and PurchaseURL.

The helper centralises the two-call flow (calculatePrice mints a
session-bound reservationOrderId; purchase commits it) and retries up to
3 times on Session timed out 400s by re-running calculatePrice from
scratch. Refs #677.

* fix(azure): switch 7 reservation clients to calculatePrice->purchase flow

Fixes the 400 "Session timed out" errors on newer Azure SKU families
(e.g., Standard_B2ats_v2, Burstable v2) by replacing the broken single-step
PUT reservationOrders/{client-id} pattern with the required two-step flow:

  1. POST calculatePrice -- mints an Azure session-bound reservationOrderId
  2. POST reservationOrders/{azure-id}/purchase -- commits the order

All 7 affected clients now delegate to the shared reservations helper:
compute, database, cache, search, cosmosdb, managedredis, synapse.

Idempotency (Option B): Azure mints the order ID at calculatePrice time so
client-supplied IDs are no longer feasible. Caller-level deduplication uses
the purchase-automation tag already attached to each reservation request.
The `uuid` dependency is removed from all 7 clients.

Tests updated to the two-step mock pattern (calculatePrice + purchase HTTP
expectations). Total: 530 tests pass, go vet clean.

Closes #677
Refs #641, #667, #632, #669, #671

* fix(azure/reservations): address CodeRabbit review on PR #680

- Honor ctx cancellation during retry backoff in DoPurchaseTwoStep:
  replace time.Sleep with a select on ctx.Done() so a deadline or
  cancel propagates immediately instead of blocking for purchaseRetryDelay.

- Add tag-injection regression tests to search and managedredis suites:
  assert purchase-automation tag is present in the request body when
  opts.Source is set and absent when it is empty.

* refactor(azure/reservations): use bytes.NewReader over strings.NewReader(string(bs))

Address remaining CR nitpick from PR #680 pass 2. Wrapping a []byte in an
io.Reader via strings.NewReader(string(bodyBytes)) forces a copy of the
underlying bytes; bytes.NewReader takes the slice directly. Two callsites
fixed (doCalculatePrice + doPurchase). Tests + go build + go vet clean.

* fix(azure): require non-empty opts.Source before relying on tag-based dedupe

CodeRabbit pass-3 round-1 A1 (outside-diff Major): Azure's reservation API
mints the order ID server-side in calculatePrice, so client-supplied
GUIDs cannot deduplicate re-driven purchases. The only stable dedupe
signal CUDly controls is the purchase-automation tag derived from
opts.Source via applyPurchaseAutomationTag (or the inline equivalent in
compute / managedredis). When opts.Source is empty the tag is dropped
silently and a re-driven purchase cannot recognise the prior attempt,
producing duplicate reservations.

Add a guard at the entry of PurchaseCommitment in every Azure service
that uses the two-step calculatePrice -> purchase flow with tag-based
dedupe so the request fails fast before any cloud API call:

  if opts.Source == "" {
      result.Error = fmt.Errorf("purchase source is required for ...")
      return result, result.Error
  }

Services covered (7): cache, database, search, cosmosdb, synapse,
compute, managedredis. Savingsplans is intentionally excluded -- it
uses the Azure OrderAlias SDK with native DisplayName-based dedup and
takes opts.PurchaseOptions as `_`, so the same invariant does not apply.

Per-service regression tests pin the guard: every PurchaseCommitment
*_RequiresSource test asserts the call returns an error containing
"purchase source is required" and that no HTTP call is issued. Existing
tests that previously passed common.PurchaseOptions{} are updated to
pass Source: common.PurchaseSourceCLI so they continue to exercise the
intended code paths.

Production safety: the call sites in internal/purchase/execution.go
already populate opts.Source from m.normalizePurchaseSource(exec); the
guard only blocks the silent untagged fallback (when the upstream
NormalizeSource rejects a malformed value), which is exactly the case
the CR finding flags as a duplicate-purchase risk.

go build ./... + go vet ./... + go test ./providers/azure/... + tests
for callers (internal/purchase, cmd) all pass.

* test(azure): assert purchase-automation tag is sent in calculatePrice body

CodeRabbit pass-3 round-1 N1 (nitpick) + N2 (nitpick covered by same shape):
the existing TwoStepFlow / Success regression tests for the Azure
two-step calculatePrice -> purchase flow only assert HTTP call counts
and outcome -- they do not capture the calculatePrice body or check that
opts.Source has been embedded as a purchase-automation tag. Without that
assertion the dedupe invariant (the only stable signal Azure's
server-minted-order-ID API exposes) could regress silently: tag drop
elsewhere in the code (e.g. helper refactor, JSON shape change) would
not fail the suite.

Add a *_PurchaseCommitment_TagInjection test in each service that lacks
one today: cache, database, cosmosdb, compute. Each test:

  1. Wires a mock that captures the calculatePrice request body via
     io.ReadAll and rewinds it with bytes.NewReader so the production
     code sees an intact request.
  2. Drives PurchaseCommitment with common.PurchaseOptions{Source:
     PurchaseSourceWeb}.
  3. Unmarshals the captured body and asserts the top-level "tags" map
     contains common.PurchaseTagKey ("purchase-automation") -> source.

Services already covered by an equivalent test, intentionally untouched:

  - search:        TestSearchClient_PurchaseCommitment_TagInjection
  - managedredis:  TestPurchaseCommitment_TagInjection
  - synapse:       TestPurchaseCommitment_withSource (asserts both
                   "purchase-automation" and the source value are present
                   in the captured body)

savingsplans excluded -- different API (OrderAlias) with native
DisplayName-based dedupe; opts.Source is `_`-ignored at the entry
point.

Net effect: the tag invariant is now pinned by a dedicated regression
test for every Azure two-step service.
…dation (closes #675) (#676)

* fix(providers/azure): case-insensitive + whitespace-tolerant SKU validation (closes #675)

Replace exact `==` comparison in ValidateOffering with
`strings.EqualFold(sku, strings.TrimSpace(rec.ResourceType))` across
compute, database, cache, search, and cosmosdb service clients, matching
the pattern already applied to synapse (commit 1313332).

Add regression sub-tests for case-insensitive and whitespace-padded SKU
inputs to each affected service.

* ci: nudge pull_request sync to trigger workflows
… per recommendation (closes #679) (#682)

* feat(providers/azure): expand reservation recs into all-upfront + no-upfront variants

Azure reservation recommendations previously emitted a single entry with
PaymentOption="upfront" (bare, non-canonical). Users could not see the
no-upfront (monthly billing) alternative even though Azure supports both
billing plans at the same total price.

Changes:
- Add ExpandPaymentVariants helper in providers/azure/internal/recommendations/
  (alongside the existing Extract helper). Given a base recommendation it returns
  two copies: one with PaymentOption="all-upfront" (RecurringMonthlyCost=0) and
  one with PaymentOption="no-upfront" (RecurringMonthlyCost=CommitmentCost/termMonths).
  EstimatedSavings and SavingsPercentage are identical across both variants since
  Azure charges the same total price for both billing plans.
- Update all five service converters (compute, cache, database, cosmosdb, search)
  and the Azure Advisor path to call ExpandPaymentVariants, producing two recs per
  API recommendation instead of one.
- Rename the import alias to azrecs in each service client to resolve the local-
  variable shadowing that the recommendations package name previously caused.
- Replace bare "upfront" with canonical "all-upfront" in all Azure code paths.
- Add focused unit tests for ExpandPaymentVariants covering: zero on-demand guard,
  zero commitment cost, 1yr vs 3yr term-month math, pointer independence between
  variants, savings equality across variants, and shared-field pass-through.
- Add TestComputeClient_GetRecommendations_EmitsBothPaymentVariants to exercise
  the end-to-end fan-out path via a mock pager.
- Update existing converter tests to expect "all-upfront" instead of "upfront".

Closes #679

* fix: apply CodeRabbit auto-fixes

Fixed 2 file(s) based on 2 unresolved review comments.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

* refactor(azure/recommendations): extract page-processing helper to fit gocyclo budget

Pull the per-page advisor-record loop body out of getAdvisorRecommendations
into appendAdvisorPageRecs, following the same pattern used for
resolveAdvisorRegion. getAdvisorRecommendations drops from cyclomatic 12
to 4; appendAdvisorPageRecs lands at 5. No behaviour change.

* test(azure/search): update converter test for nil-guard contract

After the azrecs.Extract refactor, convertAzureSearchRecommendation
returns nil on unusable SDK payloads (matching the contract shared
by cache, compute, cosmosdb, and database converters). Split the
single test into a nil-guard test and a field-population test that
uses BuildLegacyReservationRecommendation, consistent with the
other service converter test patterns.

* refactor(azure): use Azure-canonical upfront/monthly payment terms in variant expansion

ExpandPaymentVariants was emitting "all-upfront"/"no-upfront" (AWS terms)
instead of the Azure-canonical "upfront"/"monthly". This made the emitted
values inconsistent with every other Azure code path and with the
AZURE_PAYMENTS constant in commitmentOptions.ts which defines "upfront" and
"monthly" as the canonical values.

Rename throughout:
- "all-upfront" -> "upfront" in ExpandPaymentVariants, all five service
  converter functions (compute/cache/database/cosmosdb/search), and the
  convertAdvisorRecommendation base rec.
- "no-upfront" -> "monthly" in ExpandPaymentVariants.
- Update all test assertions and fixtures to match.
- Update the ExpandPaymentVariants docstring precondition and value list.

The purchase-path switch cases in all Azure reservation clients already
accept both the canonical value and the former alias
(e.g. case "all-upfront", "upfront": and case "monthly", "no-upfront":),
so renaming the emitted values does not break the purchase path.

Add "upfront" -> "Upfront" to PAYMENT_DISPLAY_LABELS in recommendations.ts
so the Payment column in the recommendations table renders a human-readable
label for Azure "upfront" recs rather than falling back to the raw string.
"monthly" was already covered.

* fix(azure): guard EstimatedSavings when OnDemandCost is zero (CR #682)

Three findings from CodeRabbit review 4348367061:

1. `converter.go:287-292` (recurring Major): the OnDemandCost-zero guard
   protected savingsPct (divide-by-zero) but the savings = totalOnDemand -
   totalReservation line ran unconditionally. Result: when OnDemandCost
   was zero and CommitmentCost was non-zero, EstimatedSavings was emitted
   as a negative number (-CommitmentCost), contradicting the function
   contract. Move savings inside the guard so both fields stay 0 when
   on-demand is zero.

2. `converter_test.go`: add
   TestExpandPaymentVariants_ZeroOnDemand_NonZeroCommitment regression
   asserting EstimatedSavings == 0 (not -CommitmentCost) when OnDemand=0,
   Commitment>0. The existing _ZeroOnDemand_NoSavings test only covered
   the 0/0 sub-case and would have missed the regression.

3. `client_test.go`: add cost-field assertions to
   TestSearchClient_ConvertAzureSearchRecommendation_PopulatesAllFields
   so the WithCosts(120, 80, 40) fixture is actually checked through to
   rec.OnDemandCost / rec.CommitmentCost / rec.EstimatedSavings. Locks
   in the converter behaviour this PR depends on.

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
…Price (closes #685) (#686)

* fix(azure/reservations): add SanitizeDisplayName helper for Azure DisplayName allowlist

Azure's Reservations API rejects DisplayName values that contain characters
outside [A-Za-z0-9_-] with HTTP 400 DisplayNameInvalid. Add a shared helper
in services/internal/reservations that replaces non-conforming characters
with '_', collapses consecutive non-conforming chars into a single '_', and
truncates to 64 characters.

* fix(azure): use SanitizeDisplayName + underscore-form displayName across 5 services (closes #685)

Replace space-containing displayName format strings with underscore-form
prefixes and pipe through SanitizeDisplayName in all 5 Azure service clients
(cache, compute, cosmosdb, database, search). Fixes HTTP 400 DisplayNameInvalid
errors from Azure's Reservations API when SKU names contain unusual characters.

Extends each service's PurchaseCommitment test with a body-capture assertion
that guards the [A-Za-z0-9_-]{1,64} allowlist going forward. The capture mocks
the calculatePrice POST so it works correctly with the two-step
calculatePrice->purchase flow established by #680.

* feat(azure/reservations): rich self-describing displayName for purchases

Extend the Azure DisplayName fix from #686 with a richer, parseable
format that mirrors the AWS RI CSV's ReservationId column shape so
operators can identify reservations in the Azure portal without
cross-referencing the purchase audit log.

Format (each segment separated by '-'):
  {svc}-{region}-{sku}-{count}x-{term}-{paymt}-{ts}-{rand}

  vm     -eastus    -Standard_D2a_v4-1x-1yr-allup -20260522T190000-a1b2c3d4
  redis  -westeurope-Premium_P1     -1x-1yr-allup -20260522T190000-a1b2c3d4
  cosmos -northeurope-EnableCassandra-1x-1yr-allup-20260522T190000-a1b2c3d4
  sql    -centralus -GP_Gen5_2      -1x-1yr-allup -20260522T190000-a1b2c3d4
  search -westus2   -standard2      -1x-1yr-allup -20260522T190000-a1b2c3d4

Output is always sanitized to Azure's [A-Za-z0-9_-]{1,64} allowlist.
When the composed string would exceed 64 chars (long region + long
SKU + high count), the builder progressively drops the optional tail
segments — random suffix first, then timestamp, then payment-option.
The service code, region, SKU, count, and term are NEVER dropped:
those are the high-signal segments operators rely on. With every
optional segment dropped, the joined required segments are truncated
as a last-resort safety net so the builder never emits >64 chars.

Previous format was `VM_Reservation_<SKU>` (and equivalents per
service) — operators reported these were indistinguishable across
purchases. The new identifier carries enough context to identify the
specific reservation purchase from the portal alone.

Implementation lives in
providers/azure/services/internal/reservations/displayname.go
alongside the existing SanitizeDisplayName helper, and is wired into
all 5 service clients (cache, compute, cosmosdb, database, search).
Time and random sources are injectable so tests can pin them; tests
cover happy-path, per-service shape, length-fit truncation at each
drop priority (random → timestamp → payment), payment normalization,
sanitization of dirty input, embedded-dash collapse, determinism,
and UTC normalization. The 5 per-service allowlist regression tests
are extended to also assert the service-code prefix and SKU presence
so accidental swaps in the literal strings are caught.

Refs #685

* fix(azure/reservations): repair progressive-drop loop in BuildDisplayName

Address CodeRabbit findings on PR #686.

Critical: BuildDisplayName's progressive-drop loop was broken. It called
SanitizeDisplayName(candidate) inside the loop before checking the
length cap. Because the sanitizer hard-truncates to 64 chars, the
condition len(candidate) <= 64 was vacuously true on iteration 1 and
the loop exited immediately, returning a mid-string-truncated full
format instead of cleanly dropping the random suffix, timestamp, and
payment segments. The existing tests passed only by coincidence (all
test inputs happened to land their required segments within the first
64 bytes of the full format).

Fix: check the pre-sanitized length to gate segment-dropping, and
sanitize only on the exit path. Each segment is already conformant via
normalizeSegment, so the in-loop sanitization was both wrong and
redundant.

Add three regression tests that fail against the pre-fix code:
- DropLoopActuallyDropsTimestamp: forces the loop to drop both random
  and timestamp; asserts the exact clean output and that no timestamp
  digits leak in.
- DropLoopActuallyDropsPayment: forces required-only fallback; asserts
  no payment fragment survives.
- ZeroNowReplacedByWallClock: asserts a zero-value Now is replaced by
  time.Now() rather than emitting the placeholder "00010101T000000".

Also address three minor doc/safety findings on the same file:
- Clarify isAllowedDisplayNameChar godoc so the [A-Za-z0-9-] description
  matches reality (underscores are passed through by the caller).
- Guard the zero-value Now case so production callers that forget to
  set it get a real timestamp instead of "00010101T000000".
- Document generateRandSuffix's silent fallback for non-nil src shorter
  than 4 bytes.

Verification: stash-revert the production code change with the new
tests in place; all three regression tests fail with exactly the
expected mid-truncated outputs. Restore the fix; all 68 reservations
tests pass and all 620 azure-package tests pass.
* fix(purchases): narrow Describe*Offerings + cap pagination (#688)

Root cause: EC2 findOfferingID was missing the offering-type filter,
causing AWS to return all payment variants, triggering indefinite
pagination through sparse empty pages and burning the entire Lambda
60s budget (execution 9336a111).

Changes per service (all 7 purchase-path clients):
- EC2: add OfferingType direct field + convertEC2PaymentOption helper;
  verify returned offering type before returning; cap at 5 pages;
  check ctx.Err() between pages; add timing log per page
- RDS: add pagination cap + ctx.Err check + variant verification +
  per-page timing log (already had all filter fields)
- ElastiCache: same as RDS
- MemoryDB: add OfferingType + Duration direct fields to narrow at
  API level; replace client-side matchesDuration/matchesOfferingType
  loop with server-side filters; cap + ctx.Err + variant verification
- OpenSearch: API has no filter fields; add cap + ctx.Err + defense-
  in-depth variant check after client-side matching
- Redshift: API has no node-type/payment filter fields; add cap +
  ctx.Err + unknown-type guard
- SavingsPlans: add pagination loop + cap + ctx.Err to lookupOfferingID
  (was a single non-paginated call)

MaxResults/MaxRecords: EC2/Redshift/RDS/ElastiCache APIs have a hard
max of 100 per page; no change from current value. MemoryDB/OpenSearch
were already at 100.

Regression tests (3 per service = 21 new tests):
- Pagination cap fires after maxOfferingPages empty pages + next token
- Wrong-variant offering rejected by defense-in-depth guard
- Happy path returns correct offering ID on first page

Refs #684 #667 #632

* fix(ec2): use typed top-level fields, not Filters[], on offering describe (#688)

Followup on the same #688 issue. The initial commit kept InstanceType,
ProductDescription, InstanceTenancy, duration, and OfferingClass packed
into DescribeReservedInstancesOfferingsInput.Filters[], with only
OfferingType promoted to the typed top-level field.

Live-AWS verification showed that the Filter[]-heavy shape is what causes
AWS to return empty pages with NextToken on sparse offering sets (e.g.
t4g.nano regional no-upfront convertible), walking until the Lambda
budget expires. The all-typed-fields shape returns the exact matching
offering immediately. Typed fields land on AWS's primary indices; Filter[]
is a generic fallback that does not always hit the same indices.

Changes:
- Move InstanceType, ProductDescription, InstanceTenancy, MinDuration,
  MaxDuration, OfferingClass to typed top-level fields on the input struct.
- Keep only scope as a Filter[] entry (no typed equivalent on the input).
- Use the typed enum constants (types.InstanceType, types.RIProductDescription,
  types.Tenancy, types.OfferingClassTypeConvertible) instead of stringly-typed
  filter values.
- Extract ec2OfferingQuery struct + buildEC2OfferingQuery / describeInputFromQuery
  helpers so findOfferingID stays under the gocyclo cap.
- Delete the now-unused buildOfferingFilters helper and the always-returns-
  "convertible" getOfferingClass helper.
- Change scanEC2OfferingPage from "hard error on variant mismatch" to "soft
  skip + log and continue scanning". With the typed OfferingType field set
  this should never fire; if AWS somehow returns a mismatched variant we
  want to keep looking on later pages rather than fail the rec.
- Update TestFindOfferingID_WrongVariantRejected to assert the new soft-skip
  behaviour (mismatched variant on the only page -> "no offerings found").
- Drop TestBuildOfferingFilters_LegacyCanonicalization: it exercised the
  deleted helper. The underlying canonicalizeEC2Tenancy / canonicalizeEC2Scope
  helpers retain their own direct unit tests.

* fix(purchases): soft-skip guards + nil/empty token parity (closes #688)

Three related cleanups on the offering-lookup pipeline:

1. EC2/RDS: normalize end-of-pagination check to guard against both nil
   and empty-string NextToken/Marker. AWS may return either form for the
   terminal page; only checking nil risks an extra empty-page request.
   EC2 uses a new isLastEC2Page helper so the intent is explicit.

2. OpenSearch: remove the post-match payment-option defense-in-depth
   guard from scanOpenSearchOfferingPage. matchesPaymentOption already
   filters mismatched variants in the same loop iteration; the redundant
   hard error can never fire in practice and makes the wrong-variant test
   harder to reason about. Soft outcome (loop exhausts -> "no offerings
   found") is safer for the caller.

3. Redshift: same rationale as OpenSearch -- remove the hard-error guard
   for unknown offering type from scanRedshiftOfferingPage.

Update TestFindOfferingID_WrongVariantRejected for both services to
assert the soft-failure shape (err != nil, id == "").

* fix(purchases): address CR #690 category A/B/C findings

Category A (4 sites, nil-or-empty pagination terminator):
- elasticache: guard result.Marker against both nil and "" in
  paginateElastiCacheOfferings; avoid the redundant page that a
  non-nil empty-string marker would otherwise trigger
- opensearch, rds, redshift: same nil-or-empty guard on their
  respective NextToken/Marker fields (EC2 was already fixed in
  5f8f269)

Category B (ElastiCache payment option coercion):
- convertPaymentOption now returns (string, error); unknown values
  return an explicit "unsupported ElastiCache payment option" error
  instead of silently mapping to "Partial Upfront"
- findOfferingID propagates the error before any pagination begins
- TestClient_ConvertPaymentOption updated to assert the (result,
  error) return shape and verify the unknown-input error path

Category C (unreachable variant-mismatch guard):
- OpenSearch: replace the matchesPaymentOption pre-filter continue
  with a direct gotPayment != wantPayment check that returns an
  explicit "payment option mismatch" error; guard is now reachable
- Redshift: replace matchesOfferingType pre-filter continue with a
  direct "Regular"/"Upgradable" enum check that returns an explicit
  "unexpected type" error on first mismatch
- Test comments and assertions updated to match the new behaviour
  (mismatch surfaces as a diagnostic error, not a silent skip)

CR #690 review round 1

* chore: restore Terraform lock files to feat branch state

The previous commit accidentally included Terraform provider constraint
changes unrelated to the #688 offering-lookup fix. Restore the three
lock files to their state on feat/multicloud-web-frontend so this
branch stays focused on the purchase-path narrowing changes only.
…closes #687) (#689)

* feat(common): add BuildReservationName for rich AWS reservation IDs

Add a shared builder that composes self-describing identifiers for AWS
reservation purchases. The format mirrors the Azure DisplayName format
introduced in #686 (cross-cloud symmetry for ops dashboards):

  {svc}-{region}-{sku}-{count}x-{term}-{paymt}-{ts}-{rand}

Output is always sanitized via SanitizeReservationID for the AWS
reservation-name allowlist ([a-zA-Z0-9-], dots in SKUs become hyphens)
and capped at 60 characters — the tightest length cap across the five
AWS services that accept a customer-supplied reservation ID/name (RDS
ReservedDBInstanceId, ElastiCache ReservedCacheNodeId, MemoryDB
ReservationId, OpenSearch ReservationName).

When composed length exceeds the cap, optional tail segments drop in
priority order: random suffix first, then timestamp, then payment
option. The service code, region, SKU, count, and term are never
dropped. In the pathological worst case the builder truncates the SKU
itself so count and term still survive the final assembly. Time and
random sources are injectable for deterministic tests.

Refs #687

  pkg/common/reservation_name.go       | +201
  pkg/common/reservation_name_test.go  | +281

* feat(aws): rich self-describing reservation names matching Azure (closes #687)

Wire common.BuildReservationName into the five AWS reservation purchase
paths. On the no-token CLI fallback path (i.e. when opts.IdempotencyToken
is empty), each service now composes a self-describing identifier of
shape:

  {svc}-{region}-{sku}-{count}x-{term}-{paymt}-{ts}-{rand}

For example, an OpenSearch RI purchase that previously yielded the
opaque `opensearch-Standard_X-1779468234` now becomes
`opensearch-us-east-1-r6g-large-search-3x-1yr-allup-20260521T002019-a1b2c3d4`
(truncated to 60 chars per service allowlist; see #687).

Per service:

* rds: deriveReservationID's fmt.Sprintf fallback -> BuildReservationName
  ("rds-" prefix). opts.ReservationID still wins when set; idempotency
  token still wins above both.
* elasticache: ReservedCacheNodeId fallback -> BuildReservationName
  ("cache-" prefix).
* memorydb: ReservationId fallback -> BuildReservationName ("memdb-").
* opensearch: ReservationName fallback -> BuildReservationName
  ("opensearch-").
* redshift: PurchaseReservedNodeOffering doesn't accept a customer-
  supplied node ID, so the rich descriptors travel as tags instead.
  tagReservedNode now writes Count / Term / PaymentOption alongside
  the pre-existing NodeType / Region / PurchaseDate / Tool tags so
  the reserved node is identifiable from the AWS console alone.

The token-based path (issue #641) is unchanged: IdempotentReservationID
still derives the deterministic ID from the token so a re-drive sends
the identical name and AWS rejects the duplicate server-side. The rich
builder applies ONLY to the no-token fallback.

Tests: one new per-service capture test asserts ^{svc-code}- prefix,
region presence, SKU (dots->hyphens) presence, count+term embedded, and
the 60-char cap. Redshift's test asserts the new tag keys
(Count/Term/PaymentOption) appear in the CreateTags input.

Closes #687

* fix(common): default zero Now in reservation name + assert payment drop in length test

Default f.Now to time.Now() when IsZero() so a missed caller field produces a
current timestamp rather than a year-0001 artifact. Deterministic tests are
unaffected because they set Now: testFixedNow explicitly.

Add assert.NotContains(t, got, "allup", ...) to
TestBuildReservationName_LengthFit_DropsPaymentLast so the test actually
verifies the payment segment is dropped in the tight-length regime, not just
that timestamp and random were removed.

Addresses CodeRabbit findings on PR #689.
) (#693)

GetReservationPurchaseRecommendation, GetSavingsPlansPurchaseRecommendation,
and DescribeDBMajorEngineVersions each silently discarded every page after
the first. For large payer accounts, CE returns recommendations across
multiple pages, so callers were seeing a fraction of the real result set.

All three call sites now drive a token-driven pagination loop. The existing
rate-limiter retry loop stays as the inner (per-page) loop. ctx.Err() is
checked at the top of each iteration (feedback_ctx_cancel_terminal). The
loop terminates on nil OR empty-string token (PR #690 parity). Exceeding
maxRecommendationPages/maxEngineVersionPages = 20 returns a diagnostic
error citing the service and issue #692. gocyclo relief: fetchSPAllPages +
fetchSPPageWithRetry extracted for SP; queryMajorEngineVersionsWithClient +
fetchMajorEngineVersionsForEngine extracted for RDS (all stay under cap 15).

Three new test groups: multi-page accumulation (3 pages), empty-token
terminal (no extra call), and pagination cap error. Updated
TestGetRecommendations_ContextCancellation to assert context.Canceled
directly -- the new ctx.Err() guard at the top of the pagination loop fires
before the rate-limiter so the error path changed correctly.
closes #695) (#696)

* feat(ri-exchange): target-offerings endpoint + UUID validation + AWS 4xx mapping (#695)

Three defects addressed:

Defect 1 (backend): add GET /api/ri-exchange/target-offerings backed by
ListTargetOfferings on ec2svc.Client. Uses typed-field approach from PR
#690 to enumerate valid convertible exchange targets for a source RI.
Introduces targetOfferingsEC2Client interface + factory in handler.go for
testability.

Defect 2: validate targets[].offering_id against a UUID regex at both the
getExchangeQuote and validateExecuteExchangeBody call sites. Returns 400
with a descriptive message when an instance type (e.g. "t3.medium") is
submitted instead of an offering UUID.

Defect 3: mapAWSExchangeError helper extracts smithy.APIError and maps
InvalidOfferingId/InvalidParameter/ValidationError/Invalid*.NotFound to
400 with the original AWS message; all other AWS errors stay 500. Applied
at the quote and execute call sites.

Tests: 27 new test cases covering the new endpoint (happy path, 404, auth),
UUID format rejection, and AWS 4xx/5xx error classification.

Closes #695

* feat(ri-exchange/ui): replace free-text offering ID with picker (#695)

Replace the "Target Offering ID" free-text input in the exchange modal
with a <select> picker backed by two sources:

- AWS-driven: GET /api/ri-exchange/target-offerings loads valid target
  offerings for the source RI from DescribeReservedInstancesOfferings
  (the new endpoint from the companion backend commit). Options are
  grouped under "AWS Target Offerings".

- CE Recommendations: alternativeTargets from reshape recommendations are
  listed as a second optgroup "CE Recommendations" with per-instance
  monthly cost displayed in the option label.

The picker drives a hidden <input type="hidden" class="modal-exchange-target">
that holds the resolved UUID. collectTargets() validates the value against
a UUID regex before submission, rejecting any non-UUID value with a
user-visible error message. This closes the path where a user could
accidentally submit an instance type (e.g. "t3.medium") instead of a UUID.

Tests: 5 new test cases covering the picker select rendering, CE options
population, AWS offerings after async load, select-to-hidden-input sync,
and non-UUID rejection at submission time. Updated 8 existing tests to
use offering_id UUIDs in place of instance types where they inject values
directly into the hidden offering input.

Closes #695
…ter Clear (#712)

Closes #700

Two sub-bugs in the column filter popover and the zero-row render path:

1. Clear-categorical handler called checkboxes.forEach + setRecommendationsColumnFilter
   directly, bypassing commitAll(). That skipped updateAllTriState() so the (All)
   checkbox remained checked or indeterminate after Clear. Fix: expose commitAll as
   commitAllRef in the outer scope and delegate to it from the Clear handler, keeping
   the same call path as the (All) checkbox itself.

2. renderRecommendationsList replaced the entire container (including <thead>) with a
   <p class="empty"> when the filter produced zero rows. Fix: always call
   buildListMarkup so <thead> is preserved, then inject a hint <td class="empty">
   into the empty <tbody> via DOM methods.

Tests: 2 new Issue #700 tests added (tri-state visual state + thead survival),
existing empty-state test updated to match new structure. All 1933 frontend tests pass.
…closes #564) (#697)

* ci(pre-commit): auth tflint via GITHUB_TOKEN + retry transient flakes (closes #564)

The `Run pre-commit` step in `.github/workflows/pre-commit.yml`
intermittently fails because `terraform_tflint`'s `tflint --init`
downloads ruleset plugins from the GitHub Releases API anonymously,
hitting the 60-requests-per-hour per-IP limit that the runner shares
with every other workflow running on the same NAT egress. Most
recently observed on PR #696
(https://github.com/LeanerCloud/CUDly/actions/runs/26411029351).

Two-part fix:

1. Expose secrets.GITHUB_TOKEN to the step. tflint reads GITHUB_TOKEN
   natively; with it set the limit jumps to 5000/hr per-token, which
   matters because the token is unique per workflow run, not per IP.
   This alone fixes the rate-limit class.

2. Wrap the step with nick-fields/retry@v3.0.2 (SHA-pinned per
   project policy) at 3 attempts with a 90s wait. The token fix
   eliminates the AWS-plugin rate-limit case; the retry handles the
   residual flakes (GitHub Releases availability blips, plugin
   download timeouts, etc.) so a transient failure no longer requires
   a manual rerun.

Acceptance criteria from #564:
- GITHUB_TOKEN exposed to the `Run pre-commit` step
- Three consecutive `pre-commit` runs in the same hour all complete
  without a tflint rate-limit failure (the retry-wrapper also catches
  the residual flakes the token cannot)

* ci(pre-commit): bump job timeout to accommodate retry budget (CR #697)

CR pointed out the retry policy was ineffective: `nick-fields/retry`
uses `timeout_minutes` PER ATTEMPT (not total), so with
`max_attempts: 3` and `timeout_minutes: 15` the worst case is
~48 min, but the job-level `timeout-minutes: 15` killed any retry
before attempt 2 could start.

Fix: tune both timeouts so the retry budget fits inside the job
budget with safety margin.

- per-attempt timeout: 15 -> 10 minutes (normal pre-commit runs
  take ~3-5 min, 10 is comfortable margin without inflating
  hang-detection latency)
- job timeout: 15 -> 35 minutes
  worst case = 3 attempts * 10 min + 2 * 90 s retry waits = 33 min,
  fits in 35 min with margin

Avoids CR's suggested 50-minute job cap because that would also
delay killswitch on a genuinely hung step. 35 min is the tightest
value that still lets all 3 attempts complete.
cristim and others added 27 commits June 8, 2026 06:36
…ghlight + safe-area padding (#989)

- viewport: add viewport-fit=cover for iOS notch support; drop spurious
  initial-scale=1.0 decimal (spec says 1)
- meta: add format-detection=telephone=no to stop iOS auto-linkifying
  numeric data in tables/charts
- css: -webkit-tap-highlight-color: transparent on button/a/[role=button]
  with explicit opacity:0.85 :active feedback instead of the blue flash
- css: safe-area-inset-top on .app-topbar and safe-area-inset-bottom on
  .app-shell so content clears notch and home indicator on iPhone
- css: touch-target minimums (min-height/width: 44px) on all interactive
  controls under @media (pointer: coarse) so desktop layout is unaffected;
  checkboxes/radios grow via label:has() instead to avoid glyph bloat
- html: inputmode="numeric" pattern="[0-9]*" on all 15 numeric inputs
  (coverage %, grace days, notification days, ramp params) to surface the
  numeric keypad on mobile
- html: autocomplete="email" on the notification email input
…unt validation log (#969)

* sec(api/plans): stop leaking raw account UUID + DB error in plan-account validation log

The GetCloudAccount DB-error path inside validatePlanAccountProviders
returned fmt.Errorf("accounts: failed to get account %s: %w", aid, getErr),
a non-ClientError that propagated to the router's
logging.Errorf("API error: %v"), leaking the raw account UUID and the
raw DB error string into logs (gap left by #946 which closed #944).

Wrap the error with the existing mapCreatePlanStorageError helper so
that: (a) a DB error emits a structured logging.Errorf carrying only
the derived provider list and account count (never the UUID), and
(b) returns a generic 500 ClientError ("failed to validate plan
accounts") with no raw UUID or DB string. The ErrNotFound sentinel
still maps to 404; the explicit (nil, nil) not-found path is unchanged.

Add logging.SetOutput as a test seam on the default logger and two
regression tests:
- TestValidatePlanAccountProviders_GetAccountDBError_NoPIILeak: asserts
  that a DB error yields a 500 ClientError whose message and emitted
  log contain neither the account UUID nor the raw DB error string.
- TestValidatePlanAccountProviders_AccountNotFound_Still404: guards that
  the not-found path still returns a 404 referencing the missing ID.

Also fixes the pre-existing build failure on the base branch
(Session.Role field removed in #940 but the test was not updated) and
gofmt drift in handler_purchases_guards_test.go.

closes #965

* fix(api/plans): stop leaking raw DB error in plan-fetch validation branch (refs #965)

getPlanForAccountProviderValidation returned fmt.Errorf("accounts: failed to
get plan: %w", err) on a non-ErrNotFound storage error. Because this is not
a ClientError, the router logged it via logging.Errorf("API error: %v"),
emitting the raw DB error string on every POST /plans and PUT /plans/:id/accounts
call that hits a DB failure (same endpoints as the sibling GetCloudAccount leak
closed by PR #969).

Fix: delegate to the existing mapCreatePlanStorageError helper (introduced by
PR #969 for the GetCloudAccount branch) so the response is a generic 500
ClientError and the log line contains only a static format string -- no raw DB
error, no plan UUID.

Adds two regression tests:
- TestGetPlanForAccountProviderValidation_GetPlanDBError_NoPIILeak: asserts
  the handler returns a 500 ClientError and that neither the client response
  nor the captured log buffer contain the raw DB error string; FAILS pre-fix.
- TestGetPlanForAccountProviderValidation_PlanNotFound_Still404: guards that
  ErrNotFound still maps to 404 after the refactor.

* sec(api): fix sibling PII leaks in plan/purchase handler paths (followup #965)

PR #969 closed the raw-UUID + raw-DB-error leak in
validatePlanAccountProviders and getPlanForAccountProviderValidation.
Adversarial review surfaced five sibling leak sites of the exact same
shape -- a non-ClientError wrapped with a raw account/plan UUID and the
underlying DB error -- that all propagate through the router's
`logging.Errorf("API error: %v")` path and leak both the UUID and the
raw DB error string.

Fixed sites (all use `mapCreatePlanStorageError` so a DB error becomes a
generic 500 ClientError, and a structured log line is emitted without
PII):

- handler_plans.go:209 updatePlan -- existingPlan fetch
- handler_plans.go:333 getPlanForPurchaseCreation -- plan fetch for
  createPlannedPurchases
- handler_plans.go:447 patchPlan -- plan fetch for PATCH
- handler_purchases.go:323/328 disablePlan -- GetPurchasePlan and
  UpdatePurchasePlan failure paths
- handler_purchases.go:2006 lookupContactEmail -- the Warnf used to log
  the raw account UUID and the raw err; now logs only the failure point
- handler_purchases.go:1955 gatherAccountContactEmails -- the wrap site
  used to re-leak the same account UUID through `%s: %w`; now returns a
  generic 500 ClientError (lookupContactEmail has already logged without
  PII)

Test changes:

- New: TestHandler_patchPlan_GetPlanDBError_NoPIILeak,
  TestHandler_updatePlan_GetPlanDBError_NoPIILeak,
  TestDisablePlan_GetPlanDBError_NoPIILeak,
  TestDisablePlan_UpdatePlanDBError_NoPIILeak,
  TestLookupContactEmail_GetAccountDBError_NoPIILeak,
  TestGatherAccountContactEmails_DBError_NoPIILeak -- each asserts the
  ClientError is 500 (server-side fault), the returned message contains
  neither the raw UUID nor the raw DB error string, and (where the path
  logs via the default logger) the captured log buffer contains neither
  either. Pattern mirrors the existing
  TestValidatePlanAccountProviders_GetAccountDBError_NoPIILeak.

- Updated: TestHandler_patchPlan_NotFound,
  TestHandler_createPlannedPurchases_PlanNotFound -- assertions moved
  off the legacy "failed to get plan" wrapper string to the new
  contract: 500 ClientError without the raw plan UUID. Both tests
  return a plain `errors.New("not found")` from the mock (not
  config.ErrNotFound) so they exercise the 500 branch, not the 404
  branch.

- Updated: TestHandler_resolveApprovalRecipients_LookupErrorPropagates
  -- the test's stated intent (no silent globalNotify fallback) is
  retained via the require.Error guard; the ErrorIs(transient) assertion
  was guarding the implementation detail that broke the no-PII-leak
  contract. Replaced with 500-ClientError + no-UUID + no-raw-DB-error
  assertions consistent with the rest of the issue #965 family.

Full internal/api suite: 1474 passed; full short suite: only two
pre-existing internal/auth failures unrelated to this PR.

refs #965

* sec(api/purchases): keep lookupContactEmail failure at error level

The GetCloudAccount failure in lookupContactEmail was logged at Warnf,
but its caller gatherAccountContactEmails maps the error into a generic
500 ClientError. The router returns ClientErrors via the branch that
skips logging.Errorf("API error: %v"), so this log line is the only
error-level breadcrumb for the resulting user-visible 500. Downgrading
it to Warnf weakened error-level alerting for exactly the outage this
change set is preserving visibility for.

Restore Errorf while keeping the message generic (no raw account UUID,
no raw DB error string) so the PII-leak fix is unaffected. Extend the
regression test to assert the [ERROR] severity and the generic message;
it fails pre-fix (emits [WARN]) and passes after.

Addresses CodeRabbit review on PR #969.
The production code in `recommendations.ts:3015-3023` already renders the
action-box summary as a min/max range via `formatSavingsRange(min, max)`,
which collapses to a single value when min == max. This PR adds the
regression tests that pin that behaviour for two cases the previous test
suite did not cover:

- Multi-cell selection with a per-cell range: 2 cells, one with two term
  variants (1yr/3yr), one single-variant. Asserts the summary shows
  `$300 - $500/mo`, `$1,100 - $1,800 upfront`, `2 cells`.
- Multi-cell selection with identical totals: collapses to `$400/mo`
  without a range separator (per formatSavingsRange's same-value branch).

Closes #281. No production changes -- the range render already shipped via
the broader pageLevelRange + formatSavingsRange work; this PR just stops
that contract from regressing silently.
Bump .rec-cell-chevron from font-size 1.2em/24px target to 1.5em/32px,
add focus-visible highlight, and add two tests: one asserting the
chevron is a <button> outside .rec-cell-summary-content (keyboard
accessible without extra keydown wiring), one confirming no-overlap with
the summary content cell.
…435) (#526)

AllowOrigins: replace hardcoded "*" with !If [HasCustomDomain, ...] so
when DashboardDomainName + DashboardHostedZoneId are set the CORS origin
is locked to "https://${DashboardDomainName}". The wildcard fallback
only applies when no custom domain is configured.

AllowMethods: replace "*" with explicit list [GET, POST, PUT, DELETE]
matching the Terraform lambda module (modules/compute/aws/lambda/main.tf).

AllowHeaders: replace "*" with explicit list [Content-Type, Authorization,
X-CSRF-Token, X-Session-Token] matching the Terraform lambda module.

Closes #435.

Note: Terraform tfvars wildcard origins (#442) are handled by PR #519
which adds a validation block rejecting ["*"] in the lambda module.
…ld frontend (#1116)

Seven jest suites (plans, history-retry-button, history-approve-button,
history-approval-queue, history-cancel-button, history-cancel-permissions,
allowed-accounts) each had a jest.mock('../utils', ...) factory that
omitted escapeHtmlAttr. When plans.ts and history.ts call it at render
time the mock throws "TypeError: (0 , utils_1.escapeHtmlAttr) is not a
function", causing cascading test failures.

Added the same escapeHtmlAttr stub already present in history.test.ts
(encodes & < > " ') to each affected factory. No logic changes.

Closes #1115
Move kms:GetKeyPolicy from KMSCreateAndRead (Resource="*", account-wide)
to a new KMSReadTaggedOnly statement in policy_compute_b.tf gated on
aws:ResourceTag/Project=CUDly. This prevents a compromised deploy token
from reading key policies of unrelated KMS CMKs in the account
(key-policy reconnaissance), while still allowing the deploy SA to
read policies of CUDly-owned keys (which are tagged at CreateKey time
by the Terraform aws_kms_key resource).

Split to policy_compute_b.tf because policy_compute.tf is at the
AWS 6144-char managed-policy limit.

Closes #427.
Add a shared "Amortize upfront over term" checkbox that folds the
one-time upfront cost evenly across the commitment term, letting every
payment type (No Upfront / Partial / All Upfront) be compared on a
total-cost-per-month basis.

New shared helper `amortizedMonthly(monthlyCost, upfrontCost, termYears)`
in utils.ts returns `monthlyCost + upfrontCost / (termYears * 12)`.
Guards: term <= 0, non-finite term, or null/non-finite upfront all
fall back to monthlyCost unchanged (no NaN/Infinity).

Toggle state lives in `state.ts` (`getAmortizeUpfront` /
`setAmortizeUpfront` / `subscribeAmortizeUpfront`) persisted under
`cudly.amortizeUpfront` in localStorage. Same pattern as
`getCostPeriod`/`setCostPeriod`. All views read the same key so they
stay in sync across tab switches and page reloads.

Views updated:
- history.ts: checkbox injected into #history-controls and
  #purchases-approval-queue-section; both renderHistoryList and
  renderApprovalQueue use amortizedMonthly when the toggle is on;
  column header updates to "Monthly Cost (amortized)" when active.
  subscribeAmortizeUpfront re-renders both tables without a refetch.
- inventory.ts: checkbox injected into .section-header-actions;
  buildCommitmentRow uses amortizedMonthly on the monthly_cost cell;
  last-fetched commitments cached so subscribeAmortizeUpfront can
  re-render without an API round-trip.
- approval-details.ts: "Monthly cost" column added to the per-rec
  table; header built via DOM methods (no innerHTML); column label
  updates to "Monthly cost (amortized)" when toggle is on. Modal
  reads state at render time, no re-render needed.

plans.ts and dashboard.ts have no monthly_cost field on their
displayed data types (PlannedPurchase / dashboard summary), so the
toggle has no effect there and no changes were needed.

Tests: 9 new `amortizedMonthly` unit tests in utils.test.ts covering
all payment types and every guard path. Existing approval-details,
history, and inventory tests updated for the new column and the new
state mock entries. Full suite: 2442 pass, 2 pre-existing timezone
failures in utils.test.ts (unrelated).

Closes #1112
)

plans-range-validation.test.ts used a closed mock factory for ../utils
that listed every util function explicitly but omitted escapeHtmlAttr,
which was added to utils.ts after the test was written.  The missing
stub caused a "not a function" TypeError whenever plans.ts called
escapeHtmlAttr during renderPlannedPurchaseRow / renderPlanCard,
making the entire plans-range-validation suite fail in CI.

Added the stub with a faithful HTML-attribute escape implementation
(mirrors the real utils.ts implementation) so the closed factory
remains exhaustive.

Closes #1115

Co-authored-by: Claude <noreply@anthropic.com>
…controls (#1118)

The recommendations filter-status bar is flex/align-center, but the
`.cost-period-selector-wrapper` span had no CSS, so the small "Show costs:"
label aligned to the baseline of the taller base-styled select and appeared
stacked above it. Make the wrapper an inline-flex (label + select vertically
centered) and size the select to match the bar buttons, so the count,
Expand all, "Show costs: <period>" and Columns all sit on one line.
… fan-out (closes #197) (#838)

* feat(recommendations): per-rec payment seeding in multi-account buckets (closes #197)

Multi-account fan-out buckets now expose per-rec Payment dropdowns so
each rec gets its own override-seeded default instead of silently
falling back to the toolbar value for the whole bucket.

Changes:
- FanOutBucket gains optional perRecPayments map (rec.id -> payment),
  populated only for buckets spanning 2+ distinct cloud_account_ids.
  Each rec's default is seeded from its account's AccountServiceOverride
  (matching provider+service, normalized, support-checked), falling back
  to the bucket-level payment.
- openFanOutModal now pre-fetches overrides for ALL accounts across all
  buckets (was: only sole-account buckets). Single-account buckets are
  unaffected; the extra fetch just removes the narrowing guard.
- renderFanOutBucketSection renders a per-rec list with individual
  Payment selects (.fanout-per-rec-payment) when perRecPayments is set.
  Each select carries data-rec-id so tests and app code can target it.
- handleFanOutExecute (app.ts) uses perRecPayments.get(r.id) for the
  POST payment field, falling back to b.payment for single-account
  buckets and for any rec not in the map.
- getFanOutBuckets deep-copies perRecPayments to prevent callers from
  mutating module state.
- Tests (h) + (i): assert per-rec map seeding from overrides and that
  dropdown changes propagate to module state.

* fix(recommendations): keep bucket-level Payment effective for fan-out rows

openFanOutModal eagerly wrote every rec of a multi-account bucket into
perRecPayments, and the execute path prefers that map over b.payment via
`perRecPayments.get(id) ?? b.payment`. Once the map was fully populated the
fallback was dead, so changing the bucket-level Payment dropdown only mutated
b.payment while unedited rows still posted their stale per-rec value. The
visible bucket-level control was a no-op for any row that originally followed
the bucket default.

Treat perRecPayments as an explicit-override set: seed only recs whose payment
differs from the bucket default, drop a per-rec entry when the user selects the
current bucket default again, and re-sync visible per-rec selects (that have no
override) when the bucket-level dropdown changes. Skip the re-sync for rows
whose service does not support the new payment so the shown value never diverges
from the POST payload.

Adds a regression test asserting the bucket-level change propagates to the
resolved payment for non-overridden recs, and updates the per-rec test to the
override-only contract. Both fail pre-fix.
…874)

Column was already being rendered with formatDateTime (absolute). Switch
the "Last used" cell to formatRelativeTime with the ISO timestamp on a
title attribute for hover. Null last_used_at renders "Never" unchanged.

Backend last_used_at was already present: column in migration 000001,
UpdateLastUsed fires async in ValidateUserAPIKey, and ListUserAPIKeysAPI
already propagates the field. Migration 000065 not needed.

Add round-trip tests to service_apikeys_api_test.go asserting the field
is non-nil for used keys and nil (not zero-time) for never-used keys.
Add frontend tests for Never and for the relative+title rendering.
…1119)

Two tests fail on a pristine checkout of feat/multicloud-web-frontend,
turning base CI red and marking every open PR UNSTABLE:

1. internal/server TestAuthServiceAdapter_ValidateCSRFToken
   The mock service was built via NewService without a CSRFKey, so it
   generated a random ephemeral key each run. The test then asserted a
   literal "csrf-token" validated, which the HMAC derivation can never
   match. Make the test hermetic: pin a deterministic key with the new
   auth.TestCSRFKey() and assert the real derived token via
   auth.DeriveTestCSRFToken(). The CSRF validation logic is untouched.

2. frontend utils.test.ts (formatDate / getDateParts)
   A date-only string ("2024-03-15") was parsed as UTC midnight, so
   toLocaleDateString / getDate() shifted it to the previous day in any
   browser west of UTC. This is a real off-by-one for users (purchase
   scheduled_date, reservation end_date) and made the test pass only in
   UTC. Fix at root: parse bare "YYYY-MM-DD" as local midnight via a new
   parseDateInput helper (with round-trip validation so out-of-range
   digits like "2024-13-45" still reject). Full timestamps and Date
   objects are left untouched. Add timezone-invariance regression tests.

Issue #993 (TestLogin_WithMFA_NoSecret) no longer reproduces on the
current base: service.go already routes every login-failure path through
the single genericLoginError constant, so unknown-email and wrong-password
return a byte-identical message (no enumeration oracle). Closing it here
with the other base-CI repairs.
* test(frontend/auth): cover cross-tab logout sync (closes #493)

PR #487 added installStorageListener() in api/client.ts as the
compensating control for moving tokens off sessionStorage. The
behaviour is implemented but had no automated regression test; if the
listener's key filter, newValue check, or idempotency guard ever
regresses, the SECURITY claim in client.ts becomes silently false.

Add crosstab-session.test.ts covering:
- cleared auth keys (authToken, apiKey, csrfToken) zero in-memory state
  and call window.location.reload
- non-auth keys are ignored
- partial updates (newValue !== null, e.g. token refresh) are ignored
- initAuth() is idempotent: storage listener installs exactly once

Each test loads api/client in jest.isolateModules() so the
storageListenerInstalled guard is fresh per test.

* test(frontend/auth): patch only window.location.reload (CR nit)

Capture the original window.location reference before tests and restore
it in an afterEach, so each test starts from a clean Location object.
This implements CR's teardown recommendation while keeping the
Object.defineProperty(window, 'location', ...) approach required by jsdom
(window.location.reload is non-configurable and cannot be patched directly).

* test(frontend/auth): clear localStorage between cross-tab logout tests (CR nit)

initAuth() seeds in-memory auth from localStorage, so without clearing it
between tests a stale token from one case could leak into the next. Extend
the existing afterEach to call localStorage.clear() for proper isolation.

* test(frontend/auth): address CR nits on cross-tab logout test

- replace the out! non-null assertion in loadAuth() with an explicit
  guard that throws a clear error if isolateModules never ran
- add a regression case for the !e.key guard in installStorageListener:
  a null-key storage event (localStorage.clear()) must not clear
  in-memory auth or reload
…290) (#804)

* feat(purchases): in-app revocation within free-cancel window (closes #290)

POST /api/purchases/{purchaseId}/revoke: Azure returns via armreservations
(CalculateRefund + Return two-step), 7-day window. AWS/GCP return 422 and
the History UI hides the button for those providers.

- DB: migration 000057 adds revocation_window_closes_at, revoked_at,
  revoked_via, support_case_id columns to purchase_history
- RBAC: revoke-own / revoke-any actions, revoke-own granted to all users
  by default; ownership is via account-access (history rows pre-date
  created_by_user_id)
- Backend: fail-closed nil-auth guard, idempotency via revoked_at IS NULL,
  Azure CalculateRefund->Return with test-injectable client interfaces
- Frontend: canRevokeCompletedRow gate (azure + window open + not yet
  revoked), Revoke button in History action cell, confirm dialog + toast
- All existing mock stores updated with GetPurchaseHistoryByPurchaseID
  and MarkPurchaseRevoked; auth permission-count tests updated to 12

* fix(ci): extract provider dispatch and account check to reduce revoke complexity; regenerate permissions on PR #804

* fix(api/purchases): align revoke admin check with group-only authz (#907)

Rebase onto feat/multicloud-web-frontend brought in #907 (group-membership-
only authorization, no role field on Session). The revoke handler still
gated admin via `session.Role == "admin"`, which no longer compiles since
api.Session has no Role field. Replace with the same two-track pattern the
sibling authorizeSessionCancel / authorizeSessionApprove already use:

  - Stateless admin API key short-circuits via `session.UserID ==
    apiKeyAdminUserID` (no DB row exists to resolve permissions from).
  - Group-based admins fall through to HasPermissionAPI; the {admin, *}
    wildcard in DefaultAdminPermissions matches revoke-any:purchases there.

Drop the dead `Role` field from the revoke test sessions; the admin test
now pins the apiKeyAdminUserID short-circuit, and the existing RevokeAny
test already covers the group-admin path via HasPermissionAPI.

Also fold in the trailing pre-commit fixes that were red on the previous
push:

  - gofmt: realign struct-field padding in TestRevokePurchase_AzureReturnClientError.
  - go mod tidy: promote armreservations from indirect to direct (the
    revoke handler imports it directly).

Free-cancel window enforcement (AzureRevocationWindowDays + windowClosesAt
check) and revoke-call idempotency (early return when record.RevokedAt is
already set) are unchanged.

Refs #290

* fix(api/purchases/revoke): fail-closed on nil account + return error on DB persist failure

Two security fixes from CR findings:

1. checkRevokeOwnAccountAccess: return 403 when CloudAccountID is
   nil/empty instead of allowing the revoke. Without an account
   association, ownership cannot be verified for revoke-own callers.
   Add regression test for this fail-closed contract.

2. revokeAzurePurchase: return error when MarkPurchaseRevoked fails
   after a successful Azure return. The previous log-and-continue
   left the DB unmarked, breaking idempotency on retries. The error
   message notes that the refund was submitted so operators can
   investigate without re-issuing.

* fix(purchases/revoke): populate revocation window at write path so the button works

The Revoke button was shipped but dead: RevocationWindowClosesAt was never
populated when a completed purchase was written, so the frontend gate
canRevokeCompletedRow (which bails on a missing revocation_window_closes_at)
hid the button on every real row.

- Stamp RevocationWindowClosesAt in the real write path
  (purchase.savePurchaseHistory): Azure = Timestamp + 7 days (the free-cancel
  window), nil for AWS/GCP (out of Phase-1 scope), via a new shared
  config.RevocationWindowClosesAtFor helper + config.AzureRevocationWindowDays
  constant that is now the single source of truth for the window length.
- Make the backend window check (revokeAzurePurchase) read
  RevocationWindowClosesAt as the source of truth, falling back to recomputing
  from Timestamp only for legacy rows written before the column was populated.
- Reject an order-only ARM path (empty reservationID) in callAzureReturn
  rather than submitting an empty Return to Azure.
- Tests: backend asserts savePurchaseHistory stamps the window for Azure and
  leaves it nil for AWS/GCP; handler asserts the stamped window drives the
  deny decision and that an empty reservationID is rejected; FE test asserts
  the Revoke button shows for a completed Azure row with a populated
  revocation_window_closes_at and is hidden without it (plus closed-window,
  already-revoked, non-Azure, and anonymous cases).

* docs(auth/revoke): correct revoke-own doc to account-scope reality (#950)

The ActionRevokeOwn doc comment claimed "Own" means created_by_user_id
matches the session user, but checkRevokeOwnAccountAccess actually enforces
ACCOUNT scope (GetAllowedAccountsAPI), because purchase_history rows pre-date
created_by_user_id and have no reliable per-creator attribution.

Fix the doc comments to describe the account-scope behavior as implemented;
the authz model itself is unchanged. Whether revoke-own should instead be
creator-scoped is a product decision tracked in issue #950, noted inline in
both the auth constant doc and the handler check.

* feat(purchases/revoke): Gmail-style pre-fire delay unifies revoke across providers

Adds status=scheduled to purchase_executions so the approval step defers
the cloud SDK call by purchase_delay_hours. A scheduled execution can be
cancelled at $0 via the existing revoke endpoint before the scheduler fires.

Two-tier revoke path:
- status=scheduled: CancelExecutionAtomic (CAS) transitions to cancelled;
  no cloud API call, returns 200 with explicit "no cost incurred" message.
- status=completed (existing): provider SDK call via purchase_history path.

revokePurchase now tries GetExecutionByID first; falls through to the
existing purchase_history lookup when no scheduled execution is found.
authorizeSessionRevokeExecution mirrors authorizeSessionRevoke but uses
CreatedByUserID (not CloudAccountID) for revoke-own scoping.

FireScheduledDelayedPurchases (purchase.Manager) drives the scheduler tick:
queries GetScheduledExecutionsDue, CAS-transitions scheduled->approved,
stamps ApprovedBy="scheduler", calls executeAndFinalize. Registered as
TaskFireScheduledPurchases in the Lambda task dispatcher.

Also folds in three CodeRabbit findings from PR #804 review f9c66c1e:
- Remove unused mockAuth in two AzureCalcRefundClientError/ReturnClientError tests
- Add idempotent audit CHECK constraints to migration 000065
  (revoked_via, support_case_id, revoked_at/revoked_via pair)
- Frontend regression test: legacy blank-status Azure rows stay revocable
- analytics/collector_test.go: hook-backed GetPurchaseHistoryByPurchaseID
  and MarkPurchaseRevoked mocks (no more hardcoded no-ops)

* refactor(api/config): extract helpers to keep revoke+approve+email+scan under gocyclo limit

revokePurchase -> loadAndRevokePurchaseHistory (handler_purchases_revoke.go)
approvePurchase -> approveViaToken (handler_purchases.go)
sendPurchaseScheduledEmail -> buildScheduledEmailData (handler_purchases.go)
scanExecutionRows -> applyNullTimesToExecution (store_postgres.go)

Also applies gofmt alignment fix to internal/analytics/collector_test.go.

* refactor(test/mocks): consolidate MockConfigStore into shared internal/mocks (#804 cleanup)

Three per-package MockConfigStore definitions (internal/api, internal/purchase,
internal/scheduler) were duplicating ~450 LOC each every time a new StoreInterface
method was added. Replace all three with a type alias pointing at internal/mocks.MockConfigStore.

Changes to internal/mocks/stores.go:
- Add Fn-override fields imported from the api and purchase local mocks
  (GetCloudAccountFn, GetPurchasePlanFn, SetPlanAccountsFn, SavePurchaseExecutionFn,
  GetPlanAccountsFn, and 6 others) so callers that used those fields keep working
  without changes.
- Add isExpected guards to recommendation-cache and RI-utilization-cache methods
  and to CancelExecutionAtomic so existing tests that call these without explicit
  On() expectations don't panic (matches the "opt-in" pattern the per-package mocks
  used via hasRecExpectation / hasExpectation).
- Add 8 interface methods that were missing from the shared mock but present in all
  per-package variants: GetExecutionsByStatuses, GetPlannedExecutions,
  GetStaleApprovedExecutions, ListStuckExecutions, GetScheduledExecutionsDue,
  MarkCollectionStarted, ClearCollectionStarted, StampRIExchangeApprovedBy.
- Add compile-time check: var _ config.StoreInterface = (*MockConfigStore)(nil).
- Promote GetGlobalConfig and GetPurchasePlan to return sensible defaults when no
  expectation is registered (matches the api-package behaviour these tests relied on).

Per-package files reduced to a single type alias line each. The scheduler test also
had stray suppression/Tx method stubs added in a later commit; those are removed
because the methods already live on the shared mock.

Intentionally left local (incompatible shape or semantics):
- internal/analytics/collector_test.go: mockConfigStore (lowercase) -- hook-field
  only pattern with no testify embedding; analytics-specific subset; different name.
- internal/server/test_helpers_test.go: mockConfigStoreForHealth -- all-zero-value
  stubs for health check tests; distinct type name; no testify embedding.
- internal/server/handler_ri_exchange_test.go: mockConfigStoreForExchange -- test-
  specific struct overriding a handful of methods.
- internal/server/handler_coverage_test.go: mockConfigStoreForExchange{Complete,
  Fail,Stale} -- per-scenario stubs with distinct type names.

Net LOC: +274 insertions / -1601 deletions (-1327 net across 4 files).

* fix(api/purchases/revoke): use status='scheduled' CAS so pre-fire revoke isn't dead

The pre-fire delay revoke path called CancelExecutionAtomic, whose SQL
guard is `WHERE status IN ('pending','notified')`. A status='scheduled'
row never matches, so the CAS returned zero rows on the happy path --
the handler then surfaced 410 "revocation window has closed" even when
the window was wide open. The user would pay the cloud charge AND see a
"cancelled" attempt in the UI -- the worst possible outcome.

The bug was hidden by mocks defaulting CancelExecutionAtomic to
(true,"cancelled",nil), so every test in the scheduled-revoke suite was
green against the wrong SQL. No pgxmock test exercised the WHERE clause.

Fix: introduce CancelScheduledExecutionAtomic with the correct
`WHERE status = 'scheduled'` guard and switch the handler to it. The two
CAS variants are kept distinct on purpose -- the scheduled-revoke flow
surfaces 410 ("scheduler already fired") on race-loss, while the
pre-purchase cancel flow surfaces 409 ("not pending"). Sharing one method
would conflate the two race outcomes.

Regression test TestRevokePurchase_ScheduledExecution_BugReg_HappyPathCAS
pins the call to CancelScheduledExecutionAtomic with an Expect and adds
an AssertNotCalled for CancelExecutionAtomic; verified to fail
pre-fix (mock expectation unmet, wrong method called) and pass post-fix.

Touches:
  internal/config/{store_postgres,interfaces}.go -- add method + comments
  internal/api/handler_purchases_revoke{,_test}.go -- switch call site + reg test
  internal/mocks/stores.go -- mock the new method (default happy path)
  internal/server/test_helpers_test.go,
  internal/analytics/collector_test.go -- satisfy StoreInterface

* fix(frontend/history): gate Revoke button on revoke-{any,own} permission

canRevokeCompletedRow only checked getCurrentUser() truthiness, so the
inline Revoke button rendered for every signed-in user regardless of
the revoke-any / revoke-own grant. The backend correctly 403s, but the
UX-vs-RBAC drift is exactly what PR #995 caught for approve / delete
on the same page.

The peer predicates (canCancelPendingRow, canApprovePendingRow,
canRetryFailedRow) already check canAccess; canRevokeCompletedRow now
does too. Verbs match the backend handler one-to-one:

  - admin or revoke-any:purchases -> always allowed
  - revoke-own:purchases -> allowed (account-scope enforced server-side)
  - anything else -> hidden

Adds revoke-own / revoke-any to the closed Action union in
permissions.ts so a future drift becomes a compile error at the
canAccess call site.

Adds a regression test that mocks getCurrentUser with an
effectivePermissions set lacking revoke-* and asserts the button is
hidden; verified to fail without the canAccess gate and pass with it.
Also corrects the existing ADMIN_USER fixture to use the real
ADMINISTRATORS_GROUP_ID GUID -- the prior 'administrators' label was
inert because the new canAccess fallback drives off isAdmin() which
checks GUID membership.

* test(frontend/permissions): update USER_PERMS expected set for revoke-own

PR #804 added 'revoke-own:purchases' to USER_PERMS in
permissions.generated.ts but missed updating the user-role expected
list in __tests__/permissions.test.ts. The test asserts perms.size
matches expected.length, so the missing entry surfaced as
"Expected 11, received 12" after the addition.

This is the same scope as the parent permissions add -- not a separate
permission grant, just the test parity update PR #804 should have
included alongside the original add.

* fix(server): table-drive ParseScheduledEvent + cover scheduled-fire sweep

The fire_scheduled_purchases case pushed ParseScheduledEvent's cyclomatic
complexity to 11, tripping the gocyclo<=10 pre-commit hook and leaving the
PR UNSTABLE. Replace the action switch with a package-level lookup table so
the complexity no longer grows with the task list; adding a task type stays a
one-line change.

Also close the test gap on the Gmail-style pre-fire delay scheduler sweep:
FireScheduledDelayedPurchases / fireOneDue had no unit coverage of their
result accounting or CAS-race classification (only the dispatch wiring was
mocked). Add scheduled_fire_test.go mirroring reaper_test.go:

- no-due-rows and list-error paths
- CAS lost to a concurrent revoke -> RaceLost, not Errored, and no SDK fire
  (the safety property that prevents double-charging a revoked purchase)
- row-vanished (ErrNotFound) -> RaceLost
- hard DB error on the CAS -> Errored (per-row isolation, sweep still succeeds)

Each race/error test fails if the classification regresses (verified by
flipping fireOneDue's return). Add the fire_scheduled_purchases case to the
ParseScheduledEvent table test so the new action is positively asserted.

* fix(migrations): renumber 000068 -> 000070 to deconflict (refs #290)

PR #808 keeps 000068 and PR #847 takes 000069; bump this branch's
purchase_history_revocation migration to 000070 to avoid conflicts
on feat/multicloud-web-frontend.

* fix(scheduler): wire FireScheduledDelayedPurchases tick (CRITICAL: pre-fire delay branch was non-functional)

- Add testify-based FireScheduledDelayedPurchases to scheduler's MockPurchaseManager
  so it records calls and satisfies mock.AssertExpectations.
- Add TestSchedulerManagerInterface_FireScheduledDelayedPurchasesWired: compile-time
  guard that ManagerInterface exposes the method + call-recording smoke test.
- Add TestFireScheduledDelayedPurchases_EndToEnd in scheduled_fire_test.go:
  skipped placeholder (with documented skip reason + issue ref) for the full
  provider-stub e2e once #1005 4-eyes lands.
- Add TestFireScheduledDelayedPurchases_DelayPathNotSilentNoOp: compile-time
  guard that FireScheduledDelayedPurchases exists on Manager.

The server/handler_test.go "fire_scheduled_purchases success" and
"fire_scheduled_purchases propagates error" cases cover the full dispatch
chain (ScheduledTaskType -> handleFireScheduledPurchases -> Purchase.FireScheduledDelayedPurchases).

* fix(api/purchases): CAS-guard scheduleApprovedExecution to prevent silent revoke loss

Replace the blind SavePurchaseExecution write in scheduleApprovedExecution with
a two-step CAS pattern:
  1. TransitionExecutionStatus(pending|notified -> scheduled) -- atomic CAS.
  2. Stamp ScheduledExecutionAt + ApprovedBy on the returned post-CAS row.
  3. SavePurchaseExecution to persist the stamps.

Before this fix a concurrent Cancel that landed between the approve handler's
SELECT and its SavePurchaseExecution would be silently overwritten: the cancelled
row would become status="scheduled" and eventually fire the cloud SDK call the
user explicitly revoked. The CAS ensures the write succeeds only when the row is
still in pending or notified; a concurrent cancel causes ErrExecutionNotInExpectedStatus
which surfaces as a clear error to the caller.

Tests added:
- TestHandler_scheduleApprovedExecution_CASGuardsConcurrentCancel: injects a
  concurrent-cancel error and asserts SavePurchaseExecution is never called.
- TestHandler_scheduleApprovedExecution_HappyPath: normal flow, asserts
  ScheduledExecutionAt is stamped on the transitioned row.

* feat(purchases/revoke): two-step quote-then-confirm + persist refund amount for audit

Addresses adversarial-review Finding #4:

- Add GET /api/purchases/revoke/calculate/{id} endpoint (calculateAzureRevoke)
  that calls CalculateRefund and returns the quoted refund amount and currency;
  no state mutation, used by the frontend confirmation modal.
- Thread expectedRefundAmount through dispatchProviderRevoke -> revokeAzurePurchase
  -> callAzureReturn; on POST /revoke the client sends the amount it consented
  to and callAzureReturn re-runs CalculateRefund, rejecting with 422 when the
  new quote diverges by more than revokeQuoteEpsilon (0.01) to close the TOCTOU
  window between user confirmation and actual Return call.
- Persist calc_refund_amount / calc_refund_currency via MarkPurchaseRevoked so the
  audit row captures the quoted values even if the actual refund differs later.
- Migration 000071 adds the two new nullable columns and a consistency CHECK
  constraint (currency must be non-empty when amount is present).
- New tests: TestCallAzureReturn_TOCTOUDivergenceRejectedWith422,
  TestCallAzureReturn_TOCTOUWithinEpsilonSucceeds,
  TestCallAzureReturn_AuditRowPopulatedWithQuote.

* fix(purchases/revoke): partial-success reconciliation; never retry a refund that already succeeded

Addresses adversarial-review Finding #6:

- Migration 000072 adds revocation_in_flight BOOLEAN NOT NULL DEFAULT false to
  purchase_history plus a partial index on rows where the flag is true.
- callAzureReturn flips revocation_in_flight=true via
  FlipPurchaseRevocationInFlight immediately before the Azure Return API call so
  the row is visible to the finalize sweep if the subsequent MarkPurchaseRevoked
  DB write fails.
- MarkPurchaseRevoked is retried up to 3 times with 1s/3s/9s backoff after
  Azure Return succeeds; if all retries fail, the endpoint returns a
  revokeReconcilePendingResult (HTTP 207 body with code=RECONCILE_PENDING,
  azure_returned=true) so the frontend shows a non-retryable toast rather than
  prompting the user to retry a refund that Azure already issued.
- loadAndRevokePurchaseHistory detects a row with revocation_in_flight=true and
  revoked_at=nil and immediately returns 207 RECONCILE_PENDING to prevent any
  duplicate Azure Return call on retry.
- purchase.Manager.FinalizeInFlightRevocations sweeps rows matching
  GetPurchaseHistoryInFlight and retries MarkPurchaseRevoked with 2s/6s backoff
  per row; the finalize_revocations scheduled task wires this sweep into the
  Lambda event handler.
- New tests: TestCallAzureReturn_MarkPurchaseRevokedFailAllRetries,
  TestLoadAndRevokePurchaseHistory_RevocationInFlightReturns207,
  finalize_revocations success and error dispatch tests.

* fix(purchases/revoke): typed Azure error classification (no more substring match on err.Error())

Addresses adversarial-review Finding #7:

Replace the string-based isAzureClientError implementation with typed error
inspection using errors.As(err, &*azcore.ResponseError). The substring-match
approach had two failure modes:
- False positives: any error whose .Error() string contains "400" (e.g. a
  network timeout "timeout after 400ms") would be misclassified as a
  client error, hiding transient infra problems from the operator.
- False negatives: Azure refund-policy errors with HTTP codes not in the
  literal set (e.g. 403, 405) would be escalated as 500.

The typed approach classifies exactly the HTTP status codes Azure uses for
policy violations and bad requests (400, 403, 404, 405, 409, 422); all other
errors (transport errors, 5xx, plain errors) correctly classify as server-side.

Updated TestRevokePurchase_AzureCalcRefundClientError to inject a real
*azcore.ResponseError{StatusCode: 400} instead of errors.New("400: ...").

New tests: TestIsAzureClientError_SubstringFalsePositive,
TestIsAzureClientError_TypedResponseError (covers 4xx client + 5xx server).

* fix(purchases/revoke): 1h safety margin on local window + clean 422 on Azure window-edge rejection

Addresses adversarial-review Finding #3:

Add azureRefundSafetyMargin = 1h so the in-app revoke button disappears and
revoke requests are rejected 1h before Azure's hard 7-day deadline. This
eliminates the tail of RefundPolicyViolated failures caused by clock skew
between CUDly's clock and Azure's at the window boundary.

The safety margin is applied only in the local pre-flight check. The value
stored in purchase_history.revocation_window_closes_at remains the unmodified
Azure deadline so operators can see the true expiry.

Additionally, detect RefundPolicyViolated errors from the Return API via the
new isAzureWindowEdgeError helper (typed errors.As on *azcore.ResponseError,
checking ErrorCode == "RefundPolicyViolated") and map them to a clean 422 with
"window has closed" message rather than the generic "Azure refund rejected" 400
path. This handles the race where our safety-margin check passes but Azure's
clock disagrees mid-flight.

New tests:
- TestRevokePurchase_AzureWithinSafetyMarginRejected: purchase 6d23h30m ago
  (30min before edge) rejected locally despite Azure deadline not yet passed.
- TestRevokePurchase_AzureJustOutsideSafetyMarginAllowed: purchase 6d22h30m
  ago (90min before edge) accepted.
- TestIsAzureWindowEdgeError: table test covering RefundPolicyViolated, other
  error codes, nil, and plain-error false-positive.
- TestCallAzureReturn_RefundPolicyViolatedReturns422WindowEdge: Return API
  returning RefundPolicyViolated surfaces as 422, not 400.

* fix(migrations): allow support-case revoke to record in-flight state (case filed, awaiting AWS)

Addresses adversarial-review Finding #5:

The original purchase_history_revoked_pair_chk required revoked_at and
revoked_via to be set or unset together. This is too strict for the AWS
support-case revocation path (issue #291 wave-2): when a case is filed,
revoked_via='support-case' is recorded immediately for the audit trail, but
revoked_at stays NULL until AWS confirms the refund. The pair check fires
as a constraint violation in that in-flight state.

Migration 000073 drops the pair check and simultaneously tightens the
support-case companion check:

  Old: CHECK (support_case_id IS NULL OR revoked_via = 'support-case')
       (prevents support_case_id on non-support-case rows only)
  New: CHECK (revoked_via != 'support-case' OR support_case_id IS NOT NULL)
       (requires support_case_id whenever revoked_via = 'support-case')

The meaningful invariant (no dangling revoked_at without a known provider
path) is preserved by the existing purchase_history_revoked_via_chk which
constrains revoked_via to ('direct-api', 'support-case').

* test(purchases/revoke): DST-crossing window math + 4-eyes approval placeholder

Addresses adversarial-review Finding #9:

TestRevocationWindowClosesAtFor_DSTCrossing: verifies that AddDate(0,0,7)
(calendar arithmetic) produces the correct 7-day window across a DST
transition. The test uses the 2024 US spring-forward on March 10 (02:00 EST ->
03:00 EDT): a purchase at 01:30 must close at 01:30 seven days later, not at
00:30 as a naive Add(168*time.Hour) would produce. This pins the correct
behaviour and documents why a fixed-duration approach would be wrong.

TestRevokePurchase_FourEyesApproval: skipped placeholder for the revoke
4-eyes approval gate tracked in issue #1005. The skip keeps the suite green
while the feature is in flight and serves as a reminder to fill in the
implementation when #1005 lands.

* fix(api/purchases): map scheduleApprovedExecution CAS race to 409 (not 500)

When a concurrent Cancel flips the execution away from schedulable status
between the approve flow check and the CAS write, approveWithDelay now returns
409 instead of 500. ErrExecutionNotInExpectedStatus from TransitionExecutionStatus
is the discriminator; any other error continues to surface as 500.

* fix(api/purchases/revoke): drop pre-check, distinguish GetExecutionByID errors

Two related fixes on the same line (revokePurchase GetExecutionByID branch):

Finding B: Remove the racy status=="scheduled" pre-check. The old code read
the status from the DB and then only dispatched to revokeScheduledExecution
when it was "scheduled", but a concurrent writer could flip the status between
the read and the CancelScheduledExecutionAtomic call. Drop the pre-check and
let CancelScheduledExecutionAtomic's WHERE status='scheduled' CAS decide; a
lost CAS returns 410 as expected.

Finding C: Distinguish a genuine DB error (non-nil execErr) from a missing row
(nil, nil). Before the fix, any non-nil execErr was folded into the "execErr ==
nil && ..." condition and silently swallowed, falling through to the history
lookup. Now a non-nil execErr surfaces immediately as a wrapped error (500).

* fix(api/purchases/revoke): clear revocation_in_flight on Azure error paths (transient retryable)

When the Azure Return call fails (window-edge, client-error, or transient), the
revocation_in_flight flag was left stuck at true. The finalize_revocations sweep
would then treat the row as "Azure succeeded, DB write pending" and retry
MarkPurchaseRevoked unnecessarily, potentially marking a purchase as revoked
when it was never actually returned.

Fix: call ClearRevocationInFlight (new store method) on all Azure Return error
paths so the row reverts to its original status. On success the flag stays true
for the sweep to handle (existing wave-1 behaviour unchanged).

Adds ClearRevocationInFlight to StoreInterface, PostgresStore, and MockConfigStore.

* fix(frontend/history): expose Revoke button for status='scheduled' rows + regression test

Three related changes to surface the Revoke button on Gmail-style pre-fire
delayed executions (status='scheduled') in the History UI:

1. Add "scheduled" to historyExecutionStatuses so these rows appear in the
   /api/history response at all (they were previously invisible).

2. Populate RevocationWindowClosesAt from ScheduledExecutionAt for scheduled
   rows in annotateHistoryRowByStatus so the frontend window check
   (revocation_window_closes_at in the future) works without a new field.

3. Update canRevokeCompletedRow to accept status==="scheduled" in addition
   to "completed" and "" (legacy blank), so the Revoke button renders.

Adds regression test: a scheduled Azure row with a future revocation window
must show the Revoke button (Findings E + G, second-wave CR).

* fix(test/purchases/revoke): fix AssertNotCalled placement + isolate parallel test backoff state

Finding F-1: Move AssertNotCalled(t, "CancelExecutionAtomic") to after the
handler call in TestRevokePurchase_ScheduledExecution_BugReg_HappyPathCAS.
The assertion was placed before h.revokePurchase(), where it trivially passes
regardless of what the handler does.

Finding F-2: Drop t.Parallel() from TestCallAzureReturn_MarkPurchaseRevokedFailAllRetries.
The test mutates the package-global revokeMarkRetryBackoffs slice; running it
in parallel with other tests that read the same variable is a data race. The
test already uses t.Cleanup to restore the original value, which is sufficient
when run sequentially.

* fix(test/history-revoke): add missing mock entries for escapeHtmlAttr and getAmortizeUpfront

The state mock lacked getAmortizeUpfront/setAmortizeUpfront/subscribeAmortizeUpfront
and the utils mock lacked escapeHtmlAttr/amortizedMonthly. Both are called inside
renderHistoryList; the missing entries caused a TypeError that loadHistory's try/catch
swallowed, silently producing an empty history-list and hiding the Revoke button for
all three "shows Revoke" cases.

* fix(purchases/revoke): let the CAS decide scheduled cancellability (#290)

Remove the early window-expiry 410 in revokeScheduledExecution. A row
still in status="scheduled" has not been transitioned by the scheduler,
so the cloud SDK call has not fired regardless of how far
scheduled_execution_at is in the past (scheduler lag / backpressure).
Returning 410 purely on a past timestamp broke free-cancel during lag
even though CancelScheduledExecutionAtomic could still cancel the row
before any cloud call. Let the CAS be the sole arbiter: it returns
cancelled=false (410) only when the row has actually moved out of
"scheduled".

Updates the former WindowExpired test to assert the new contract (a
past-timestamp scheduled row is cancelled for free via the CAS), and
refreshes two stale "window-check" comments. Closes the last open
CodeRabbit thread on PR #804.

* fix(email/revoke): require recipient for scheduled-delay email; tidy tests (#290)

Address the second CodeRabbit review pass on PR #804:

- Security: SendPurchaseScheduledNotification (SES Sender) no longer
  falls back to the broadcast SendNotification path when RecipientEmail
  is empty. That email embeds a live, execution-scoped revoke link, so
  broadcasting it leaked an action link to every alert subscriber and
  broke the ownership/RBAC model around revocation. It now returns
  ErrNoRecipient, matching SendScheduledPurchaseNotification and the SMTP
  sender. Adds a regression test asserting ErrNoRecipient on empty
  recipient.
- Test: rename TestRevokePurchase_AzureJustOutsideSafetyMarginAllowed ->
  TestCallAzureReturn_JustOutsideSafetyMargin and correct its docstring;
  it drives callAzureReturn directly and never exercised the local
  1h safety-margin gate (which lives in dispatchProviderRevoke). The
  reject side of that gate stays covered end-to-end via
  TestRevokePurchase_AzureWithinSafetyMarginRejected.
- Build: implement the ClearRevocationInFlight StoreInterface method on
  the standalone analytics and server test mocks (mockConfigStore,
  mockConfigStoreForExchange, mockConfigStoreForHealth) so those test
  packages compile after the interface gained the method.

* refactor(purchases): reduce cyclomatic complexity below the pre-commit gate (#290)

The gocyclo pre-commit hook (threshold 10) failed CI on four functions
the #290 feature work grew past the limit. Split each into focused
helpers with no behavior change:

- calculateAzureRevoke (21): extract validateAzureRevokeRequest (auth +
  load + authorize + window/ID validation, itself split into
  azureRevokeWindowAndIDs) and extractAzureRefundQuote.
- callAzureReturn (21): extract azureCalculateRefund (CalculateRefund +
  parse), handleAzureReturnError (clear in-flight + status mapping), and
  persistAzureRevocation (MarkPurchaseRevoked retry + 207 result).
- annotateHistoryRowByStatus (12): move the in-flight / audit-gap cases
  into annotateInFlightOrAuditGapRow.
- dispatchTask (11): switch to a map-based dispatch.

gocyclo now reports nothing over 10; full internal test suite green
(except the pre-existing CSRF flake that also fails on base).
…1121)

* fix(auth): derive stable CSRF key from encryption key (closes #1120)

Production built the auth service with no CSRFKey, so NewService minted a
random per-process key. CSRF tokens are HMAC-SHA256(csrfKey, rawSessionToken)
minted at login and validated on every write, so a token issued by one Lambda
instance was rejected by another instance (and by the same instance after a
cold-start). This surfaced as intermittent "CSRF validation failed" on plan
create and admin user-group updates for both Standard and Admin users.

Derive a stable CSRFKey from the deploy-provided credential encryption key via
HKDF-SHA256 with a domain-separation label, so every instance and every
cold-start derives the same key. The encryption key is already mandatory in
production (the app refuses to start without it), so this fails closed without
requiring a new secret. The key load is memoized via sync.Once, so seeding the
CSRF key before the credential store adds no extra work.

Add a cross-instance regression test: a token minted by instance A validates on
instance B that shares the same master secret, and is rejected by an instance
built from a different secret (the pre-fix random-key situation). DeriveCSRFKey
also fails closed on an empty master secret.

* refactor(server): extract CSRF key derivation to satisfy gocyclo

Extract CSRF key derivation and auth.Service construction from
reinitializeAfterConnect into a new (*Application).initAuthService
helper. Reduces reinitializeAfterConnect cyclomatic complexity from
11 to 10 (project limit), with initAuthService at 3. No behavior,
ordering, error handling, or fail-closed semantics changed.
…es for non-admins (#981)

* fix(settings): suppress permission-denied banner on Purchasing policies for non-admins (closes #979)

When GET /api/config returns HTTP 403 with a "permission denied" error,
loadGlobalSettings now returns early instead of rendering the red error
banner. The section stays hidden and the page degrades gracefully,
matching the existing Exchange Automation hide-for-non-admins pattern.
Non-permission 403s, 5xx, and network failures still surface the banner.

Four new tests in settings-permissions.test.ts cover: the 403+permission
path (no banner), unrelated 403s (banner shown), 500s (banner shown),
and network errors (banner shown).

* refactor(settings): extract isPermissionDeniedError helper (#981 CR)

Pull the inline 403 + "permission denied" check out of the catch block
into a small helper so the catch reads as a single intent-revealing
line. Behavior unchanged; existing settings-permissions tests cover
the contract.

* refactor(settings): narrow Error before reading status in isPermissionDeniedError (#981 CR)

Reorder the guard so the `instanceof Error` narrowing precedes the
`.status` type assertion, matching the conventional TypeScript idiom
(narrow first, then access properties). Behavior is unchanged thanks to
short-circuit evaluation; this addresses the CodeRabbit style nitpick.
…e collector (#1123)

triggerAutoRefreshIfStale fired a background re-collect whenever the
recommendations cache looked stale, and startRecommendationsRefresh's
success path called onReload() (loadRecommendations on Opportunities,
loadDashboard on Home), which re-entered triggerAutoRefreshIfStale. When
the backend collection never advances last_collected_at (e.g. it fails at
a DB error, so freshness stays stale forever) the success path re-fired
the collect on every reload with no terminal condition.

Observed in production at ~4-10 POST /recommendations/refresh per second,
sustained (~18k calls/hour). The calls currently fail fast at a DB error
before reaching Cost Explorer, but once the imminent DB fix lands every
iteration becomes a paid Cost Explorer / pricing call (~$0.01 each),
continuously. The existing autoRefreshInFlight guard only dedups
*concurrent* refreshes and is cleared per cycle, so it does not stop this
sequential re-trigger.

Add a once-per-page-session latch (autoRefreshAttempted) so the
stale-on-open auto-refresh fires at most once regardless of how many
times loadRecommendations / loadDashboard re-run (filter changes, tab
re-entry, the post-refresh reload). The latch is checked before the
freshness GET so neither the GET nor the collect repeats. Explicit user
actions (lookback change, manual refresh) call startRecommendationsRefresh
directly and are never gated by the latch. A hard page reload resets
module state, re-arming the once-per-load convenience.

Add a regression test that reproduces the real scenario (collect POST
resolves, freshness stays stale) and asserts the collect endpoint is hit
exactly once across repeated loads and the post-refresh reload. Fails
pre-fix (2 calls), passes post-fix (1 call). Reset the latch in the
recommendations test beforeEach so the latch cannot leak across tests.
…routine (#1111)

A panic in one region's worker (nil deref, SDK type assertion failure,
slice OOB) crashed the whole CLI process. Add a deferred recover() that
logs the panic with region context and lets the remaining goroutines
complete, mirroring the pattern applied to Lambda goroutines in PR #859
(issue #672).

Closes #997

Co-authored-by: Claude <noreply@anthropic.com>
…498) (#1122)

The Savings History / savings-over-time chart and YTD KPI sparkline are
backed by GET /history/analytics. The handler read and applied the
account filter but never read params["provider"], and QueryHistory had
no provider predicate, so the series showed identical all-provider data
regardless of the selected AWS/Azure/GCP chip (QA row 2.3). The Home
dashboard trend chart also deliberately dropped the provider param when
calling the endpoint.

Thread the provider through end to end, mirroring the working
/history and trends (QueryByService) paths:

- handler getHistoryAnalytics: read provider, normalise the "all"
  sentinel to "" (no filter) and validateProvider at the boundary,
  then pass it to QueryHistory.
- PostgresAnalyticsClient.QueryHistory + AnalyticsClientInterface: add a
  provider param and an optional parameter-bound `AND provider = $N`
  WHERE fragment ("" = all providers), positioned after the dual-column
  account binds.
- dashboard.ts loadSavingsTrendChart: forward the provider chip (and
  name it in the empty-state via buildFilterDesc, now that the query is
  actually provider-scoped). modules/savings-history.ts already
  forwarded provider; correct its stale "no-op until #502" comment.

Regression tests fail pre-fix and pass post-fix:
- TestQueryHistory_ProviderFilter / _ProviderFilterWithAccount assert the
  provider bind and its position.
- TestHandler_getHistoryAnalytics_ProviderFilter / _ProviderAllIsNoFilter
  / _InvalidProvider cover threading, the "all" sentinel, and boundary
  rejection.
- dashboard.test.ts asserts the provider chip is forwarded to the API and
  omitted when no provider is selected.

Also reset the topbar-filter mocks in the savings-trend describe block so
tests stay isolated now that the chart reads the provider chip.
…1126)

The Home dashboard showed a non-zero "current / committed" figure for Azure
(e.g. $166) while the Coverage tab reported $0 / "No usage detected" for the
same subscription. The two surfaces read different fields of the same active
commitments, and the Coverage path silently dropped Azure rows.

Root cause: getCoverageBreakdown summed only the recurring MonthlyCost and
guarded it with `if p.MonthlyCost != nil`. Azure all-upfront RIs carry
MonthlyCost == nil (no recurring charge at the commitment layer, documented on
config.PurchaseHistoryRecord.MonthlyCost), so every such row was skipped and the
provider collapsed to nil Services / nil OverallCoveragePct. The dashboard, by
contrast, aggregates EstimatedSavings (always populated), so it still counted the
commitment. PR #1105 only fixed the recommendation converter's RecurringMonthlyCost
for future rows; it never touched the Coverage handler, so the divergence persisted.

Fix: derive the effective covered monthly spend via a shared commitmentCoveredMonthly
helper = recurring MonthlyCost (when present) + amortised upfront
(UpfrontCost / (Term * MonthsPerYear)), matching the canonical effective-monthly
formula already used by analytics.Collector and exchange_lookup. An all-upfront
commitment now contributes its amortised upfront instead of being silently dropped,
so Coverage reflects the same active commitments the dashboard does. A nil
MonthlyCost is treated as a real $0 recurring component (not a fabricated total),
and Term <= 0 is guarded against division by zero, mirroring the collector's
skip-bad-term defence.

Regression test TestHandler_getCoverageBreakdown_AzureAllUpfrontConsistency seeds
an active Azure compute all-upfront RI (MonthlyCost nil, $1200 upfront / 1y =
$100/mo) and asserts both that the dashboard primitive counts it AND that the
Coverage tab now reports $100/mo covered (100% coverage) instead of nil. It fails
pre-fix (azure.Services nil) and passes post-fix.
…re alarm (#1124)

* fix(migrations): opt-in dirty auto-heal, raise timeout default, failure alarm

Harden the lazy-on-connect migration path that caused a prod outage: a slow
migration (index build on a growing table) blew the 20s timeout mid-run,
golang-migrate left schema_migrations dirty, and since dirty is terminal every
later cold start fail-opened and served 500s on any query needing the
unapplied columns (070-073 never applied).

Three changes:

1. Default-on dirty auto-heal (internal/database/postgres/migrations/migrate.go).
   New maybeAutoHealDirty runs after maybeForceMigrationVersion and before
   Up(): when the DB is dirty it Force()s the CURRENT recorded version to clear
   the flag, then Up() re-applies the pending tail, so a cold start self-recovers
   instead of staying broken until a manual force (the multi-hour outage shape).
   Enabled by default; CUDLY_MIGRATION_AUTOHEAL=false is the escape hatch.
   Forces the current version, NEVER lower: forcing below already-applied seed
   migrations re-runs guards that raise on a second run (e.g. 000059 "Purchaser
   already exists"). The app ALWAYS starts: if auto-heal still can't reach head,
   the error propagates to ensureDB which fail-opens (existing behavior) and the
   alarm fires -- never a crash-loop. CUDLY_FORCE_MIGRATION_VERSION still takes
   precedence. Relies on migrations being idempotent (documented + tested).

2. Raise defaultMigrationsTimeout 20s -> 120s (internal/server/app.go), still
   well under the 300s Lambda limit, so a normal index build can't be killed
   mid-run. CUDLY_MIGRATION_TIMEOUT override unchanged.

3. CloudWatch migration-failure alarm (terraform/modules/compute/aws/lambda).
   Log metric filter on the "Migration failed" line + alarm, since the app
   fail-opens and AWS/Lambda Errors stays clean during a broken migration.
   Notification target optional via alarm_sns_topic_arn (default []); reuses
   the existing alarm/metric-filter patterns, no new SNS infra.

Fix latent migration bug surfaced by the new idempotency test: migration
000067 widened savings_snapshots.account_id BEFORE dropping the materialized
views that depend on it, so the full stack failed to reach head on a fresh DB
("cannot alter type of a column used by a view"). Moved the view DROPs ahead
of the ALTER COLUMN. This moved the migrations integration suite from 4/37 to
28/42 passing; the remaining failures are pre-existing and unrelated (users
role column, savings-plans split assertions).

Tests:
- TestMigrations_FullStackIdempotent (integration): migrate to head twice,
  second run is a clean no-op and not dirty.
- TestMigrations_AutoHealDirty (integration): dirty DB self-heals by default
  and with =true; CUDLY_MIGRATION_AUTOHEAL=false disables it so a dirty DB
  errors with the dirty flag intact.
- TestResolveMigrationsTimeout (unit): default is 120s, honors the override,
  falls back on invalid/non-positive input.

Spec specs/migration-resilience.md updated for the new default, the auto-heal
flag, precedence, and the alarm.

* fix(migrations): extract migrator recovery hooks to fix gocyclo

RunMigrations hit cyclomatic complexity 11 (gocyclo budget is 10) after
the default-on dirty auto-heal added a second pre-Up error branch, failing
the pre-commit CI hook. Extract migrator creation plus the operator-force
and auto-heal hooks into newMigratorWithRecovery, dropping RunMigrations to
9 while preserving hook ordering and error propagation. The helper closes
the migrator on its own error paths so callers never receive a
half-initialized migrator.

* test(migrations): enforce all migrations are transactional + address CR

Add a static guard test that reads every *.up.sql / *.down.sql under the
migrations dir and fails if any file contains a statement PostgreSQL refuses
to run inside a transaction block (CREATE/DROP INDEX CONCURRENTLY, REINDEX,
VACUUM, ALTER SYSTEM, CREATE/DROP DATABASE, CREATE TABLESPACE, REFRESH
MATERIALIZED VIEW CONCURRENTLY).

golang-migrate runs each migration file in one implicit transaction, so
forbidding non-transactional statements guarantees a failed migration rolls
back atomically with no partial schema. This is orthogonal to the dirty-flag
auto-heal in this PR; it closes the partial-apply hole.

The scanner strips SQL comments and dollar-quoted function/DO bodies before
matching, so CONCURRENTLY inside a stored function (e.g. migration 000003's
refresh_savings_materialized_views) is correctly treated as benign. Files may
opt out via the marker comment "-- migrate:no-transaction"; such files are
skipped with a loud warning and MUST be applied via the deploy-time runner
(issue #1125), never the in-request auto-migrate path.

Also address the CodeRabbit nitpick on specs/migration-resilience.md:
clarify that alarm_sns_topic_arn is a list of SNS topic ARNs.
…loses #419) (#856)

* sec: replace math/rand v1 with math/rand/v2 for rate-limiter jitter (#419)

math/rand/v2 (Go 1.22+) is the project standard for non-crypto jitter
(already used in pkg/retry/exponential.go). Swaps the package-level
import in providers/aws/recommendations/ratelimiter.go; the call-site
(rand.Float64()) is identical in v2.

Adds TestWait_JitterRange: 200 iterations verify jitter stays within
[0, 20%] of the base delay and spans more than 5% (non-stuck source).

* test(ratelimiter): make jitter test deterministic to remove wall-clock flakiness (refs #419)

Extract computeJitter(delay, r) as a pure function and replace the
wall-clock-based TestWait_JitterRange with TestComputeJitter, which
drives the function with 10,000 evenly-spaced r values in [0,1) and
asserts the output is within [0, 0.2*delay]. No sleeping means OS
scheduling overhead cannot blow the upper-bound check. The spread
assertion (max observed >= 95% of cap) still catches a regressed
always-zero jitter source.
…re-assert) (#1127)

* fix(migrations): repair partially-applied 058-067 schema (idempotent re-assert)

A prior emergency Force(67) on a dev/QA database marked migrations 058-067
as applied without running their SQL, leaving genuinely-unapplied schema
changes missing. The most visible symptom is live 500s on
planned/pending/stuck purchase-execution queries:
"column idempotency_key does not exist" (purchase_executions.idempotency_key
from 066). groups.system_managed (059) and other 058-067 schema are missing
too. The 058-067 range cannot simply be re-run: migration 059 has a
RAISE EXCEPTION seed-guard that conflicts with the Purchaser group already
relocated by 064.

Add migration 000074, which re-asserts ONLY the schema from 058-067 in fully
idempotent form so any partially-migrated (or fully-migrated) DB converges to
the correct schema. It intentionally omits the conflicting data seeds
(059 Purchaser INSERT + admin backfill, 060 universal-plans DELETE, 064
relocate); those rows already exist or were deliberately relocated on any
affected DB.

Schema re-asserted, by source migration:
- 058: purchase_executions.executed_by_user_id, .executed_at,
  .pre_approval_skip_reason; index idx_executions_direct_execute
- 059: groups.system_managed
- 063: purchase_history.monthly_cost DROP NOT NULL
- 065: function check_min_one_admin(); constraint triggers
  trg_min_one_admin_delete, trg_min_one_admin_update
- 066: purchase_executions.idempotency_key
- 067: savings_snapshots.account_id widened to VARCHAR(255);
  total_usage / coverage_percentage made NULLABLE (DROP NOT NULL + DROP
  DEFAULT); materialized views monthly_savings_summary, daily_savings_trend,
  provider_savings_summary recreated carrying cloud_account_id (+ unique
  indexes); function drop_old_savings_partitions()

Ordering differs from source 067 in one place: the materialized views are
dropped BEFORE the account_id type widening. On the corrupted DB the views
still exist from migration 000003 selecting account_id, which blocks
ALTER COLUMN account_id TYPE. Dropping the dependent views first lets the
repair converge from the view-present state.

The migration is fully transactional (no CONCURRENTLY), so golang-migrate
applies it atomically, and it is idempotent (re-applying is a clean no-op).
The .down.sql is a documented no-op: this repair does not own the objects
(the source migrations do), so dropping them on rollback would corrupt a
correctly-migrated DB.

Verified against a Postgres 16 container: reproduced the corrupted baseline
(001-057 applied, 058-067 skipped, 070-073 applied), confirmed all listed
objects were missing, applied 074, and confirmed every object converged;
re-applied 074 and confirmed it is a clean no-op; confirmed it runs inside a
single transaction. go build and migration unit tests green, gofmt clean.

* fix(db/migration): address CR findings on 000074 repair migration

Two CodeRabbit findings on the idempotent 058-067 repair migration:

- Materialized views (monthly_savings_summary, daily_savings_trend,
  provider_savings_summary) were created without WITH NO DATA, so each
  populated immediately and the subsequent REFRESH re-ran the same
  aggregation, doubling work and lock time on savings_snapshots. Create
  the views WITH NO DATA so the single REFRESH is the only populate step.

- drop_old_savings_partitions enumerated partitions via pg_tables with
  tablename LIKE 'savings_snapshots_%', where the LIKE wildcard '_' could
  match unrelated tables (e.g. savings_snapshots_backup_2024_01) and risk
  dropping non-partition data. Enumerate real child partitions from the
  catalog (pg_inherits) scoped to the savings_snapshots parent instead.

Migration stays fully idempotent, transactional, and CONCURRENTLY-free;
no 059 data seed is reintroduced. go build ./... and
go test ./internal/database/... pass.
… are not dropped (#1129)

Cost Explorer returns the EC2 RI tenancy title-cased ("Shared",
"Dedicated", "Host"), but resolveEC2Tenancy compared it case-sensitively
against lowercase "shared"/"dedicated". Every real recommendation hit the
default branch and errored; parseRecommendations logged a warning and
continue'd, silently dropping all EC2 RI recommendations. With Savings
Plans collection separately blocked by an IAM AccessDenied, AWS produced
zero recommendations and the dashboard showed no AWS opportunities.

Normalise the input with strings.ToLower(strings.TrimSpace(...)) before
matching, mirroring the existing whitespace-tolerant handling on the RDS
DeploymentOption path. The fail-loud M5 behaviour is preserved: "host"
(Dedicated Hosts, no RI product) and any unknown value still error rather
than collapse to a default-tenancy buy.

Regression test: replicates the production casing ("Shared"/"Dedicated"
plus surrounding whitespace) which fails on the pre-fix parser, and
asserts title-cased "Host" still errors.
* docs(readme): document web interface install + usage

Add a "Web Interface" section covering what the dashboard provides,
honest capabilities and limitations per provider/feature, Terraform
deployment across AWS/GCP/Azure, and how to access the app and
complete the first-login admin bootstrap.

Verified against: frontend/package.json, internal/api/router.go,
internal/server/static.go + http.go + app.go,
terraform/environments/{aws,gcp,azure}/*, docs/DEPLOYMENT.md.

* docs(readme): mark web GUI experimental + add CLI support matrix

Add a prominent experimental note to the Web Interface section heading
and opening blockquote. Retitle "AWS Services (Production Ready)" to
"AWS Services" to avoid overstating stability.

Add an "AWS CLI Support Matrix" subsection under Supported Cloud
Providers & Services listing all seven CLI-accessible AWS services
(rds, elasticache, ec2, opensearch, redshift, memorydb, savingsplans)
verified against providers/aws/services/ and cmd/main.go, with RDS
and ElastiCache marked Tested and the remaining five marked
Experimental (seeking testers), per owner guidance.
…to unblock deploys (#1130)

#1124's migration-failure metric filter requires logs:PutMetricFilter, which the
CI deploy SA lacks until the ci-cd-permissions bootstrap grants it, so every
terraform apply 403s and blocks all deploys (bootstrap-vs-runtime IAM split, see
CLAUDE.md CI/CD IAM). Gate the metric filter and alarm behind a new
enable_migration_alarm variable (default false) so deploys succeed; enable it
later by re-applying the bootstrap with the metric-filter permissions and
setting enable_migration_alarm=true.
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 526184a7-b6a6-4093-9c27-976b795c36c5

📥 Commits

Reviewing files that changed from the base of the PR and between b5953bd and ebfeaed.

⛔ Files ignored due to path filters (3)
  • frontend/package-lock.json is excluded by !**/package-lock.json
  • frontend/src/favicon.svg is excluded by !**/*.svg
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (297)
  • .air.toml
  • .dockerignore
  • .env.example
  • .gitallowed
  • .github/runbooks/compromised-dependency.md
  • .github/runbooks/credential-compromise.md
  • .github/runbooks/data-breach-response.md
  • .github/runbooks/ddos-mitigation.md
  • .github/workflows/README.md
  • .github/workflows/aws_sanity.yml
  • .github/workflows/azure_sanity.yml
  • .github/workflows/ci.yml
  • .github/workflows/cleanup-staging.yml
  • .github/workflows/database-migration.yml
  • .github/workflows/deploy-all.yml
  • .github/workflows/deploy-aws-fargate.yml
  • .github/workflows/deploy-aws-lambda.yml
  • .github/workflows/deploy-azure.yml
  • .github/workflows/deploy-gcp.yml
  • .github/workflows/destroy-fargate-dev.yml
  • .github/workflows/frontend-build-sentinel.yml
  • .github/workflows/frontend-build.yml
  • .github/workflows/pre-commit.yml
  • .github/workflows/rollback.yml
  • .gitignore
  • .gitleaksignore
  • .golangci.yml
  • .hadolint.yaml
  • .markdownlint.yaml
  • .pre-commit-config.yaml
  • .snyk
  • .tflint.hcl
  • .trivyignore
  • CHANGELOG.md
  • CLAUDE.md
  • CONTRIBUTING.md
  • Dockerfile
  • Dockerfile.dev
  • Makefile
  • Makefile.terraform
  • NOTICE
  • README.md
  • SECURITY.md
  • arm/CUDly-CrossSubscription/setup-gcp-wif.sh
  • arm/CUDly-CrossSubscription/setup.sh
  • arm/CUDly-CrossSubscription/template.json
  • ci_cd_sanity_tests/cmd/ri-exchange/main.go
  • ci_cd_sanity_tests/pkg/commitments/aws/ri_exchange.go
  • ci_cd_sanity_tests/pkg/sanity/aws/aws.go
  • ci_cd_sanity_tests/pkg/sanity/azure/azure.go
  • ci_cd_sanity_tests/pkg/sanity/azure/azure_test.go
  • cloudformation/stacks/CUDly-CrossAccount/template.yaml
  • cloudformation/stacks/CUDly/template.yaml
  • cmd/cleanup-lambda/main.go
  • cmd/configure_azure.go
  • cmd/configure_gcp.go
  • cmd/configure_test.go
  • cmd/gen-permissions/main.go
  • cmd/helpers.go
  • cmd/helpers_test.go
  • cmd/lambda/clear-rate-limit/main.go
  • cmd/lambda/main.go
  • cmd/lambda/main_test.go
  • cmd/main.go
  • cmd/main_test.go
  • cmd/multi_service.go
  • cmd/multi_service_coverage_test.go
  • cmd/multi_service_csv.go
  • cmd/multi_service_csv_test.go
  • cmd/multi_service_engine_versions.go
  • cmd/multi_service_engine_versions_paginate_test.go
  • cmd/multi_service_engine_versions_test.go
  • cmd/multi_service_filters.go
  • cmd/multi_service_filters_test.go
  • cmd/multi_service_helpers.go
  • cmd/multi_service_helpers_test.go
  • cmd/multi_service_stats.go
  • cmd/multi_service_stats_helpers.go
  • cmd/multi_service_stats_test.go
  • cmd/multi_service_test.go
  • cmd/multi_service_test_common_test.go
  • cmd/rekey/README.md
  • cmd/rekey/main.go
  • cmd/rekey/main_test.go
  • cmd/secrets_store.go
  • cmd/server/main.go
  • cmd/server/main_test.go
  • cmd/validators.go
  • cmd/validators_test.go
  • docker-compose.test.yml
  • docker-compose.yml
  • docs/DEPLOYMENT.md
  • docs/DEVELOPMENT.md
  • docs/smoke-test-multi-account.sh
  • frontend/deploy.sh
  • frontend/package.json
  • frontend/src/__tests__/a11y.test.ts
  • frontend/src/__tests__/allowed-accounts.test.ts
  • frontend/src/__tests__/api-accounts.test.ts
  • frontend/src/__tests__/api-apikeys.test.ts
  • frontend/src/__tests__/api-history.test.ts
  • frontend/src/__tests__/api-inventory.test.ts
  • frontend/src/__tests__/api-mfa.test.ts
  • frontend/src/__tests__/api-permissions.test.ts
  • frontend/src/__tests__/api.test.ts
  • frontend/src/__tests__/apikeys.test.ts
  • frontend/src/__tests__/app.test.ts
  • frontend/src/__tests__/approval-details.test.ts
  • frontend/src/__tests__/archera.test.ts
  • frontend/src/__tests__/auth-mfa-enroll.test.ts
  • frontend/src/__tests__/auth-mfa-login.test.ts
  • frontend/src/__tests__/auth.test.ts
  • frontend/src/__tests__/charts-css.test.ts
  • frontend/src/__tests__/chip-select.test.ts
  • frontend/src/__tests__/column-filters.test.ts
  • frontend/src/__tests__/commitmentOptions.test.ts
  • frontend/src/__tests__/confirmDialog.test.ts
  • frontend/src/__tests__/crosstab-session.test.ts
  • frontend/src/__tests__/css.test.ts
  • frontend/src/__tests__/dashboard-ownership-950.test.ts
  • frontend/src/__tests__/dashboard.test.ts
  • frontend/src/__tests__/docs.test.ts
  • frontend/src/__tests__/execute-mode-toggle.test.ts
  • frontend/src/__tests__/groups.test.ts
  • frontend/src/__tests__/history-approval-queue.test.ts
  • frontend/src/__tests__/history-approve-button.test.ts
  • frontend/src/__tests__/history-cancel-button.test.ts
  • frontend/src/__tests__/history-cancel-permissions.test.ts
  • frontend/src/__tests__/history-deeplink.test.ts
  • frontend/src/__tests__/history-retry-button.test.ts
  • frontend/src/__tests__/history-revoke-button.test.ts
  • frontend/src/__tests__/history.test.ts
  • frontend/src/__tests__/html.test.ts
  • frontend/src/__tests__/index.test.ts
  • frontend/src/__tests__/inventory.test.ts
  • frontend/src/__tests__/mobile-nav.test.ts
  • frontend/src/__tests__/mocks/styleMock.ts
  • frontend/src/__tests__/modal.test.ts
  • frontend/src/__tests__/navigation.test.ts
  • frontend/src/__tests__/permissions.test.ts
  • frontend/src/__tests__/plans-column-filters.test.ts
  • frontend/src/__tests__/plans-ownership-950.test.ts
  • frontend/src/__tests__/plans-permissions.test.ts
  • frontend/src/__tests__/plans-range-validation.test.ts
  • frontend/src/__tests__/plans.test.ts
  • frontend/src/__tests__/purchase-compatibility.test.ts
  • frontend/src/__tests__/purchase-execution-toast.test.ts
  • frontend/src/__tests__/purchases-deeplink.test.ts
  • frontend/src/__tests__/recommendations-enabled-providers.test.ts
  • frontend/src/__tests__/recommendations-lookback.test.ts
  • frontend/src/__tests__/recommendations-permissions.test.ts
  • frontend/src/__tests__/recommendations.test.ts
  • frontend/src/__tests__/registrations.test.ts
  • frontend/src/__tests__/responsive.test.ts
  • frontend/src/__tests__/riexchange-column-filters.test.ts
  • frontend/src/__tests__/riexchange-permissions.test.ts
  • frontend/src/__tests__/riexchange.test.ts
  • frontend/src/__tests__/savings-history.test.ts
  • frontend/src/__tests__/settings-accounts.test.ts
  • frontend/src/__tests__/settings-external-id.test.ts
  • frontend/src/__tests__/settings-permissions.test.ts
  • frontend/src/__tests__/settings-purchasing-grace.test.ts
  • frontend/src/__tests__/settings-subnav.test.ts
  • frontend/src/__tests__/settings.test.ts
  • frontend/src/__tests__/setup.ts
  • frontend/src/__tests__/sidebar-anchors.test.ts
  • frontend/src/__tests__/skeleton.test.ts
  • frontend/src/__tests__/state.test.ts
  • frontend/src/__tests__/toast.test.ts
  • frontend/src/__tests__/topbar-filters.test.ts
  • frontend/src/__tests__/users.test.ts
  • frontend/src/__tests__/utils.test.ts
  • frontend/src/__tests__/xss-provider-class.test.ts
  • frontend/src/__tests__/xss-purchase-status.test.ts
  • frontend/src/api.ts
  • frontend/src/api/accounts.ts
  • frontend/src/api/apikeys.ts
  • frontend/src/api/auth.ts
  • frontend/src/api/client.ts
  • frontend/src/api/dashboard.ts
  • frontend/src/api/groups.ts
  • frontend/src/api/history.ts
  • frontend/src/api/index.ts
  • frontend/src/api/inventory.ts
  • frontend/src/api/plans.ts
  • frontend/src/api/purchases.ts
  • frontend/src/api/recommendations.ts
  • frontend/src/api/registrations.ts
  • frontend/src/api/riexchange.ts
  • frontend/src/api/settings.ts
  • frontend/src/api/types.ts
  • frontend/src/api/users.ts
  • frontend/src/apikeys.ts
  • frontend/src/app.ts
  • frontend/src/approval-details.ts
  • frontend/src/archera.ts
  • frontend/src/auth.ts
  • frontend/src/commitmentOptions.ts
  • frontend/src/confirmDialog.ts
  • frontend/src/css-modules.d.ts
  • frontend/src/dashboard.ts
  • frontend/src/docs.css
  • frontend/src/docs.html
  • frontend/src/federation.ts
  • frontend/src/groups/groupActions.ts
  • frontend/src/groups/groupList.ts
  • frontend/src/groups/groupModals.ts
  • frontend/src/groups/handlers.ts
  • frontend/src/groups/index.ts
  • frontend/src/groups/state.ts
  • frontend/src/history.ts
  • frontend/src/index.html
  • frontend/src/index.ts
  • frontend/src/inventory.ts
  • frontend/src/lib/chip-select.ts
  • frontend/src/lib/column-filters.ts
  • frontend/src/lib/purchase-compatibility.ts
  • frontend/src/lib/skeleton.ts
  • frontend/src/modal.ts
  • frontend/src/modules/registrations.ts
  • frontend/src/modules/savings-history.ts
  • frontend/src/navigation.ts
  • frontend/src/permissions.generated.ts
  • frontend/src/permissions.ts
  • frontend/src/plans.ts
  • frontend/src/purchases-deeplink.ts
  • frontend/src/recommendations.ts
  • frontend/src/riexchange.ts
  • frontend/src/settings-subnav.ts
  • frontend/src/settings.ts
  • frontend/src/state.ts
  • frontend/src/styles.css
  • frontend/src/styles/base.css
  • frontend/src/styles/charts.css
  • frontend/src/styles/components.css
  • frontend/src/styles/forms.css
  • frontend/src/styles/index.css
  • frontend/src/styles/layout.css
  • frontend/src/styles/modals.css
  • frontend/src/styles/plans.css
  • frontend/src/styles/responsive.css
  • frontend/src/styles/settings.css
  • frontend/src/styles/tables.css
  • frontend/src/styles/tabs.css
  • frontend/src/toast.ts
  • frontend/src/topbar-filters.ts
  • frontend/src/types.ts
  • frontend/src/users.ts
  • frontend/src/users/filters.ts
  • frontend/src/users/handlers.ts
  • frontend/src/users/index.ts
  • frontend/src/users/permissionMatrix.ts
  • frontend/src/users/state.ts
  • frontend/src/users/userActions.ts
  • frontend/src/users/userList.ts
  • frontend/src/users/userModals.ts
  • frontend/src/users/utils.ts
  • frontend/src/utils.ts
  • frontend/tsconfig.json
  • frontend/webpack.config.js
  • go.mod
  • iac/embed.go
  • iac/federation/aws-cross-account/cloudformation/template.yaml
  • iac/federation/aws-cross-account/terraform/.terraform.lock.hcl
  • iac/federation/aws-cross-account/terraform/main.tf
  • iac/federation/aws-cross-account/terraform/outputs.tf
  • iac/federation/aws-cross-account/terraform/registration.tf
  • iac/federation/aws-cross-account/terraform/variables.tf
  • iac/federation/aws-target/cloudformation/template.yaml
  • iac/federation/aws-target/terraform/.terraform.lock.hcl
  • iac/federation/aws-target/terraform/main.tf
  • iac/federation/aws-target/terraform/outputs.tf
  • iac/federation/aws-target/terraform/registration.tf
  • iac/federation/aws-target/terraform/variables.tf
  • iac/federation/azure-target/bicep/azure-wif.arm.json
  • iac/federation/azure-target/bicep/azure-wif.bicep
  • iac/federation/azure-target/terraform/.terraform.lock.hcl
  • iac/federation/azure-target/terraform/main.tf
  • iac/federation/azure-target/terraform/outputs.tf
  • iac/federation/azure-target/terraform/registration.tf
  • iac/federation/azure-target/terraform/variables.tf
  • iac/federation/gcp-sa-impersonation/terraform/.terraform.lock.hcl
  • iac/federation/gcp-sa-impersonation/terraform/main.tf
  • iac/federation/gcp-sa-impersonation/terraform/outputs.tf
  • iac/federation/gcp-sa-impersonation/terraform/registration.tf
  • iac/federation/gcp-sa-impersonation/terraform/variables.tf
  • iac/federation/gcp-target/terraform/.terraform.lock.hcl
  • iac/federation/gcp-target/terraform/backend-config.example.hcl
  • iac/federation/gcp-target/terraform/backend.tf.example
  • iac/federation/gcp-target/terraform/main.tf
  • iac/federation/gcp-target/terraform/outputs.tf
  • iac/federation/gcp-target/terraform/registration.tf
  • iac/federation/gcp-target/terraform/variables.tf
  • internal/accounts/org_discovery.go
  • internal/accounts/org_discovery_extra_test.go
  • internal/accounts/org_discovery_test.go
  • internal/analytics/collector.go

📝 Walkthrough

Walkthrough

This PR adds local development, CI/CD, deployment, and incident-response files; expands CLI sizing and multi-service processing logic; introduces Lambda and cloud-setup commands; and updates documentation, security policy, and repository hygiene configs.

Changes

Runtime and purchase flow

Layer / File(s) Summary
Sizing and service selection
cmd/main.go, cmd/helpers.go, cmd/main_test.go
Config gains target-coverage and purchase-pipeline fields, Savings Plans service selection expands to per-plan types, purchase IDs derive from normalized engine labels, and the new sizing helpers/tests cover coverage routing and validation.
Multi-service processing
cmd/multi_service.go, cmd/multi_service_coverage_test.go
The multi-service pipeline now fetches existing coverage, filters and scores recommendations, handles interruptible purchases, and updates CSV-mode filtering with target-coverage and existing-coverage logic.
CSV import and export
cmd/multi_service_csv.go, cmd/multi_service_csv_test.go
CSV recommendations are parsed back into typed details, reports include sorting and totals, and tests cover the new field formatting and file handling.
Engine version lookup
cmd/multi_service_engine_versions.go, cmd/multi_service_engine_versions_paginate_test.go, cmd/multi_service_engine_versions_test.go
RDS instance/version discovery, major-version pagination, and extended-support filtering are added with tests for parsing, pagination, and adjustment behavior.
Lambda handlers and cloud credential setup
cmd/lambda/main.go, cmd/lambda/main_test.go, cmd/lambda/clear-rate-limit/main.go, cmd/cleanup-lambda/main.go, cmd/configure_azure.go, cmd/configure_gcp.go, cmd/configure_test.go
New Lambda entrypoints and cleanup behavior are added, and Azure/GCP credential configuration commands store secrets in Secrets Manager with validation and unit tests.

Infrastructure and delivery

Layer / File(s) Summary
Build and local dev setup
.air.toml, .env.example, .dockerignore, Dockerfile, Dockerfile.dev, Makefile, Makefile.terraform
Local development, container build, environment template, and Makefile tooling are added or expanded for build, test, Terraform, and Docker workflows.
Cloud infrastructure templates
arm/CUDly-CrossSubscription/*, cloudformation/stacks/CUDly*/*
Azure onboarding scripts/templates and AWS CloudFormation stacks add cross-subscription access, application infrastructure, dashboard resources, and Lambda/policy wiring.
GitHub Actions workflows
.github/workflows/*
New CI, deployment, migration, rollback, cleanup, frontend, and pre-commit workflows are added, and several existing workflow action references are pinned.

Documentation, policy, and repository rules

Layer / File(s) Summary
Docs and policy
README.md, SECURITY.md, CHANGELOG.md, CLAUDE.md, CONTRIBUTING.md, NOTICE
Project docs expand with deployment, security, contribution, changelog, and operating guidance, plus updated release notes and license attribution.
Runbooks
.github/runbooks/*
Incident response runbooks are added for compromised dependencies, credential compromise, data breaches, and DDoS mitigation.
Repository hygiene
.gitignore, .gitleaksignore, .golangci.yml, .hadolint.yaml, .markdownlint.yaml, .pre-commit-config.yaml, .snyk, .tflint.hcl, .trivyignore, .gitallowed
Repo ignore lists and lint/security tooling configs are expanded for Go, Terraform, Docker, Markdown, pre-commit, Snyk, Trivy, and leak scanning.

Sequence Diagram(s)

sequenceDiagram
  participant runToolMultiService
  participant fetchExistingCoverage
  participant filterAndAdjustRecommendations
  participant processPurchaseLoop
  participant audit record writer
  runToolMultiService->>fetchExistingCoverage: fetch existing RI coverage
  runToolMultiService->>filterAndAdjustRecommendations: score and filter recommendations
  runToolMultiService->>processPurchaseLoop: confirm and execute purchases
  processPurchaseLoop->>audit record writer: write per-recommendation audit records
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related PRs

  • LeanerCloud/CUDly#601: Both PRs modify cmd/multi_service_csv.go’s CSV parsing and typed detail reconstruction.
  • LeanerCloud/CUDly#339: Both PRs touch the same target-coverage and sizing pipeline in cmd/helpers.go, cmd/main.go, and cmd/multi_service.go.
  • LeanerCloud/CUDly#536: Both PRs pin GitHub Actions workflow uses: references to immutable commit SHAs.

Suggested labels

triaged, priority/p2, severity/low, urgency/this-quarter, impact/all-users, effort/xl, type/chore

Poem

A bunny hops through build and cloud,
with workflows pinned and docs avowed.
🐇 Fresh carrots for the CSV tray,
and Lambda lights the morning way.
The burrow hums with tests and plans,
while tidy configs clap their hands.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/multicloud-web-frontend

@cristim cristim merged commit 3445821 into main Jun 9, 2026
17 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant