chore: promote multicloud web frontend to main (1380 commits)#1131
Conversation
… test does not panic (#663) When TransitionExecutionStatus returns an error, RecoverStrandedApprovals calls GetExecutionByID to distinguish a race (the row already left "approved") from a real store failure. The test TestManager_RecoverStrandedApprovals_LateCompletionNotClobbered had no expectation for GetExecutionByID, so testify panicked on the unstubbed call (mocks_test.go:277). Add a stub returning Status="completed" so the manager correctly skips the execution and the assertion (recovered==0) still genuinely proves that a row which completed between the stale SELECT and the recovery UPDATE is not clobbered by the sweep. Closes #657
… rows (#664) The production queryExecutions scan reads 22 columns; the mock row definitions in TestPGXMock_GetExecutionByID_Success and TestPGXMock_GetExecutionByID_WithTimestamps only declared 21, causing "incorrect argument number 22 for columns 21" at scan time. Add "approval_token_expires_at" as the 22nd column (after retry_attempt_n, matching the SELECT and rows.Scan order in store_postgres.go) and supply sql.NullTime{} as its value in both rows. Closes #627, Closes #624
…closes #136) (#580) * feat(settings): per-product Coverage and Enabled controls on SP cards (closes #136) Add Coverage % and Enabled toggle to each of the four Savings Plans cards (Compute, EC2 Instance, SageMaker, Database), matching the per-product settings already available on RI cards. - SERVICE_FIELDS SP entries extended with coverageId/enabledId - TRACKED_FIELDS includes the new IDs for dirty-state detection - loadGlobalSettings populates per-card coverage/enabled from service config, falling back to global default_coverage when absent - saveGlobalSettings reads per-card coverage/enabled from DOM when the card exposes the controls; non-SP cards continue to inherit global settings - index.html: Coverage % number input and Enabled checkbox added to each SP card in the settings form - Tests: DOM fixture updated; 3 new tests cover save-coverage, save-enabled, and load-from-service-config paths; HTML structural assertions extended * fix(settings): apply SP coverage/enabled fallback when service row absent loadGlobalSettings only populated per-product Savings Plans coverage and enabled controls for services present in the config response. A SP card whose service row was missing kept the HTML default (80) instead of falling back to global.default_coverage, which then persisted an incorrect value on the next save. Iterate SERVICE_FIELDS (the source of truth for SP card IDs) and overlay any matching service-config values, so every SP card gets the global default when its row is absent. Harden the regression test with a non-default seeded DOM state and a non-80 global default so the fallback assignment is actually exercised. Addresses CodeRabbit review on PR #580.
…story (closes #642, #646) (#650) * fix(purchases): record partial success and stamp target account on history Partial-failure runs (some recs commit, others fail) were marked the whole execution "failed" and skipped the success notification despite real commitments written to purchase_history with Purchased=true, inviting a re-approve that double-buys the already-purchased recs (#642). The single-account path also stamped the ambient AWS host account on history regardless of the resolved target account or provider (#646). - #642: introduce a "partially_completed" outcome. When at least one rec committed, never mark the row "failed"; record partially_completed, send the confirmation for the recs that purchased, and surface the row in History (flagged IsAuditGap so its execution-level dollars are excluded - the committed dollars come from the per-rec purchase_history rows). Applies to both the single-account path (via a partialPurchaseError sentinel that finalizeExecution maps) and the multi-account fan-out path. - #642 frontend: on a partial fan-out failure, name the buckets that created pending executions so the still-actionable requests are not silent orphans; add a distinct "Partial" history badge. - #646: resolve the target account's ExternalID from the resolved account and stamp that on purchase_history, falling back to the ambient AWS STS identity only when no target account can be identified. Fixes assume-role AWS and direct-execute Azure/GCP stamping. Regression tests: single-account + multi-account partial success (partially_completed status + notification sent), correct account stamping for AWS and Azure, History row synthesis with dollar-exclusion, and the fan-out partial toast naming submitted buckets. Closes #642 Closes #646 * fix(purchases): error when target account is unresolved, never fall back to ambient When a single-account purchase resolves a concrete cloudAccountID but GetCloudAccount returns a nil account (the account does not exist), resolveSingleAccountProvider returned (nil, "", nil). The caller then treated the empty target ID as "truly ambient" and fell back to the host AWS STS identity, purchasing and stamping purchase_history against the wrong account (#646). Surface a descriptive error instead so the caller's no-ambient-fallback contract holds once a target account is known. Add a regression test covering the not-found path.
…n-flight rows (closes #645) (#648) * fix(purchases): align token cancel path with session path and guard in-flight rows The token/email cancel path (loadCancelableExecution) only rejected completed/cancelled, so an email-link holder could cancel an approved/running/paused/failed/expired execution that the dashboard session path refuses. Cancelling an approved/running row is unsafe: the AWS commitment is being or has been created, so the cancel would leave the DB and the cloud out of sync. Restrict the token path to pending/notified only, matching cancelPurchaseViaSession. This restriction is itself the in-flight guard. Add table-driven tests on both the token (Manager.CancelExecution) and session (cancelPurchaseViaSession) paths covering every status: each non-cancelable status is rejected with no store write, and pending/notified still cancel successfully. Closes #645 * refactor(purchase): centralize cancelable-status predicate Extract the pending/notified cancelability rule into a single PurchaseExecution.IsCancelable method and call it from both cancel paths (purchase.Manager.loadCancelableExecution and the session-authed cancelPurchaseViaSession) so the policy can never drift between the token and session flows. Behavior is unchanged: only pending/notified rows stay cancelable; approved/running and other in-flight states remain non-cancelable. Addresses CodeRabbit nitpick on PR #648 (issue #645).
…isory-only (#649) These three GCP PurchaseCommitment implementations did not buy a committed-use discount: they provisioned a brand-new billable resource (a Cloud SQL instance, an empty GCS bucket, a Redis instance), so a "purchase" silently spun up infrastructure that kept billing and the commitment the user believed they bought never existed. None of these services exposes a safe programmatic commitment-purchase API: Cloud SQL and Memorystore CUDs are spend-based (bought via the Cloud Billing console / Cloud Commerce Consumer Procurement API) and GCS has no CUD purchase API at all. Their recommendations are therefore advisory only. Each PurchaseCommitment now returns the new sentinel common.ErrCommitmentPurchaseNotSupported and never calls any resource-creation API. The purchase executor already returns that error to the caller (no success path, no infrastructure created). Regression tests assert PurchaseCommitment returns the sentinel and never invokes InsertInstance / bucket Create / CreateInstance. Closes #640
…) (#653) * feat(purchases): make Azure reservation purchases idempotent Extends the #636 IdempotencyToken mechanism to all five Azure reservation executors (cache, compute, cosmosdb, database, search) so a re-drive (and Each executor previously minted a fresh reservationOrderID per call (uuid.New(), or a time.Now() timestamp in search), so a retry created a second reservation. The Azure Reservations API PUTs to reservationOrders/{id} and is idempotent on a stable order ID, so the order ID is now derived deterministically from the IdempotencyToken via common.IdempotencyGUID: the same token re-PUTs the same order and returns the existing reservation rather than creating a second. An empty/short token falls back to the prior non-idempotent ID, preserving behaviour for non-execution callers. Adds common.IdempotencyGUID (canonical 8-4-4-4-12 GUID from the first 128 bits of the SHA-256 token) and common.ReservationOrderID, a single helper that returns the derived GUID or the caller's fallback so each executor stays a one-statement assignment (keeps gocyclo under threshold and avoids repeating the empty-token guard across services). Per-provider regression tests on compute and search assert that a same-token re-drive PUTs to the identical reservationOrders/{id} URL (no second reservation) while a distinct token targets a distinct order. Refs #641. * fix(azure/idem): reject non-hex IdempotencyGUID input; use uuid.New() fallback IdempotencyGUID now validates the first 32 characters are valid lowercase hex before formatting them as a GUID, returning "" for any non-hex input rather than silently producing a malformed GUID. Azure search PurchaseCommitment replaces the Unix-second timestamp fallback with uuid.New().String() so the fallback is collision-resistant at sub-second granularity rather than guaranteed to collide on burst retries. Co-authored-by: CodeRabbit CR #653
…647) (#655) * fix(api/purchases): guard web purchase execution against malformed recs and double-submit Two related guards on the web executePurchase path: - #643: validate each client-supplied recommendation's Provider/Service/ Term/Payment/Count at the API boundary (validatePurchaseRecommendation in validation.go), wired into validateExecutePurchaseRequest after the account scope check. Per-provider Term (1/3) and Payment whitelists reject malformed values (Term:7, Payment:"foo", non-positive Count, empty/all provider) before they reach the cloud SDK with a silent default substituted. Scoped to the web execute path only; the retry path replays already-validated recs and is not re-gated. - #644: submit-time idempotency. purchaseIdempotencyKey hashes actor + sorted rec tuples + capacity; findDuplicatePendingExecution recomputes the key for recent web-sourced pending executions (no schema change) and collapses a double-click/retry within a 2-minute window onto the existing execution instead of minting a second approvable row (double-spend). The frontend execute button is now disabled before the confirm dialog and network call in both the single and fan-out paths, and re-enabled on cancel. Updated existing executePurchase / per-account-perms tests to carry valid recs and the new GetPendingExecutions lookup; added table-driven validation tests, idempotency-key and duplicate-lookup tests, and frontend button-disable tests. #647 (server-side capacity_percent consistency) requires plumbing the pre-scaling count through the frontend and rec type; split into a follow-up to keep this change focused. * fix(api/purchases): validate capacity_percent against scaled rec counts (#647) capacity_percent was a decorative audit-only field: the frontend scaled rec counts client-side, sent the percent alongside, and the backend stored it without ever cross-checking that the scaled counts agreed with the recorded percent. The audit trail for a financial action could claim e.g. "50% capacity" while the recs summed to 100%. Fix: stamp the pre-scaling recommended count onto each rec at submit time (RecommendedCount in config.RecommendationRecord, recommended_count on the TS Recommendation/LocalRecommendation wire types) and add validateCapacityConsistency, wired into validateExecutePurchaseRequest after the per-rec content validation. It rejects any rec where floor(RecommendedCount*pct/100) != Count with a 400. The check is opt-in per rec: recs that carry no recommended_count (legacy callers, single-rec full-capacity purchases, retry replays) are skipped, so the change is backward-compatible and never produces a false rejection for callers that predate the field. JSONB persistence means no schema change. The frontend stamps recommended_count = r.count on the scaled copy in handleBulkPurchaseClick before the counts are floored, so the value the backend verifies always describes the recs it accompanies. Tests: table-driven validateCapacityConsistency cases (full/partial/floor/ mismatch/absent/mixed); a frontend round-trip test that recommended_count survives into the executePurchase POST body on the single-rec path. * fix(api/purchases): atomic duplicate-detection via SELECT FOR UPDATE in tx (#643) Move the pending-execution scan inside the WithTx block so the duplicate check and the INSERT are a single atomic operation, closing the TOCTOU race that the pre-tx duplicatePurchaseResponse call could not prevent. - Add GetPendingExecutionsTx to StoreInterface (SELECT ... FOR UPDATE) - Extract scanExecutionRows helper to share row-scan logic between the pool path and the new tx path - Extract matchDuplicateInList to keep persistExecutionAndSuppressions and executePurchase under the gocyclo threshold - Add derefStringOrEmpty to eliminate the nil-check branch in executePurchase - Change validatePurchaseRecommendation to *RecommendationRecord receiver so provider/service/payment normalisation is written back to the caller - Add GetPendingExecutionsTx fallback to all six test mock types Closes #643
…client (closes #556) (#620) * feat(providers/azure/managed-redis): add Azure Managed Redis service client (closes #556) Add providers/azure/services/managedredis as Azure's counterpart to AWS MemoryDB. Azure Cache for Redis is the closest Azure equivalent: a fully- managed in-memory caching service with reservation-based pricing via the Azure Consumption and Retail Prices APIs. The new client registers under ServiceMemoryDB so it is routed and reported symmetrically with the AWS MemoryDB client at the provider-dispatch level. It reuses the armredis/v3 and armconsumption SDK packages already present in the azure provider module - no new dependencies. Changes: - providers/azure/services/managedredis/client.go: new ManagedRedisClient implementing the full provider.ServiceClient interface (GetRecommendations, GetExistingCommitments, PurchaseCommitment, ValidateOffering, GetOfferingDetails, GetValidResourceTypes) - providers/azure/services/managedredis/client_test.go: 34 tests covering all methods, pager injection, HTTP mock, token error, SKU fallback - providers/azure/services.go: NewManagedRedisClient factory - providers/azure/provider.go: ServiceMemoryDB in GetSupportedServices and newServiceClientForSubscription switch - providers/azure/services_test.go: TestNewManagedRedisClient - providers/azure/provider_test.go: ServiceMemoryDB assertions in GetSupportedServices and AllServiceTypes tests * fix(managedredis): address CR findings - error propagation, real order discovery, UUID, recommendation mapping - Propagate pager-creation and NextPage errors in GetExistingCommitments instead of swallowing them (return nil, err rather than empty slice or silent break) - Replace hardcoded zero-UUID order ID in reservationDetailsPager with the subscription-scope NewListPager so all real reservation orders are enumerated; update ReservationsDetailsPager interface to ListResponse (not ByReservationOrder) - Use uuid.New().String() for PurchaseCommitment reservation order ID to avoid second-level collisions under concurrent purchases - Replace static stub in convertRecommendation with recommendations.Extract so all Azure recommendation fields (SKU, count, cost, region, term) are mapped - Update tests: pager mock type, pager-error case now asserts error, new convertRecommendation tests cover nil and legacy recommendation shapes * fix(managedredis): address second CR pass - pagination + nil context - Follow NextPageLink in getRedisPricing so all pricing pages are consumed rather than only the first response - Replace nil context with context.Background() in TestGetValidResourceTypes_Fallback - Add TestGetOfferingDetails_Paginated to exercise the multi-page fetch path end-to-end * fix(providers/azure/managedredis): address CR third-pass production findings - Guard NewClientWithHTTP against a nil HTTPClient by substituting a default *http.Client with a 30s timeout, avoiding nil-deref panics on the first Do call. - Change collectSKUsFromPager to return (map, error) and have GetValidResourceTypes discard partial results and fall back to the curated commonSKUs() list on pager error, preventing false validation failures for valid SKUs. - Pass the converter's *float64 RecurringMonthlyCost through directly so nil (the documented "provider did not return a monthly breakdown" sentinel in pkg/common/types.go) stays distinct from an explicit 0. * test(providers/azure/managedredis): tighten mock + fixture coverage per CR - Register h.AssertExpectations(t) via t.Cleanup on every test using mockHTTPClient (TestGetOfferingDetails_{1yr,3yr,NoUpfront,APIError, NoPricing} and TestPurchaseCommitment_*), so the mocked Do call is required to fire instead of being silently optional. - Add a "3 Years" reservation entry to samplePricingJSON and assert the exact recurring/total cost in TestGetOfferingDetails_3yr so the test actually exercises 3-year term selection rather than just asserting RecurringCost > 0 against the 1-year fixture. - Replace nil context with context.Background() in TestValidateOffering_InvalidSKU for consistency with other tests. * fix(azure/managed-redis): fail closed on unknown terms and missing reservation price - Add parseTermYears helper (mirrors synapse) to reject any term string outside the explicit "1yr"/"3yr" allowlist instead of silently defaulting to 1 year in PurchaseCommitment and GetOfferingDetails - Remove hardcoded 45% reservation-price fallback in getRedisPricing; return an error when the API returns no reservation price for the requested SKU - Add TestPurchaseCommitment_InvalidTerm and TestGetOfferingDetails_InvalidTerm to exercise the new error paths
…578) The Purchases-tab Savings History chart (modules/savings-history.ts) only wired its local period dropdown and refresh button, so changing the global Provider or Account topbar chip did not refresh it (the tab had to be left and re-entered), and it ignored the account filter entirely. Mirror the PR #500 / #488 pattern: - isPurchasesTabActive() guards the reload so a chip change while the user is on another tab defers (switchTab reloads on entry). - queueMicrotask coalesces the account-then-provider subscriber pair fired from one user action (the #185 ordering rule), collapsing two reloads into one and avoiding stale-overwrite races. - loadSavingsHistory() now forwards the selected account_id (single-select, matching the backend) and the provider chip to getSavingsAnalytics. The provider param is honoured by the backend once #502 lands; until then account_ids does the filtering. Adds regression tests covering subscriber wiring, the active-tab guard, coalescing, account_id passthrough, and two consecutive account changes producing distinct fetches. Updates app.test.ts's state mock to expose the new subscribe getters since it exercises the real savings-history module. Closes #503
GCP Compute CUD creation named commitments cud-<unix-second> and ignored opts.IdempotencyToken, so a re-drive of the same execution (issue #639's recovery sweep) would create a SECOND committed-use discount: a financial double-purchase. Thread opts.IdempotencyToken into RegionCommitments.Insert via two deterministic mechanisms so the same token can never buy twice: - RequestId: GCP's native server-side idempotency key on Insert, which the API documents as preventing duplicate commitments. It must be a valid non-zero UUID, so we format the SHA-256 token into a canonical UUID with the existing common.IdempotencyGUID helper (the same derivation PR #653 used for the Azure reservationOrderID). The same token always yields the same RequestId, so the second Insert is a server-side no-op. - Name: also derived from the token (cud-<first-32-hex>, RFC1035-valid) as defense in depth. Commitment names are unique per project+region, so a re-drive that somehow reached Insert collides on the name and GCP rejects it with ALREADY_EXISTS rather than creating a duplicate. An empty token preserves the prior non-idempotent timestamp-based name (the CLI path, which has no owning execution). The token is masked in logs via common.MaskToken (never logged verbatim), matching PR #652. Adds a re-drive regression test mirroring the AWS/Azure ones: the same token on a second PurchaseCommitment yields an identical RequestId and commitment name, plus an empty-token test confirming the CLI path stays non-idempotent. Scope is computeengine only; CloudSQL/Storage/Memorystore were made advisory-only in #649 and are untouched. Closes #654 refs #641
… (refs #667) (#668) The purchase-execution flow had zero Info-level observability between the approve handler and the final aggregated error toast. When a sync approve hit a context-deadline timeout on the cloud SDK call (issue #667), the only signal was the wrapped error returned to the HTTP response. CloudWatch showed nothing about which rec ran, what details shape was used, how long the AWS API call took, or whether the SDK reached the purchase call at all. This change adds Info/Error log lines at every critical step inside executeSinglePurchase + processPurchaseRecommendations, every line tagged with the owning execution UUID so a CloudWatch filter by exec ID surfaces the full per-rec trace: purchase[<exec-id>]: dispatching N recommendation(s) for account=X plan=Y purchase[<exec-id>]: starting rec <provider>/<svc>/<region>/<sku> ... purchase[<exec-id>]: provider <provider> constructed in <ms> purchase[<exec-id>]: service client <provider>/<svc>/<region> ready in <ms> purchase[<exec-id>]: details ready for <provider>/<svc> (detailsAbsent=<bool> engine=<engine> idempotencyToken=<token>) purchase[<exec-id>]: <rec-tuple> PurchaseCommitment succeeded in <ms> (commitmentID=<id>, cost=<n>) Failure paths log at Error level with the same exec-ID tag + elapsed time so a slow SDK call is distinguishable from a fast IAM denial. To carry the execution ID into executeSinglePurchase without a new parameter (which would ripple through every caller including the CLI purchase path that has no owning execution), a new ExecutionID field is added to common.PurchaseOptions. The web/dashboard path now populates it from exec.ExecutionID at the processPurchaseRecommendations entry; the CLI path leaves it empty as documented in the field comment. The `detailsAbsent=true` signal in the "details ready" line is the key diagnostic — when set, the rec has no persisted ServiceDetails JSON and the cloud client falls back to defaults (Platform=Linux/UNIX, Tenancy=default, Scope=Region), which is the legacy-row code path from issue #453. Combined with the timing on the PurchaseCommitment line, operators can now distinguish "AWS SDK retry storm against a fully- specified rec" from "AWS returned zero offerings because a legacy rec fell back to defaults". Updated two existing tests that asserted exact PurchaseOptions equality to include the new ExecutionID field. Test results: 161/161 purchase tests pass, full project build clean. Refs #667
…541, #543) (#573) * sec(frontend/docs): add SRI + version pin to Go-served swagger-ui docs The Go /docs handler still loaded swagger-ui-dist@5.32.5 from unpkg with no integrity/crossorigin, leaving the public docs page open to CDN poisoning (issue #447). PR #521 only fixed the static frontend/src/docs.html, not the Go handler (issue #541). Bring handler_docs.go to parity with docs.html: pin @5.32.6, add the same sha384 SRI hashes + crossorigin="anonymous" to the css/bundle/standalone- preset tags, and load the standalone-preset script the bootstrap references (previously missing). The unpkg host stays in script-src because element-level SRI does not grant CSP load permission; the godoc explains why. Add a Go regression test asserting the served HTML keeps the version pin and SRI attrs and uses the shared hash constants. Closes #447, #541 Refs #521 * test(frontend/docs): guard docs.html swagger-ui version pin + SRI attrs PR #521 pinned the swagger-ui-dist version and added SRI integrity/crossorigin to docs.html but added no test, so a future edit could silently drop the integrity attribute or revert the @5.32.6 pin to a floating @5 tag (issue #543). Add a regression test that, for each unpkg/swagger-ui-dist link/script tag in docs.html, asserts an exact x.y.z version (never a floating major tag), an integrity="sha384-..." attribute, and crossorigin="anonymous". Closes #543 Refs #447, #521
…utines (closes #669) (#670) Defence-in-depth for the per-account / per-rec fan-out helper. Before this change, any panic inside fn (nil deref, type assertion failure, slice OOB in a future refactor) propagated up the unrecovered goroutine and crashed the entire Lambda process. Consequences: - Purchase execution row stays at 'approved' with no transition to 'failed' (state machine never runs the aggregator that would record the failure). - Lambda invocation terminates abnormally; CloudWatch may not flush the final log lines. - User sees a generic Lambda invocation error rather than the actual panic detail. The fan-out helper now installs a deferred recover() in every goroutine that converts a panic into a structured Err on the per-item Result slot, logs the goroutine stack at Error level for post-mortem, and lets the parent aggregator process the failure exactly like a normal fn-returned error. Current call sites in internal/purchase/execution.go (per-account executeForAccount + per-rec processPurchaseRecommendations) have no obvious panic source today; the recover() is cheap insurance for the high-impact failure mode and covers future refactors that might introduce one. Regression test (TestFanOut_PanicInFn) panics inside fn for one of three items, asserts the panicking item carries a non-nil Err containing the panic value AND that the other two items still succeed (i.e. one goroutine's panic doesn't cascade across the fan-out). Closes #669
…rove (closes #671) (#674) * fix(purchases): atomic CAS cancel to close TOCTOU with concurrent approve (closes #671) Both cancel paths (session-authed dashboard cancel and email-token CancelExecution) used an out-of-transaction IsCancelable check followed by an unconditional SavePurchaseExecutionTx upsert. A concurrent approve that transitioned the row to 'approved' between the check and the upsert would be silently overwritten, leaving the DB at 'cancelled' while the AWS/Azure/GCP commitment is real and billable. Add CancelExecutionAtomic(ctx, tx, executionID, cancelledBy) to the store layer (interface + PostgresStore + all three mocks). The method issues a conditional UPDATE WHERE status IN ('pending','notified') inside the caller-provided tx, returning (false, currentStatus, nil) on zero rows affected so callers can 409 with the racing status visible. On success it returns (true, "cancelled", nil). Update both cancel paths to call CancelExecutionAtomic inside WithTx (paired with DeleteSuppressionsByExecutionTx on the success branch) and return 409 / a descriptive error when the CAS loses the race. Fix the misleading "optimistic-locking guard inside the tx" comment on handler_purchases.go that claimed a guard existed when it did not. Regression tests added for both paths: - TestManager_CancelExecution_RaceWithApprove (token path, purchase pkg) - TestHandler_cancelPurchase_Session_RaceWithApprove (session path, api pkg) Update all existing cancel tests to assert CancelExecutionAtomic rather than SavePurchaseExecution to reflect the new call site. * fix(cancel): surface tx errors as 5xx, tighten suppression-cleanup assertions - handler_purchases.go: return fmt.Errorf (5xx) for WithTx failures instead of NewClientError(409); only the !cancelled path is a true conflict - handler_purchases_test.go: add DeleteSuppressionsByExecutionTx expectation + AssertCalled in runSessionCancelAllowed to cover the transactional contract - purchase/mocks_test.go: make DeleteSuppressionsByExecutionTx use m.Called so testify tracks calls for AssertNotCalled/AssertExpectations - approvals_test.go, coverage_extra_test.go: register the expectation on all success-path cancel tests; add AssertNotCalled on race test to guard against accidental cleanup writes when CAS misses Addresses CodeRabbit review on PR #674.
#555) (#622) * feat(providers/azure/synapse): add data-warehouse service client (closes #555) Add Azure Synapse Analytics Reserved Capacity service client as the Azure counterpart to AWS Redshift (issue #473 parity gap). - New package providers/azure/services/synapse implements the full ServiceClient interface: GetRecommendations, GetExistingCommitments, PurchaseCommitment, ValidateOffering, GetOfferingDetails, GetValidResourceTypes. - Recommendations sourced from Azure Consumption API (armconsumption.ReservationRecommendationsClient) using the "SQLDatabaseDTU" resource type filter; supports both Legacy (EA) and Modern (MCA) billing account shapes via the shared internal/recommendations.Extract converter. - Existing commitments retrieved via armconsumption.ReservationsDetails, filtered to DW*/SCU SKUs that identify Synapse SQL Pool reservations. - Purchase issues a PUT to Microsoft.Capacity/reservationOrders using reservedResourceType "SqlDW"; purchase-automation tag applied when PurchaseOptions.Source is non-empty. - Pricing via Azure Retail Prices API (azure-synapse service name); handles "1 Year" / "3 Years" reservation term strings as returned by the live API. - GetValidResourceTypes returns static DW100c-DW30000c list (same approach as AWS Redshift client); no armsynapse SDK dependency needed. - Wire into provider.go (GetSupportedServices + newServiceClientForSubscription) and services.go factory; update provider_test.go to assert ServiceDataWarehouse. - 31 unit tests covering all methods, pager mocks, HTTP mocks, credential errors, SKU filtering, pricing extraction, and tag application. * fix(synapse): address CR findings - error propagation, term validation, input guards, pricing safety - Propagate pager-creation error in GetExistingCommitments instead of returning empty - Extract parseReservationTermYears helper that fails closed on unknown terms; replace both silent 1yr-coercion blocks in PurchaseCommitment and GetOfferingDetails - Guard PurchaseCommitment against empty ResourceType and non-positive Count - Remove synthetic 40% reservation-cost fabrication in getSynapsePricing; return a clear "pricing data unavailable" error when the API returns no reservation price - Add test coverage for all four new code paths * test(synapse): address remaining CodeRabbit review findings - Propagate io.ReadAll error in doPurchaseRequest; when the HTTP response body read fails, surface a clear error message including the read failure rather than silently using an empty/partial body string in the status-error message. - Add common.ServiceDataWarehouse to TestAzureProvider_GetServiceClient_AllServiceTypes so both the account and non-account client paths are regression-protected for the DataWarehouse service type. * fix(synapse): filter GetRecommendations to client region The pager returns recommendations for all regions; downstream offering/purchase logic is region-bound. Gate each converted recommendation on strings.EqualFold(converted.Region, c.region) so callers receive only the region they constructed the client for. Adds TestGetRecommendations_regionFilter (verifies cross-region rec is dropped) and TestGetRecommendations_modernShape (Modern/MCA payload shape coverage requested by CR). Addresses CR finding on PR #622 (2026-05-22T11:20:18Z). * test(azure/provider): add ServiceSavingsPlans and ServiceSearch to test matrices Add the two newly wired services to all three test functions that guard provider service wiring: TestAzureProvider_GetSupportedServices, TestAzureProvider_GetServiceClient_AllServiceTypes, and TestAzureProvider_GetServiceClientForAccount. Also add ServiceMemoryDB to the GetServiceClientForAccount matrix which was missing. Guard NewClientWithHTTP against a nil httpClient by substituting http.DefaultClient, preventing a nil-dereference panic on purchase and pricing paths that call c.httpClient.Do. Closes CR findings: test matrix gap (ServiceSavingsPlans/ServiceSearch) and nil HTTP client guard. * fix(providers/azure/synapse): tighten SCU SKU filter + case-insensitive offering validation Addresses CodeRabbit round-5 findings on PR #622: - L210: Replace `strings.Contains(skuLower, "scu")` with `strings.HasPrefix(skuLower, "scu")` so unrelated SKUs that merely contain "scu" as a substring (e.g. "rescue_*") are no longer misclassified as Synapse reservations. - L351: Use `strings.EqualFold` with `strings.TrimSpace` in `ValidateOffering` so a valid SKU with different casing (e.g. "dw1000c") is accepted, matching the rest of the Azure provider's case-insensitive comparison pattern. Adds tests: - TestValidateOffering_caseInsensitive - TestValidateOffering_trimsWhitespace - TestConvertSynapseReservation_scuPrefix - TestConvertSynapseReservation_scuSubstringNotMatched (regression guard)
* sec(ci): pin all GitHub Actions to commit SHAs (#415) Replace all `uses: org/action@tag` refs across every workflow file with `uses: org/action@<sha> # tag` pins. Several workflows referenced non-existent future versions (v5/v6/v7 for actions still on v4); these are corrected to the actual latest released tag while being SHA-pinned. Closes #415 * fix(ci): disable credential persistence on all checkout steps All 11 actions/checkout uses defaulted to persist-credentials: true, leaving the GitHub token in git config after checkout. None of these jobs need persisted credentials for authenticated git operations. Set persist-credentials: false on every step to reduce the token exposure window. Fixes CodeRabbit finding (artipacked) on PR #536. * fix(ci): disable credential persistence in rollback and database-migration All 5 actions/checkout steps in rollback.yml and all 4 in database-migration.yml defaulted to persist-credentials: true, leaving the GitHub token in git config for the remainder of each job. Neither workflow has any step that performs authenticated git operations after checkout (cloud auth is done via dedicated aws-actions/configure-aws-credentials, google-github-actions/auth, and azure/login actions). Setting persist-credentials: false on every checkout step closes the token exposure window. Fixes remaining CodeRabbit artipacked findings on PR #536 (nitpick comments on rollback.yml and database-migration.yml). * fix(ci): pin actions to latest SHA of the in-use major, not a downgraded major The previous SHA-pin commit resolved each action to a commit in an older major version than the workflows were using (e.g. checkout pinned to v4.3.1 while the branch used v5, configure-aws-credentials pinned to v4.3.1 while the branch used v6, etc.). This corrects every pin to the latest release within the same major that the base branch (feat/multicloud-web-frontend) was already on. No logic changes; SHA + version comment only. Actions corrected (old major -> correct major): actions/checkout v4.3.1 -> v5.0.1 (34 occurrences) actions/download-artifact v4.3.0 -> v7.0.0 (4 occurrences) actions/upload-artifact v4.6.2 -> v6.0.0 (9 occurrences) actions/setup-go v5.6.0 -> v6.4.0 (8 occurrences) actions/setup-node v4.4.0 -> v6.4.0 (3 occurrences) actions/setup-python v5.6.0 -> v6.2.0 (1 occurrence) aws-actions/configure-aws-credentials v4.3.1 -> v6.1.1 (10 occurrences; standardises ALL to v6.1.1) azure/login v2.3.0 -> v3.0.0 (7 occurrences) docker/build-push-action v6.9.0 -> v7.2.0 (1 occurrence) docker/setup-buildx-action v3.9.0 -> v4.0.0 (2 occurrences) google-github-actions/auth v2.1.9 -> v3.0.0 (5 occurrences) google-github-actions/setup-gcloud v2.2.1 -> v3.0.1 (2 occurrences) hashicorp/setup-terraform v3.1.2 -> v4.0.1 (15 occurrences) codecov/codecov-action v5.5.4 -> v6.0.1 (1 occurrence) golangci/golangci-lint-action v6.5.2 -> v9.2.1 (1 occurrence) snyk/actions/golang v1.0.0 -> 0.4.0 (1 occurrence; base was on 0.x) securego/gosec v2.22.9 -> v2.26.1 (1 occurrence; updated to latest v2) github/codeql-action v4.35.5 -> v4.36.0 (2 occurrences; updated to latest v4) All 23 SHAs independently verified against their tags via GitHub API. No float tags remain. YAML validates on all 15 files.
…) (#540) * fix(db): add regression test for migration 000027 idempotency Migration 000027 (savings_snapshots_pk) already contains the DROP CONSTRAINT IF EXISTS guard that makes it safe on fresh databases where 000018 has already created savings_snapshots_pkey. This commit adds the integration test that would have caught the original failure: - TestMigration_SavingsSnapshotsPK/full_migration_from_scratch: runs all migrations from scratch, verifying 000027 does not fail with "multiple primary keys for table" even though 000018 already added the constraint. - TestMigration_SavingsSnapshotsPK/rollback_to_000026_then_re-apply: rolls back past 000027 (without rolling back 000018) and re-applies all migrations, verifying idempotency in the replay scenario. Closes #246 * fix(db): fix 000040 down SQL column scope and test rollback loop The ORDER BY CASE k in 000040_split_savingsplans.down.sql referenced column k from a UNION branch, which is out of scope at the outer ORDER BY level in PostgreSQL. Restructure the SP-collapse branch to use a subquery that exposes a numeric priority column, then ORDER BY that column name inside the subquery. The savings_snapshots_pk_test.go rollback sub-test computed stepsToRollback = currentVersion - 26 (= 21 on a 47-migration DB) and passed it in a single RollbackMigrations call, which hit the 10-step safety cap. Call in a loop of at-most-10-step batches. * refactor(migrations): extract getMigrationsPath to shared test helper Move getMigrationsPath() from split_savingsplans_test.go into a new helpers_test.go shared by all integration tests in the package. This removes the implicit cross-file dependency that CodeRabbit flagged in savings_snapshots_pk_test.go: the function was defined in a sibling file and invisible to a reader of the calling file alone. Both files are in the same package and build tag (integration), so they always compiled together. The refactor makes each test file self-evidently complete without relying on symbol spillover from a sibling. * fix(db): derive rollback step count from schema_migrations row count Replace `currentVersion - 26` arithmetic with a direct COUNT query on schema_migrations for rows with version > 26. This is correct even when migration numbering has gaps, since Steps(-n) steps back through n applied migrations (not n version-number units). Also clarify the comment about migration 000018 being "still in effect" to make clear that its schema_migrations entry persists but its original constraint was superseded by 000027. Addresses CodeRabbit finding on PR #540 (nitpick, lines 68-73 and 90-93 of savings_snapshots_pk_test.go).
closes #678) (#681) * feat(config): add ListStuckExecutions store method Selects purchase_executions rows whose status is in a caller-supplied set AND whose updated_at is older than a caller-supplied interval. Newest-stuck-first, capped at MaxListLimit per call. Mirrors the existing GetStaleProcessingExchanges pattern for RI exchanges. Used by the reaper sweep (issue #678) to find executions stuck in approved/running because the synchronous executor failed mid-flight (Lambda timeout, OOM, network hang) without flipping the row to a terminal state. Local SELECT only — no provider-side mutation. Adds the method to StoreInterface and to every test-side mock store that satisfies it (purchase, api, analytics, scheduler, server health). Adds pgxmock coverage for the happy path, the empty-statuses short-circuit, and a query-error path. * feat(purchase): add ReapStuckExecutions sweep + per-row CAS to failed New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds executions stuck in approved/running longer than reapAfter and atomically transitions them to "failed" via the existing TransitionExecutionStatus CAS. Each successful transition also writes a canonical, human-readable error string so the History UI (#621) shows operators why the row was reaped and confirms it's safe to retry. Safety properties: - Local-status-only: never touches provider commitments. If the real executor did manage to create a commitment before dying, the retry hits the idempotency path (#636/#638/#652) which surfaces a duplicate-reservation error cleanly. - CAS-protected: if the real executor wakes up and finishes between the SELECT and the transition, the CAS rejects; logged at INFO, not surfaced as an error. The real executor wins the race. - Per-row error-isolation: a failure on row N never blocks N+1..K. Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and fall back to a 10m default on missing-or-malformed env (with a WARN log so ops can spot a typo). Never panics — a misconfigured env var must not crash the other scheduled tasks sharing the Lambda. Tests cover: - stale approved row → flipped to failed with canonical message - stale running row → same - younger-than-threshold rows are filtered out by the SELECT (no TransitionExecutionStatus / SavePurchaseExecution calls happen) - terminal-status rows are never passed to the SELECT (regression guard against accidentally widening stuckStatuses) - CAS race lost → logged + counted, not surfaced as sweep error - CAS (nil, nil) defensive path treated as race-lost - 3-stuck-row integration with per-row CAS expectations - SELECT error → propagated as sweep error - per-row save error after successful CAS → Reaped++ AND Errored++ - env-var: unset, valid, invalid, non-default unit (2h30m) * feat(purchase): wire periodic reaper invocation (closes #678) Wires the ReapStuckExecutions sweep (added in the previous commit) into the existing scheduled-task pipeline: - New ScheduledTaskType "reap_stuck_purchases" registered alongside the other periodic tasks (cleanup, analytics_refresh, ri_exchange_reshape). Adds a dedicated handleReapStuckPurchases dispatcher that reads the threshold via purchase.ParseReapAfterFromEnv on every invocation so an ops-side env-var tune via PURCHASE_APPROVED_REAP_AFTER takes effect on the next sweep without a redeploy. - Adds ReapStuckExecutions to PurchaseManagerInterface so the handler contract is symmetric with the Manager + the testutil mock can satisfy it. MockPurchaseManager gains ReapStuckExecutionsFunc following the existing per-method-Func mock convention. - New EventBridge rule "${stack_name}-reap-stuck-purchases" on rate(5 minutes), targeting the main Lambda with {"action":"reap_stuck_purchases"}. Cadence is intentionally more frequent than the 10m default threshold so a stuck row is reaped within ~1 threshold-window. The reaper itself is CAS-protected (TransitionExecutionStatus from approved/running → failed) so an over-run is safe — the real executor wins the race. - New terraform vars: enable_reap_stuck_purchases_schedule (bool, default true), reap_stuck_purchases_schedule (string, default "rate(5 minutes)"), purchase_approved_reap_after (string, default "" → use in-code DefaultReapAfter). Empty-string default for the threshold avoids pinning a Lambda env value when ops hasn't taken a position. - ParseScheduledEvent learns the "reap_stuck_purchases" action so the EventBridge → Lambda payload round-trips through the dispatcher. Tests cover the dispatcher success path (asserts the default 10m threshold is passed when the env var is unset), the store-error propagation path, the lock-ID uniqueness guard, and the ParseScheduledEvent action mapping. * fix(purchase/reaper): distinguish CAS race-loss from real DB errors The reaper's per-row error handler bucketed every failure from TransitionExecutionStatus into RaceLost, including real DB outages (connection refused, query timeout, etc.). That masked persistent store failures: a downed DB would surface only as a quiet count inflation on RaceLost, with no errored signal for ops to alert on. Wrap the two legitimate race-loss outcomes from TransitionExecutionStatus in sentinel errors so callers can use errors.Is rather than brittle string matching: - ErrExecutionNotInExpectedStatus: row exists but its status moved out of the allowed set before the CAS (the real executor finished between SELECT and UPDATE — race lost). - ErrNotFound (reused for the "row vanished" case): row was deleted between SELECT and UPDATE — also a race outcome, nothing for the reaper to do. In the reaper, classify errors.Is(err, ErrExecutionNotInExpectedStatus) or errors.Is(err, ErrNotFound) as RaceLost (INFO log); everything else bumps Errored with an ERROR log so the ops signal is visible. Addresses A1 from CodeRabbit round-1 review on #681. Regression tests: - TestReapStuckExecutions_CASRaceLostNoError now wraps the sentinel (was raw errors.New); guards the "race-loss is INFO-only" path. - TestReapStuckExecutions_RowVanishedTreatedAsRaceLost covers the ErrNotFound branch (manual DELETE between SELECT and CAS). - TestReapStuckExecutions_HardDBErrorClassifiedAsErrored asserts a non-sentinel error bumps Errored, not RaceLost — the regression guard for the original A1 finding. The wrapped errors keep the original message substrings ("not found", "cannot transition from ...") so existing call sites that match on substrings (e.g. internal/purchase/manager.go's RecoverStrandedApprovals, internal/purchase/approvals_test.go) are unaffected. * fix(purchase/reaper): reject non-positive reap durations + stabilize env tests ParseReapAfterFromEnv accepted "0s" and negative durations such as "-5m" because time.ParseDuration treats them as syntactically valid. Feeding either into ListStuckExecutions would be catastrophic: - 0s: the cutoff "updated_at < NOW() - 0s" matches every approved/running row regardless of age — the reaper would flip fresh, in-flight executions to failed. - -5m: the cutoff "updated_at < NOW() - (-5m)" == "updated_at < NOW() + 5m" matches rows from 5 minutes in the future, i.e. effectively every row. Same outcome. The store-side guard added in the original PR (ListStuckExecutions rejects olderThan <= 0) prevents the broken SELECT from executing, but a misconfigured env value would still cause every sweep to fail silently with a confusing store error rather than a WARN at the config-parse boundary. Reject non-positive durations at the env-parse layer with a WARN log + fallback to DefaultReapAfter so the misconfig is visible in the Lambda's startup logs. Addresses A2 + A3 (defense-in-depth) + A4 + A5 from CodeRabbit round-1 review on #681. Regression tests (A4): - TestParseReapAfterFromEnv_ZeroFallsBackToDefault — "0s" - TestParseReapAfterFromEnv_NegativeFallsBackToDefault — "-5m" - TestParseReapAfterFromEnv_GarbageFallsBackToDefault — explicit "garbage" case complementing the existing _Invalid* test, named per the CR finding so the regression intent is searchable. handler_test.go (A5): - TestHandleScheduledTask's table-driven loop now calls t.Setenv("PURCHASE_APPROVED_REAP_AFTER", "") inside the per-case sub-Run. The two reap_stuck_purchases cases assert reapAfter == 10*time.Minute (the default); without an explicit env clear, an ambient PURCHASE_APPROVED_REAP_AFTER in CI/dev would silently make them flaky. t.Setenv auto-restores the prior value at cleanup. The store-side olderThan <= 0 guard (A3) was already folded into the earlier rebased commit "feat(config): add ListStuckExecutions store method" so no additional store-side change is needed here — A3 is covered by the existing defense-in-depth, and this commit completes the env-side validation A2 calls for.
…purchase flow (closes #677) (#680) * feat(azure/reservations): add shared two-step calculatePrice->purchase helper Introduces providers/azure/services/internal/reservations with DoPurchaseTwoStep, IsSessionTimeout, CalculatePriceURL, and PurchaseURL. The helper centralises the two-call flow (calculatePrice mints a session-bound reservationOrderId; purchase commits it) and retries up to 3 times on Session timed out 400s by re-running calculatePrice from scratch. Refs #677. * fix(azure): switch 7 reservation clients to calculatePrice->purchase flow Fixes the 400 "Session timed out" errors on newer Azure SKU families (e.g., Standard_B2ats_v2, Burstable v2) by replacing the broken single-step PUT reservationOrders/{client-id} pattern with the required two-step flow: 1. POST calculatePrice -- mints an Azure session-bound reservationOrderId 2. POST reservationOrders/{azure-id}/purchase -- commits the order All 7 affected clients now delegate to the shared reservations helper: compute, database, cache, search, cosmosdb, managedredis, synapse. Idempotency (Option B): Azure mints the order ID at calculatePrice time so client-supplied IDs are no longer feasible. Caller-level deduplication uses the purchase-automation tag already attached to each reservation request. The `uuid` dependency is removed from all 7 clients. Tests updated to the two-step mock pattern (calculatePrice + purchase HTTP expectations). Total: 530 tests pass, go vet clean. Closes #677 Refs #641, #667, #632, #669, #671 * fix(azure/reservations): address CodeRabbit review on PR #680 - Honor ctx cancellation during retry backoff in DoPurchaseTwoStep: replace time.Sleep with a select on ctx.Done() so a deadline or cancel propagates immediately instead of blocking for purchaseRetryDelay. - Add tag-injection regression tests to search and managedredis suites: assert purchase-automation tag is present in the request body when opts.Source is set and absent when it is empty. * refactor(azure/reservations): use bytes.NewReader over strings.NewReader(string(bs)) Address remaining CR nitpick from PR #680 pass 2. Wrapping a []byte in an io.Reader via strings.NewReader(string(bodyBytes)) forces a copy of the underlying bytes; bytes.NewReader takes the slice directly. Two callsites fixed (doCalculatePrice + doPurchase). Tests + go build + go vet clean. * fix(azure): require non-empty opts.Source before relying on tag-based dedupe CodeRabbit pass-3 round-1 A1 (outside-diff Major): Azure's reservation API mints the order ID server-side in calculatePrice, so client-supplied GUIDs cannot deduplicate re-driven purchases. The only stable dedupe signal CUDly controls is the purchase-automation tag derived from opts.Source via applyPurchaseAutomationTag (or the inline equivalent in compute / managedredis). When opts.Source is empty the tag is dropped silently and a re-driven purchase cannot recognise the prior attempt, producing duplicate reservations. Add a guard at the entry of PurchaseCommitment in every Azure service that uses the two-step calculatePrice -> purchase flow with tag-based dedupe so the request fails fast before any cloud API call: if opts.Source == "" { result.Error = fmt.Errorf("purchase source is required for ...") return result, result.Error } Services covered (7): cache, database, search, cosmosdb, synapse, compute, managedredis. Savingsplans is intentionally excluded -- it uses the Azure OrderAlias SDK with native DisplayName-based dedup and takes opts.PurchaseOptions as `_`, so the same invariant does not apply. Per-service regression tests pin the guard: every PurchaseCommitment *_RequiresSource test asserts the call returns an error containing "purchase source is required" and that no HTTP call is issued. Existing tests that previously passed common.PurchaseOptions{} are updated to pass Source: common.PurchaseSourceCLI so they continue to exercise the intended code paths. Production safety: the call sites in internal/purchase/execution.go already populate opts.Source from m.normalizePurchaseSource(exec); the guard only blocks the silent untagged fallback (when the upstream NormalizeSource rejects a malformed value), which is exactly the case the CR finding flags as a duplicate-purchase risk. go build ./... + go vet ./... + go test ./providers/azure/... + tests for callers (internal/purchase, cmd) all pass. * test(azure): assert purchase-automation tag is sent in calculatePrice body CodeRabbit pass-3 round-1 N1 (nitpick) + N2 (nitpick covered by same shape): the existing TwoStepFlow / Success regression tests for the Azure two-step calculatePrice -> purchase flow only assert HTTP call counts and outcome -- they do not capture the calculatePrice body or check that opts.Source has been embedded as a purchase-automation tag. Without that assertion the dedupe invariant (the only stable signal Azure's server-minted-order-ID API exposes) could regress silently: tag drop elsewhere in the code (e.g. helper refactor, JSON shape change) would not fail the suite. Add a *_PurchaseCommitment_TagInjection test in each service that lacks one today: cache, database, cosmosdb, compute. Each test: 1. Wires a mock that captures the calculatePrice request body via io.ReadAll and rewinds it with bytes.NewReader so the production code sees an intact request. 2. Drives PurchaseCommitment with common.PurchaseOptions{Source: PurchaseSourceWeb}. 3. Unmarshals the captured body and asserts the top-level "tags" map contains common.PurchaseTagKey ("purchase-automation") -> source. Services already covered by an equivalent test, intentionally untouched: - search: TestSearchClient_PurchaseCommitment_TagInjection - managedredis: TestPurchaseCommitment_TagInjection - synapse: TestPurchaseCommitment_withSource (asserts both "purchase-automation" and the source value are present in the captured body) savingsplans excluded -- different API (OrderAlias) with native DisplayName-based dedupe; opts.Source is `_`-ignored at the entry point. Net effect: the tag invariant is now pinned by a dedicated regression test for every Azure two-step service.
…dation (closes #675) (#676) * fix(providers/azure): case-insensitive + whitespace-tolerant SKU validation (closes #675) Replace exact `==` comparison in ValidateOffering with `strings.EqualFold(sku, strings.TrimSpace(rec.ResourceType))` across compute, database, cache, search, and cosmosdb service clients, matching the pattern already applied to synapse (commit 1313332). Add regression sub-tests for case-insensitive and whitespace-padded SKU inputs to each affected service. * ci: nudge pull_request sync to trigger workflows
… per recommendation (closes #679) (#682) * feat(providers/azure): expand reservation recs into all-upfront + no-upfront variants Azure reservation recommendations previously emitted a single entry with PaymentOption="upfront" (bare, non-canonical). Users could not see the no-upfront (monthly billing) alternative even though Azure supports both billing plans at the same total price. Changes: - Add ExpandPaymentVariants helper in providers/azure/internal/recommendations/ (alongside the existing Extract helper). Given a base recommendation it returns two copies: one with PaymentOption="all-upfront" (RecurringMonthlyCost=0) and one with PaymentOption="no-upfront" (RecurringMonthlyCost=CommitmentCost/termMonths). EstimatedSavings and SavingsPercentage are identical across both variants since Azure charges the same total price for both billing plans. - Update all five service converters (compute, cache, database, cosmosdb, search) and the Azure Advisor path to call ExpandPaymentVariants, producing two recs per API recommendation instead of one. - Rename the import alias to azrecs in each service client to resolve the local- variable shadowing that the recommendations package name previously caused. - Replace bare "upfront" with canonical "all-upfront" in all Azure code paths. - Add focused unit tests for ExpandPaymentVariants covering: zero on-demand guard, zero commitment cost, 1yr vs 3yr term-month math, pointer independence between variants, savings equality across variants, and shared-field pass-through. - Add TestComputeClient_GetRecommendations_EmitsBothPaymentVariants to exercise the end-to-end fan-out path via a mock pager. - Update existing converter tests to expect "all-upfront" instead of "upfront". Closes #679 * fix: apply CodeRabbit auto-fixes Fixed 2 file(s) based on 2 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai> * refactor(azure/recommendations): extract page-processing helper to fit gocyclo budget Pull the per-page advisor-record loop body out of getAdvisorRecommendations into appendAdvisorPageRecs, following the same pattern used for resolveAdvisorRegion. getAdvisorRecommendations drops from cyclomatic 12 to 4; appendAdvisorPageRecs lands at 5. No behaviour change. * test(azure/search): update converter test for nil-guard contract After the azrecs.Extract refactor, convertAzureSearchRecommendation returns nil on unusable SDK payloads (matching the contract shared by cache, compute, cosmosdb, and database converters). Split the single test into a nil-guard test and a field-population test that uses BuildLegacyReservationRecommendation, consistent with the other service converter test patterns. * refactor(azure): use Azure-canonical upfront/monthly payment terms in variant expansion ExpandPaymentVariants was emitting "all-upfront"/"no-upfront" (AWS terms) instead of the Azure-canonical "upfront"/"monthly". This made the emitted values inconsistent with every other Azure code path and with the AZURE_PAYMENTS constant in commitmentOptions.ts which defines "upfront" and "monthly" as the canonical values. Rename throughout: - "all-upfront" -> "upfront" in ExpandPaymentVariants, all five service converter functions (compute/cache/database/cosmosdb/search), and the convertAdvisorRecommendation base rec. - "no-upfront" -> "monthly" in ExpandPaymentVariants. - Update all test assertions and fixtures to match. - Update the ExpandPaymentVariants docstring precondition and value list. The purchase-path switch cases in all Azure reservation clients already accept both the canonical value and the former alias (e.g. case "all-upfront", "upfront": and case "monthly", "no-upfront":), so renaming the emitted values does not break the purchase path. Add "upfront" -> "Upfront" to PAYMENT_DISPLAY_LABELS in recommendations.ts so the Payment column in the recommendations table renders a human-readable label for Azure "upfront" recs rather than falling back to the raw string. "monthly" was already covered. * fix(azure): guard EstimatedSavings when OnDemandCost is zero (CR #682) Three findings from CodeRabbit review 4348367061: 1. `converter.go:287-292` (recurring Major): the OnDemandCost-zero guard protected savingsPct (divide-by-zero) but the savings = totalOnDemand - totalReservation line ran unconditionally. Result: when OnDemandCost was zero and CommitmentCost was non-zero, EstimatedSavings was emitted as a negative number (-CommitmentCost), contradicting the function contract. Move savings inside the guard so both fields stay 0 when on-demand is zero. 2. `converter_test.go`: add TestExpandPaymentVariants_ZeroOnDemand_NonZeroCommitment regression asserting EstimatedSavings == 0 (not -CommitmentCost) when OnDemand=0, Commitment>0. The existing _ZeroOnDemand_NoSavings test only covered the 0/0 sub-case and would have missed the regression. 3. `client_test.go`: add cost-field assertions to TestSearchClient_ConvertAzureSearchRecommendation_PopulatesAllFields so the WithCosts(120, 80, 40) fixture is actually checked through to rec.OnDemandCost / rec.CommitmentCost / rec.EstimatedSavings. Locks in the converter behaviour this PR depends on. --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
…Price (closes #685) (#686) * fix(azure/reservations): add SanitizeDisplayName helper for Azure DisplayName allowlist Azure's Reservations API rejects DisplayName values that contain characters outside [A-Za-z0-9_-] with HTTP 400 DisplayNameInvalid. Add a shared helper in services/internal/reservations that replaces non-conforming characters with '_', collapses consecutive non-conforming chars into a single '_', and truncates to 64 characters. * fix(azure): use SanitizeDisplayName + underscore-form displayName across 5 services (closes #685) Replace space-containing displayName format strings with underscore-form prefixes and pipe through SanitizeDisplayName in all 5 Azure service clients (cache, compute, cosmosdb, database, search). Fixes HTTP 400 DisplayNameInvalid errors from Azure's Reservations API when SKU names contain unusual characters. Extends each service's PurchaseCommitment test with a body-capture assertion that guards the [A-Za-z0-9_-]{1,64} allowlist going forward. The capture mocks the calculatePrice POST so it works correctly with the two-step calculatePrice->purchase flow established by #680. * feat(azure/reservations): rich self-describing displayName for purchases Extend the Azure DisplayName fix from #686 with a richer, parseable format that mirrors the AWS RI CSV's ReservationId column shape so operators can identify reservations in the Azure portal without cross-referencing the purchase audit log. Format (each segment separated by '-'): {svc}-{region}-{sku}-{count}x-{term}-{paymt}-{ts}-{rand} vm -eastus -Standard_D2a_v4-1x-1yr-allup -20260522T190000-a1b2c3d4 redis -westeurope-Premium_P1 -1x-1yr-allup -20260522T190000-a1b2c3d4 cosmos -northeurope-EnableCassandra-1x-1yr-allup-20260522T190000-a1b2c3d4 sql -centralus -GP_Gen5_2 -1x-1yr-allup -20260522T190000-a1b2c3d4 search -westus2 -standard2 -1x-1yr-allup -20260522T190000-a1b2c3d4 Output is always sanitized to Azure's [A-Za-z0-9_-]{1,64} allowlist. When the composed string would exceed 64 chars (long region + long SKU + high count), the builder progressively drops the optional tail segments — random suffix first, then timestamp, then payment-option. The service code, region, SKU, count, and term are NEVER dropped: those are the high-signal segments operators rely on. With every optional segment dropped, the joined required segments are truncated as a last-resort safety net so the builder never emits >64 chars. Previous format was `VM_Reservation_<SKU>` (and equivalents per service) — operators reported these were indistinguishable across purchases. The new identifier carries enough context to identify the specific reservation purchase from the portal alone. Implementation lives in providers/azure/services/internal/reservations/displayname.go alongside the existing SanitizeDisplayName helper, and is wired into all 5 service clients (cache, compute, cosmosdb, database, search). Time and random sources are injectable so tests can pin them; tests cover happy-path, per-service shape, length-fit truncation at each drop priority (random → timestamp → payment), payment normalization, sanitization of dirty input, embedded-dash collapse, determinism, and UTC normalization. The 5 per-service allowlist regression tests are extended to also assert the service-code prefix and SKU presence so accidental swaps in the literal strings are caught. Refs #685 * fix(azure/reservations): repair progressive-drop loop in BuildDisplayName Address CodeRabbit findings on PR #686. Critical: BuildDisplayName's progressive-drop loop was broken. It called SanitizeDisplayName(candidate) inside the loop before checking the length cap. Because the sanitizer hard-truncates to 64 chars, the condition len(candidate) <= 64 was vacuously true on iteration 1 and the loop exited immediately, returning a mid-string-truncated full format instead of cleanly dropping the random suffix, timestamp, and payment segments. The existing tests passed only by coincidence (all test inputs happened to land their required segments within the first 64 bytes of the full format). Fix: check the pre-sanitized length to gate segment-dropping, and sanitize only on the exit path. Each segment is already conformant via normalizeSegment, so the in-loop sanitization was both wrong and redundant. Add three regression tests that fail against the pre-fix code: - DropLoopActuallyDropsTimestamp: forces the loop to drop both random and timestamp; asserts the exact clean output and that no timestamp digits leak in. - DropLoopActuallyDropsPayment: forces required-only fallback; asserts no payment fragment survives. - ZeroNowReplacedByWallClock: asserts a zero-value Now is replaced by time.Now() rather than emitting the placeholder "00010101T000000". Also address three minor doc/safety findings on the same file: - Clarify isAllowedDisplayNameChar godoc so the [A-Za-z0-9-] description matches reality (underscores are passed through by the caller). - Guard the zero-value Now case so production callers that forget to set it get a real timestamp instead of "00010101T000000". - Document generateRandSuffix's silent fallback for non-nil src shorter than 4 bytes. Verification: stash-revert the production code change with the new tests in place; all three regression tests fail with exactly the expected mid-truncated outputs. Restore the fix; all 68 reservations tests pass and all 620 azure-package tests pass.
* fix(purchases): narrow Describe*Offerings + cap pagination (#688) Root cause: EC2 findOfferingID was missing the offering-type filter, causing AWS to return all payment variants, triggering indefinite pagination through sparse empty pages and burning the entire Lambda 60s budget (execution 9336a111). Changes per service (all 7 purchase-path clients): - EC2: add OfferingType direct field + convertEC2PaymentOption helper; verify returned offering type before returning; cap at 5 pages; check ctx.Err() between pages; add timing log per page - RDS: add pagination cap + ctx.Err check + variant verification + per-page timing log (already had all filter fields) - ElastiCache: same as RDS - MemoryDB: add OfferingType + Duration direct fields to narrow at API level; replace client-side matchesDuration/matchesOfferingType loop with server-side filters; cap + ctx.Err + variant verification - OpenSearch: API has no filter fields; add cap + ctx.Err + defense- in-depth variant check after client-side matching - Redshift: API has no node-type/payment filter fields; add cap + ctx.Err + unknown-type guard - SavingsPlans: add pagination loop + cap + ctx.Err to lookupOfferingID (was a single non-paginated call) MaxResults/MaxRecords: EC2/Redshift/RDS/ElastiCache APIs have a hard max of 100 per page; no change from current value. MemoryDB/OpenSearch were already at 100. Regression tests (3 per service = 21 new tests): - Pagination cap fires after maxOfferingPages empty pages + next token - Wrong-variant offering rejected by defense-in-depth guard - Happy path returns correct offering ID on first page Refs #684 #667 #632 * fix(ec2): use typed top-level fields, not Filters[], on offering describe (#688) Followup on the same #688 issue. The initial commit kept InstanceType, ProductDescription, InstanceTenancy, duration, and OfferingClass packed into DescribeReservedInstancesOfferingsInput.Filters[], with only OfferingType promoted to the typed top-level field. Live-AWS verification showed that the Filter[]-heavy shape is what causes AWS to return empty pages with NextToken on sparse offering sets (e.g. t4g.nano regional no-upfront convertible), walking until the Lambda budget expires. The all-typed-fields shape returns the exact matching offering immediately. Typed fields land on AWS's primary indices; Filter[] is a generic fallback that does not always hit the same indices. Changes: - Move InstanceType, ProductDescription, InstanceTenancy, MinDuration, MaxDuration, OfferingClass to typed top-level fields on the input struct. - Keep only scope as a Filter[] entry (no typed equivalent on the input). - Use the typed enum constants (types.InstanceType, types.RIProductDescription, types.Tenancy, types.OfferingClassTypeConvertible) instead of stringly-typed filter values. - Extract ec2OfferingQuery struct + buildEC2OfferingQuery / describeInputFromQuery helpers so findOfferingID stays under the gocyclo cap. - Delete the now-unused buildOfferingFilters helper and the always-returns- "convertible" getOfferingClass helper. - Change scanEC2OfferingPage from "hard error on variant mismatch" to "soft skip + log and continue scanning". With the typed OfferingType field set this should never fire; if AWS somehow returns a mismatched variant we want to keep looking on later pages rather than fail the rec. - Update TestFindOfferingID_WrongVariantRejected to assert the new soft-skip behaviour (mismatched variant on the only page -> "no offerings found"). - Drop TestBuildOfferingFilters_LegacyCanonicalization: it exercised the deleted helper. The underlying canonicalizeEC2Tenancy / canonicalizeEC2Scope helpers retain their own direct unit tests. * fix(purchases): soft-skip guards + nil/empty token parity (closes #688) Three related cleanups on the offering-lookup pipeline: 1. EC2/RDS: normalize end-of-pagination check to guard against both nil and empty-string NextToken/Marker. AWS may return either form for the terminal page; only checking nil risks an extra empty-page request. EC2 uses a new isLastEC2Page helper so the intent is explicit. 2. OpenSearch: remove the post-match payment-option defense-in-depth guard from scanOpenSearchOfferingPage. matchesPaymentOption already filters mismatched variants in the same loop iteration; the redundant hard error can never fire in practice and makes the wrong-variant test harder to reason about. Soft outcome (loop exhausts -> "no offerings found") is safer for the caller. 3. Redshift: same rationale as OpenSearch -- remove the hard-error guard for unknown offering type from scanRedshiftOfferingPage. Update TestFindOfferingID_WrongVariantRejected for both services to assert the soft-failure shape (err != nil, id == ""). * fix(purchases): address CR #690 category A/B/C findings Category A (4 sites, nil-or-empty pagination terminator): - elasticache: guard result.Marker against both nil and "" in paginateElastiCacheOfferings; avoid the redundant page that a non-nil empty-string marker would otherwise trigger - opensearch, rds, redshift: same nil-or-empty guard on their respective NextToken/Marker fields (EC2 was already fixed in 5f8f269) Category B (ElastiCache payment option coercion): - convertPaymentOption now returns (string, error); unknown values return an explicit "unsupported ElastiCache payment option" error instead of silently mapping to "Partial Upfront" - findOfferingID propagates the error before any pagination begins - TestClient_ConvertPaymentOption updated to assert the (result, error) return shape and verify the unknown-input error path Category C (unreachable variant-mismatch guard): - OpenSearch: replace the matchesPaymentOption pre-filter continue with a direct gotPayment != wantPayment check that returns an explicit "payment option mismatch" error; guard is now reachable - Redshift: replace matchesOfferingType pre-filter continue with a direct "Regular"/"Upgradable" enum check that returns an explicit "unexpected type" error on first mismatch - Test comments and assertions updated to match the new behaviour (mismatch surfaces as a diagnostic error, not a silent skip) CR #690 review round 1 * chore: restore Terraform lock files to feat branch state The previous commit accidentally included Terraform provider constraint changes unrelated to the #688 offering-lookup fix. Restore the three lock files to their state on feat/multicloud-web-frontend so this branch stays focused on the purchase-path narrowing changes only.
…closes #687) (#689) * feat(common): add BuildReservationName for rich AWS reservation IDs Add a shared builder that composes self-describing identifiers for AWS reservation purchases. The format mirrors the Azure DisplayName format introduced in #686 (cross-cloud symmetry for ops dashboards): {svc}-{region}-{sku}-{count}x-{term}-{paymt}-{ts}-{rand} Output is always sanitized via SanitizeReservationID for the AWS reservation-name allowlist ([a-zA-Z0-9-], dots in SKUs become hyphens) and capped at 60 characters — the tightest length cap across the five AWS services that accept a customer-supplied reservation ID/name (RDS ReservedDBInstanceId, ElastiCache ReservedCacheNodeId, MemoryDB ReservationId, OpenSearch ReservationName). When composed length exceeds the cap, optional tail segments drop in priority order: random suffix first, then timestamp, then payment option. The service code, region, SKU, count, and term are never dropped. In the pathological worst case the builder truncates the SKU itself so count and term still survive the final assembly. Time and random sources are injectable for deterministic tests. Refs #687 pkg/common/reservation_name.go | +201 pkg/common/reservation_name_test.go | +281 * feat(aws): rich self-describing reservation names matching Azure (closes #687) Wire common.BuildReservationName into the five AWS reservation purchase paths. On the no-token CLI fallback path (i.e. when opts.IdempotencyToken is empty), each service now composes a self-describing identifier of shape: {svc}-{region}-{sku}-{count}x-{term}-{paymt}-{ts}-{rand} For example, an OpenSearch RI purchase that previously yielded the opaque `opensearch-Standard_X-1779468234` now becomes `opensearch-us-east-1-r6g-large-search-3x-1yr-allup-20260521T002019-a1b2c3d4` (truncated to 60 chars per service allowlist; see #687). Per service: * rds: deriveReservationID's fmt.Sprintf fallback -> BuildReservationName ("rds-" prefix). opts.ReservationID still wins when set; idempotency token still wins above both. * elasticache: ReservedCacheNodeId fallback -> BuildReservationName ("cache-" prefix). * memorydb: ReservationId fallback -> BuildReservationName ("memdb-"). * opensearch: ReservationName fallback -> BuildReservationName ("opensearch-"). * redshift: PurchaseReservedNodeOffering doesn't accept a customer- supplied node ID, so the rich descriptors travel as tags instead. tagReservedNode now writes Count / Term / PaymentOption alongside the pre-existing NodeType / Region / PurchaseDate / Tool tags so the reserved node is identifiable from the AWS console alone. The token-based path (issue #641) is unchanged: IdempotentReservationID still derives the deterministic ID from the token so a re-drive sends the identical name and AWS rejects the duplicate server-side. The rich builder applies ONLY to the no-token fallback. Tests: one new per-service capture test asserts ^{svc-code}- prefix, region presence, SKU (dots->hyphens) presence, count+term embedded, and the 60-char cap. Redshift's test asserts the new tag keys (Count/Term/PaymentOption) appear in the CreateTags input. Closes #687 * fix(common): default zero Now in reservation name + assert payment drop in length test Default f.Now to time.Now() when IsZero() so a missed caller field produces a current timestamp rather than a year-0001 artifact. Deterministic tests are unaffected because they set Now: testFixedNow explicitly. Add assert.NotContains(t, got, "allup", ...) to TestBuildReservationName_LengthFit_DropsPaymentLast so the test actually verifies the payment segment is dropped in the tight-length regime, not just that timestamp and random were removed. Addresses CodeRabbit findings on PR #689.
) (#693) GetReservationPurchaseRecommendation, GetSavingsPlansPurchaseRecommendation, and DescribeDBMajorEngineVersions each silently discarded every page after the first. For large payer accounts, CE returns recommendations across multiple pages, so callers were seeing a fraction of the real result set. All three call sites now drive a token-driven pagination loop. The existing rate-limiter retry loop stays as the inner (per-page) loop. ctx.Err() is checked at the top of each iteration (feedback_ctx_cancel_terminal). The loop terminates on nil OR empty-string token (PR #690 parity). Exceeding maxRecommendationPages/maxEngineVersionPages = 20 returns a diagnostic error citing the service and issue #692. gocyclo relief: fetchSPAllPages + fetchSPPageWithRetry extracted for SP; queryMajorEngineVersionsWithClient + fetchMajorEngineVersionsForEngine extracted for RDS (all stay under cap 15). Three new test groups: multi-page accumulation (3 pages), empty-token terminal (no extra call), and pagination cap error. Updated TestGetRecommendations_ContextCancellation to assert context.Canceled directly -- the new ctx.Err() guard at the top of the pagination loop fires before the rate-limiter so the error path changed correctly.
closes #695) (#696) * feat(ri-exchange): target-offerings endpoint + UUID validation + AWS 4xx mapping (#695) Three defects addressed: Defect 1 (backend): add GET /api/ri-exchange/target-offerings backed by ListTargetOfferings on ec2svc.Client. Uses typed-field approach from PR #690 to enumerate valid convertible exchange targets for a source RI. Introduces targetOfferingsEC2Client interface + factory in handler.go for testability. Defect 2: validate targets[].offering_id against a UUID regex at both the getExchangeQuote and validateExecuteExchangeBody call sites. Returns 400 with a descriptive message when an instance type (e.g. "t3.medium") is submitted instead of an offering UUID. Defect 3: mapAWSExchangeError helper extracts smithy.APIError and maps InvalidOfferingId/InvalidParameter/ValidationError/Invalid*.NotFound to 400 with the original AWS message; all other AWS errors stay 500. Applied at the quote and execute call sites. Tests: 27 new test cases covering the new endpoint (happy path, 404, auth), UUID format rejection, and AWS 4xx/5xx error classification. Closes #695 * feat(ri-exchange/ui): replace free-text offering ID with picker (#695) Replace the "Target Offering ID" free-text input in the exchange modal with a <select> picker backed by two sources: - AWS-driven: GET /api/ri-exchange/target-offerings loads valid target offerings for the source RI from DescribeReservedInstancesOfferings (the new endpoint from the companion backend commit). Options are grouped under "AWS Target Offerings". - CE Recommendations: alternativeTargets from reshape recommendations are listed as a second optgroup "CE Recommendations" with per-instance monthly cost displayed in the option label. The picker drives a hidden <input type="hidden" class="modal-exchange-target"> that holds the resolved UUID. collectTargets() validates the value against a UUID regex before submission, rejecting any non-UUID value with a user-visible error message. This closes the path where a user could accidentally submit an instance type (e.g. "t3.medium") instead of a UUID. Tests: 5 new test cases covering the picker select rendering, CE options population, AWS offerings after async load, select-to-hidden-input sync, and non-UUID rejection at submission time. Updated 8 existing tests to use offering_id UUIDs in place of instance types where they inject values directly into the hidden offering input. Closes #695
…ter Clear (#712) Closes #700 Two sub-bugs in the column filter popover and the zero-row render path: 1. Clear-categorical handler called checkboxes.forEach + setRecommendationsColumnFilter directly, bypassing commitAll(). That skipped updateAllTriState() so the (All) checkbox remained checked or indeterminate after Clear. Fix: expose commitAll as commitAllRef in the outer scope and delegate to it from the Clear handler, keeping the same call path as the (All) checkbox itself. 2. renderRecommendationsList replaced the entire container (including <thead>) with a <p class="empty"> when the filter produced zero rows. Fix: always call buildListMarkup so <thead> is preserved, then inject a hint <td class="empty"> into the empty <tbody> via DOM methods. Tests: 2 new Issue #700 tests added (tri-state visual state + thead survival), existing empty-state test updated to match new structure. All 1933 frontend tests pass.
…closes #564) (#697) * ci(pre-commit): auth tflint via GITHUB_TOKEN + retry transient flakes (closes #564) The `Run pre-commit` step in `.github/workflows/pre-commit.yml` intermittently fails because `terraform_tflint`'s `tflint --init` downloads ruleset plugins from the GitHub Releases API anonymously, hitting the 60-requests-per-hour per-IP limit that the runner shares with every other workflow running on the same NAT egress. Most recently observed on PR #696 (https://github.com/LeanerCloud/CUDly/actions/runs/26411029351). Two-part fix: 1. Expose secrets.GITHUB_TOKEN to the step. tflint reads GITHUB_TOKEN natively; with it set the limit jumps to 5000/hr per-token, which matters because the token is unique per workflow run, not per IP. This alone fixes the rate-limit class. 2. Wrap the step with nick-fields/retry@v3.0.2 (SHA-pinned per project policy) at 3 attempts with a 90s wait. The token fix eliminates the AWS-plugin rate-limit case; the retry handles the residual flakes (GitHub Releases availability blips, plugin download timeouts, etc.) so a transient failure no longer requires a manual rerun. Acceptance criteria from #564: - GITHUB_TOKEN exposed to the `Run pre-commit` step - Three consecutive `pre-commit` runs in the same hour all complete without a tflint rate-limit failure (the retry-wrapper also catches the residual flakes the token cannot) * ci(pre-commit): bump job timeout to accommodate retry budget (CR #697) CR pointed out the retry policy was ineffective: `nick-fields/retry` uses `timeout_minutes` PER ATTEMPT (not total), so with `max_attempts: 3` and `timeout_minutes: 15` the worst case is ~48 min, but the job-level `timeout-minutes: 15` killed any retry before attempt 2 could start. Fix: tune both timeouts so the retry budget fits inside the job budget with safety margin. - per-attempt timeout: 15 -> 10 minutes (normal pre-commit runs take ~3-5 min, 10 is comfortable margin without inflating hang-detection latency) - job timeout: 15 -> 35 minutes worst case = 3 attempts * 10 min + 2 * 90 s retry waits = 33 min, fits in 35 min with margin Avoids CR's suggested 50-minute job cap because that would also delay killswitch on a genuinely hung step. 35 min is the tightest value that still lets all 3 attempts complete.
…ghlight + safe-area padding (#989) - viewport: add viewport-fit=cover for iOS notch support; drop spurious initial-scale=1.0 decimal (spec says 1) - meta: add format-detection=telephone=no to stop iOS auto-linkifying numeric data in tables/charts - css: -webkit-tap-highlight-color: transparent on button/a/[role=button] with explicit opacity:0.85 :active feedback instead of the blue flash - css: safe-area-inset-top on .app-topbar and safe-area-inset-bottom on .app-shell so content clears notch and home indicator on iPhone - css: touch-target minimums (min-height/width: 44px) on all interactive controls under @media (pointer: coarse) so desktop layout is unaffected; checkboxes/radios grow via label:has() instead to avoid glyph bloat - html: inputmode="numeric" pattern="[0-9]*" on all 15 numeric inputs (coverage %, grace days, notification days, ramp params) to surface the numeric keypad on mobile - html: autocomplete="email" on the notification email input
…unt validation log (#969) * sec(api/plans): stop leaking raw account UUID + DB error in plan-account validation log The GetCloudAccount DB-error path inside validatePlanAccountProviders returned fmt.Errorf("accounts: failed to get account %s: %w", aid, getErr), a non-ClientError that propagated to the router's logging.Errorf("API error: %v"), leaking the raw account UUID and the raw DB error string into logs (gap left by #946 which closed #944). Wrap the error with the existing mapCreatePlanStorageError helper so that: (a) a DB error emits a structured logging.Errorf carrying only the derived provider list and account count (never the UUID), and (b) returns a generic 500 ClientError ("failed to validate plan accounts") with no raw UUID or DB string. The ErrNotFound sentinel still maps to 404; the explicit (nil, nil) not-found path is unchanged. Add logging.SetOutput as a test seam on the default logger and two regression tests: - TestValidatePlanAccountProviders_GetAccountDBError_NoPIILeak: asserts that a DB error yields a 500 ClientError whose message and emitted log contain neither the account UUID nor the raw DB error string. - TestValidatePlanAccountProviders_AccountNotFound_Still404: guards that the not-found path still returns a 404 referencing the missing ID. Also fixes the pre-existing build failure on the base branch (Session.Role field removed in #940 but the test was not updated) and gofmt drift in handler_purchases_guards_test.go. closes #965 * fix(api/plans): stop leaking raw DB error in plan-fetch validation branch (refs #965) getPlanForAccountProviderValidation returned fmt.Errorf("accounts: failed to get plan: %w", err) on a non-ErrNotFound storage error. Because this is not a ClientError, the router logged it via logging.Errorf("API error: %v"), emitting the raw DB error string on every POST /plans and PUT /plans/:id/accounts call that hits a DB failure (same endpoints as the sibling GetCloudAccount leak closed by PR #969). Fix: delegate to the existing mapCreatePlanStorageError helper (introduced by PR #969 for the GetCloudAccount branch) so the response is a generic 500 ClientError and the log line contains only a static format string -- no raw DB error, no plan UUID. Adds two regression tests: - TestGetPlanForAccountProviderValidation_GetPlanDBError_NoPIILeak: asserts the handler returns a 500 ClientError and that neither the client response nor the captured log buffer contain the raw DB error string; FAILS pre-fix. - TestGetPlanForAccountProviderValidation_PlanNotFound_Still404: guards that ErrNotFound still maps to 404 after the refactor. * sec(api): fix sibling PII leaks in plan/purchase handler paths (followup #965) PR #969 closed the raw-UUID + raw-DB-error leak in validatePlanAccountProviders and getPlanForAccountProviderValidation. Adversarial review surfaced five sibling leak sites of the exact same shape -- a non-ClientError wrapped with a raw account/plan UUID and the underlying DB error -- that all propagate through the router's `logging.Errorf("API error: %v")` path and leak both the UUID and the raw DB error string. Fixed sites (all use `mapCreatePlanStorageError` so a DB error becomes a generic 500 ClientError, and a structured log line is emitted without PII): - handler_plans.go:209 updatePlan -- existingPlan fetch - handler_plans.go:333 getPlanForPurchaseCreation -- plan fetch for createPlannedPurchases - handler_plans.go:447 patchPlan -- plan fetch for PATCH - handler_purchases.go:323/328 disablePlan -- GetPurchasePlan and UpdatePurchasePlan failure paths - handler_purchases.go:2006 lookupContactEmail -- the Warnf used to log the raw account UUID and the raw err; now logs only the failure point - handler_purchases.go:1955 gatherAccountContactEmails -- the wrap site used to re-leak the same account UUID through `%s: %w`; now returns a generic 500 ClientError (lookupContactEmail has already logged without PII) Test changes: - New: TestHandler_patchPlan_GetPlanDBError_NoPIILeak, TestHandler_updatePlan_GetPlanDBError_NoPIILeak, TestDisablePlan_GetPlanDBError_NoPIILeak, TestDisablePlan_UpdatePlanDBError_NoPIILeak, TestLookupContactEmail_GetAccountDBError_NoPIILeak, TestGatherAccountContactEmails_DBError_NoPIILeak -- each asserts the ClientError is 500 (server-side fault), the returned message contains neither the raw UUID nor the raw DB error string, and (where the path logs via the default logger) the captured log buffer contains neither either. Pattern mirrors the existing TestValidatePlanAccountProviders_GetAccountDBError_NoPIILeak. - Updated: TestHandler_patchPlan_NotFound, TestHandler_createPlannedPurchases_PlanNotFound -- assertions moved off the legacy "failed to get plan" wrapper string to the new contract: 500 ClientError without the raw plan UUID. Both tests return a plain `errors.New("not found")` from the mock (not config.ErrNotFound) so they exercise the 500 branch, not the 404 branch. - Updated: TestHandler_resolveApprovalRecipients_LookupErrorPropagates -- the test's stated intent (no silent globalNotify fallback) is retained via the require.Error guard; the ErrorIs(transient) assertion was guarding the implementation detail that broke the no-PII-leak contract. Replaced with 500-ClientError + no-UUID + no-raw-DB-error assertions consistent with the rest of the issue #965 family. Full internal/api suite: 1474 passed; full short suite: only two pre-existing internal/auth failures unrelated to this PR. refs #965 * sec(api/purchases): keep lookupContactEmail failure at error level The GetCloudAccount failure in lookupContactEmail was logged at Warnf, but its caller gatherAccountContactEmails maps the error into a generic 500 ClientError. The router returns ClientErrors via the branch that skips logging.Errorf("API error: %v"), so this log line is the only error-level breadcrumb for the resulting user-visible 500. Downgrading it to Warnf weakened error-level alerting for exactly the outage this change set is preserving visibility for. Restore Errorf while keeping the message generic (no raw account UUID, no raw DB error string) so the PII-leak fix is unaffected. Extend the regression test to assert the [ERROR] severity and the generic message; it fails pre-fix (emits [WARN]) and passes after. Addresses CodeRabbit review on PR #969.
The production code in `recommendations.ts:3015-3023` already renders the action-box summary as a min/max range via `formatSavingsRange(min, max)`, which collapses to a single value when min == max. This PR adds the regression tests that pin that behaviour for two cases the previous test suite did not cover: - Multi-cell selection with a per-cell range: 2 cells, one with two term variants (1yr/3yr), one single-variant. Asserts the summary shows `$300 - $500/mo`, `$1,100 - $1,800 upfront`, `2 cells`. - Multi-cell selection with identical totals: collapses to `$400/mo` without a range separator (per formatSavingsRange's same-value branch). Closes #281. No production changes -- the range render already shipped via the broader pageLevelRange + formatSavingsRange work; this PR just stops that contract from regressing silently.
Bump .rec-cell-chevron from font-size 1.2em/24px target to 1.5em/32px, add focus-visible highlight, and add two tests: one asserting the chevron is a <button> outside .rec-cell-summary-content (keyboard accessible without extra keydown wiring), one confirming no-overlap with the summary content cell.
…435) (#526) AllowOrigins: replace hardcoded "*" with !If [HasCustomDomain, ...] so when DashboardDomainName + DashboardHostedZoneId are set the CORS origin is locked to "https://${DashboardDomainName}". The wildcard fallback only applies when no custom domain is configured. AllowMethods: replace "*" with explicit list [GET, POST, PUT, DELETE] matching the Terraform lambda module (modules/compute/aws/lambda/main.tf). AllowHeaders: replace "*" with explicit list [Content-Type, Authorization, X-CSRF-Token, X-Session-Token] matching the Terraform lambda module. Closes #435. Note: Terraform tfvars wildcard origins (#442) are handled by PR #519 which adds a validation block rejecting ["*"] in the lambda module.
…ld frontend (#1116) Seven jest suites (plans, history-retry-button, history-approve-button, history-approval-queue, history-cancel-button, history-cancel-permissions, allowed-accounts) each had a jest.mock('../utils', ...) factory that omitted escapeHtmlAttr. When plans.ts and history.ts call it at render time the mock throws "TypeError: (0 , utils_1.escapeHtmlAttr) is not a function", causing cascading test failures. Added the same escapeHtmlAttr stub already present in history.test.ts (encodes & < > " ') to each affected factory. No logic changes. Closes #1115
Move kms:GetKeyPolicy from KMSCreateAndRead (Resource="*", account-wide) to a new KMSReadTaggedOnly statement in policy_compute_b.tf gated on aws:ResourceTag/Project=CUDly. This prevents a compromised deploy token from reading key policies of unrelated KMS CMKs in the account (key-policy reconnaissance), while still allowing the deploy SA to read policies of CUDly-owned keys (which are tagged at CreateKey time by the Terraform aws_kms_key resource). Split to policy_compute_b.tf because policy_compute.tf is at the AWS 6144-char managed-policy limit. Closes #427.
Add a shared "Amortize upfront over term" checkbox that folds the one-time upfront cost evenly across the commitment term, letting every payment type (No Upfront / Partial / All Upfront) be compared on a total-cost-per-month basis. New shared helper `amortizedMonthly(monthlyCost, upfrontCost, termYears)` in utils.ts returns `monthlyCost + upfrontCost / (termYears * 12)`. Guards: term <= 0, non-finite term, or null/non-finite upfront all fall back to monthlyCost unchanged (no NaN/Infinity). Toggle state lives in `state.ts` (`getAmortizeUpfront` / `setAmortizeUpfront` / `subscribeAmortizeUpfront`) persisted under `cudly.amortizeUpfront` in localStorage. Same pattern as `getCostPeriod`/`setCostPeriod`. All views read the same key so they stay in sync across tab switches and page reloads. Views updated: - history.ts: checkbox injected into #history-controls and #purchases-approval-queue-section; both renderHistoryList and renderApprovalQueue use amortizedMonthly when the toggle is on; column header updates to "Monthly Cost (amortized)" when active. subscribeAmortizeUpfront re-renders both tables without a refetch. - inventory.ts: checkbox injected into .section-header-actions; buildCommitmentRow uses amortizedMonthly on the monthly_cost cell; last-fetched commitments cached so subscribeAmortizeUpfront can re-render without an API round-trip. - approval-details.ts: "Monthly cost" column added to the per-rec table; header built via DOM methods (no innerHTML); column label updates to "Monthly cost (amortized)" when toggle is on. Modal reads state at render time, no re-render needed. plans.ts and dashboard.ts have no monthly_cost field on their displayed data types (PlannedPurchase / dashboard summary), so the toggle has no effect there and no changes were needed. Tests: 9 new `amortizedMonthly` unit tests in utils.test.ts covering all payment types and every guard path. Existing approval-details, history, and inventory tests updated for the new column and the new state mock entries. Full suite: 2442 pass, 2 pre-existing timezone failures in utils.test.ts (unrelated). Closes #1112
) plans-range-validation.test.ts used a closed mock factory for ../utils that listed every util function explicitly but omitted escapeHtmlAttr, which was added to utils.ts after the test was written. The missing stub caused a "not a function" TypeError whenever plans.ts called escapeHtmlAttr during renderPlannedPurchaseRow / renderPlanCard, making the entire plans-range-validation suite fail in CI. Added the stub with a faithful HTML-attribute escape implementation (mirrors the real utils.ts implementation) so the closed factory remains exhaustive. Closes #1115 Co-authored-by: Claude <noreply@anthropic.com>
…controls (#1118) The recommendations filter-status bar is flex/align-center, but the `.cost-period-selector-wrapper` span had no CSS, so the small "Show costs:" label aligned to the baseline of the taller base-styled select and appeared stacked above it. Make the wrapper an inline-flex (label + select vertically centered) and size the select to match the bar buttons, so the count, Expand all, "Show costs: <period>" and Columns all sit on one line.
… fan-out (closes #197) (#838) * feat(recommendations): per-rec payment seeding in multi-account buckets (closes #197) Multi-account fan-out buckets now expose per-rec Payment dropdowns so each rec gets its own override-seeded default instead of silently falling back to the toolbar value for the whole bucket. Changes: - FanOutBucket gains optional perRecPayments map (rec.id -> payment), populated only for buckets spanning 2+ distinct cloud_account_ids. Each rec's default is seeded from its account's AccountServiceOverride (matching provider+service, normalized, support-checked), falling back to the bucket-level payment. - openFanOutModal now pre-fetches overrides for ALL accounts across all buckets (was: only sole-account buckets). Single-account buckets are unaffected; the extra fetch just removes the narrowing guard. - renderFanOutBucketSection renders a per-rec list with individual Payment selects (.fanout-per-rec-payment) when perRecPayments is set. Each select carries data-rec-id so tests and app code can target it. - handleFanOutExecute (app.ts) uses perRecPayments.get(r.id) for the POST payment field, falling back to b.payment for single-account buckets and for any rec not in the map. - getFanOutBuckets deep-copies perRecPayments to prevent callers from mutating module state. - Tests (h) + (i): assert per-rec map seeding from overrides and that dropdown changes propagate to module state. * fix(recommendations): keep bucket-level Payment effective for fan-out rows openFanOutModal eagerly wrote every rec of a multi-account bucket into perRecPayments, and the execute path prefers that map over b.payment via `perRecPayments.get(id) ?? b.payment`. Once the map was fully populated the fallback was dead, so changing the bucket-level Payment dropdown only mutated b.payment while unedited rows still posted their stale per-rec value. The visible bucket-level control was a no-op for any row that originally followed the bucket default. Treat perRecPayments as an explicit-override set: seed only recs whose payment differs from the bucket default, drop a per-rec entry when the user selects the current bucket default again, and re-sync visible per-rec selects (that have no override) when the bucket-level dropdown changes. Skip the re-sync for rows whose service does not support the new payment so the shown value never diverges from the POST payload. Adds a regression test asserting the bucket-level change propagates to the resolved payment for non-overridden recs, and updates the per-rec test to the override-only contract. Both fail pre-fix.
…874) Column was already being rendered with formatDateTime (absolute). Switch the "Last used" cell to formatRelativeTime with the ISO timestamp on a title attribute for hover. Null last_used_at renders "Never" unchanged. Backend last_used_at was already present: column in migration 000001, UpdateLastUsed fires async in ValidateUserAPIKey, and ListUserAPIKeysAPI already propagates the field. Migration 000065 not needed. Add round-trip tests to service_apikeys_api_test.go asserting the field is non-nil for used keys and nil (not zero-time) for never-used keys. Add frontend tests for Never and for the relative+title rendering.
…1119) Two tests fail on a pristine checkout of feat/multicloud-web-frontend, turning base CI red and marking every open PR UNSTABLE: 1. internal/server TestAuthServiceAdapter_ValidateCSRFToken The mock service was built via NewService without a CSRFKey, so it generated a random ephemeral key each run. The test then asserted a literal "csrf-token" validated, which the HMAC derivation can never match. Make the test hermetic: pin a deterministic key with the new auth.TestCSRFKey() and assert the real derived token via auth.DeriveTestCSRFToken(). The CSRF validation logic is untouched. 2. frontend utils.test.ts (formatDate / getDateParts) A date-only string ("2024-03-15") was parsed as UTC midnight, so toLocaleDateString / getDate() shifted it to the previous day in any browser west of UTC. This is a real off-by-one for users (purchase scheduled_date, reservation end_date) and made the test pass only in UTC. Fix at root: parse bare "YYYY-MM-DD" as local midnight via a new parseDateInput helper (with round-trip validation so out-of-range digits like "2024-13-45" still reject). Full timestamps and Date objects are left untouched. Add timezone-invariance regression tests. Issue #993 (TestLogin_WithMFA_NoSecret) no longer reproduces on the current base: service.go already routes every login-failure path through the single genericLoginError constant, so unknown-email and wrong-password return a byte-identical message (no enumeration oracle). Closing it here with the other base-CI repairs.
* test(frontend/auth): cover cross-tab logout sync (closes #493) PR #487 added installStorageListener() in api/client.ts as the compensating control for moving tokens off sessionStorage. The behaviour is implemented but had no automated regression test; if the listener's key filter, newValue check, or idempotency guard ever regresses, the SECURITY claim in client.ts becomes silently false. Add crosstab-session.test.ts covering: - cleared auth keys (authToken, apiKey, csrfToken) zero in-memory state and call window.location.reload - non-auth keys are ignored - partial updates (newValue !== null, e.g. token refresh) are ignored - initAuth() is idempotent: storage listener installs exactly once Each test loads api/client in jest.isolateModules() so the storageListenerInstalled guard is fresh per test. * test(frontend/auth): patch only window.location.reload (CR nit) Capture the original window.location reference before tests and restore it in an afterEach, so each test starts from a clean Location object. This implements CR's teardown recommendation while keeping the Object.defineProperty(window, 'location', ...) approach required by jsdom (window.location.reload is non-configurable and cannot be patched directly). * test(frontend/auth): clear localStorage between cross-tab logout tests (CR nit) initAuth() seeds in-memory auth from localStorage, so without clearing it between tests a stale token from one case could leak into the next. Extend the existing afterEach to call localStorage.clear() for proper isolation. * test(frontend/auth): address CR nits on cross-tab logout test - replace the out! non-null assertion in loadAuth() with an explicit guard that throws a clear error if isolateModules never ran - add a regression case for the !e.key guard in installStorageListener: a null-key storage event (localStorage.clear()) must not clear in-memory auth or reload
…290) (#804) * feat(purchases): in-app revocation within free-cancel window (closes #290) POST /api/purchases/{purchaseId}/revoke: Azure returns via armreservations (CalculateRefund + Return two-step), 7-day window. AWS/GCP return 422 and the History UI hides the button for those providers. - DB: migration 000057 adds revocation_window_closes_at, revoked_at, revoked_via, support_case_id columns to purchase_history - RBAC: revoke-own / revoke-any actions, revoke-own granted to all users by default; ownership is via account-access (history rows pre-date created_by_user_id) - Backend: fail-closed nil-auth guard, idempotency via revoked_at IS NULL, Azure CalculateRefund->Return with test-injectable client interfaces - Frontend: canRevokeCompletedRow gate (azure + window open + not yet revoked), Revoke button in History action cell, confirm dialog + toast - All existing mock stores updated with GetPurchaseHistoryByPurchaseID and MarkPurchaseRevoked; auth permission-count tests updated to 12 * fix(ci): extract provider dispatch and account check to reduce revoke complexity; regenerate permissions on PR #804 * fix(api/purchases): align revoke admin check with group-only authz (#907) Rebase onto feat/multicloud-web-frontend brought in #907 (group-membership- only authorization, no role field on Session). The revoke handler still gated admin via `session.Role == "admin"`, which no longer compiles since api.Session has no Role field. Replace with the same two-track pattern the sibling authorizeSessionCancel / authorizeSessionApprove already use: - Stateless admin API key short-circuits via `session.UserID == apiKeyAdminUserID` (no DB row exists to resolve permissions from). - Group-based admins fall through to HasPermissionAPI; the {admin, *} wildcard in DefaultAdminPermissions matches revoke-any:purchases there. Drop the dead `Role` field from the revoke test sessions; the admin test now pins the apiKeyAdminUserID short-circuit, and the existing RevokeAny test already covers the group-admin path via HasPermissionAPI. Also fold in the trailing pre-commit fixes that were red on the previous push: - gofmt: realign struct-field padding in TestRevokePurchase_AzureReturnClientError. - go mod tidy: promote armreservations from indirect to direct (the revoke handler imports it directly). Free-cancel window enforcement (AzureRevocationWindowDays + windowClosesAt check) and revoke-call idempotency (early return when record.RevokedAt is already set) are unchanged. Refs #290 * fix(api/purchases/revoke): fail-closed on nil account + return error on DB persist failure Two security fixes from CR findings: 1. checkRevokeOwnAccountAccess: return 403 when CloudAccountID is nil/empty instead of allowing the revoke. Without an account association, ownership cannot be verified for revoke-own callers. Add regression test for this fail-closed contract. 2. revokeAzurePurchase: return error when MarkPurchaseRevoked fails after a successful Azure return. The previous log-and-continue left the DB unmarked, breaking idempotency on retries. The error message notes that the refund was submitted so operators can investigate without re-issuing. * fix(purchases/revoke): populate revocation window at write path so the button works The Revoke button was shipped but dead: RevocationWindowClosesAt was never populated when a completed purchase was written, so the frontend gate canRevokeCompletedRow (which bails on a missing revocation_window_closes_at) hid the button on every real row. - Stamp RevocationWindowClosesAt in the real write path (purchase.savePurchaseHistory): Azure = Timestamp + 7 days (the free-cancel window), nil for AWS/GCP (out of Phase-1 scope), via a new shared config.RevocationWindowClosesAtFor helper + config.AzureRevocationWindowDays constant that is now the single source of truth for the window length. - Make the backend window check (revokeAzurePurchase) read RevocationWindowClosesAt as the source of truth, falling back to recomputing from Timestamp only for legacy rows written before the column was populated. - Reject an order-only ARM path (empty reservationID) in callAzureReturn rather than submitting an empty Return to Azure. - Tests: backend asserts savePurchaseHistory stamps the window for Azure and leaves it nil for AWS/GCP; handler asserts the stamped window drives the deny decision and that an empty reservationID is rejected; FE test asserts the Revoke button shows for a completed Azure row with a populated revocation_window_closes_at and is hidden without it (plus closed-window, already-revoked, non-Azure, and anonymous cases). * docs(auth/revoke): correct revoke-own doc to account-scope reality (#950) The ActionRevokeOwn doc comment claimed "Own" means created_by_user_id matches the session user, but checkRevokeOwnAccountAccess actually enforces ACCOUNT scope (GetAllowedAccountsAPI), because purchase_history rows pre-date created_by_user_id and have no reliable per-creator attribution. Fix the doc comments to describe the account-scope behavior as implemented; the authz model itself is unchanged. Whether revoke-own should instead be creator-scoped is a product decision tracked in issue #950, noted inline in both the auth constant doc and the handler check. * feat(purchases/revoke): Gmail-style pre-fire delay unifies revoke across providers Adds status=scheduled to purchase_executions so the approval step defers the cloud SDK call by purchase_delay_hours. A scheduled execution can be cancelled at $0 via the existing revoke endpoint before the scheduler fires. Two-tier revoke path: - status=scheduled: CancelExecutionAtomic (CAS) transitions to cancelled; no cloud API call, returns 200 with explicit "no cost incurred" message. - status=completed (existing): provider SDK call via purchase_history path. revokePurchase now tries GetExecutionByID first; falls through to the existing purchase_history lookup when no scheduled execution is found. authorizeSessionRevokeExecution mirrors authorizeSessionRevoke but uses CreatedByUserID (not CloudAccountID) for revoke-own scoping. FireScheduledDelayedPurchases (purchase.Manager) drives the scheduler tick: queries GetScheduledExecutionsDue, CAS-transitions scheduled->approved, stamps ApprovedBy="scheduler", calls executeAndFinalize. Registered as TaskFireScheduledPurchases in the Lambda task dispatcher. Also folds in three CodeRabbit findings from PR #804 review f9c66c1e: - Remove unused mockAuth in two AzureCalcRefundClientError/ReturnClientError tests - Add idempotent audit CHECK constraints to migration 000065 (revoked_via, support_case_id, revoked_at/revoked_via pair) - Frontend regression test: legacy blank-status Azure rows stay revocable - analytics/collector_test.go: hook-backed GetPurchaseHistoryByPurchaseID and MarkPurchaseRevoked mocks (no more hardcoded no-ops) * refactor(api/config): extract helpers to keep revoke+approve+email+scan under gocyclo limit revokePurchase -> loadAndRevokePurchaseHistory (handler_purchases_revoke.go) approvePurchase -> approveViaToken (handler_purchases.go) sendPurchaseScheduledEmail -> buildScheduledEmailData (handler_purchases.go) scanExecutionRows -> applyNullTimesToExecution (store_postgres.go) Also applies gofmt alignment fix to internal/analytics/collector_test.go. * refactor(test/mocks): consolidate MockConfigStore into shared internal/mocks (#804 cleanup) Three per-package MockConfigStore definitions (internal/api, internal/purchase, internal/scheduler) were duplicating ~450 LOC each every time a new StoreInterface method was added. Replace all three with a type alias pointing at internal/mocks.MockConfigStore. Changes to internal/mocks/stores.go: - Add Fn-override fields imported from the api and purchase local mocks (GetCloudAccountFn, GetPurchasePlanFn, SetPlanAccountsFn, SavePurchaseExecutionFn, GetPlanAccountsFn, and 6 others) so callers that used those fields keep working without changes. - Add isExpected guards to recommendation-cache and RI-utilization-cache methods and to CancelExecutionAtomic so existing tests that call these without explicit On() expectations don't panic (matches the "opt-in" pattern the per-package mocks used via hasRecExpectation / hasExpectation). - Add 8 interface methods that were missing from the shared mock but present in all per-package variants: GetExecutionsByStatuses, GetPlannedExecutions, GetStaleApprovedExecutions, ListStuckExecutions, GetScheduledExecutionsDue, MarkCollectionStarted, ClearCollectionStarted, StampRIExchangeApprovedBy. - Add compile-time check: var _ config.StoreInterface = (*MockConfigStore)(nil). - Promote GetGlobalConfig and GetPurchasePlan to return sensible defaults when no expectation is registered (matches the api-package behaviour these tests relied on). Per-package files reduced to a single type alias line each. The scheduler test also had stray suppression/Tx method stubs added in a later commit; those are removed because the methods already live on the shared mock. Intentionally left local (incompatible shape or semantics): - internal/analytics/collector_test.go: mockConfigStore (lowercase) -- hook-field only pattern with no testify embedding; analytics-specific subset; different name. - internal/server/test_helpers_test.go: mockConfigStoreForHealth -- all-zero-value stubs for health check tests; distinct type name; no testify embedding. - internal/server/handler_ri_exchange_test.go: mockConfigStoreForExchange -- test- specific struct overriding a handful of methods. - internal/server/handler_coverage_test.go: mockConfigStoreForExchange{Complete, Fail,Stale} -- per-scenario stubs with distinct type names. Net LOC: +274 insertions / -1601 deletions (-1327 net across 4 files). * fix(api/purchases/revoke): use status='scheduled' CAS so pre-fire revoke isn't dead The pre-fire delay revoke path called CancelExecutionAtomic, whose SQL guard is `WHERE status IN ('pending','notified')`. A status='scheduled' row never matches, so the CAS returned zero rows on the happy path -- the handler then surfaced 410 "revocation window has closed" even when the window was wide open. The user would pay the cloud charge AND see a "cancelled" attempt in the UI -- the worst possible outcome. The bug was hidden by mocks defaulting CancelExecutionAtomic to (true,"cancelled",nil), so every test in the scheduled-revoke suite was green against the wrong SQL. No pgxmock test exercised the WHERE clause. Fix: introduce CancelScheduledExecutionAtomic with the correct `WHERE status = 'scheduled'` guard and switch the handler to it. The two CAS variants are kept distinct on purpose -- the scheduled-revoke flow surfaces 410 ("scheduler already fired") on race-loss, while the pre-purchase cancel flow surfaces 409 ("not pending"). Sharing one method would conflate the two race outcomes. Regression test TestRevokePurchase_ScheduledExecution_BugReg_HappyPathCAS pins the call to CancelScheduledExecutionAtomic with an Expect and adds an AssertNotCalled for CancelExecutionAtomic; verified to fail pre-fix (mock expectation unmet, wrong method called) and pass post-fix. Touches: internal/config/{store_postgres,interfaces}.go -- add method + comments internal/api/handler_purchases_revoke{,_test}.go -- switch call site + reg test internal/mocks/stores.go -- mock the new method (default happy path) internal/server/test_helpers_test.go, internal/analytics/collector_test.go -- satisfy StoreInterface * fix(frontend/history): gate Revoke button on revoke-{any,own} permission canRevokeCompletedRow only checked getCurrentUser() truthiness, so the inline Revoke button rendered for every signed-in user regardless of the revoke-any / revoke-own grant. The backend correctly 403s, but the UX-vs-RBAC drift is exactly what PR #995 caught for approve / delete on the same page. The peer predicates (canCancelPendingRow, canApprovePendingRow, canRetryFailedRow) already check canAccess; canRevokeCompletedRow now does too. Verbs match the backend handler one-to-one: - admin or revoke-any:purchases -> always allowed - revoke-own:purchases -> allowed (account-scope enforced server-side) - anything else -> hidden Adds revoke-own / revoke-any to the closed Action union in permissions.ts so a future drift becomes a compile error at the canAccess call site. Adds a regression test that mocks getCurrentUser with an effectivePermissions set lacking revoke-* and asserts the button is hidden; verified to fail without the canAccess gate and pass with it. Also corrects the existing ADMIN_USER fixture to use the real ADMINISTRATORS_GROUP_ID GUID -- the prior 'administrators' label was inert because the new canAccess fallback drives off isAdmin() which checks GUID membership. * test(frontend/permissions): update USER_PERMS expected set for revoke-own PR #804 added 'revoke-own:purchases' to USER_PERMS in permissions.generated.ts but missed updating the user-role expected list in __tests__/permissions.test.ts. The test asserts perms.size matches expected.length, so the missing entry surfaced as "Expected 11, received 12" after the addition. This is the same scope as the parent permissions add -- not a separate permission grant, just the test parity update PR #804 should have included alongside the original add. * fix(server): table-drive ParseScheduledEvent + cover scheduled-fire sweep The fire_scheduled_purchases case pushed ParseScheduledEvent's cyclomatic complexity to 11, tripping the gocyclo<=10 pre-commit hook and leaving the PR UNSTABLE. Replace the action switch with a package-level lookup table so the complexity no longer grows with the task list; adding a task type stays a one-line change. Also close the test gap on the Gmail-style pre-fire delay scheduler sweep: FireScheduledDelayedPurchases / fireOneDue had no unit coverage of their result accounting or CAS-race classification (only the dispatch wiring was mocked). Add scheduled_fire_test.go mirroring reaper_test.go: - no-due-rows and list-error paths - CAS lost to a concurrent revoke -> RaceLost, not Errored, and no SDK fire (the safety property that prevents double-charging a revoked purchase) - row-vanished (ErrNotFound) -> RaceLost - hard DB error on the CAS -> Errored (per-row isolation, sweep still succeeds) Each race/error test fails if the classification regresses (verified by flipping fireOneDue's return). Add the fire_scheduled_purchases case to the ParseScheduledEvent table test so the new action is positively asserted. * fix(migrations): renumber 000068 -> 000070 to deconflict (refs #290) PR #808 keeps 000068 and PR #847 takes 000069; bump this branch's purchase_history_revocation migration to 000070 to avoid conflicts on feat/multicloud-web-frontend. * fix(scheduler): wire FireScheduledDelayedPurchases tick (CRITICAL: pre-fire delay branch was non-functional) - Add testify-based FireScheduledDelayedPurchases to scheduler's MockPurchaseManager so it records calls and satisfies mock.AssertExpectations. - Add TestSchedulerManagerInterface_FireScheduledDelayedPurchasesWired: compile-time guard that ManagerInterface exposes the method + call-recording smoke test. - Add TestFireScheduledDelayedPurchases_EndToEnd in scheduled_fire_test.go: skipped placeholder (with documented skip reason + issue ref) for the full provider-stub e2e once #1005 4-eyes lands. - Add TestFireScheduledDelayedPurchases_DelayPathNotSilentNoOp: compile-time guard that FireScheduledDelayedPurchases exists on Manager. The server/handler_test.go "fire_scheduled_purchases success" and "fire_scheduled_purchases propagates error" cases cover the full dispatch chain (ScheduledTaskType -> handleFireScheduledPurchases -> Purchase.FireScheduledDelayedPurchases). * fix(api/purchases): CAS-guard scheduleApprovedExecution to prevent silent revoke loss Replace the blind SavePurchaseExecution write in scheduleApprovedExecution with a two-step CAS pattern: 1. TransitionExecutionStatus(pending|notified -> scheduled) -- atomic CAS. 2. Stamp ScheduledExecutionAt + ApprovedBy on the returned post-CAS row. 3. SavePurchaseExecution to persist the stamps. Before this fix a concurrent Cancel that landed between the approve handler's SELECT and its SavePurchaseExecution would be silently overwritten: the cancelled row would become status="scheduled" and eventually fire the cloud SDK call the user explicitly revoked. The CAS ensures the write succeeds only when the row is still in pending or notified; a concurrent cancel causes ErrExecutionNotInExpectedStatus which surfaces as a clear error to the caller. Tests added: - TestHandler_scheduleApprovedExecution_CASGuardsConcurrentCancel: injects a concurrent-cancel error and asserts SavePurchaseExecution is never called. - TestHandler_scheduleApprovedExecution_HappyPath: normal flow, asserts ScheduledExecutionAt is stamped on the transitioned row. * feat(purchases/revoke): two-step quote-then-confirm + persist refund amount for audit Addresses adversarial-review Finding #4: - Add GET /api/purchases/revoke/calculate/{id} endpoint (calculateAzureRevoke) that calls CalculateRefund and returns the quoted refund amount and currency; no state mutation, used by the frontend confirmation modal. - Thread expectedRefundAmount through dispatchProviderRevoke -> revokeAzurePurchase -> callAzureReturn; on POST /revoke the client sends the amount it consented to and callAzureReturn re-runs CalculateRefund, rejecting with 422 when the new quote diverges by more than revokeQuoteEpsilon (0.01) to close the TOCTOU window between user confirmation and actual Return call. - Persist calc_refund_amount / calc_refund_currency via MarkPurchaseRevoked so the audit row captures the quoted values even if the actual refund differs later. - Migration 000071 adds the two new nullable columns and a consistency CHECK constraint (currency must be non-empty when amount is present). - New tests: TestCallAzureReturn_TOCTOUDivergenceRejectedWith422, TestCallAzureReturn_TOCTOUWithinEpsilonSucceeds, TestCallAzureReturn_AuditRowPopulatedWithQuote. * fix(purchases/revoke): partial-success reconciliation; never retry a refund that already succeeded Addresses adversarial-review Finding #6: - Migration 000072 adds revocation_in_flight BOOLEAN NOT NULL DEFAULT false to purchase_history plus a partial index on rows where the flag is true. - callAzureReturn flips revocation_in_flight=true via FlipPurchaseRevocationInFlight immediately before the Azure Return API call so the row is visible to the finalize sweep if the subsequent MarkPurchaseRevoked DB write fails. - MarkPurchaseRevoked is retried up to 3 times with 1s/3s/9s backoff after Azure Return succeeds; if all retries fail, the endpoint returns a revokeReconcilePendingResult (HTTP 207 body with code=RECONCILE_PENDING, azure_returned=true) so the frontend shows a non-retryable toast rather than prompting the user to retry a refund that Azure already issued. - loadAndRevokePurchaseHistory detects a row with revocation_in_flight=true and revoked_at=nil and immediately returns 207 RECONCILE_PENDING to prevent any duplicate Azure Return call on retry. - purchase.Manager.FinalizeInFlightRevocations sweeps rows matching GetPurchaseHistoryInFlight and retries MarkPurchaseRevoked with 2s/6s backoff per row; the finalize_revocations scheduled task wires this sweep into the Lambda event handler. - New tests: TestCallAzureReturn_MarkPurchaseRevokedFailAllRetries, TestLoadAndRevokePurchaseHistory_RevocationInFlightReturns207, finalize_revocations success and error dispatch tests. * fix(purchases/revoke): typed Azure error classification (no more substring match on err.Error()) Addresses adversarial-review Finding #7: Replace the string-based isAzureClientError implementation with typed error inspection using errors.As(err, &*azcore.ResponseError). The substring-match approach had two failure modes: - False positives: any error whose .Error() string contains "400" (e.g. a network timeout "timeout after 400ms") would be misclassified as a client error, hiding transient infra problems from the operator. - False negatives: Azure refund-policy errors with HTTP codes not in the literal set (e.g. 403, 405) would be escalated as 500. The typed approach classifies exactly the HTTP status codes Azure uses for policy violations and bad requests (400, 403, 404, 405, 409, 422); all other errors (transport errors, 5xx, plain errors) correctly classify as server-side. Updated TestRevokePurchase_AzureCalcRefundClientError to inject a real *azcore.ResponseError{StatusCode: 400} instead of errors.New("400: ..."). New tests: TestIsAzureClientError_SubstringFalsePositive, TestIsAzureClientError_TypedResponseError (covers 4xx client + 5xx server). * fix(purchases/revoke): 1h safety margin on local window + clean 422 on Azure window-edge rejection Addresses adversarial-review Finding #3: Add azureRefundSafetyMargin = 1h so the in-app revoke button disappears and revoke requests are rejected 1h before Azure's hard 7-day deadline. This eliminates the tail of RefundPolicyViolated failures caused by clock skew between CUDly's clock and Azure's at the window boundary. The safety margin is applied only in the local pre-flight check. The value stored in purchase_history.revocation_window_closes_at remains the unmodified Azure deadline so operators can see the true expiry. Additionally, detect RefundPolicyViolated errors from the Return API via the new isAzureWindowEdgeError helper (typed errors.As on *azcore.ResponseError, checking ErrorCode == "RefundPolicyViolated") and map them to a clean 422 with "window has closed" message rather than the generic "Azure refund rejected" 400 path. This handles the race where our safety-margin check passes but Azure's clock disagrees mid-flight. New tests: - TestRevokePurchase_AzureWithinSafetyMarginRejected: purchase 6d23h30m ago (30min before edge) rejected locally despite Azure deadline not yet passed. - TestRevokePurchase_AzureJustOutsideSafetyMarginAllowed: purchase 6d22h30m ago (90min before edge) accepted. - TestIsAzureWindowEdgeError: table test covering RefundPolicyViolated, other error codes, nil, and plain-error false-positive. - TestCallAzureReturn_RefundPolicyViolatedReturns422WindowEdge: Return API returning RefundPolicyViolated surfaces as 422, not 400. * fix(migrations): allow support-case revoke to record in-flight state (case filed, awaiting AWS) Addresses adversarial-review Finding #5: The original purchase_history_revoked_pair_chk required revoked_at and revoked_via to be set or unset together. This is too strict for the AWS support-case revocation path (issue #291 wave-2): when a case is filed, revoked_via='support-case' is recorded immediately for the audit trail, but revoked_at stays NULL until AWS confirms the refund. The pair check fires as a constraint violation in that in-flight state. Migration 000073 drops the pair check and simultaneously tightens the support-case companion check: Old: CHECK (support_case_id IS NULL OR revoked_via = 'support-case') (prevents support_case_id on non-support-case rows only) New: CHECK (revoked_via != 'support-case' OR support_case_id IS NOT NULL) (requires support_case_id whenever revoked_via = 'support-case') The meaningful invariant (no dangling revoked_at without a known provider path) is preserved by the existing purchase_history_revoked_via_chk which constrains revoked_via to ('direct-api', 'support-case'). * test(purchases/revoke): DST-crossing window math + 4-eyes approval placeholder Addresses adversarial-review Finding #9: TestRevocationWindowClosesAtFor_DSTCrossing: verifies that AddDate(0,0,7) (calendar arithmetic) produces the correct 7-day window across a DST transition. The test uses the 2024 US spring-forward on March 10 (02:00 EST -> 03:00 EDT): a purchase at 01:30 must close at 01:30 seven days later, not at 00:30 as a naive Add(168*time.Hour) would produce. This pins the correct behaviour and documents why a fixed-duration approach would be wrong. TestRevokePurchase_FourEyesApproval: skipped placeholder for the revoke 4-eyes approval gate tracked in issue #1005. The skip keeps the suite green while the feature is in flight and serves as a reminder to fill in the implementation when #1005 lands. * fix(api/purchases): map scheduleApprovedExecution CAS race to 409 (not 500) When a concurrent Cancel flips the execution away from schedulable status between the approve flow check and the CAS write, approveWithDelay now returns 409 instead of 500. ErrExecutionNotInExpectedStatus from TransitionExecutionStatus is the discriminator; any other error continues to surface as 500. * fix(api/purchases/revoke): drop pre-check, distinguish GetExecutionByID errors Two related fixes on the same line (revokePurchase GetExecutionByID branch): Finding B: Remove the racy status=="scheduled" pre-check. The old code read the status from the DB and then only dispatched to revokeScheduledExecution when it was "scheduled", but a concurrent writer could flip the status between the read and the CancelScheduledExecutionAtomic call. Drop the pre-check and let CancelScheduledExecutionAtomic's WHERE status='scheduled' CAS decide; a lost CAS returns 410 as expected. Finding C: Distinguish a genuine DB error (non-nil execErr) from a missing row (nil, nil). Before the fix, any non-nil execErr was folded into the "execErr == nil && ..." condition and silently swallowed, falling through to the history lookup. Now a non-nil execErr surfaces immediately as a wrapped error (500). * fix(api/purchases/revoke): clear revocation_in_flight on Azure error paths (transient retryable) When the Azure Return call fails (window-edge, client-error, or transient), the revocation_in_flight flag was left stuck at true. The finalize_revocations sweep would then treat the row as "Azure succeeded, DB write pending" and retry MarkPurchaseRevoked unnecessarily, potentially marking a purchase as revoked when it was never actually returned. Fix: call ClearRevocationInFlight (new store method) on all Azure Return error paths so the row reverts to its original status. On success the flag stays true for the sweep to handle (existing wave-1 behaviour unchanged). Adds ClearRevocationInFlight to StoreInterface, PostgresStore, and MockConfigStore. * fix(frontend/history): expose Revoke button for status='scheduled' rows + regression test Three related changes to surface the Revoke button on Gmail-style pre-fire delayed executions (status='scheduled') in the History UI: 1. Add "scheduled" to historyExecutionStatuses so these rows appear in the /api/history response at all (they were previously invisible). 2. Populate RevocationWindowClosesAt from ScheduledExecutionAt for scheduled rows in annotateHistoryRowByStatus so the frontend window check (revocation_window_closes_at in the future) works without a new field. 3. Update canRevokeCompletedRow to accept status==="scheduled" in addition to "completed" and "" (legacy blank), so the Revoke button renders. Adds regression test: a scheduled Azure row with a future revocation window must show the Revoke button (Findings E + G, second-wave CR). * fix(test/purchases/revoke): fix AssertNotCalled placement + isolate parallel test backoff state Finding F-1: Move AssertNotCalled(t, "CancelExecutionAtomic") to after the handler call in TestRevokePurchase_ScheduledExecution_BugReg_HappyPathCAS. The assertion was placed before h.revokePurchase(), where it trivially passes regardless of what the handler does. Finding F-2: Drop t.Parallel() from TestCallAzureReturn_MarkPurchaseRevokedFailAllRetries. The test mutates the package-global revokeMarkRetryBackoffs slice; running it in parallel with other tests that read the same variable is a data race. The test already uses t.Cleanup to restore the original value, which is sufficient when run sequentially. * fix(test/history-revoke): add missing mock entries for escapeHtmlAttr and getAmortizeUpfront The state mock lacked getAmortizeUpfront/setAmortizeUpfront/subscribeAmortizeUpfront and the utils mock lacked escapeHtmlAttr/amortizedMonthly. Both are called inside renderHistoryList; the missing entries caused a TypeError that loadHistory's try/catch swallowed, silently producing an empty history-list and hiding the Revoke button for all three "shows Revoke" cases. * fix(purchases/revoke): let the CAS decide scheduled cancellability (#290) Remove the early window-expiry 410 in revokeScheduledExecution. A row still in status="scheduled" has not been transitioned by the scheduler, so the cloud SDK call has not fired regardless of how far scheduled_execution_at is in the past (scheduler lag / backpressure). Returning 410 purely on a past timestamp broke free-cancel during lag even though CancelScheduledExecutionAtomic could still cancel the row before any cloud call. Let the CAS be the sole arbiter: it returns cancelled=false (410) only when the row has actually moved out of "scheduled". Updates the former WindowExpired test to assert the new contract (a past-timestamp scheduled row is cancelled for free via the CAS), and refreshes two stale "window-check" comments. Closes the last open CodeRabbit thread on PR #804. * fix(email/revoke): require recipient for scheduled-delay email; tidy tests (#290) Address the second CodeRabbit review pass on PR #804: - Security: SendPurchaseScheduledNotification (SES Sender) no longer falls back to the broadcast SendNotification path when RecipientEmail is empty. That email embeds a live, execution-scoped revoke link, so broadcasting it leaked an action link to every alert subscriber and broke the ownership/RBAC model around revocation. It now returns ErrNoRecipient, matching SendScheduledPurchaseNotification and the SMTP sender. Adds a regression test asserting ErrNoRecipient on empty recipient. - Test: rename TestRevokePurchase_AzureJustOutsideSafetyMarginAllowed -> TestCallAzureReturn_JustOutsideSafetyMargin and correct its docstring; it drives callAzureReturn directly and never exercised the local 1h safety-margin gate (which lives in dispatchProviderRevoke). The reject side of that gate stays covered end-to-end via TestRevokePurchase_AzureWithinSafetyMarginRejected. - Build: implement the ClearRevocationInFlight StoreInterface method on the standalone analytics and server test mocks (mockConfigStore, mockConfigStoreForExchange, mockConfigStoreForHealth) so those test packages compile after the interface gained the method. * refactor(purchases): reduce cyclomatic complexity below the pre-commit gate (#290) The gocyclo pre-commit hook (threshold 10) failed CI on four functions the #290 feature work grew past the limit. Split each into focused helpers with no behavior change: - calculateAzureRevoke (21): extract validateAzureRevokeRequest (auth + load + authorize + window/ID validation, itself split into azureRevokeWindowAndIDs) and extractAzureRefundQuote. - callAzureReturn (21): extract azureCalculateRefund (CalculateRefund + parse), handleAzureReturnError (clear in-flight + status mapping), and persistAzureRevocation (MarkPurchaseRevoked retry + 207 result). - annotateHistoryRowByStatus (12): move the in-flight / audit-gap cases into annotateInFlightOrAuditGapRow. - dispatchTask (11): switch to a map-based dispatch. gocyclo now reports nothing over 10; full internal test suite green (except the pre-existing CSRF flake that also fails on base).
…1121) * fix(auth): derive stable CSRF key from encryption key (closes #1120) Production built the auth service with no CSRFKey, so NewService minted a random per-process key. CSRF tokens are HMAC-SHA256(csrfKey, rawSessionToken) minted at login and validated on every write, so a token issued by one Lambda instance was rejected by another instance (and by the same instance after a cold-start). This surfaced as intermittent "CSRF validation failed" on plan create and admin user-group updates for both Standard and Admin users. Derive a stable CSRFKey from the deploy-provided credential encryption key via HKDF-SHA256 with a domain-separation label, so every instance and every cold-start derives the same key. The encryption key is already mandatory in production (the app refuses to start without it), so this fails closed without requiring a new secret. The key load is memoized via sync.Once, so seeding the CSRF key before the credential store adds no extra work. Add a cross-instance regression test: a token minted by instance A validates on instance B that shares the same master secret, and is rejected by an instance built from a different secret (the pre-fix random-key situation). DeriveCSRFKey also fails closed on an empty master secret. * refactor(server): extract CSRF key derivation to satisfy gocyclo Extract CSRF key derivation and auth.Service construction from reinitializeAfterConnect into a new (*Application).initAuthService helper. Reduces reinitializeAfterConnect cyclomatic complexity from 11 to 10 (project limit), with initAuthService at 3. No behavior, ordering, error handling, or fail-closed semantics changed.
…es for non-admins (#981) * fix(settings): suppress permission-denied banner on Purchasing policies for non-admins (closes #979) When GET /api/config returns HTTP 403 with a "permission denied" error, loadGlobalSettings now returns early instead of rendering the red error banner. The section stays hidden and the page degrades gracefully, matching the existing Exchange Automation hide-for-non-admins pattern. Non-permission 403s, 5xx, and network failures still surface the banner. Four new tests in settings-permissions.test.ts cover: the 403+permission path (no banner), unrelated 403s (banner shown), 500s (banner shown), and network errors (banner shown). * refactor(settings): extract isPermissionDeniedError helper (#981 CR) Pull the inline 403 + "permission denied" check out of the catch block into a small helper so the catch reads as a single intent-revealing line. Behavior unchanged; existing settings-permissions tests cover the contract. * refactor(settings): narrow Error before reading status in isPermissionDeniedError (#981 CR) Reorder the guard so the `instanceof Error` narrowing precedes the `.status` type assertion, matching the conventional TypeScript idiom (narrow first, then access properties). Behavior is unchanged thanks to short-circuit evaluation; this addresses the CodeRabbit style nitpick.
…e collector (#1123) triggerAutoRefreshIfStale fired a background re-collect whenever the recommendations cache looked stale, and startRecommendationsRefresh's success path called onReload() (loadRecommendations on Opportunities, loadDashboard on Home), which re-entered triggerAutoRefreshIfStale. When the backend collection never advances last_collected_at (e.g. it fails at a DB error, so freshness stays stale forever) the success path re-fired the collect on every reload with no terminal condition. Observed in production at ~4-10 POST /recommendations/refresh per second, sustained (~18k calls/hour). The calls currently fail fast at a DB error before reaching Cost Explorer, but once the imminent DB fix lands every iteration becomes a paid Cost Explorer / pricing call (~$0.01 each), continuously. The existing autoRefreshInFlight guard only dedups *concurrent* refreshes and is cleared per cycle, so it does not stop this sequential re-trigger. Add a once-per-page-session latch (autoRefreshAttempted) so the stale-on-open auto-refresh fires at most once regardless of how many times loadRecommendations / loadDashboard re-run (filter changes, tab re-entry, the post-refresh reload). The latch is checked before the freshness GET so neither the GET nor the collect repeats. Explicit user actions (lookback change, manual refresh) call startRecommendationsRefresh directly and are never gated by the latch. A hard page reload resets module state, re-arming the once-per-load convenience. Add a regression test that reproduces the real scenario (collect POST resolves, freshness stays stale) and asserts the collect endpoint is hit exactly once across repeated loads and the post-refresh reload. Fails pre-fix (2 calls), passes post-fix (1 call). Reset the latch in the recommendations test beforeEach so the latch cannot leak across tests.
…routine (#1111) A panic in one region's worker (nil deref, SDK type assertion failure, slice OOB) crashed the whole CLI process. Add a deferred recover() that logs the panic with region context and lets the remaining goroutines complete, mirroring the pattern applied to Lambda goroutines in PR #859 (issue #672). Closes #997 Co-authored-by: Claude <noreply@anthropic.com>
…498) (#1122) The Savings History / savings-over-time chart and YTD KPI sparkline are backed by GET /history/analytics. The handler read and applied the account filter but never read params["provider"], and QueryHistory had no provider predicate, so the series showed identical all-provider data regardless of the selected AWS/Azure/GCP chip (QA row 2.3). The Home dashboard trend chart also deliberately dropped the provider param when calling the endpoint. Thread the provider through end to end, mirroring the working /history and trends (QueryByService) paths: - handler getHistoryAnalytics: read provider, normalise the "all" sentinel to "" (no filter) and validateProvider at the boundary, then pass it to QueryHistory. - PostgresAnalyticsClient.QueryHistory + AnalyticsClientInterface: add a provider param and an optional parameter-bound `AND provider = $N` WHERE fragment ("" = all providers), positioned after the dual-column account binds. - dashboard.ts loadSavingsTrendChart: forward the provider chip (and name it in the empty-state via buildFilterDesc, now that the query is actually provider-scoped). modules/savings-history.ts already forwarded provider; correct its stale "no-op until #502" comment. Regression tests fail pre-fix and pass post-fix: - TestQueryHistory_ProviderFilter / _ProviderFilterWithAccount assert the provider bind and its position. - TestHandler_getHistoryAnalytics_ProviderFilter / _ProviderAllIsNoFilter / _InvalidProvider cover threading, the "all" sentinel, and boundary rejection. - dashboard.test.ts asserts the provider chip is forwarded to the API and omitted when no provider is selected. Also reset the topbar-filter mocks in the savings-trend describe block so tests stay isolated now that the chart reads the provider chip.
…1126) The Home dashboard showed a non-zero "current / committed" figure for Azure (e.g. $166) while the Coverage tab reported $0 / "No usage detected" for the same subscription. The two surfaces read different fields of the same active commitments, and the Coverage path silently dropped Azure rows. Root cause: getCoverageBreakdown summed only the recurring MonthlyCost and guarded it with `if p.MonthlyCost != nil`. Azure all-upfront RIs carry MonthlyCost == nil (no recurring charge at the commitment layer, documented on config.PurchaseHistoryRecord.MonthlyCost), so every such row was skipped and the provider collapsed to nil Services / nil OverallCoveragePct. The dashboard, by contrast, aggregates EstimatedSavings (always populated), so it still counted the commitment. PR #1105 only fixed the recommendation converter's RecurringMonthlyCost for future rows; it never touched the Coverage handler, so the divergence persisted. Fix: derive the effective covered monthly spend via a shared commitmentCoveredMonthly helper = recurring MonthlyCost (when present) + amortised upfront (UpfrontCost / (Term * MonthsPerYear)), matching the canonical effective-monthly formula already used by analytics.Collector and exchange_lookup. An all-upfront commitment now contributes its amortised upfront instead of being silently dropped, so Coverage reflects the same active commitments the dashboard does. A nil MonthlyCost is treated as a real $0 recurring component (not a fabricated total), and Term <= 0 is guarded against division by zero, mirroring the collector's skip-bad-term defence. Regression test TestHandler_getCoverageBreakdown_AzureAllUpfrontConsistency seeds an active Azure compute all-upfront RI (MonthlyCost nil, $1200 upfront / 1y = $100/mo) and asserts both that the dashboard primitive counts it AND that the Coverage tab now reports $100/mo covered (100% coverage) instead of nil. It fails pre-fix (azure.Services nil) and passes post-fix.
…re alarm (#1124) * fix(migrations): opt-in dirty auto-heal, raise timeout default, failure alarm Harden the lazy-on-connect migration path that caused a prod outage: a slow migration (index build on a growing table) blew the 20s timeout mid-run, golang-migrate left schema_migrations dirty, and since dirty is terminal every later cold start fail-opened and served 500s on any query needing the unapplied columns (070-073 never applied). Three changes: 1. Default-on dirty auto-heal (internal/database/postgres/migrations/migrate.go). New maybeAutoHealDirty runs after maybeForceMigrationVersion and before Up(): when the DB is dirty it Force()s the CURRENT recorded version to clear the flag, then Up() re-applies the pending tail, so a cold start self-recovers instead of staying broken until a manual force (the multi-hour outage shape). Enabled by default; CUDLY_MIGRATION_AUTOHEAL=false is the escape hatch. Forces the current version, NEVER lower: forcing below already-applied seed migrations re-runs guards that raise on a second run (e.g. 000059 "Purchaser already exists"). The app ALWAYS starts: if auto-heal still can't reach head, the error propagates to ensureDB which fail-opens (existing behavior) and the alarm fires -- never a crash-loop. CUDLY_FORCE_MIGRATION_VERSION still takes precedence. Relies on migrations being idempotent (documented + tested). 2. Raise defaultMigrationsTimeout 20s -> 120s (internal/server/app.go), still well under the 300s Lambda limit, so a normal index build can't be killed mid-run. CUDLY_MIGRATION_TIMEOUT override unchanged. 3. CloudWatch migration-failure alarm (terraform/modules/compute/aws/lambda). Log metric filter on the "Migration failed" line + alarm, since the app fail-opens and AWS/Lambda Errors stays clean during a broken migration. Notification target optional via alarm_sns_topic_arn (default []); reuses the existing alarm/metric-filter patterns, no new SNS infra. Fix latent migration bug surfaced by the new idempotency test: migration 000067 widened savings_snapshots.account_id BEFORE dropping the materialized views that depend on it, so the full stack failed to reach head on a fresh DB ("cannot alter type of a column used by a view"). Moved the view DROPs ahead of the ALTER COLUMN. This moved the migrations integration suite from 4/37 to 28/42 passing; the remaining failures are pre-existing and unrelated (users role column, savings-plans split assertions). Tests: - TestMigrations_FullStackIdempotent (integration): migrate to head twice, second run is a clean no-op and not dirty. - TestMigrations_AutoHealDirty (integration): dirty DB self-heals by default and with =true; CUDLY_MIGRATION_AUTOHEAL=false disables it so a dirty DB errors with the dirty flag intact. - TestResolveMigrationsTimeout (unit): default is 120s, honors the override, falls back on invalid/non-positive input. Spec specs/migration-resilience.md updated for the new default, the auto-heal flag, precedence, and the alarm. * fix(migrations): extract migrator recovery hooks to fix gocyclo RunMigrations hit cyclomatic complexity 11 (gocyclo budget is 10) after the default-on dirty auto-heal added a second pre-Up error branch, failing the pre-commit CI hook. Extract migrator creation plus the operator-force and auto-heal hooks into newMigratorWithRecovery, dropping RunMigrations to 9 while preserving hook ordering and error propagation. The helper closes the migrator on its own error paths so callers never receive a half-initialized migrator. * test(migrations): enforce all migrations are transactional + address CR Add a static guard test that reads every *.up.sql / *.down.sql under the migrations dir and fails if any file contains a statement PostgreSQL refuses to run inside a transaction block (CREATE/DROP INDEX CONCURRENTLY, REINDEX, VACUUM, ALTER SYSTEM, CREATE/DROP DATABASE, CREATE TABLESPACE, REFRESH MATERIALIZED VIEW CONCURRENTLY). golang-migrate runs each migration file in one implicit transaction, so forbidding non-transactional statements guarantees a failed migration rolls back atomically with no partial schema. This is orthogonal to the dirty-flag auto-heal in this PR; it closes the partial-apply hole. The scanner strips SQL comments and dollar-quoted function/DO bodies before matching, so CONCURRENTLY inside a stored function (e.g. migration 000003's refresh_savings_materialized_views) is correctly treated as benign. Files may opt out via the marker comment "-- migrate:no-transaction"; such files are skipped with a loud warning and MUST be applied via the deploy-time runner (issue #1125), never the in-request auto-migrate path. Also address the CodeRabbit nitpick on specs/migration-resilience.md: clarify that alarm_sns_topic_arn is a list of SNS topic ARNs.
…loses #419) (#856) * sec: replace math/rand v1 with math/rand/v2 for rate-limiter jitter (#419) math/rand/v2 (Go 1.22+) is the project standard for non-crypto jitter (already used in pkg/retry/exponential.go). Swaps the package-level import in providers/aws/recommendations/ratelimiter.go; the call-site (rand.Float64()) is identical in v2. Adds TestWait_JitterRange: 200 iterations verify jitter stays within [0, 20%] of the base delay and spans more than 5% (non-stuck source). * test(ratelimiter): make jitter test deterministic to remove wall-clock flakiness (refs #419) Extract computeJitter(delay, r) as a pure function and replace the wall-clock-based TestWait_JitterRange with TestComputeJitter, which drives the function with 10,000 evenly-spaced r values in [0,1) and asserts the output is within [0, 0.2*delay]. No sleeping means OS scheduling overhead cannot blow the upper-bound check. The spread assertion (max observed >= 95% of cap) still catches a regressed always-zero jitter source.
…re-assert) (#1127) * fix(migrations): repair partially-applied 058-067 schema (idempotent re-assert) A prior emergency Force(67) on a dev/QA database marked migrations 058-067 as applied without running their SQL, leaving genuinely-unapplied schema changes missing. The most visible symptom is live 500s on planned/pending/stuck purchase-execution queries: "column idempotency_key does not exist" (purchase_executions.idempotency_key from 066). groups.system_managed (059) and other 058-067 schema are missing too. The 058-067 range cannot simply be re-run: migration 059 has a RAISE EXCEPTION seed-guard that conflicts with the Purchaser group already relocated by 064. Add migration 000074, which re-asserts ONLY the schema from 058-067 in fully idempotent form so any partially-migrated (or fully-migrated) DB converges to the correct schema. It intentionally omits the conflicting data seeds (059 Purchaser INSERT + admin backfill, 060 universal-plans DELETE, 064 relocate); those rows already exist or were deliberately relocated on any affected DB. Schema re-asserted, by source migration: - 058: purchase_executions.executed_by_user_id, .executed_at, .pre_approval_skip_reason; index idx_executions_direct_execute - 059: groups.system_managed - 063: purchase_history.monthly_cost DROP NOT NULL - 065: function check_min_one_admin(); constraint triggers trg_min_one_admin_delete, trg_min_one_admin_update - 066: purchase_executions.idempotency_key - 067: savings_snapshots.account_id widened to VARCHAR(255); total_usage / coverage_percentage made NULLABLE (DROP NOT NULL + DROP DEFAULT); materialized views monthly_savings_summary, daily_savings_trend, provider_savings_summary recreated carrying cloud_account_id (+ unique indexes); function drop_old_savings_partitions() Ordering differs from source 067 in one place: the materialized views are dropped BEFORE the account_id type widening. On the corrupted DB the views still exist from migration 000003 selecting account_id, which blocks ALTER COLUMN account_id TYPE. Dropping the dependent views first lets the repair converge from the view-present state. The migration is fully transactional (no CONCURRENTLY), so golang-migrate applies it atomically, and it is idempotent (re-applying is a clean no-op). The .down.sql is a documented no-op: this repair does not own the objects (the source migrations do), so dropping them on rollback would corrupt a correctly-migrated DB. Verified against a Postgres 16 container: reproduced the corrupted baseline (001-057 applied, 058-067 skipped, 070-073 applied), confirmed all listed objects were missing, applied 074, and confirmed every object converged; re-applied 074 and confirmed it is a clean no-op; confirmed it runs inside a single transaction. go build and migration unit tests green, gofmt clean. * fix(db/migration): address CR findings on 000074 repair migration Two CodeRabbit findings on the idempotent 058-067 repair migration: - Materialized views (monthly_savings_summary, daily_savings_trend, provider_savings_summary) were created without WITH NO DATA, so each populated immediately and the subsequent REFRESH re-ran the same aggregation, doubling work and lock time on savings_snapshots. Create the views WITH NO DATA so the single REFRESH is the only populate step. - drop_old_savings_partitions enumerated partitions via pg_tables with tablename LIKE 'savings_snapshots_%', where the LIKE wildcard '_' could match unrelated tables (e.g. savings_snapshots_backup_2024_01) and risk dropping non-partition data. Enumerate real child partitions from the catalog (pg_inherits) scoped to the savings_snapshots parent instead. Migration stays fully idempotent, transactional, and CONCURRENTLY-free; no 059 data seed is reintroduced. go build ./... and go test ./internal/database/... pass.
… are not dropped (#1129) Cost Explorer returns the EC2 RI tenancy title-cased ("Shared", "Dedicated", "Host"), but resolveEC2Tenancy compared it case-sensitively against lowercase "shared"/"dedicated". Every real recommendation hit the default branch and errored; parseRecommendations logged a warning and continue'd, silently dropping all EC2 RI recommendations. With Savings Plans collection separately blocked by an IAM AccessDenied, AWS produced zero recommendations and the dashboard showed no AWS opportunities. Normalise the input with strings.ToLower(strings.TrimSpace(...)) before matching, mirroring the existing whitespace-tolerant handling on the RDS DeploymentOption path. The fail-loud M5 behaviour is preserved: "host" (Dedicated Hosts, no RI product) and any unknown value still error rather than collapse to a default-tenancy buy. Regression test: replicates the production casing ("Shared"/"Dedicated" plus surrounding whitespace) which fails on the pre-fix parser, and asserts title-cased "Host" still errors.
* docs(readme): document web interface install + usage
Add a "Web Interface" section covering what the dashboard provides,
honest capabilities and limitations per provider/feature, Terraform
deployment across AWS/GCP/Azure, and how to access the app and
complete the first-login admin bootstrap.
Verified against: frontend/package.json, internal/api/router.go,
internal/server/static.go + http.go + app.go,
terraform/environments/{aws,gcp,azure}/*, docs/DEPLOYMENT.md.
* docs(readme): mark web GUI experimental + add CLI support matrix
Add a prominent experimental note to the Web Interface section heading
and opening blockquote. Retitle "AWS Services (Production Ready)" to
"AWS Services" to avoid overstating stability.
Add an "AWS CLI Support Matrix" subsection under Supported Cloud
Providers & Services listing all seven CLI-accessible AWS services
(rds, elasticache, ec2, opensearch, redshift, memorydb, savingsplans)
verified against providers/aws/services/ and cmd/main.go, with RDS
and ElastiCache marked Tested and the remaining five marked
Experimental (seeking testers), per owner guidance.
…to unblock deploys (#1130) #1124's migration-failure metric filter requires logs:PutMetricFilter, which the CI deploy SA lacks until the ci-cd-permissions bootstrap grants it, so every terraform apply 403s and blocks all deploys (bootstrap-vs-runtime IAM split, see CLAUDE.md CI/CD IAM). Gate the metric filter and alarm behind a new enable_migration_alarm variable (default false) so deploys succeed; enable it later by re-applying the bootstrap with the metric-filter permissions and setting enable_migration_alarm=true.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (3)
📒 Files selected for processing (297)
📝 WalkthroughWalkthroughThis PR adds local development, CI/CD, deployment, and incident-response files; expands CLI sizing and multi-service processing logic; introduces Lambda and cloud-setup commands; and updates documentation, security policy, and repository hygiene configs. ChangesRuntime and purchase flow
Infrastructure and delivery
Documentation, policy, and repository rules
Sequence Diagram(s)sequenceDiagram
participant runToolMultiService
participant fetchExistingCoverage
participant filterAndAdjustRecommendations
participant processPurchaseLoop
participant audit record writer
runToolMultiService->>fetchExistingCoverage: fetch existing RI coverage
runToolMultiService->>filterAndAdjustRecommendations: score and filter recommendations
runToolMultiService->>processPurchaseLoop: confirm and execute purchases
processPurchaseLoop->>audit record writer: write per-recommendation audit records
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes Possibly related PRs
Suggested labels
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
Promotes the long-lived integration branch
feat/multicloud-web-frontendtomain.mainis a clean ancestor (0 commits ahead, ~1380 behind), so this is a conflict-free fast-forward preserving full history.After merge:
feat/multicloud-web-frontendtomain;mainas the integration base;feat/multicloud-web-frontendis retired.Deploy workflows already trigger on
main, so deployment continuity is preserved. The migration-failure alarm is gated off by default (#1130) so the deploy does not 403.Summary by CodeRabbit
Release Notes
New Features
Enhancements
Documentation