Skip to content

perf(azure database): batched per-server lookup for AZConfig (closes #149)#812

Open
cristim wants to merge 2 commits into
mainfrom
fix/149-wave5
Open

perf(azure database): batched per-server lookup for AZConfig (closes #149)#812
cristim wants to merge 2 commits into
mainfrom
fix/149-wave5

Conversation

@cristim

@cristim cristim commented May 28, 2026

Copy link
Copy Markdown
Member

Summary

  • Replace deferred per-server AZConfig/Deployment lookups with two subscription-wide batched pager walks (managed instances + regular servers), each walked at most once per GetRecommendations call via sync.Once.
  • For a 10-server account: before = 0 SDK calls (fields left empty); after = 2 SDK calls total regardless of N recommendations -- ManagedInstancesClient.NewListPager + ServersClient.NewListPager, both cached for the client lifetime.
  • AZConfig derived from dominant ManagedInstanceProperties.ZoneRedundant ("zoneRedundant" / "none" / "" when ambiguous). Deployment derived from managed-instance vs. regular-server presence ("managed" / "single" / "").

Test plan

  • _PopulatesAZConfig: all zone-redundant managed instances -> AZConfig="zoneRedundant"
  • _AmbiguousAZConfig: mixed zone-redundancy -> AZConfig=""
  • _AZConfigPagerErrorFallsBack: pager error -> AZConfig="", conversion still succeeds
  • _PopulatesDeployment: only managed instances -> Deployment="managed"
  • _DeploymentSingle: only regular servers, no managed instances -> Deployment="single"
  • _DeploymentPagerErrorFallsBack: servers pager error -> Deployment="", conversion still succeeds
  • _ServerInfo_FetchedOnce: 10 converter calls -> managed-instances pager walked exactly once
  • All 53 existing tests remain green
  • go build ./... clean

@cristim cristim added triaged Item has been triaged priority/p3 Polish / idea / may never ship severity/low Minor harm urgency/this-quarter Within the quarter impact/few Limited audience effort/s Hours type/feat New capability labels May 28, 2026
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@cristim, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 1 second. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8154859c-b63f-4766-93d9-e4aa5edd5029

📥 Commits

Reviewing files that changed from the base of the PR and between 451a70f and 9330bda.

📒 Files selected for processing (2)
  • providers/azure/services/database/client.go
  • providers/azure/services/database/client_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/149-wave5

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim

cristim commented May 28, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim

cristim commented May 30, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Full review triggered.

@cristim

cristim commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Rate Limit Exceeded

@cristim have exceeded the limit for the number of chat messages per hour. Please wait 37 minutes and 44 seconds before sending another message.

cristim commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai full review


Generated by Claude Code

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get PR details and files changed
gh pr view 812 --repo LeanerCloud/CUDly --json title,body,files,additions,deletions,commits

echo "=== PR DIFF ==="
gh pr diff 812 --repo LeanerCloud/CUDly

Length of output: 24586



@cristim: Thorough review below. The overall design is solid — the sync.Once/two-pager pattern mirrors the existing skuCacheOnce approach cleanly, interfaces enable proper mocking, and the 8 new tests cover the important scenarios. Two issues worth addressing:


🔴 Medium — Misleading log message when a page walk partially succeeds

File: providers/azure/services/database/client.gowalkManagedInstances

When a page fetch fails mid-walk, the warning says "signal unavailable", but the function returns the partial counts it collected up to that point. fetchServerInfo then derives azConfig from those partial counts.

Example: page 1 returns 3 zone-redundant instances, page 2 errors → azConfig = "zoneRedundant" is emitted while the log says the signal is unavailable. This is a silent correctness hazard: the operator sees "unavailable" but the recommendation is actually enriched with incomplete data.

Choose one of two consistent paths:

Option A — document partial-data use (optimistic):

logging.Warnf("azure database: managed instances page fetch failed after %d instances: %v — AZConfig/Deployment derived from partial data", total, err)
return zoneRedundant, nonZoneRedundant, total

Option B — zero out on any error (conservative, matches the "unavailable" wording):

logging.Warnf("azure database: managed instances page fetch failed: %v — AZConfig/Deployment signal unavailable", err)
return 0, 0, 0

The same choice applies to the equivalent warning in hasRegularServers (though that one is inherently early-exit so partial-vs-full isn't a semantic risk there).


🟡 Low — Missing test for the mixed-deployment branch

fetchServerInfo has a default: deployment = "" arm covering the case where both managed instances and regular servers exist simultaneously. All other arms are pinned by tests, but this one is not. A small addition would complete coverage:

func TestDatabaseClient_ConvertAzureSQLRecommendation_DeploymentAmbiguous(t *testing.T) {
    client := NewClient(nil, "test-subscription", "eastus")
    client.SetCapabilitiesClient(&MockCapabilitiesClient{})
    zr := true
    client.SetManagedInstancesPager(&MockSQLManagedInstancesPager{
        pages: []armsql.ManagedInstancesClientListResponse{
            {ManagedInstanceListResult: armsql.ManagedInstanceListResult{
                Value: []*armsql.ManagedInstance{
                    {Properties: &armsql.ManagedInstanceProperties{ZoneRedundant: &zr}},
                },
            }},
        },
    })
    // Also has regular servers → mixed → Deployment=""
    client.SetServersPager(&MockSQLServersPager{
        pages: []armsql.ServersClientListResponse{buildServerPage(1)},
    })

    rec := mocks.BuildLegacyReservationRecommendation(mocks.WithNormalizedSize("GeneralPurpose_Gen5_2"))
    out := client.convertAzureSQLRecommendation(context.Background(), rec)
    require.NotNil(t, out)
    details, ok := out.Details.(common.DatabaseDetails)
    require.True(t, ok)
    assert.Empty(t, details.Deployment, "mixed managed+regular must leave Deployment empty")
}

ℹ️ Informational — ZoneRedundant==nil instances: counted for deployment, excluded from AZConfig

walkManagedInstances increments total (returned as managedCount) for every instance, but only increments zoneRedundant/nonZoneRedundant when ZoneRedundant != nil. So instances with a nil redundancy field count toward "is this a managed deployment?" but are invisible to the AZConfig dominance logic. This is the right call, but a brief comment in fetchServerInfo would prevent future readers from wondering whether it's intentional:

// total (managedCount) includes all managed instances, even those lacking
// a ZoneRedundant signal, so Deployment reflects presence correctly while
// AZConfig only counts instances with a known redundancy setting.
zoneRedundantCount, nonZoneRedundantCount, managedCount := c.walkManagedInstances(ctx)

🧠 Learnings used
Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-05-05T15:44:54.463Z
Learning: In the LeanerCloud/CUDly repository, when a CodeRabbit review flags something as "outside-diff" that was already implemented in an earlier commit of the same PR, it is likely a stale-snapshot artifact where the review diff was generated against a pre-fix view. The inline comment convention `(CR pass on PR #<number> Finding N)` is used by the maintainer (cristim) to mark lines that were changed in direct response to a prior CodeRabbit review pass.

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-06-07T19:32:26.182Z
Learning: In LeanerCloud/CUDly `providers/aws/recommendations/parser_services.go`, `parseRDSDetails` uses an exhaustive switch on `DeploymentOption`: "Multi-AZ" maps to "multi-az", "Single-AZ" maps to "single-az", nil leaves AZConfig unset, and any other non-nil value returns an error. Silent defaulting to "single-az" for unknown values was removed in PR `#1085` commit 681fd9e81. Regression test added.

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-06-07T18:28:28.139Z
Learning: In LeanerCloud/CUDly `providers/gcp/recommendations.go`, `collectRegion` fans out to all four GCP service clients (computeengine, cloudsql, memorystore, cloudstorage) concurrently (fixed in PR `#1047` commit 95a390f, H-2 finding). `regionResult` gains `cache` and `storage` slices; merge is updated accordingly. The decision was to wire (not trim) memorystore and cloudstorage because both have complete `GetRecommendations` implementations and their no-op `PurchaseCommitment` paths are orthogonal to surfacing advisory recommendations. Regression tests: `TestRegionResult_HasCacheAndStorageFields` and `TestShouldIncludeService_Cache_Storage`.

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-05-04T14:28:19.903Z
Learning: In providers/azure/recommendations.go (Go, Azure provider), the service constructors (compute.NewClient, cache.NewClient, etc.) build concrete Azure SDK types with no interface injection seam. Introducing a concurrency timing test that asserts elapsed ≈ max-single-service latency would require adding package-level function variables — a refactor that is out of scope for small performance PRs.

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-06-07T19:32:26.182Z
Learning: In LeanerCloud/CUDly `providers/aws/services/rds/client.go`, `findOfferingID` validates AZConfig with a switch that has a `default: error` case covering non-empty invalid values (e.g. "typo-az"), not just an empty-string check. A bad non-empty AZConfig previously fell through to `multiAZ==false` in `paginateRDSOfferings` (same mis-buy as the old default). Fixed in PR `#1085` commit 681fd9e81. Regression test added.

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-06-07T18:28:28.139Z
Learning: In LeanerCloud/CUDly GCP service clients (computeengine, cloudsql, memorystore, cloudstorage), `GetRecommendations` filters out non-ACTIVE recommendations using the typed `recommenderpb.RecommendationStateInfo_ACTIVE` enum (CLAIMED/SUCCEEDED/FAILED/DISMISSED are skipped). State is accessed nil-safely via the proto getter. Fixed in PR `#1047` commit 95a390f (H-1 finding). Regression tests: `TestGetRecommendations_FiltersNonActiveStates` (5-rec mock, only ACTIVE survives) and `TestGetRecommendations_ActiveRecIncluded`. Existing mock tests in all four services had StateInfo added to satisfy the new filter.

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-06-07T17:45:38.069Z
Learning: In LeanerCloud/CUDly `providers/gcp/services/cloudsql/client.go`, `getSQLPricing` had the same per-hour/term-total unit mismatch as memorystore and cloudstorage: `commitmentPrice` (per-hour SKU rate) was passed directly to savings math that expected a term total, producing ~99.99% savings, and `HourlyRate` was set to `per-hour / hoursInTerm` (near-zero). Fixed in PR `#1047` commit 20590c6b4 (issue `#1078` folded in): `commitmentPriceTerm := commitmentPrice * hoursInTerm`; `HourlyRate = commitmentPrice` (raw per-hour). `convertGCPRecommendation` also hardcoded `rec.Term = "1yr"`, ignoring `params.Term`; fixed with same defaulting pattern as memorystore/cloudstorage. Regression tests: `TestGetSQLPricing_CommitmentPriceIsTermTotal` and `TestCloudSQLConvertGCPRecommendation_PropagatesTermFromParams`. PaymentOption was already correct (`if paymentOption == "" { paymentOption = "monthly" }`).

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-06-07T17:38:19.433Z
Learning: In LeanerCloud/CUDly GCP service converters (`convertGCPRecommendation` in memorystore/client.go and cloudstorage/client.go), `rec.Term` must be derived from `params.Term` with a `"1yr"` default — not hardcoded to `"1yr"`. Without this, 3-year callers always emit 1-year commitments even though the downstream `termYears` derivation from `rec.Term` is correct. Fixed in PR `#1047` commit c6280c390 (F2 for memorystore, F4 for cloudstorage). Regression tests: `TestConvertGCPRecommendation_PropagatesTermFromParams` and `TestCloudStorageConvertGCPRecommendation_PropagatesTermFromParams`.

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-06-07T17:38:19.433Z
Learning: In LeanerCloud/CUDly `providers/gcp/services/computeengine/client.go`, `convertGCPRecommendation` must force `paymentOption = "monthly"` unconditionally (logging a WARN when a non-monthly value such as "upfront", "all-upfront", or "partial-upfront" is supplied) because GCP CUDs are monthly-only and any non-monthly value passed through would be a silent misconfiguration. Fixed in PR `#1047` commit c6280c390 (F1). Regression test: `TestConvertGCPRecommendation_NonMonthlyPaymentOptionForcedToMonthly`.

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 0
File: :0-0
Timestamp: 2026-06-07T18:13:37.445Z
Learning: In LeanerCloud/CUDly `providers/gcp/services/computeengine/client.go`, `convertGCPRecommendation` no longer hardcodes `Term = "1yr"` (fixed in PR `#1047` commit 8f9a787, H-3 audit finding). It propagates `params.Term` with a `"1yr"` default, validates via `termPlan`, and returns nil on unknown term. This is stricter than memorystore/cloudstorage/cloudsql (which default and continue): computeengine returns nil on unknown term because it is on the purchase path. Regression tests: `TestConvertGCPRecommendation_PropagatesParamsTerm`, `TestConvertGCPRecommendation_RejectsUnknownTerm`.

@cristim

cristim commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@cristim cristim changed the base branch from feat/multicloud-web-frontend to main June 9, 2026 15:45
cristim added 2 commits June 19, 2026 17:38
…loses #149)

Replace deferred per-server AZConfig and Deployment lookups with two
subscription-wide batched pager walks (managed instances + servers),
gated by a single sync.Once so the pagers are each walked at most once
per GetRecommendations call regardless of recommendation count.

Before: 0 SDK calls (fields left empty, deferred).
After: 2 SDK calls total for any N recommendations -- one
ManagedInstancesClient.NewListPager walk and one ServersClient.NewListPager
walk, both cached for the client lifetime.

Adds 8 unit tests covering: zoneRedundant/none/ambiguous AZConfig,
managed/single/error Deployment, pager-error fallback, and the
_FetchedOnce invariant pinning the 2-call guarantee.
@cristim

cristim commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/s Hours impact/few Limited audience priority/p3 Polish / idea / may never ship severity/low Minor harm triaged Item has been triaged type/feat New capability urgency/this-quarter Within the quarter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant