Skip to content

feat(blueprints): consolidate single-node into full-multi-node-cluster with Arc machine support#581

Open
nguyena2 wants to merge 13 commits into
mainfrom
feat/aio-on-arc-machine-blueprint
Open

feat(blueprints): consolidate single-node into full-multi-node-cluster with Arc machine support#581
nguyena2 wants to merge 13 commits into
mainfrom
feat/aio-on-arc-machine-blueprint

Conversation

@nguyena2

@nguyena2 nguyena2 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

This PR retires the standalone full-single-node-cluster blueprint and folds its capabilities into full-multi-node-cluster, which now serves as the single canonical cluster blueprint for both single- and multi-node deployments. The unified blueprint can target either Azure-provisioned VMs or pre-existing Arc-enabled machines, selected through a should_use_arc_machines toggle (Terraform) / shouldUseArcMachines parameter (Bicep). Supporting changes reach into the 100-cncf-cluster component, the messaging Azure Functions module, CI templates, developer tooling, and documentation across the repository.

This is a breaking change: the blueprints/full-single-node-cluster/ directory is removed with no compatibility shim. Consumers must move to blueprints/full-multi-node-cluster/.

Description

Blueprint consolidation

The entire blueprints/full-single-node-cluster/ directory was deleted, including both bicep and terraform implementations, the generated READMEs, and the top-level README. Its .tfvars.example files (dataflow, dataflow-graphs variants, dataflow-endpoint, foundry-project, leak-detection, sse-connector-assets) and its Terratest suite were moved into blueprints/full-multi-node-cluster/, with test names updated from the single-node naming (for example, TestTerraformFullSingleNodeClusterDeploy became TestTerraformFullMultiNodeClusterDeploy).

Arc machine targeting

The full-multi-node-cluster blueprint gained a deployment-mode switch so it can target Arc-enabled machines instead of provisioning VMs:

  • Terraform (main.tf, variables.tf): added should_use_arc_machines, arc_machine_count, arc_machine_name, arc_machine_name_prefix, and arc_machine_resource_group_name. When the toggle is on, a data "azurerm_arc_machine" "arc_machines" block resolves machines by exact name (when arc_machine_count == 1) or a {prefix}{N} pattern, the cloud_vm_host module is gated off, and arc_onboarding_principal_ids is derived from each machine's system-assigned identity. A new cluster_server_ip input is validated as required in Arc mode.
  • Bicep (main.bicep): added shouldUseArcMachines, arcMachineCount, arcMachineName, arcMachineNamePrefix, and arcMachineResourceGroupName. Existing Microsoft.HybridCompute/machines resources are referenced with existing and gated by shouldUseArcMachines. The adminPassword parameter became optional (required only when targeting VMs), and outputs were regrouped into semantic sections.

The multi-node blueprint defaults also shifted toward a production-ready single-node baseline: host_machine_count now defaults to 1, should_deploy_resource_sync_rules defaults to true, and aks_should_enable_private_cluster defaults to true.

CNCF cluster component

The 100-cncf-cluster Bicep component was extended to onboard Arc machines: arcOnboardingPrincipalIds is now an array, clusterServerArcMachineName and clusterNodeArcMachineNames were added, and a shouldDeployArcMachines toggle routes script deployment to a new deploy-scripts-to-arc.bicep module (the Arc counterpart of deploy-scripts-to-vm.bicep). The key-vault-role-assignment module was updated to consume the principal-ID array.

Networking, messaging, and other components

  • The NAT toggles in the Bicep blueprint were collapsed into a single shouldEnableManagedOutboundAccess parameter (default true).
  • The messaging azure-functions Terraform module moved Event Hub client-ID computation into a locals block and conditionally sets EventHubConnection__clientId, avoiding a circular dependency.
  • The ONVIF connector docs moved from username_password_ref to usernamePasswordCredentials with ConfigMap-based trust lists, and the MQTT tools manifest clarified container selection (-c mqtt-tools) and listener ports (18883 authenticated, 18884 anonymous).

Documentation and tooling

Repository-wide references to full-single-node-cluster were repointed to full-multi-node-cluster across blueprints/README.md, docs/, ADR libraries, scenario prerequisites, copilot/ guidance, .github/ agents/instructions/prompts, the .azdo cluster-test template, and scripts/build/Detect-Folder-Changes.ps1. The VS Code workspace added a launch.json and updated settings.json to wire the Azure Functions project at src/500-application/513-tiered-notification-service, and .gitignore now ignores compiled Bicep output (**/bicep/main.json).

Related Issue

None

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Blueprint modification or addition
  • Component modification or addition
  • Documentation update
  • CI/CD pipeline change
  • Other (please describe):

Implementation Details

Rather than maintaining parallel single-node and multi-node blueprints, the single-node blueprint was removed and its examples and tests were migrated so that full-multi-node-cluster covers the full range from a one-machine cluster to a multi-node cluster. Deployment mode is driven by a single toggle (should_use_arc_machines / shouldUseArcMachines): when off, the blueprint provisions VMs via the existing cloud_vm_host path; when on, it resolves pre-existing Arc machines through data.azurerm_arc_machine / existing resources and feeds their system-assigned identities into Arc onboarding. The CNCF cluster component carries the matching shouldDeployArcMachines logic and the new deploy-scripts-to-arc.bicep module so the in-cluster setup scripts run on Arc machines. Optional capabilities remain behind their existing should_* toggles.

Testing Performed

  • Terraform plan/apply
  • Blueprint deployment test
  • Unit tests
  • Integration tests
  • Bug fix includes regression test (see Test Policy)
  • Manual validation
  • Other:

Validation Steps

  1. From blueprints/full-multi-node-cluster/terraform, run terraform init and terraform validate for both default (VM) and Arc modes.
  2. For Arc mode, set should_use_arc_machines = true, provide arc_machine_* and cluster_server_ip values, and run terraform plan to confirm the Arc machine data sources resolve and the VM host module is skipped.
  3. Build the Bicep blueprint (az bicep build) and confirm shouldUseArcMachines gates the Microsoft.HybridCompute/machines references.
  4. Confirm there are no remaining references to full-single-node-cluster in docs, CI templates, or component READMEs.

Checklist

  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed
  • I have run terraform fmt on all Terraform code
  • I have run terraform validate on all Terraform code
  • I have run az bicep format on all Bicep code
  • I have run az bicep build to validate all Bicep code
  • I have checked for any sensitive data/tokens that should not be committed
  • Lint checks pass (run applicable linters for changed file types)

Security Review

  • No credentials, secrets, or tokens are hardcoded or logged
  • RBAC and identity changes follow least-privilege principles
  • No new network exposure or public endpoints introduced without justification
  • Dependency additions or updates have been reviewed for known vulnerabilities
  • Container image changes use pinned digests or SHA references

Additional Notes

This PR does not modify the security-sensitive paths flagged by the template (SECURITY.md, src/000-cloud/010-security-identity/, deploy/). Secret-bearing inputs and outputs remain marked sensitive. The Bicep blueprint outputs for notification are intentionally stubbed ("Not deployed") with a note that Bicep does not yet wire the 045-notification component, kept for parity with Terraform.

Screenshots (if applicable)

nguyena2 added 3 commits June 3, 2026 17:49
…urations

- specify required providers: azurerm, azuread, azapi
- set required Terraform version constraints
- configure azurerm provider with storage use and partner ID settings

🔧 - Generated by Copilot
…Arc cluster blueprint

- separate long description into multiple lines for clarity
- adjust table formatting for better alignment

🔧 - Generated by Copilot
@nguyena2 nguyena2 requested a review from a team as a code owner June 3, 2026 19:18
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

📚 Documentation Health Report

Generated on: 2026-06-03 19:23:38 UTC

📈 Documentation Statistics

Category File Count
Main Documentation 222
Infrastructure Components 197
Blueprints 41
GitHub Resources 26
AI Assistant Guides (Copilot) 17
Total 503

🏗️ Three-Tree Architecture Status

  • ✅ Bicep Documentation Tree: Auto-generated navigation
  • ✅ Terraform Documentation Tree: Auto-generated navigation
  • ✅ README Documentation Tree: Manual README organization

🔍 Quality Metrics

  • Frontmatter Validation:
    success
  • Link Validation: success

This report is automatically generated by the Documentation Automation workflow.

@bindsi bindsi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Automated review: Large blueprint addition (full single-node Arc machine cluster). Adds devcontainer lock, Terraform/Bicep configurations, and supporting infrastructure. Structure follows established blueprint patterns. No obvious security or functional issues in the scaffolding visible from the diff.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

📚 Documentation Health Report

Generated on: 2026-06-05 19:27:58 UTC

📈 Documentation Statistics

Category File Count
Main Documentation 222
Infrastructure Components 197
Blueprints 41
GitHub Resources 26
AI Assistant Guides (Copilot) 17
Total 503

🏗️ Three-Tree Architecture Status

  • ✅ Bicep Documentation Tree: Auto-generated navigation
  • ✅ Terraform Documentation Tree: Auto-generated navigation
  • ✅ README Documentation Tree: Manual README organization

🔍 Quality Metrics

  • Frontmatter Validation:
    success
  • Link Validation: success

This report is automatically generated by the Documentation Automation workflow.

- downgrade golang.org/x/crypto to v0.52.0
- upgrade golang.org/x/mod to v0.35.0
- upgrade golang.org/x/net to v0.55.0
- upgrade golang.org/x/sys to v0.45.0
- upgrade golang.org/x/text to v0.37.0
- upgrade golang.org/x/tools to v0.44.0

🔧 - Generated by Copilot
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

📚 Documentation Health Report

Generated on: 2026-06-05 19:43:50 UTC

📈 Documentation Statistics

Category File Count
Main Documentation 222
Infrastructure Components 197
Blueprints 41
GitHub Resources 26
AI Assistant Guides (Copilot) 17
Total 503

🏗️ Three-Tree Architecture Status

  • ✅ Bicep Documentation Tree: Auto-generated navigation
  • ✅ Terraform Documentation Tree: Auto-generated navigation
  • ✅ README Documentation Tree: Manual README organization

🔍 Quality Metrics

  • Frontmatter Validation:
    success
  • Link Validation: success

This report is automatically generated by the Documentation Automation workflow.

@bindsi bindsi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated batch review: requesting changes for one blocking Terraform dependency-cycle issue.

Comment thread blueprints/full-single-arc-machine-cluster/terraform/main.tf Outdated
…local variable for client ID

- remove direct reference to client ID in locals
- compute client ID in a local variable to avoid module self-dependency

🔧 - Generated by Copilot
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

📚 Documentation Health Report

Generated on: 2026-06-08 19:37:54 UTC

📈 Documentation Statistics

Category File Count
Main Documentation 222
Infrastructure Components 197
Blueprints 41
GitHub Resources 26
AI Assistant Guides (Copilot) 17
Total 503

🏗️ Three-Tree Architecture Status

  • ✅ Bicep Documentation Tree: Auto-generated navigation
  • ✅ Terraform Documentation Tree: Auto-generated navigation
  • ✅ README Documentation Tree: Manual README organization

🔍 Quality Metrics

  • Frontmatter Validation:
    success
  • Link Validation: success

This report is automatically generated by the Documentation Automation workflow.

@bindsi bindsi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated batch re-review: no actionable findings. The prior Terraform self-dependency concern appears resolved: EventHubConnection__clientId is now computed inside the Azure Functions module rather than fed back through blueprint locals.

@katriendg

Copy link
Copy Markdown
Collaborator

@nguyena2 thanks for the work here, though I do fear we already have this functionality in the full-multi-node-cluster blueprint as you can already supply the Arc machines, it's potentially just missing the fact you can supply a single one (always requires more than one by input variable validation).
One idea we had was to simplify and remove the full-single-node-cluster blueprint altogether and make the multi node one the default with the option to have 1 single VM/arc machine vs multiple. That would also simplify future updates.
Could you review if you feel this new blueprint still brings enough value, or would be benefit from consolidating into the existing one. Wdty?

@nguyena2

nguyena2 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@nguyena2 thanks for the work here, though I do fear we already have this functionality in the full-multi-node-cluster blueprint as you can already supply the Arc machines, it's potentially just missing the fact you can supply a single one (always requires more than one by input variable validation). One idea we had was to simplify and remove the full-single-node-cluster blueprint altogether and make the multi node one the default with the option to have 1 single VM/arc machine vs multiple. That would also simplify future updates. Could you review if you feel this new blueprint still brings enough value, or would be benefit from consolidating into the existing one. Wdty?

I actually like this idea. I'll move to removing the two single cluster blueprints and modify the multi-node one to be more streamlined.

@nguyena2 nguyena2 marked this pull request as draft June 9, 2026 16:05
nguyena2 added 4 commits June 10, 2026 21:08
…-node to multi-node cluster blueprints

- replace references to full-single-node-cluster with full-multi-node-cluster
- ensure consistency across multiple application components and README files
- clarify deployment instructions for production environments

📚 - Generated by Copilot
…settings

- rename username_password_ref to usernamePasswordCredentials
- update secret names for camera credentials
- change trustSettings to use a config map for trusted CA certificates

🔒 - Generated by Copilot
@github-actions

Copy link
Copy Markdown

📚 Documentation Health Report

Generated on: 2026-06-10 21:51:07 UTC

📈 Documentation Statistics

Category File Count
Main Documentation 222
Infrastructure Components 197
Blueprints 36
GitHub Resources 26
AI Assistant Guides (Copilot) 17
Total 498

🏗️ Three-Tree Architecture Status

  • ✅ Bicep Documentation Tree: Auto-generated navigation
  • ✅ Terraform Documentation Tree: Auto-generated navigation
  • ✅ README Documentation Tree: Manual README organization

🔍 Quality Metrics

  • Frontmatter Validation:
    success
  • Link Validation: success

This report is automatically generated by the Documentation Automation workflow.

@katriendg

katriendg commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

@nguyena2 thanks for the work here, though I do fear we already have this functionality in the full-multi-node-cluster blueprint as you can already supply the Arc machines, it's potentially just missing the fact you can supply a single one (always requires more than one by input variable validation). One idea we had was to simplify and remove the full-single-node-cluster blueprint altogether and make the multi node one the default with the option to have 1 single VM/arc machine vs multiple. That would also simplify future updates. Could you review if you feel this new blueprint still brings enough value, or would be benefit from consolidating into the existing one. Wdty?

I actually like this idea. I'll move to removing the two single cluster blueprints and modify the multi-node one to be more streamlined.

Would you feel like picking that one up? It's a major overhaul in the sense that the blueprint itself (both in TF and Bicep) is not major work, but the documentation and references to the single one are spread out so you probably need some dedicated work on this. Docs, copilot instructions, build system paths, etc. Would be lovely if you want to take it in!

nguyena2 added 2 commits June 11, 2026 18:33
…tructions

- clarify connection command for MQTT tools container
- specify usage of anonymous listener for mqttui

🔧 - Generated by Copilot
@nguyena2 nguyena2 changed the title feat(blueprints): add full single-node Arc machine cluster blueprint feat(blueprints): consolidate single-node into full-multi-node-cluster with Arc machine support Jun 11, 2026
@nguyena2 nguyena2 marked this pull request as ready for review June 11, 2026 18:54
@github-actions

Copy link
Copy Markdown

📚 Documentation Health Report

Generated on: 2026-06-11 18:55:01 UTC

📈 Documentation Statistics

Category File Count
Main Documentation 222
Infrastructure Components 197
Blueprints 36
GitHub Resources 26
AI Assistant Guides (Copilot) 17
Total 498

🏗️ Three-Tree Architecture Status

  • ✅ Bicep Documentation Tree: Auto-generated navigation
  • ✅ Terraform Documentation Tree: Auto-generated navigation
  • ✅ README Documentation Tree: Manual README organization

🔍 Quality Metrics

  • Frontmatter Validation:
    success
  • Link Validation: success

This report is automatically generated by the Documentation Automation workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants