Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.

Add DM intents merkle root check via k8s adapter#12

Merged
ianchen0119 merged 11 commits into
Gthulhu:mainfrom
vx416:feature/intents_sync
Feb 17, 2026
Merged

Add DM intents merkle root check via k8s adapter#12
ianchen0119 merged 11 commits into
Gthulhu:mainfrom
vx416:feature/intents_sync

Conversation

@yanun0323
Copy link
Copy Markdown
Contributor

Summary

  • Implement Merkle tree utilities and minimal tests
  • Build and cache intents Merkle root in DecisionMaker
  • Add DM endpoint to fetch Merkle root
  • Extend Manager adapter/client to compare DB intents root vs DM nodes
  • Add regression tests for newly introduced intents-sync flows across DecisionMaker and Manager

Changes

  • CheckDMIntents now queries DM pods via k8s_adapter and logs mismatches
  • New GET /api/v1/intents/merkle endpoint returns root hash
  • New DecisionMakerAdapter.GetIntentMerkleRoot method
  • Add decisionmaker/service/intents_svc_test.go to cover:
    • nil request handling
    • depth truncation behavior
    • subtree lookup by RootHash
    • not-found hash behavior
    • merkle root refresh from cached intents
    • deterministic hashing regardless of label map order
  • Add manager/service/cron_svc_test.go to cover:
    • missing K8SAdapter
    • DM pod query error path
    • no DM pods path
    • missing DMAdapter on online nodes
    • happy path with online/offline node filtering
    • deterministic intent sorting and hashing
  • Add manager/client/deicison_maker_test.go to cover:
    • successful merkle root fetch
    • non-200 response handling
    • empty response data handling

Testing

  • go test ./decisionmaker/service
  • go test ./manager/service
  • go test ./manager/client
  • go test ./...
  • go test -race ./decisionmaker/service ./manager/service ./manager/client

Notes

  • Merkle hash rule is deterministic based on intent fields
  • Offline DM pods are skipped in the check
  • New tests focus on regression coverage for the intent merkle sync path without changing public APIs

vx416 and others added 4 commits January 4, 2026 12:04
- cover TraverseIntentMerkleTree branches and deterministic hashing
- cover CheckDMIntents paths including online-only DM comparison
- cover GetIntentMerkleRoot success, non-OK, and empty-data cases
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Merkle tree-based integrity checking for scheduling intents between the Manager and DecisionMaker nodes. The Manager periodically queries DM pods for their intent Merkle root hashes and compares them against the expected hash computed from the database to detect synchronization issues.

Changes:

  • Added Merkle tree utilities with SHA-256 hashing for building and traversing trees
  • Extended DecisionMaker service to cache intents and maintain a Merkle root hash
  • Added new REST endpoint GET /api/v1/intents/merkle in DecisionMaker for retrieving the Merkle root
  • Implemented Manager's CheckDMIntents cron function to compare Merkle roots across online DM nodes
  • Added comprehensive regression tests covering intent synchronization flows

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pkg/util/merkle.go New Merkle tree utilities with hash computation, tree building, node finding, and truncation functions
pkg/util/merkle_test.go Basic tests for empty tree, traversal, and truncation operations
decisionmaker/domain/pod.go Added IntentID field to Intent struct (unused)
decisionmaker/service/service.go Added intent caching with Merkle root computation in ProcessIntents and refresh logic
decisionmaker/service/intents_svc.go New TraverseIntentMerkleTree service method for querying Merkle tree structure
decisionmaker/service/intents_svc_test.go Tests covering nil requests, depth truncation, subtree lookup, and deterministic hashing
decisionmaker/rest/intent_handler.go New GetIntentMerkleRoot handler returning root hash
decisionmaker/rest/handler.go Registered new Merkle endpoint in routes
manager/domain/interface.go Added GetIntentMerkleRoot method to DecisionMakerAdapter interface
manager/domain/mock_domain.go Generated mock for new adapter method
manager/client/deicison_maker.go Implemented GetIntentMerkleRoot client method
manager/client/deicison_maker_test.go Tests for successful fetch, non-200 status, and empty data scenarios
manager/service/cron_svc.go New CheckDMIntents function with deterministic intent sorting and hashing
manager/service/cron_svc_test.go Tests covering error paths, online/offline filtering, and hash determinism

Comment on lines +87 to +94
intent.PodName,
intent.PodID,
intent.NodeID,
intent.K8sNamespace,
intent.CommandRegex,
strconv.Itoa(intent.Priority),
strconv.FormatInt(intent.ExecutionTime, 10),
strings.Join(labels, ","),
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash serialization format inconsistency between Manager and DecisionMaker. The Manager's scheduleIntentSortKey function (lines 86-95) creates a pipe-delimited string without field name prefixes (e.g., "pod-name|pod-id|..."), while the DecisionMaker's hashIntent function uses field name prefixes (e.g., "podName=pod-name|podID=pod-id|..."). This will cause Merkle root mismatches even when intents are identical. Either both should use prefixes or both should omit them to ensure deterministic hash comparison.

Suggested change
intent.PodName,
intent.PodID,
intent.NodeID,
intent.K8sNamespace,
intent.CommandRegex,
strconv.Itoa(intent.Priority),
strconv.FormatInt(intent.ExecutionTime, 10),
strings.Join(labels, ","),
"podName=" + intent.PodName,
"podID=" + intent.PodID,
"nodeID=" + intent.NodeID,
"k8sNamespace=" + intent.K8sNamespace,
"commandRegex=" + intent.CommandRegex,
"priority=" + strconv.Itoa(intent.Priority),
"executionTime=" + strconv.FormatInt(intent.ExecutionTime, 10),
"labels=" + strings.Join(labels, ","),

Copilot uses AI. Check for mistakes.
Comment thread decisionmaker/service/intents_svc.go Outdated
Comment on lines +31 to +45
if svc.intentMerkleRoot == nil {
svc.refreshIntentMerkleTreeIfNeeded()
}

root := svc.intentMerkleRoot
if req.RootHash != "" && root != nil {
found := util.FindMerkleNode(root, req.RootHash)
if found == nil {
return &TraverseIntentMerkleTreeResp{RootNode: nil}, nil
}
root = found
}

truncated := util.TruncateMerkleTree(root, req.Depth)
return &TraverseIntentMerkleTreeResp{RootNode: convertMerkleNode(truncated)}, nil
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: svc.intentMerkleRoot is accessed without holding the intentCacheMu lock. Line 31 checks if the root is nil, line 32 calls refreshIntentMerkleTreeIfNeeded(), and line 35 reads the root again, all without proper synchronization. This can cause data races when ProcessIntents is updating the Merkle root concurrently. Wrap the access to svc.intentMerkleRoot with appropriate read locks.

Copilot uses AI. Check for mistakes.
svc.intentMerkleRootHash = root.Hash
} else {
svc.intentMerkleRootHash = ""
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function refreshIntentMerkleTreeIfNeeded is missing a closing brace. This will cause a compilation error. Add a closing brace after line 360 before the comment on line 362.

Suggested change
}
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +58
for _, dm := range dms {
if dm.State != domain.NodeStateOnline {
continue
}
if svc.DMAdapter == nil {
return fmt.Errorf("decision maker adapter is nil")
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DMAdapter nil check should occur before the loop over DM pods. Currently, the code checks if svc.DMAdapter == nil inside the loop for each online node, which means it will only error on the first online node. This check should be moved before line 52 to fail fast if the adapter is missing, rather than iterating through pods unnecessarily.

Suggested change
for _, dm := range dms {
if dm.State != domain.NodeStateOnline {
continue
}
if svc.DMAdapter == nil {
return fmt.Errorf("decision maker adapter is nil")
}
if svc.DMAdapter == nil {
return fmt.Errorf("decision maker adapter is nil")
}
for _, dm := range dms {
if dm.State != domain.NodeStateOnline {
continue
}

Copilot uses AI. Check for mistakes.
Comment thread decisionmaker/service/intents_svc.go Outdated
RootNode *Node
}

// TODO: TraverseIntentMerkleTree
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment should be removed or clarified. This function appears to be fully implemented, so the TODO comment on line 25 is misleading. Either remove the comment if the implementation is complete, or clarify what remains to be done.

Suggested change
// TODO: TraverseIntentMerkleTree
// TraverseIntentMerkleTree returns a truncated view of the in-memory intent Merkle tree.
// If RootHash is provided and found, traversal starts from that node; otherwise it starts
// from the current Merkle root. The tree is truncated to the specified Depth.

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +65
logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %s", dm)
continue
}
if rootHash != expectedRoot {
logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %s: expected=%s actual=%s", dm, expectedRoot, rootHash)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect format verb in log statement. The dm variable (of type *domain.DecisionMakerPod) is used with %s format verb, but likely needs a proper string representation. While this will compile, it will print the memory address instead of meaningful information. Consider either implementing a String method for DecisionMakerPod or using structured logging fields (e.g., Str("nodeID", dm.NodeID)) for better log readability.

Suggested change
logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %s", dm)
continue
}
if rootHash != expectedRoot {
logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %s: expected=%s actual=%s", dm, expectedRoot, rootHash)
logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %v", dm)
continue
}
if rootHash != expectedRoot {
logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %v: expected=%s actual=%s", dm, expectedRoot, rootHash)

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +65
logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %s", dm)
continue
}
if rootHash != expectedRoot {
logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %s: expected=%s actual=%s", dm, expectedRoot, rootHash)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect format verb in log statement. Same issue as line 61 - the dm variable is used with %s format verb which will print a memory address instead of meaningful information. Consider implementing a String method for DecisionMakerPod or using structured logging fields.

Suggested change
logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %s", dm)
continue
}
if rootHash != expectedRoot {
logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %s: expected=%s actual=%s", dm, expectedRoot, rootHash)
logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %+v", dm)
continue
}
if rootHash != expectedRoot {
logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %+v: expected=%s actual=%s", dm, expectedRoot, rootHash)

Copilot uses AI. Check for mistakes.
@@ -16,6 +16,7 @@ type PodInfo struct {
}

type Intent struct {
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New field IntentID added but never used. This field is added to the Intent struct but is not referenced anywhere in the codebase (neither in hashing logic, serialization, nor any other operations). If this field is intended for future use, consider adding a comment to clarify its purpose. If it should be included in the hash computation for intent comparison, it must be added to the hashIntent function.

Suggested change
type Intent struct {
type Intent struct {
// IntentID is an optional opaque identifier for this intent.
// It is used by higher-level components (e.g. the manager service or external clients)
// and is intentionally excluded from hashing/comparison logic such as hashIntent,
// which only considers the scheduling properties of an intent.

Copilot uses AI. Check for mistakes.
* compare manager checks against node-specific intent roots
* normalize intent hashing and nil filtering in merkle helpers
* lock merkle root reads in TraverseIntentMerkleTree
* add concurrency and node-scoped regression tests
@yanun0323
Copy link
Copy Markdown
Contributor Author

commit b77ccf4,主要修正 intents sync 的比對邏輯與併發安全。

這次調整

  • manager/service/cron_svc.goCheckDMIntents 改為以 NodeID 計算 node-scoped expected Merkle root,再逐一與對應 DM node 回報值比對。
  • manager/service/cron_svc.go:節點無 intents 時使用 empty-tree root 作為 fallback,避免比對基準不一致。
  • manager/service/cron_svc.go:新增 normalizeScheduleIntents,先過濾 nil intents。
  • manager/service/cron_svc.gohashScheduleIntent 改為明確欄位序列化 + label 排序,確保 hash deterministic。
  • decisionmaker/service/intents_svc.goTraverseIntentMerkleTree 讀取 intentMerkleRoot 時加上 RLock/RUnlock,避免 concurrent read/write race。

測試補強

  • decisionmaker/service/intents_svc_test.go:新增 TestTraverseIntentMerkleTreeConcurrentReadWrite,覆蓋 Merkle root 讀寫併發情境。
  • manager/service/cron_svc_test.go:新增 TestCheckDMIntentsComparesNodeScopedMerkleRoots,驗證多節點各自 root 比對。
  • manager/service/cron_svc_test.go:調整既有 happy path 測試 expected root,對齊 node-scoped 行為。
  • manager/service/cron_svc_test.go:補強 hash deterministic 斷言,固定序列化格式。

影響範圍

  • 不變更 public API。
  • 行為修正為「每個 DM 節點比對自己的 intents root」,避免原本使用全域 root 造成誤判。
  • 本次 commit 變更量:4 files changed, 208 insertions(+), 34 deletions(-)

@ianchen0119
Copy link
Copy Markdown
Member

LGTM

I will update the status if system test has a error.

@ianchen0119 ianchen0119 merged commit 781269a into Gthulhu:main Feb 17, 2026
2 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants