Add DM intents merkle root check via k8s adapter#12
Conversation
- cover TraverseIntentMerkleTree branches and deterministic hashing - cover CheckDMIntents paths including online-only DM comparison - cover GetIntentMerkleRoot success, non-OK, and empty-data cases
There was a problem hiding this comment.
Pull request overview
This PR implements Merkle tree-based integrity checking for scheduling intents between the Manager and DecisionMaker nodes. The Manager periodically queries DM pods for their intent Merkle root hashes and compares them against the expected hash computed from the database to detect synchronization issues.
Changes:
- Added Merkle tree utilities with SHA-256 hashing for building and traversing trees
- Extended DecisionMaker service to cache intents and maintain a Merkle root hash
- Added new REST endpoint
GET /api/v1/intents/merklein DecisionMaker for retrieving the Merkle root - Implemented Manager's
CheckDMIntentscron function to compare Merkle roots across online DM nodes - Added comprehensive regression tests covering intent synchronization flows
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/util/merkle.go | New Merkle tree utilities with hash computation, tree building, node finding, and truncation functions |
| pkg/util/merkle_test.go | Basic tests for empty tree, traversal, and truncation operations |
| decisionmaker/domain/pod.go | Added IntentID field to Intent struct (unused) |
| decisionmaker/service/service.go | Added intent caching with Merkle root computation in ProcessIntents and refresh logic |
| decisionmaker/service/intents_svc.go | New TraverseIntentMerkleTree service method for querying Merkle tree structure |
| decisionmaker/service/intents_svc_test.go | Tests covering nil requests, depth truncation, subtree lookup, and deterministic hashing |
| decisionmaker/rest/intent_handler.go | New GetIntentMerkleRoot handler returning root hash |
| decisionmaker/rest/handler.go | Registered new Merkle endpoint in routes |
| manager/domain/interface.go | Added GetIntentMerkleRoot method to DecisionMakerAdapter interface |
| manager/domain/mock_domain.go | Generated mock for new adapter method |
| manager/client/deicison_maker.go | Implemented GetIntentMerkleRoot client method |
| manager/client/deicison_maker_test.go | Tests for successful fetch, non-200 status, and empty data scenarios |
| manager/service/cron_svc.go | New CheckDMIntents function with deterministic intent sorting and hashing |
| manager/service/cron_svc_test.go | Tests covering error paths, online/offline filtering, and hash determinism |
| intent.PodName, | ||
| intent.PodID, | ||
| intent.NodeID, | ||
| intent.K8sNamespace, | ||
| intent.CommandRegex, | ||
| strconv.Itoa(intent.Priority), | ||
| strconv.FormatInt(intent.ExecutionTime, 10), | ||
| strings.Join(labels, ","), |
There was a problem hiding this comment.
Hash serialization format inconsistency between Manager and DecisionMaker. The Manager's scheduleIntentSortKey function (lines 86-95) creates a pipe-delimited string without field name prefixes (e.g., "pod-name|pod-id|..."), while the DecisionMaker's hashIntent function uses field name prefixes (e.g., "podName=pod-name|podID=pod-id|..."). This will cause Merkle root mismatches even when intents are identical. Either both should use prefixes or both should omit them to ensure deterministic hash comparison.
| intent.PodName, | |
| intent.PodID, | |
| intent.NodeID, | |
| intent.K8sNamespace, | |
| intent.CommandRegex, | |
| strconv.Itoa(intent.Priority), | |
| strconv.FormatInt(intent.ExecutionTime, 10), | |
| strings.Join(labels, ","), | |
| "podName=" + intent.PodName, | |
| "podID=" + intent.PodID, | |
| "nodeID=" + intent.NodeID, | |
| "k8sNamespace=" + intent.K8sNamespace, | |
| "commandRegex=" + intent.CommandRegex, | |
| "priority=" + strconv.Itoa(intent.Priority), | |
| "executionTime=" + strconv.FormatInt(intent.ExecutionTime, 10), | |
| "labels=" + strings.Join(labels, ","), |
| if svc.intentMerkleRoot == nil { | ||
| svc.refreshIntentMerkleTreeIfNeeded() | ||
| } | ||
|
|
||
| root := svc.intentMerkleRoot | ||
| if req.RootHash != "" && root != nil { | ||
| found := util.FindMerkleNode(root, req.RootHash) | ||
| if found == nil { | ||
| return &TraverseIntentMerkleTreeResp{RootNode: nil}, nil | ||
| } | ||
| root = found | ||
| } | ||
|
|
||
| truncated := util.TruncateMerkleTree(root, req.Depth) | ||
| return &TraverseIntentMerkleTreeResp{RootNode: convertMerkleNode(truncated)}, nil |
There was a problem hiding this comment.
Race condition: svc.intentMerkleRoot is accessed without holding the intentCacheMu lock. Line 31 checks if the root is nil, line 32 calls refreshIntentMerkleTreeIfNeeded(), and line 35 reads the root again, all without proper synchronization. This can cause data races when ProcessIntents is updating the Merkle root concurrently. Wrap the access to svc.intentMerkleRoot with appropriate read locks.
| svc.intentMerkleRootHash = root.Hash | ||
| } else { | ||
| svc.intentMerkleRootHash = "" | ||
| } |
There was a problem hiding this comment.
The function refreshIntentMerkleTreeIfNeeded is missing a closing brace. This will cause a compilation error. Add a closing brace after line 360 before the comment on line 362.
| } | |
| } | |
| } |
| for _, dm := range dms { | ||
| if dm.State != domain.NodeStateOnline { | ||
| continue | ||
| } | ||
| if svc.DMAdapter == nil { | ||
| return fmt.Errorf("decision maker adapter is nil") | ||
| } |
There was a problem hiding this comment.
The DMAdapter nil check should occur before the loop over DM pods. Currently, the code checks if svc.DMAdapter == nil inside the loop for each online node, which means it will only error on the first online node. This check should be moved before line 52 to fail fast if the adapter is missing, rather than iterating through pods unnecessarily.
| for _, dm := range dms { | |
| if dm.State != domain.NodeStateOnline { | |
| continue | |
| } | |
| if svc.DMAdapter == nil { | |
| return fmt.Errorf("decision maker adapter is nil") | |
| } | |
| if svc.DMAdapter == nil { | |
| return fmt.Errorf("decision maker adapter is nil") | |
| } | |
| for _, dm := range dms { | |
| if dm.State != domain.NodeStateOnline { | |
| continue | |
| } |
| RootNode *Node | ||
| } | ||
|
|
||
| // TODO: TraverseIntentMerkleTree |
There was a problem hiding this comment.
TODO comment should be removed or clarified. This function appears to be fully implemented, so the TODO comment on line 25 is misleading. Either remove the comment if the implementation is complete, or clarify what remains to be done.
| // TODO: TraverseIntentMerkleTree | |
| // TraverseIntentMerkleTree returns a truncated view of the in-memory intent Merkle tree. | |
| // If RootHash is provided and found, traversal starts from that node; otherwise it starts | |
| // from the current Merkle root. The tree is truncated to the specified Depth. |
| logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %s", dm) | ||
| continue | ||
| } | ||
| if rootHash != expectedRoot { | ||
| logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %s: expected=%s actual=%s", dm, expectedRoot, rootHash) |
There was a problem hiding this comment.
Incorrect format verb in log statement. The dm variable (of type *domain.DecisionMakerPod) is used with %s format verb, but likely needs a proper string representation. While this will compile, it will print the memory address instead of meaningful information. Consider either implementing a String method for DecisionMakerPod or using structured logging fields (e.g., Str("nodeID", dm.NodeID)) for better log readability.
| logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %s", dm) | |
| continue | |
| } | |
| if rootHash != expectedRoot { | |
| logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %s: expected=%s actual=%s", dm, expectedRoot, rootHash) | |
| logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %v", dm) | |
| continue | |
| } | |
| if rootHash != expectedRoot { | |
| logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %v: expected=%s actual=%s", dm, expectedRoot, rootHash) |
| logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %s", dm) | ||
| continue | ||
| } | ||
| if rootHash != expectedRoot { | ||
| logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %s: expected=%s actual=%s", dm, expectedRoot, rootHash) |
There was a problem hiding this comment.
Incorrect format verb in log statement. Same issue as line 61 - the dm variable is used with %s format verb which will print a memory address instead of meaningful information. Consider implementing a String method for DecisionMakerPod or using structured logging fields.
| logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %s", dm) | |
| continue | |
| } | |
| if rootHash != expectedRoot { | |
| logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %s: expected=%s actual=%s", dm, expectedRoot, rootHash) | |
| logger.Logger(ctx).Warn().Err(err).Msgf("failed to get merkle root from dm %+v", dm) | |
| continue | |
| } | |
| if rootHash != expectedRoot { | |
| logger.Logger(ctx).Warn().Msgf("intent merkle mismatch for dm %+v: expected=%s actual=%s", dm, expectedRoot, rootHash) |
| @@ -16,6 +16,7 @@ type PodInfo struct { | |||
| } | |||
|
|
|||
| type Intent struct { | |||
There was a problem hiding this comment.
New field IntentID added but never used. This field is added to the Intent struct but is not referenced anywhere in the codebase (neither in hashing logic, serialization, nor any other operations). If this field is intended for future use, consider adding a comment to clarify its purpose. If it should be included in the hash computation for intent comparison, it must be added to the hashIntent function.
| type Intent struct { | |
| type Intent struct { | |
| // IntentID is an optional opaque identifier for this intent. | |
| // It is used by higher-level components (e.g. the manager service or external clients) | |
| // and is intentionally excluded from hashing/comparison logic such as hashIntent, | |
| // which only considers the scheduling properties of an intent. |
* compare manager checks against node-specific intent roots * normalize intent hashing and nil filtering in merkle helpers * lock merkle root reads in TraverseIntentMerkleTree * add concurrency and node-scoped regression tests
|
commit 這次調整
測試補強
影響範圍
|
|
LGTM I will update the status if system test has a error. |
Summary
Changes
CheckDMIntentsnow queries DM pods viak8s_adapterand logs mismatchesGET /api/v1/intents/merkleendpoint returns root hashDecisionMakerAdapter.GetIntentMerkleRootmethoddecisionmaker/service/intents_svc_test.goto cover:RootHashmanager/service/cron_svc_test.goto cover:K8SAdapterDMAdapteron online nodesmanager/client/deicison_maker_test.goto cover:Testing
go test ./decisionmaker/servicego test ./manager/servicego test ./manager/clientgo test ./...go test -race ./decisionmaker/service ./manager/service ./manager/clientNotes