boardwalk: improve operator recovery during caught workflows#231
Conversation
asullivan-blze
left a comment
There was a problem hiding this comment.
On the whole, I'd concur that this would be useful; especially with deduplicating instances where we run health check loops, which might ultimately timeout/fail and yield webhooks where it just concatenates all failures:
[ ... message truncated at 2000 characters; see log for 16268 remaining character(s) ... ]
I've added some inline comments for review, and also one that I don't have an easy place to attach to a line:
- Please include an example configuration file for the Slack error advice that's being added in in
test/server-client/; having an example file that matches with the existing test playbooks will help with future debugging, versus needing to write one on the fly. I'd suggest using the fail message fromtest/server-client/playbooks/playbook-job-test-should-fail.ymlas an example pattern, where the advice would be along the lines of "Successful test, this may be ignored", or something.
I'll probably have more once you review these comments/suggestions -- and actually pull the branch to kick the tires -- but I figured getting your thoughts now would be ideal, since the PR is still in a draft state.
|
thanks so much for the review! made the requested changes:
also did a local smoke test with boardwalkd + a fake slack webhook:
|
What and why
This improves the operator experience when long-running Boardwalk workflows pause or fail in recoverable ways.
The cleanup actions are worker-mediated, so boardwalkd does not need SSH/sudo access to target hosts.
How this was tested