boardwalk: improve operator recovery during caught workflows by bzlo · Pull Request #231 · Backblaze/boardwalk

bzlo · 2026-05-21T23:37:17Z

What and why

This improves the operator experience when long-running Boardwalk workflows pause or fail in recoverable ways.

Posts Boardwalk API auth login prompts to Slack, including worker/Jenkins context when provided.
Shows active auth prompts in the Boardwalk UI so operators can find the login link without digging through worker logs.
Adds generic Slack error-advice rules so deployments can attach local guidance to matching failure messages without hardcoding internal behavior into Boardwalk.
Mentions the Jenkins-triggering user in Slack error notifications when Boardwalk can resolve their email to a Slack user.
Improves Ansible failure summaries by surfacing useful result details, including health-check assertion messages.
Adds caught-workspace UI actions for the active worker to remove:
- /etc/ansible/facts.d/boardwalk_state.fact
- /opt/boardwalk.mutex

The cleanup actions are worker-mediated, so boardwalkd does not need SSH/sudo access to target hosts.

How this was tested

Local-focused regression tests covering auth prompts, Slack advice/mentions, Ansible error summaries, and cleanup request/worker paths.
poetry run pytest test/boardwalkd test/boardwalk -v
poetry run ruff check ...
poetry run ruff format --check ...
poetry run pyright
git diff --check

asullivan-blze

On the whole, I'd concur that this would be useful; especially with deduplicating instances where we run health check loops, which might ultimately timeout/fail and yield webhooks where it just concatenates all failures:

[ ... message truncated at 2000 characters; see log for 16268 remaining character(s) ... ]

I've added some inline comments for review, and also one that I don't have an easy place to attach to a line:

Please include an example configuration file for the Slack error advice that's being added in in test/server-client/; having an example file that matches with the existing test playbooks will help with future debugging, versus needing to write one on the fly. I'd suggest using the fail message from test/server-client/playbooks/playbook-job-test-should-fail.yml as an example pattern, where the advice would be along the lines of "Successful test, this may be ignored", or something.

I'll probably have more once you review these comments/suggestions -- and actually pull the branch to kick the tires -- but I figured getting your thoughts now would be ideal, since the PR is still in a draft state.

bzlo · 2026-05-27T21:40:11Z

thanks so much for the review! made the requested changes:

added an example slack error advice TOML config under test/server-client/, using the existing failing test playbook message
switched slack error advice from inline JSON to TOML config files, and updated the matching configMgmt PR to use the deployed TOML config path too
generalized the public jenkins_* fields to deployment_* and updated the worker/UI/Slack references

also did a local smoke test with boardwalkd + a fake slack webhook:

auth prompt slack payload included the login link + deployment context
/workspaces showed the active auth prompt with the deployment link
posting a matching error event through local boardwalkd produced a Slack error payload with the configured advice block
the standard pytest, ruff, and pyright checks are passing too.

boardwalk: improve operator recovery UX

3e39345

asullivan-blze requested changes May 27, 2026

View reviewed changes

Comment thread src/boardwalkd/protocol.py Outdated

Comment thread src/boardwalkd/cli.py Outdated

boardwalk: address Slack advice review feedback

b03c269

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boardwalk: improve operator recovery during caught workflows#231

boardwalk: improve operator recovery during caught workflows#231
bzlo wants to merge 2 commits into
Backblaze:mainfrom
bzlo:lorelei/auth-login-slack-notification

bzlo commented May 21, 2026

Uh oh!

asullivan-blze left a comment

Uh oh!

Uh oh!

Uh oh!

bzlo commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bzlo commented May 21, 2026

What and why

How this was tested

Uh oh!

asullivan-blze left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bzlo commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bzlo commented May 27, 2026 •

edited

Loading