Skip to content

boardwalk: improve operator recovery during caught workflows#231

Draft
bzlo wants to merge 2 commits into
Backblaze:mainfrom
bzlo:lorelei/auth-login-slack-notification
Draft

boardwalk: improve operator recovery during caught workflows#231
bzlo wants to merge 2 commits into
Backblaze:mainfrom
bzlo:lorelei/auth-login-slack-notification

Conversation

@bzlo
Copy link
Copy Markdown
Contributor

@bzlo bzlo commented May 21, 2026

What and why

This improves the operator experience when long-running Boardwalk workflows pause or fail in recoverable ways.

  • Posts Boardwalk API auth login prompts to Slack, including worker/Jenkins context when provided.
  • Shows active auth prompts in the Boardwalk UI so operators can find the login link without digging through worker logs.
  • Adds generic Slack error-advice rules so deployments can attach local guidance to matching failure messages without hardcoding internal behavior into Boardwalk.
  • Mentions the Jenkins-triggering user in Slack error notifications when Boardwalk can resolve their email to a Slack user.
  • Improves Ansible failure summaries by surfacing useful result details, including health-check assertion messages.
  • Adds caught-workspace UI actions for the active worker to remove:
    • /etc/ansible/facts.d/boardwalk_state.fact
    • /opt/boardwalk.mutex

The cleanup actions are worker-mediated, so boardwalkd does not need SSH/sudo access to target hosts.

How this was tested

  • Local-focused regression tests covering auth prompts, Slack advice/mentions, Ansible error summaries, and cleanup request/worker paths.
  • poetry run pytest test/boardwalkd test/boardwalk -v
  • poetry run ruff check ...
  • poetry run ruff format --check ...
  • poetry run pyright
  • git diff --check

Copy link
Copy Markdown
Contributor

@asullivan-blze asullivan-blze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole, I'd concur that this would be useful; especially with deduplicating instances where we run health check loops, which might ultimately timeout/fail and yield webhooks where it just concatenates all failures:

[ ... message truncated at 2000 characters; see log for 16268 remaining character(s) ... ]

I've added some inline comments for review, and also one that I don't have an easy place to attach to a line:

  • Please include an example configuration file for the Slack error advice that's being added in in test/server-client/; having an example file that matches with the existing test playbooks will help with future debugging, versus needing to write one on the fly. I'd suggest using the fail message from test/server-client/playbooks/playbook-job-test-should-fail.yml as an example pattern, where the advice would be along the lines of "Successful test, this may be ignored", or something.

I'll probably have more once you review these comments/suggestions -- and actually pull the branch to kick the tires -- but I figured getting your thoughts now would be ideal, since the PR is still in a draft state.

Comment thread src/boardwalkd/protocol.py Outdated
Comment thread src/boardwalkd/cli.py Outdated
@bzlo
Copy link
Copy Markdown
Contributor Author

bzlo commented May 27, 2026

thanks so much for the review! made the requested changes:

  • added an example slack error advice TOML config under test/server-client/, using the existing failing test playbook message
  • switched slack error advice from inline JSON to TOML config files, and updated the matching configMgmt PR to use the deployed TOML config path too
  • generalized the public jenkins_* fields to deployment_* and updated the worker/UI/Slack references

also did a local smoke test with boardwalkd + a fake slack webhook:

  • auth prompt slack payload included the login link + deployment context
  • /workspaces showed the active auth prompt with the deployment link
  • posting a matching error event through local boardwalkd produced a Slack error payload with the configured advice block
    the standard pytest, ruff, and pyright checks are passing too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants