Skip to content

feat(tamanu): TAM-6782: add reload subcommand for safe rolling restart#313

Open
dannash100 wants to merge 11 commits into
mainfrom
reload-command
Open

feat(tamanu): TAM-6782: add reload subcommand for safe rolling restart#313
dannash100 wants to merge 11 commits into
mainfrom
reload-command

Conversation

@dannash100
Copy link
Copy Markdown

Restarts running tamanu services one at a time with a wait between each,
mirroring the approach in the ops repo's tamanu-single-upgrade playbook.
On Linux, drives systemd units tamanu-{kind}-*, tamanu-frontend@* and
tamanu-patientportal, reloading caddy + flushing systemd-resolved
between restarts so caddy picks up the new container IP. On Windows,
restarts every online pm2 process. Optional HTTP probe via --check-url.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

dannash100 and others added 3 commits May 14, 2026 11:34
Restarts running tamanu services one at a time with a wait between each,
mirroring the approach in the ops repo's tamanu-single-upgrade playbook.
On Linux, drives systemd units `tamanu-{kind}-*`, `tamanu-frontend@*` and
`tamanu-patientportal`, reloading caddy + flushing systemd-resolved
between restarts so caddy picks up the new container IP. On Windows,
restarts every online pm2 process. Optional HTTP probe via --check-url.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`systemctl is-active` only proves podman started the container, not that
the Node app inside is serving. After is-active, look up the container by
its `PODMAN_SYSTEMD_UNIT` label, get its netavark IP, and probe
http://<ip>:3000/ until it responds (any non-network-error counts). Skip
for known workers (`*-tasks`, `*-fhir-*`) that don't listen on a port.
Opt-out via --no-strict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trip the typos CI check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dannash100 and others added 4 commits May 14, 2026 11:39
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
macos-15 runners may ship without rustup pre-installed; without this,
`rustup toolchain install --no-self-update` fails and downstream
`cargo auditable build` ends up routed to `rustup-init`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
USAGE.md is the Linux-canonical form; update-usage.sh on macOS leaks
the macOS data dir string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pm2 path now rolls one pm_id at a time (so scaled apps like tamanu-api
roll instance-by-instance instead of all at once) and the strict probe
hits http://127.0.0.1:<PORT>/ per process, using the resolved PORT env
var from pm2's increment_var. Workers without a PORT skip the probe,
matching the Linux worker carve-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread crates/bestool/src/actions/tamanu/reload.rs Outdated
Comment thread crates/bestool/src/actions/tamanu/reload.rs Outdated
Comment thread crates/bestool/src/actions/tamanu/reload.rs Outdated
dannash100 and others added 2 commits May 18, 2026 11:32
- Deserialize systemd `active` as bool instead of String
- Collapse detect_backend into a single if/else chain
- Rename --no-strict to --no-probe-http for clarity

Co-authored-by: Félix Saparelli <felix@passcod.name>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dannash100 dannash100 requested a review from passcod May 18, 2026 03:05
@passcod passcod changed the title feat(tamanu): add reload subcommand for safe rolling restart feat(tamanu): TAM-6782: add reload subcommand for safe rolling restart May 18, 2026
@passcod
Copy link
Copy Markdown
Member

passcod commented May 18, 2026

LGTM but please test builds on actual non-prod before we merge :)

@passcod
Copy link
Copy Markdown
Member

passcod commented May 19, 2026

Yeah, that's not working right

sudo ./bestool-new t reload
2026-05-19T09:12:52.695611Z  INFO bestool::actions::tamanu::reload: rolling tamanu services backend=Systemd kind=Facility
2026-05-19T09:12:52.707640Z  INFO bestool::actions::tamanu::reload: will restart 3 service(s):
2026-05-19T09:12:52.707658Z  INFO bestool::actions::tamanu::reload:   - tamanu-frontend@a.service
2026-05-19T09:12:52.707662Z  INFO bestool::actions::tamanu::reload:   - tamanu-frontend@b.service
2026-05-19T09:12:52.707663Z  INFO bestool::actions::tamanu::reload:   - tamanu-patientportal.service
2026-05-19T09:12:52.715172Z  INFO bestool::actions::tamanu::reload: [1/3] restarting tamanu-frontend@a.service
2026-05-19T09:12:53.567732Z  INFO bestool::actions::tamanu::reload: waiting 10s before next restart
2026-05-19T09:13:03.571674Z  INFO bestool::actions::tamanu::reload: [2/3] restarting tamanu-frontend@b.service
2026-05-19T09:13:04.347646Z  INFO bestool::actions::tamanu::reload: waiting 10s before next restart
2026-05-19T09:13:14.348755Z  INFO bestool::actions::tamanu::reload: [3/3] restarting tamanu-patientportal.service
2026-05-19T09:13:15.113652Z  INFO bestool::actions::tamanu::reload: rolled 3 service(s)

Also needs to call sudo

And I don't see it check that the frontends are up before moving on?

@dannash100
Copy link
Copy Markdown
Author

@passcod on it!

If invoked without root on a systemd host, re-execs the same args via
sudo so users don't have to remember the prefix. Promotes the per-unit
active/HTTP probe logs from debug to info so each restart visibly shows
the readiness gate before the cooldown.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@passcod
Copy link
Copy Markdown
Member

passcod commented May 20, 2026

Ah, something that was missing but I didn't say explicitly there is that this was just restarting the frontends and not doing the apis/workers/tasks/etc. Maybe it's just matching on tamanu-[a-z0-9@]+ (doesn't include -s?)

…efixed

The deployment uses unprefixed unit names (tamanu-api, tamanu-tasks,
tamanu-sync, tamanu-fhir-*), same as on pm2. The kind-prefix filter
('tamanu-central-' / 'tamanu-facility-') matched nothing, so reload
was only catching the explicit tamanu-frontend@* and tamanu-patientportal
units. Broaden to all active 'tamanu-*' units, excluding templates,
'tamanu-meta*', and 'tamanu-alertd'. Switch systemd kind detection to
the same tamanu-sync heuristic the pm2 path uses.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@passcod
Copy link
Copy Markdown
Member

passcod commented May 20, 2026

The deployment uses unprefixed unit names (tamanu-api, tamanu-tasks,
tamanu-sync, tamanu-fhir-*), same as on pm2. The kind-prefix filter
('tamanu-central-' / 'tamanu-facility-') matched nothing, so reload
was only catching the explicit tamanu-frontend@* and tamanu-patientportal
units. Broaden to all active 'tamanu-*' units, excluding templates,
'tamanu-meta*', and 'tamanu-alertd'. Switch systemd kind detection to
the same tamanu-sync heuristic the pm2 path uses.

While the fix probably works, the reasoning is bonkers lmao. Wonder what it thinks of:

$ sudo systemctl list-units tamanu-*
  UNIT                                LOAD   ACTIVE SUB     DESCRIPTION
  tamanu-central-api@1.service        loaded active running Tamanu Central server API
  tamanu-central-api@2.service        loaded active running Tamanu Central server API
  tamanu-central-fhir-refresh.service loaded active running Tamanu Central FHIR worker (refresh)
  tamanu-central-fhir-resolve.service loaded active running Tamanu Central FHIR worker (resolve)
  tamanu-central-tasks.service        loaded active running Tamanu Central server scheduled tasks
  tamanu-frontend@a.service           loaded active running Tamanu frontend
  tamanu-frontend@b.service           loaded active running Tamanu frontend
  tamanu-patientportal.service        loaded active running Tamanu patient portal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants