Category: bug Severity: major
Location: lib/arcp/runtime/job_manager.rb:129-145
Spec: ARCP v1.1 §7.4, §7.3
What
Spec §7.4: cancellation applies to "a non-terminal job." JobManager#cancel has no terminal-state guard. If a client cancels a job that has already reached success/error/timed_out, the method calls record.task&.stop (a no-op) and then unconditionally publish_error(... final_status: 'cancelled' ...), which (a) rewrites the stored status to cancelled, flipping a completed job's terminal state, (b) emits a second terminal job.error (CANCELLED) into the event stream and fans it out to all subscribers, and (c) re-runs credential revocation and lease teardown. A late/duplicate cancel thus corrupts a successful job's recorded outcome and double-emits a terminal event.
Evidence
# lib/arcp/runtime/job_manager.rb:129-145 — no check that the job is non-terminal
def cancel(job_id:, principal_id:, reason: nil)
record = @mutex.synchronize { @jobs[job_id] }
raise Arcp::Errors::JobNotFound, "no such job: #{job_id}" unless record
# ... principal check ...
record.task&.stop
publish_error(job_id, Arcp::Job::JobError.new(
job_id: job_id, final_status: 'cancelled',
code: 'CANCELLED', message: reason, retryable: false, details: {}))
end
Proposed fix
Before stopping the task / publishing, check the current status; if it is already terminal (success/succeeded/error/cancelled/timed_out), return without re-publishing (optionally a no-op ack, never a status overwrite). Guard the read-and-transition under the existing mutex to avoid racing a concurrently-finishing agent. Add a test that completes a job, then cancels it, and asserts the recorded status stays terminal-success and no second job.error is emitted.
Acceptance criteria
Category: bug Severity: major
Location:
lib/arcp/runtime/job_manager.rb:129-145Spec: ARCP v1.1 §7.4, §7.3
What
Spec §7.4: cancellation applies to "a non-terminal job."
JobManager#cancelhas no terminal-state guard. If a client cancels a job that has already reachedsuccess/error/timed_out, the method callsrecord.task&.stop(a no-op) and then unconditionallypublish_error(... final_status: 'cancelled' ...), which (a) rewrites the stored status tocancelled, flipping a completed job's terminal state, (b) emits a second terminaljob.error(CANCELLED) into the event stream and fans it out to all subscribers, and (c) re-runs credential revocation and lease teardown. A late/duplicate cancel thus corrupts a successful job's recorded outcome and double-emits a terminal event.Evidence
Proposed fix
Before stopping the task / publishing, check the current status; if it is already terminal (
success/succeeded/error/cancelled/timed_out), return without re-publishing (optionally a no-op ack, never a status overwrite). Guard the read-and-transition under the existing mutex to avoid racing a concurrently-finishing agent. Add a test that completes a job, then cancels it, and asserts the recorded status stays terminal-success and no secondjob.erroris emitted.Acceptance criteria
job.erroris fanned out for an already-terminal job.