Skip to content

feat: surface transaction abort reason on TxnConflictException#290

Open
rahst12 wants to merge 2 commits into
dgraph-io:mainfrom
rahst12:txn-abort-reason-surface-phase-1
Open

feat: surface transaction abort reason on TxnConflictException#290
rahst12 wants to merge 2 commits into
dgraph-io:mainfrom
rahst12:txn-abort-reason-surface-phase-1

Conversation

@rahst12

@rahst12 rahst12 commented Jun 17, 2026

Copy link
Copy Markdown

Problem

When Dgraph aborts a transaction, the Java client collapses every cause into a single
opaque exception:

io.dgraph.TxnConflictException: Transaction has been aborted. Please retry

TxnConflictException exposes no way to tell why the transaction aborted, so an
application cannot distinguish the cases that warrant different responses:

  • a write-write conflict — retry immediately with a fresh transaction;
  • a predicate move — a predicate is relocating between groups and commits on it are
    temporarily blocked, so back off and retry once the move completes;
  • a stale start-ts — the transaction predates the current Zero leader (a leader
    change); retry with a fresh transaction.

The server now reports the category (see the companion dgraph PR), encoding it as a
"<code>: <detail>" prefix on the gRPC ABORTED status description. Without client
support, that information is only visible by string-scraping getMessage().

Example

Before, every abort looked the same — there was no way to branch on the cause:

catch (TxnConflictException e) {
  retryWithNewTxn(); // the only option, regardless of why it aborted
}

Now the category is available via getReason(), so callers can respond appropriately:

try {
  txn.mutate(mutation); // or txn.commit()
} catch (TxnConflictException e) {
  switch (e.getReason()) {
    case CONFLICT:       // write-write conflict — retry now with a fresh txn
    case STALE_STARTTS:  // leader change — retry with a fresh txn
      retryWithNewTxn();
      break;
    case PREDICATE_MOVE: // predicate relocating — back off, then retry
      backoffAndRetry();
      break;
    case UNKNOWN:        // older server, or unrecognized reason
    default:
      log.warn("Txn aborted: {}", e.getMessage()); // full text still available
      retryWithNewTxn();
  }
}

Fix

PreReq: dgraph-io/dgraph#9747

Expose the category as a typed value on TxnConflictException, parsed from the status
description the server already sends.

  • AbortReason enumCONFLICT, PREDICATE_MOVE, STALE_STARTTS, UNKNOWN.
  • TxnConflictException.getReason() — parses the "<code>: <detail>" prefix off the
    gRPC status and maps it to an AbortReason. The full human-readable te
    available via getMessage().

Backward compatible by design:

  • Against an older server that reports no reason, getReason() returns
    UNKNOWN, so callers degrade gracefully.
  • getMessage() is unchanged, isRetryable() still returns true, and the exception
    type/hierarchy is unchanged — existing catch blocks and retry loops are unaffected.

Tests

  • AbortReasonTest — unit tests feeding synthetic gRPC ABORTED sta
    Exceptions.translate, asserting each category parses correctly and that an absent or
    prefix-less description degrades to UNKNOWN.
  • AbortReasonLiveTest — cross-language end-to-end test that drives a real
    (locally patched) Dgraph cluster, forces each abort category, and asserts it propagates
    all the way to getReason(). Skips gracefully when the cluster prerequisites
    (multi-group for predicate-move, a Zero-restart hook for stale-startts)
    configured. A docker-compose.abort-reason.yml is included to stand up that cluster.

Future work

This change surfaces the category only — the slice the server provide
server enriches aborts with the contended predicate and UID/token (planned dgraph
follow-up via the gRPC rich-error model, then a first-class TxnContext field), this
client can add typed accessors such as getConflictPredicates() and getConflicts()
without breaking the getReason() surface introduced here. A later phase may also expand
AbortReason (e.g. splitting non-move predicate-move cases into PREDICATE_UNAVAILABLE
/ INTERNAL) and revisit isRetryable() for the non-retryable internal
codes already map to UNKNOWN, so adding values stays backward compatible.

Checklist

@rahst12 rahst12 requested a review from a team as a code owner June 17, 2026 05:53
@CLAassistant

CLAassistant commented Jun 17, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@rahst12 rahst12 changed the title Txn abort reason surface phase 1 feat: surface transaction abort reason on TxnConflictException Jun 17, 2026
@amalistari amalistari requested a review from mlwelles June 18, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants