Skip to content

DAOS-19212 object: some improvement to avoid server overload#18609

Draft
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-19212_2
Draft

DAOS-19212 object: some improvement to avoid server overload#18609
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-19212_2

Conversation

@Nasf-Fan

@Nasf-Fan Nasf-Fan commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Mainly include two fixes:

  1. Make client to retry modification (RPC) if related IO handler hits -DER_TX_RESTART failure repeatedly.

On server side, when IO handler repeatedly hit -DER_TX_RESTART, then it is quite possible that the -DER_TX_RESTART failure is related with server overload or some congestion caused RPC delay. Under such case, server retry with newer epoch may increase server workload/congestion. Then let's ask client to retry with some backoff delay.

  1. Restrict inflight object RPC bulk transfer.

Too many inflight object RPC bulk transfer many cause server overload and network congestion. This patch introduces new server environment variable "DAOS_OBJ_RPC_BULK_THD" to control the inflight object RPC bulk count. Once exceeds such threshold, server will ask client to retry. The default value for DAOS_OBJ_RPC_BULK_THD is 512 (per-target). The admin can disable such restriction via setting it as zero.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

Ticket title is 'Aurora: performance jobs keep timing out with v2.8.0-rc1 and MDonSSD'
Status is 'In Progress'
Labels: '2.8.0rc1,md_on_ssd,scrubbed_2.8,test_2.8.0rc'
https://daosio.atlassian.net/browse/DAOS-19212

Mainly include two fixes:

1. Make client to retry modification (RPC) if related IO handler hits
   -DER_TX_RESTART failure repeatedly.

On server side, when IO handler repeatedly hit -DER_TX_RESTART, then
it is quite possible that the -DER_TX_RESTART failure is related with
server overload or some congestion caused RPC delay. Under such case,
server retry with newer epoch may increase server workload/congestion.
Then let's ask client to retry with some backoff delay.

2. Restrict inflight object RPC bulk transfer.

Too many inflight object RPC bulk transfer many cause server overload
and network congestion. This patch introduces new server environment
variable "DAOS_OBJ_RPC_BULK_THD" to control the inflight object RPC
bulk count. Once exceeds such threshold, server will ask client to
retry. The default value for DAOS_OBJ_RPC_BULK_THD is 512 (per-target).
The admin can disable such restriction via setting it as zero.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19212_2 branch from 237e152 to 771e673 Compare July 3, 2026 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant