Add Realm execution support for parallel operators (Replicate, Repartition, Combine, Reduction)#1641
Open
seemamirch wants to merge 5 commits intoflexflow:masterfrom
Open
Conversation
added 5 commits
April 9, 2026 15:49
- Add perform_pass_expansion_for_replicate for fwd/bwd pass expansion - Add perform_shard_expansion_for_replicate and _bwd for shard expansion - Add build_replicate_invocation in make_dynamic_open_dataflow_graph - Add is_replicate_attrs helper and guard replicate in copy_insertion - Add ReplicateAttrs to TrainingOperationAttrs - Add SumReductionFloat/Double for backward replicate reduce operation - Add issue_replicate_bwd in spawn_dynamic_node_invocation - Fix per_device_op_state init race condition with direct write - Fix .value() calls on optional per_device_op_state across op impls - Update issue_copy to support optional reduction op - Add testcase for replicate op
- Add perform_pass_expansion_for_replicate for fwd/bwd pass expansion - Add perform_shard_expansion_for_replicate and _bwd for shard expansion - Add build_replicate_invocation in make_dynamic_open_dataflow_graph - Add is_replicate_attrs helper and guard replicate in copy_insertion - Add ReplicateAttrs to TrainingOperationAttrs - Add SumReductionFloat/Double for backward replicate reduce operation - Add issue_replicate_bwd in spawn_dynamic_node_invocation - Fix per_device_op_state init race condition with direct write - Fix .value() calls on optional per_device_op_state across op impls - Update issue_copy to support optional reduction op - Add testcase for replicate op
…ion) in Realm backend (CPU versions) Each parallel op is handled via Realm copies rather than op tasks: - Replicate FWD: broadcast copy; BWD: sum-reduce replica gradients - Repartition FWD: scatter into shards; BWD: gather shards into full tensor - Combine FWD: gather shards into full tensor; BWD: scatter gradient into shards - Reduction FWD: sum-reduce partials; BWD: broadcast gradient to all partials Key implementation details: - Parallel ops have no ComputationGraphOpAttrs equivalent - Instance allocation uses offset index spaces for sharded tensors - issue_copy uses actual instance index space via get_indexspace() - Add CopyDomain::SRC/DST to select correct copy domain - Combine FWD and Reduction FWD register only first invocation in ManyToOne - Add get_per_device_shape() for correct per-device tensor size - Add perform_shard_expansion_one_to_many and _many_to_one generic functions - Add parallel_op_utils.h shared header for is_parallel_op_attrs - Add CopyDomain enum and create_instance_with_offset to RealmContext - Add multi-cpu tests for the parallel operators
Author
|
@lockshaw @elliottslaughter - please review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements support for all four parallel operators (Replicate, Repartition, Combine, Reduction) CPU based in the Realm-based execution backend.
Motivation
Parallel operators are fundamental to distributed training in FlexFlow — they handle data redistribution between devices (repartition), replication (replicate), gathering (combine), and partial sum reduction (reduction). Previously only the graph pass infrastructure existed; this PR adds the Realm execution layer.
Approach
Each parallel op is executed as a Realm copy operation rather than a regular op task, since they involve data movement between devices rather than computation.
Data Movement
Key Design Decisions
Offset index space allocation: Each shard instance is allocated with a non-zero origin rect reflecting its position in the full tensor (e.g. shard 1 of a [10,16] tensor split along dim 0 is allocated at [5..9, 0..15]). This allows plain Realm copies between shards and combined tensors to work correctly.
CopyDomain: A new
CopyDomainenum (SRC/DST) selects which instance's index space is used as the copy domain.SRCis used when the source is the smaller piece;DSTwhen the destination is the smaller piece.get_per_device_shape: A new function that correctly computes the per-device tensor size by dividing each dimension by its shard degree, used for instance allocation.
ManyToOne collision fix: Combine FWD and Reduction FWD produce multiple invocations with the same output
DynamicValueAttrs. Only the first is registered in the producer map to avoid collision while still marking the value as having a producer.Changes
New files
lib/task-spec/include/task-spec/dynamic_graph/parallel_op_utils.h— sharedis_parallel_op_attrshelperlib/realm-execution/test/src/realm-execution/test_op_combine.cc— combine op e2e testlib/realm-execution/test/src/realm-execution/test_op_reduce.cc— reduction op e2e testlib/realm-execution/test/src/realm-execution/test_op_repartition.cc— repartition op e2e testModified files
shard_expansion.cc— addperform_shard_expansion_one_to_manyandperform_shard_expansion_many_to_onegeneric functions covering all eight FWD/BWD combinationspass_expansion.cc— addperform_pass_expansion_for_parallel_opcopy_insertion.cc— guard parallel ops in copy insertionmake_dynamic_open_dataflow_graph_from_mapped_pcg.cc— dispatch parallel ops tobuild_parallel_op_invocationdynamic_open_dataflow_graph.cc— fix ManyToOne collision for combine/reduction FWDpcg_instance.cc— add Realm copy dispatch for all parallel opsrealm_context.cc/h— addCopyDomain,create_instance_with_offset, updateissue_copyto useget_indexspace()instance_allocation.cc— use offset index spaces for sharded tensor allocationparallel_tensor_dims.cc/h— addget_per_device_dimsparallel_tensor_shape.cc/h— addget_per_device_shapetask_id_t.cc— returnnulloptfor all parallel ops inget_init_task_id_for_op_attrsTesting
Added e2e tests for each parallel op running on 2 CPU devices:
test_op_combine.cc— repartition → combine → relutest_op_reduce.cc— repartition → linear → reduction → relutest_op_repartition.cc— repartition → reluThis change is