feat: support AllReduce example with OpenMPI backend implementation#1
Open
Ziminli wants to merge 9 commits intofeat/dev-infrafrom
Open
feat: support AllReduce example with OpenMPI backend implementation#1Ziminli wants to merge 9 commits intofeat/dev-infrafrom
Ziminli wants to merge 9 commits intofeat/dev-infrafrom
Conversation
- link the example programs with `src/` library in CMake - use internal device/runtime/traits for validation - add malloc/memcpy/free runtime calls in `examples/all_reduce` example
… `examples/all_reduce.cc`
- support `infiniFinalize()` and add its ompi's implementation, used in `examples/all_reduce.cc` - create `examples/utils.h` for having all the utilities used by the example programs and move the `CHECK_INFINI` macro into it
…t related required features, and fix errors - support `infiniCommInitAll()` and `infiniCommDestroy()` with ompi backend - change `Init()` to use `MPI_THREAD_FUNNELED` for ompi's implementation (otherwise will hang) - add some mutators for `Communicator` class - update `OmpiInstance` with default handle value and `Destroy()` method - add `SetDevice()` alias for NVIDIA's runtime - add error code info printing for the `CHECK_INFINI` macro in `examples/utils.h`
…the message printing - add a simple `Logger` and its `PrintMsg()` method in `src/logging.h` - update places where this is used: `src/base/comm_init_all.h` and `src/ompi/impl/comm_init_all.h`
…O` comments. - add `LOG` macro for convenient logging, but this will later be replaced with `glog` - update the `TODO` comments that remind logging task
… and result validation in the allreduce example - support `infiniAllReduce()` and its ompi backend - add `Timer`, `Metrics`, and `Validator` in `examples/utils.h` for simple profiling and result checking - add `infiniRedOp_t` and its internal mapping - add two synchronize runtime alias for NVIDIA's runtime backend
…/all_reduce.cc` - add `warmup_iters` and `profile_iters` for controlling the number of rounds for warmup and profiling loops - abstract out the original main function in `examples/all_reduce.cc` into `RunAllReduceExample()`, now the main function only set control parameters and then call `RunAllReduceExample()`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for six communication operations required by the AllReduce example (i.e.,
examples/all_reduce.cc), enabling InfiniCCL to perform AllReduce using an OpenMPI backend.In addition, this PR introduces a lightweight logging module and refactors the AllReduce example to improve structure, usability, and maintainability.
Changes
infiniGetRank()infiniGetSize()infiniFinalize()infiniCommInitAll()infiniCommDestroy()infiniAllReduce().examples/;Timer,Metrics, andValidatorinexamples/utils.hfor simple profiling and result checking;src/logging.hfor structured information and error reporting.Known Issues & Future Work
glogin the future.mpirun has exited due to process rank <RANK#> with PID <PID#> on node <IP> exiting improperlymessage. This issue has been investigated but not fully resolved. It does not appear to impact functionality at this time, but should be addressed in future work.Logs & Screenshots