AReaL's Megatron training is much slower than FSDP

I've been comparing the training efficiency of AReaL and verl in text generation tasks, and encountered a confusion that I hope to get some insights on. Here's the detailed context:

### Experiment Setup
- **Task**: Text generation training (comparing AReaL and verl)
- **Reward Function**: Simplified to directly return 1 (no complex evaluation logic)
- **Dataset**: Identical dataset used for both frameworks
- **Text Length**: All sequences (including prompt and response) are within 8k tokens
- **Inference Acceleration**: Both use SGLang for inference speedup
- **Training Algorithm**: GRPO

### Observation
Under the above identical settings, the training duration of AReaL is significantly longer than that of verl. 

After profiling, I found that the most time-consuming part in AReaL's `model_worker` is the loss calculation and model update step, specifically the code around [this line](https://github.com/inclusionAI/AReaL/blob/main/realhf/impl/model/interface/ppo_interface.py#L609) in `ppo_interface.py`.

### Question
Could anyone help explain why AReaL takes longer in this scenario, especially regarding the performance bottleneck in the loss calculation and model update stage? Are there any configuration adjustments or optimizations I might be missing for AReaL under such experimental settings?

Thanks in advance for your insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AReaL's Megatron training is much slower than FSDP #193

Experiment Setup

Observation

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AReaL's Megatron training is much slower than FSDP #193

Description

Experiment Setup

Observation

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions