Skip to content

AReaL's Megatron training is much slower than FSDP #193

@spencergotowork

Description

@spencergotowork

I've been comparing the training efficiency of AReaL and verl in text generation tasks, and encountered a confusion that I hope to get some insights on. Here's the detailed context:

Experiment Setup

  • Task: Text generation training (comparing AReaL and verl)
  • Reward Function: Simplified to directly return 1 (no complex evaluation logic)
  • Dataset: Identical dataset used for both frameworks
  • Text Length: All sequences (including prompt and response) are within 8k tokens
  • Inference Acceleration: Both use SGLang for inference speedup
  • Training Algorithm: GRPO

Observation

Under the above identical settings, the training duration of AReaL is significantly longer than that of verl.

After profiling, I found that the most time-consuming part in AReaL's model_worker is the loss calculation and model update step, specifically the code around this line in ppo_interface.py.

Question

Could anyone help explain why AReaL takes longer in this scenario, especially regarding the performance bottleneck in the loss calculation and model update stage? Are there any configuration adjustments or optimizations I might be missing for AReaL under such experimental settings?

Thanks in advance for your insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions