You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I was trying to fine-tune the lerobot pi0_base model with lerobot/aloha_sim_transfer_cube_human dataset. The pi0 requires:
3 cameras: observation.images.base_0_rgb/left_wrist_0_rgb/right_wrist_0_rgb, type VISUAL, shape [224,224,3], and observation.state and action are shape 32.
However, lerobot/aloha_sim_transfer_cube_human dataset has only one single camera observation.images.top, and it is video (not image) of shape [480,640,3], and observation.state and action are of shape 14.
Following is the script and arguments I used to train (offline training):
The training process was able to run to the end except for one warning:
WARNING 2025-10-23 10:15:57 ing_pi0.py:1046 Vision embedding key might need handling: paligemma_with_expert.paligemma.model.vision_tower.vision_model.embeddings.patch_embedding.bias
WARNING 2025-10-23 10:15:57 ing_pi0.py:1046 Vision embedding key might need handling: paligemma_with_expert.paligemma.model.vision_tower.vision_model.embeddings.patch_embedding.weight
Remapped 777 state dict keys
Warning: Could not remap state dict keys: Error(s) in loading state_dict for PI0Policy:
Missing key(s) in state_dict: "model.paligemma_with_expert.paligemma.model.language_model.embed_tokens.weight".
Finally, loss is around 0.165. I don't know whether what I have done is right or wrong. I just don't feel right. I attached the whole log for information. Could experts here advise me on the issues in my training? Thanks so much!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I was trying to fine-tune the lerobot pi0_base model with lerobot/aloha_sim_transfer_cube_human dataset. The pi0 requires:
3 cameras: observation.images.base_0_rgb/left_wrist_0_rgb/right_wrist_0_rgb, type VISUAL, shape [224,224,3], and observation.state and action are shape 32.
However, lerobot/aloha_sim_transfer_cube_human dataset has only one single camera observation.images.top, and it is video (not image) of shape [480,640,3], and observation.state and action are of shape 14.
Following is the script and arguments I used to train (offline training):
The training process was able to run to the end except for one warning:
Finally, loss is around 0.165. I don't know whether what I have done is right or wrong. I just don't feel right. I attached the whole log for information. Could experts here advise me on the issues in my training? Thanks so much!
pi0 train log 3.txt
Beta Was this translation helpful? Give feedback.
All reactions