out of mem when fine-tune aloha dataset #450
Unanswered
HERBERT-WH
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This is my config:
` TrainConfig(
name="1",
model=pi0.Pi0Config(paligemma_variant="gemma_2b_lora",
# action_dim = 14,
action_expert_variant="gemma_300m_lora"),
data=LeRobotAlohaDataConfig(
repo_id="physical-intelligence/aloah_pen",
assets=AssetsConfig(
assets_dir="/home/xsuper/.cache/openpi/local_dir/assets",
# assets_dir="s3://openpi-assets/checkpoints/pi0_base/assets",
# asset_id="trossen_mobile",
asset_id="trossen",
),
default_prompt="uncap the pen",
repack_transforms=_transforms.Group(
inputs=[
_transforms.RepackTransform(
{
"images": {
"cam_high": "observation.images.cam_high",
"cam_left_wrist": "observation.images.cam_left_wrist",
"cam_right_wrist": "observation.images.cam_right_wrist",
},
"state": "observation.state",
"actions": "action",
}
)
]
),
base_config=DataConfig(
local_files_only=True, # Set to True for local-only datasets.
),
),
weight_loader=weight_loaders.CheckpointWeightLoader("/home/xsuper/.cache/openpi/local_dir/params"),
# weight_loader=weight_loaders.CheckpointWeightLoader("s3://openpi-assets/checkpoints/pi0_base/params"),
And this is my error:
`2025-04-24 21:13:29.885224: W external/xla/xla/tsl/framework/bfc_allocator.cc:501] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.96GiB (rounded to 2106589184)requested by op
2025-04-24 21:13:29.885289: W external/xla/xla/tsl/framework/bfc_allocator.cc:512] *************************************************************************************************___
E0424 21:13:29.885320 1335894 pjrt_stream_executor_client.cc:3045] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 2106589184 bytes. [tf-allocator-allocation-error='']
Traceback (most recent call last):
File "/home/xsuper/HJW/openpi/scripts/train.py", line 275, in
main(_config.cli())
File "/home/xsuper/HJW/openpi/scripts/train.py", line 231, in main
train_state, train_state_sharding = init_train_state(config, init_rng, mesh, resume=resuming)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xsuper/HJW/openpi/.venv/lib/python3.11/site-packages/jaxtyping/_decorator.py", line 559, in wrapped_fn
return wrapped_fn_impl(args, kwargs, bound, memos)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xsuper/HJW/openpi/.venv/lib/python3.11/site-packages/jaxtyping/_decorator.py", line 483, in wrapped_fn_impl
out = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/xsuper/HJW/openpi/scripts/train.py", line 127, in init_train_state
train_state = jax.jit(
^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 2106589184 bytes.
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
`
On ARM:48G
Beta Was this translation helpful? Give feedback.
All reactions