Skip to content

Ray bf16 availability: the check does not happen in the gpu worker, so it always says bf16 is not available #3179

@denadai2

Description

@denadai2

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

This check here is always false in settings in which a ray cluster has a gpu node https://github.com/axolotl-ai-cloud/axolotl/blob/b3b92687c4ba8792d343b6b1a616f541840db8b3/src/axolotl/cli/config.py#L222C21-L222C48. Why? because the transformer code does not get executed in a gpu worker.

You should run this inside the RayTrainer

Current behaviour

UDA for now.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 121, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 63, in do_cli
    parsed_cfg = load_cfg(config, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/cli/config.py", line 219, in load_cfg
    cfg = validate_config(
          ^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/utils/config/__init__.py", line 295, in validate_config
    AxolotlConfigWCapabilities(
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/pydantic/main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for AxolotlConfigWCapabilities
  Value error, bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above. [type=value_error, input_value={'base_model': '/mnt/prol...r_prefetch_factor': 256}, input_type=dict]```

### Steps to reproduce

see above

### Config yaml

```yaml

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

master

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions