Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,10 @@ spec:
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvarobartt Falcon model has this error. Tried couple of different version > 2 all of them failed.

INFO 2025-01-16T23:04:59.120810571Z [resource.labels.containerName: llm] �[2m2025-01-16T23:04:59.120688Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Error when initializing model
INFO 2025-01-16T23:04:59.120839184Z [resource.labels.containerName: llm] Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args)
INFO 2025-01-16T23:04:59.120940322Z [resource.labels.containerName: llm] > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
INFO 2025-01-16T23:04:59.120942462Z [resource.labels.containerName: llm] model = get_model(
INFO 2025-01-16T23:04:59.120944599Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 714, in get_model
INFO 2025-01-16T23:04:59.120946477Z [resource.labels.containerName: llm] return FlashRWSharded(
INFO 2025-01-16T23:04:59.120948410Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_rw.py", line 77, in init
INFO 2025-01-16T23:04:59.120950825Z [resource.labels.containerName: llm] model = FlashRWForCausalLM(config, weights)
INFO 2025-01-16T23:04:59.120953217Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 659, in init
INFO 2025-01-16T23:04:59.120955289Z [resource.labels.containerName: llm] self.transformer = FlashRWModel(config, weights)
INFO 2025-01-16T23:04:59.120957337Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 593, in init
INFO 2025-01-16T23:04:59.120959345Z [resource.labels.containerName: llm] [
INFO 2025-01-16T23:04:59.120961371Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 594, in
INFO 2025-01-16T23:04:59.120963316Z [resource.labels.containerName: llm] FlashRWLargeLayer(layer_id, config, weights)
INFO 2025-01-16T23:04:59.120965300Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 525, in init
INFO 2025-01-16T23:04:59.120967342Z [resource.labels.containerName: llm] self.ln_layer = FlashRWLayerNorm(config, prefix, weights)
INFO 2025-01-16T23:04:59.120969434Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 484, in init
INFO 2025-01-16T23:04:59.120971427Z [resource.labels.containerName: llm] self.num_ln = config.num_ln_in_parallel_attn
INFO 2025-01-16T23:04:59.120973477Z [resource.labels.containerName: llm] File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 264, in getattribute
INFO 2025-01-16T23:04:59.120975858Z [resource.labels.containerName: llm] return super().getattribute(key)
INFO 2025-01-16T23:04:59.120977917Z [resource.labels.containerName: llm] AttributeError: 'RWConfig' object has no attribute 'num_ln_in_parallel_attn'
INFO 2025-01-16T23:04:59.160341634Z [resource.labels.containerName: llm] �[2m2025-01-16T23:04:59.160221Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Error when initializing model

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found an issue but its resolved > 6 months back
huggingface/text-generation-inference#2349

# mountPath is set to /tmp as it's the path where the HF_HOME environment
# variable points to i.e. where the downloaded model from the Hub will be
# stored
- mountPath: /tmp
name: ephemeral-volume
volumes:
- name: dshm
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,10 @@ spec:
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /data
# mountPath is set to /tmp as it's the path where the HF_HOME environment
# variable points to i.e. where the downloaded model from the Hub will be
# stored
- mountPath: /tmp
name: ephemeral-volume
volumes:
- name: dshm
Expand Down
Copy link
Member

@raushan2016 raushan2016 Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still some issue in this sample. It fails with

"Traceback (most recent call last):
  File ""/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py"", line 1269, in warmup
    _, batch, _ = self.generate_token(batch)
  File ""/opt/conda/lib/python3.11/contextlib.py"", line 81, in inner
    return func(*args, **kwds)
  File ""/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py"", line 1730, in generate_token
    prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 514.00 MiB. GPU 0 has a total capacity of 21.96 GiB of which 393.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 20.87 GiB is allocated by PyTorch, and 382.89 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I even tried with g2-standard-48 and it still keeps on crashing. Can you please run some tests to validate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all logs feel free to request access. But the error above should be enough to find the issue and diff between TGI default and TGI DLC image, as the only difference is the image and mountpath
https://docs.google.com/spreadsheets/d/1hKZP9X2ueP-Zvnb9zIXMLk6LeGXvfr-mGR6z-NGfA3s/edit?gid=1556804789#gid=1556804789

Copy link
Member

@raushan2016 raushan2016 Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked with a2-highgpu-2g which has 40G (A100) GPU memory. Which means there is something not right with the DLC image and breaks using LLAMA3-70B running on L4 GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed report @raushan2016, let me run some tests on our end to investigate and I'll ping you as soon as those are completed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure sounds fair, I'll try to investigate the issue today! Thanks for your time!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvarobartt Somehow things are working now, maybe the issue was fixed in the image since the image is pointing to latest ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the confirmation, but yes that is odd indeed, because AFAIK us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 was released some time ago (not sure if they rolled an update over that image, but doesn't look like it as per https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry
For LLAMA3 this image works us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310

But the current image us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 doesn't work.

So something related to the changed TGI or Cuda version. Will really be helpful if you can look into this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR with right images which are working
#1591

Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,10 @@ spec:
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /data
# mountPath is set to /tmp as it's the path where the HF_HOME environment
# variable points to i.e. where the downloaded model from the Hub will be
# stored
- mountPath: /tmp
name: ephemeral-volume
volumes:
- name: dshm
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,10 @@ spec:
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /data
# mountPath is set to /tmp as it's the path where the HF_HOME environment
# variable points to i.e. where the downloaded model from the Hub will be
# stored
- mountPath: /tmp
name: ephemeral-volume
volumes:
- name: dshm
Expand Down