generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 521
Open
Description
Checklist
- I've prepended issue tag with type of change: [bug]
- (If applicable) I've attached the script to reproduce the bug
- (If applicable) I've documented below the DLC image/dockerfile this relates to
- (If applicable) I've documented below the tests I've run on the DLC image
- I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
- I've built my own container based off DLC (and I've attached the code used to build my own image)
Concise Description:
Create a SageMaker model with container with 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-tensorrtllm0.21.0-cu128 and environment variables:
HF_MODEL_ID:Qwen/Qwen3-0.6BOPTION_ENGINE:MPIOPTION_TRUST_REMOTE_CODE:true
and deploy the endpoint on ml.g6.48xlarge:
ImportError: /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so: undefined symbol: cuKernelGetName
Might be related to CUDA driver version. If it is, the container does not log the driver version used so it is difficult for me to check.
DLC image/dockerfile:
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-tensorrtllm0.21.0-cu128
Current behavior:
Deploying Qwen3 causes the above error
Expected behavior:
Successfully deploy Qwen 0.6B on ml.g6.48xlarge
Additional context:
Might be related to CUDA driver version
Metadata
Metadata
Assignees
Labels
No labels