Skip to content

Commit d6d5c57

Browse files
authored
add guide for quantizing llm on gke (#1813)
* feat: add llm quantization example * add readme file for quantize guide
1 parent 0a556a6 commit d6d5c57

File tree

5 files changed

+230
-0
lines changed

5 files changed

+230
-0
lines changed
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
FROM pytorch/pytorch:2.8.0-cuda12.6-cudnn9-runtime
16+
17+
COPY requirements.txt ./
18+
19+
RUN pip install -r requirements.txt
20+
21+
COPY main.py ./
22+
23+
CMD ["python", "main.py"]
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# LLM Quantization on GKE
2+
3+
This document describes how to run LLM quantization on a GKE cluster.
4+
5+
## Prerequisites
6+
7+
- A GKE cluster with NVIDIA H100 GPUs.
8+
- `gcloud` CLI installed and configured.
9+
- `kubectl` CLI installed and configured.
10+
11+
## Steps
12+
13+
1. **Create an Artifact Registry repository:**
14+
15+
```bash
16+
export REPO_NAME=llm-quantize
17+
export REGION=us-central1
18+
gcloud artifacts repositories create $REPO_NAME --repository-format=docker --location=$REGION
19+
```
20+
21+
2. **Build and push the Docker image:**
22+
23+
```bash
24+
export IMAGE_URL=${REGION}-docker.pkg.dev/$(gcloud config get-value project)/${REPO_NAME}/llm-processor-gptq
25+
gcloud auth configure-docker ${REGION}-docker.pkg.dev
26+
gcloud builds submit --tag $IMAGE_URL .
27+
```
28+
29+
3. **Set environment variables:**
30+
31+
```bash
32+
export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
33+
export HF_TOKEN="your-hugging-face-token"
34+
```
35+
36+
4. **Create a Kubernetes secret for the Hugging Face token:**
37+
38+
```bash
39+
kubectl create secret generic hf-secret --from-literal=hf_api_token=$HF_TOKEN
40+
```
41+
42+
5. **Deploy the quantization Job to GKE:**
43+
44+
```bash
45+
envsubst < job.yaml | kubectl apply -f -
46+
```
47+
48+
6. **Monitor the Job:**
49+
50+
```bash
51+
kubectl get pods -w
52+
```
53+
54+
7. **View the logs:**
55+
56+
```bash
57+
kubectl logs -f -l job-name=quantize
58+
```
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: batch/v1
16+
kind: Job
17+
metadata:
18+
name: quantize
19+
spec:
20+
ttlSecondsAfterFinished: 100
21+
template:
22+
spec:
23+
nodeSelector:
24+
cloud.google.com/gke-accelerator: nvidia-h100-80gb
25+
containers:
26+
- name: llm-compressor
27+
image: $IMAGE_URL
28+
command: ["python", "main.py"]
29+
resources:
30+
limits:
31+
nvidia.com/gpu: "1"
32+
cpu: "12"
33+
memory: "80Gi"
34+
ephemeral-storage: "80Gi"
35+
env:
36+
- name: LD_LIBRARY_PATH
37+
value: ${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64
38+
- name: MODEL_ID
39+
value: $MODEL_ID
40+
- name: HUGGING_FACE_HUB_TOKEN
41+
valueFrom:
42+
secretKeyRef:
43+
name: hf-secret
44+
key: hf_api_token
45+
volumeMounts:
46+
- mountPath: /dev/shm
47+
name: dshm
48+
volumes:
49+
- name: dshm
50+
emptyDir:
51+
medium: Memory
52+
restartPolicy: Never
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
from datasets import load_dataset
17+
from transformers import AutoModelForCausalLM, AutoTokenizer
18+
19+
from llmcompressor import oneshot
20+
from llmcompressor.modifiers.quantization import GPTQModifier
21+
from llmcompressor.utils import dispatch_for_generation
22+
23+
# Select model and load it.
24+
model_id = os.environ["MODEL_ID"]
25+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
26+
tokenizer = AutoTokenizer.from_pretrained(model_id)
27+
28+
# Select calibration dataset.
29+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
30+
DATASET_SPLIT = "train_sft"
31+
32+
# Select number of samples. 512 samples is a good place to start.
33+
# Increasing the number of samples can improve accuracy.
34+
NUM_CALIBRATION_SAMPLES = 512
35+
MAX_SEQUENCE_LENGTH = 2048
36+
37+
# Load dataset and preprocess.
38+
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
39+
ds = ds.shuffle(seed=42)
40+
41+
42+
def preprocess(example):
43+
return {
44+
"text": tokenizer.apply_chat_template(
45+
example["messages"],
46+
tokenize=False,
47+
)
48+
}
49+
50+
51+
ds = ds.map(preprocess)
52+
53+
54+
# Tokenize inputs.
55+
def tokenize(sample):
56+
return tokenizer(
57+
sample["text"],
58+
padding=False,
59+
max_length=MAX_SEQUENCE_LENGTH,
60+
truncation=True,
61+
add_special_tokens=False,
62+
)
63+
64+
65+
ds = ds.map(tokenize, remove_columns=ds.column_names)
66+
67+
# Configure the quantization algorithm to run.
68+
# * quantize the weights to 4 bit with GPTQ with a group size 128
69+
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
70+
71+
# Apply algorithms.
72+
oneshot(
73+
model=model,
74+
dataset=ds,
75+
recipe=recipe,
76+
max_seq_length=MAX_SEQUENCE_LENGTH,
77+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
78+
)
79+
80+
# Confirm generations of the quantized model look sane.
81+
print("\n\n")
82+
print("========== SAMPLE GENERATION ==============")
83+
dispatch_for_generation(model)
84+
sample = tokenizer("Hello my name is", return_tensors="pt")
85+
sample = {key: value.to(model.device) for key, value in sample.items()}
86+
output = model.generate(**sample, max_new_tokens=100)
87+
print(tokenizer.decode(output[0]))
88+
print("==========================================\n\n")
89+
90+
# Save to disk compressed.
91+
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
92+
model.save_pretrained(SAVE_DIR, save_compressed=True)
93+
tokenizer.save_pretrained(SAVE_DIR)
94+
95+
model.push_to_hub(SAVE_DIR)
96+
tokenizer.push_to_hub(SAVE_DIR)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
llmcompressor==0.8.1

0 commit comments

Comments
 (0)