Scalable Edge-Assisted Serving Framework for Interactive LLMs
SpecEdge is an edge-assisted LLM inference framework that leverages consumer-grade GPUs for cost-effective serving at scale. By splitting workloads between edge and server using speculative decoding, SpecEdge achieves 2.22x server throughput and 11.24% lower latency compared to server-only baselines. The framework features proactive edge drafting and pipeline-aware scheduling to maximize resource utilization and serving efficiency.
Our experiments were conducted with the following hardware:
- GCP a2-highgpu-1g
- Ubuntu 24.04 LTS NVIDIA version: 580
- GCP a2-ultragpu-1g
- Ubuntu 24.04 LTS NVIDIA version: 580
- Ubuntu 20.04.6 LTS (Focal Fossa)
- Two Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
- 208GiB Memory
- 10GbE network interface
git clone https://github.com/kaist-ina/specedge
cd specedge
uv syncBefore running, ensure that SSH communication is established between the node executing the client_host.sh script and other edge nodes. Additionally, the SpecEdge repository must be cloned to identical absolute paths across all edge nodes (e.g., all nodes should have the repository at /home/user/specedge).
# server
./script/batch_server.sh -f config/specedge.example.yaml
# edge
./script/client_host.sh -f config/specedge.example.yamlThe specedge.example.yaml configuration file contains the following settings:
Base Settings:
result_path: Base directory where experiment results are savedexp_name: Experiment name for folder name and identificationdtype: Model precision (fp16/fp32)seed: Random seed for reproducibilityssh_key: SSH key path for remote server accessmax_len: Maximum sequence length (max KV Cache length)
Server Settings:
process_name: Server process identifier for loggingtarget_model: HuggingFace model path (e.g., Qwen/Qwen3-14B)device: CUDA device identifier for the server target model (e.g., cuda:0)temperature: Sampling temperature for generationmax_batch_size: Maximum batch size for concurrent requestsnum_clients: Expected number of concurrent edge clients, must match the number of edge clientsbatch_type: Batching strategy (static/dynamic)cache_prefill: Enable prefill KV cache preloadingtrue: Pre-compute and cache all dataset prompts at server startup for benchmark experimentsfalse: Perform prefill at runtime using client-provided prompts
Client Settings:
host: Server endpoint (e.g., 127.0.0.1:8000)process_name: Client process identifier for loggingdraft_model: HuggingFace model path for draft generation (e.g., Qwen/Qwen3-1.7B)dataset: Benchmark dataset name (c4, mtbench, oasst, wikitext and specbench)sample_req_cnt: Sampling frequency of requests from datasetreasoning: Enable reasoningreq_offset: Offset for request samplingmax_n_beams,max_beam_len,max_branch_width,max_budget: Speculative decoding parametersproactive: Proactive edge drafting configurationtype: Proactive drafting mode (excluded/included)max_n_beams,max_beam_len,max_branch_width,max_budget: Proactive drafting parameters
max_new_tokens: Maximum tokens to generate per requestmax_request_num: Total requests to process (-1 for all)
Node Settings:
node-name: Name of each edge node for SSH access (must match the SSH hostname configured in your SSH config or be a resolvable hostname)device: CUDA device identifier for the edge process on this node (e.g., cuda:0, cuda:1)
Before running the metric script, you need to collect the JSONL files from both the server and edge into a single location.
. .venv/bin/activate
python src/metric/specedge.py -d result/demo/specedge --gpu "A100-40" # A100 40GB
python src/metric/specedge.py -d result/demo/specedge --gpu "A100-80" # A100 80GB./script/auto_batch.sh -f config/auto_batch.example.yamlThe auto_batch.example.yaml configuration file contains the following settings:
Base Settings:
result_path: Base directory where experiment results are savedexp_name: Experiment name for folder name and identificationseed: Random seed for reproducibilitymodel: HuggingFace model path (e.g., Qwen/Qwen3-14B)device: CUDA device identifier for the model (e.g., cuda:0)dtype: Model precision (fp16/fp32)temperature: Sampling temperature for generationdataset: Benchmark dataset name (c4, mtbench, oasst, wikitext and specbench)batch_size: Batch size for concurrent request processingmax_len: Maximum sequence length (max KV Cache length)max_new_tokens: Maximum tokens to generate per requestmax_request_num: Total requests to process (-1 for all)sample_req_cnt: Number of sample requests from dataset
. .venv/bin/activate
python src/metric/auto_batch.py -d result/demo/auto_batch --gpu "A100-40" # A100 40GB
python src/metric/auto_batch.py -d result/demo/auto_batch --gpu "A100-80" # A100 80GB./script/server_only.sh -f config/server_only.example.yamlThe server_only.example.yaml configuration file contains the following settings:
Base Settings:
result_path: Base directory where experiment results are savedexp_name: Experiment name for folder name and identificationdtype: Model precision (fp16/fp32)seed: Random seed for reproducibilityssh_key: SSH key path for remote server access (not required for server_only)max_len: Maximum sequence length (max KV Cache length)
Server Settings:
process_name: Server process identifier for loggingtarget_model: HuggingFace model path (e.g., Qwen/Qwen3-14B)device: CUDA device identifier for the server target model (e.g., cuda:0)temperature: Sampling temperature for generationnum_clients: Expected number of concurrent clients
Client Settings:
host: Server endpoint (not required for server_only)process_name: Client process identifier for loggingdraft_model: HuggingFace model path (e.g., Qwen/Qwen3-1.7B)dataset: Benchmark dataset name (c4, mtbench, oasst, wikitext and specbench)max_n_beams,max_beam_len,max_branch_width,max_budget: Speculative decoding parametersmax_batch_size: Maximum batch size for requestsmax_new_tokens: Maximum tokens to generate per requestmax_request_num: Total requests to process (-1 for all)sample_req_cnt: Number of sample requests from datasetdevice: CUDA device identifier for the client draft model (e.g., cuda:0)
. .venv/bin/activate
python src/metric/server_only.py -d result/demo/server_only --gpu "A100-40" # A100 40GB
python src/metric/server_only.py -d result/demo/server_only --gpu "A100-80" # A100 80GBSpecEdge uses a tree-based speculative decoding approach to generate multiple candidate tokens efficiently. The following parameters control the structure and size of the speculation tree:
max_n_beams: Maximum number of nodes that can be forwarded through the model in a single iteration. This parameter limits how many tree nodes are selected as candidates for the forward pass. Even if more nodes could potentially be forwarded based on their probabilities, only the topmax_n_beamsnodes are processed to control computational cost.max_beam_len: Maximum depth of the speculation tree, representing how many sequential token generation steps are performed during one draft phase. The tree grows iteratively formax_beam_lensteps, with each step expanding the tree by forwarding selected candidate nodes.max_branch_width: Maximum number of child tokens generated from each parent node in the tree. When processing logits from a forward pass, up tomax_branch_widthtokens with the highest probabilities are selected as potential continuations from each node.max_budget: Maximum total number of nodes allowed in the speculation tree (excluding the prefix). This parameter controls the overall tree size by pruning nodes with lower probabilities. After tree construction, if the number of nodes exceedsmax_budget, only the topmax_budgetnodes based on cumulative log probabilities are retained. The budget mechanism uses a priority-based filtering approach with a decay factor (0.9) applied to parent node scores.
This speculative decoding approach is based on the SpecExec algorithm. For more details, please refer to the SpecExec paper.
@inproceedings{park2025specedge,
author = {Jinwoo Park and Seunggeun Cho and Dongsu Han},
title = {SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs},
booktitle = {Annual Conference on Neural Information Processing Systems},
year = {2025},
eprint = {2505.17052},
archivePrefix = {arXiv},
primaryClass= {cs.CL}
}

