Skip to content

Commit ffa1b3e

Browse files
sleepcoozhaochenyang20
andauthoredMar 7, 2025
Add an example of using deepseekv3 int8 sglang. (sgl-project#4177)
Co-authored-by: zhaochenyang20 <[email protected]>
1 parent 7e3bb52 commit ffa1b3e

File tree

2 files changed

+22
-0
lines changed

2 files changed

+22
-0
lines changed
 

‎benchmark/deepseek_v3/README.md

+20
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,26 @@ AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for qua
184184
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
185185
```
186186

187+
### Example: Serving with 16 A100/A800 with int8 Quantization
188+
189+
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
190+
191+
- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
192+
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
193+
194+
```bash
195+
#master
196+
python3 -m sglang.launch_server \
197+
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
198+
HEAD_IP:5000 --nnodes 2 --node-rank 0 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
199+
#cluster
200+
python3 -m sglang.launch_server \
201+
--model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \
202+
HEAD_IP:5000 --nnodes 2 --node-rank 1 --trust-remote --enable-torch-compile --torch-compile-max-bs 8
203+
```
204+
205+
206+
187207
### Example: Serving on any cloud or Kubernetes with SkyPilot
188208

189209
SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1).

‎docs/references/deepseek.md

+2
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ SGLang is recognized as one of the top engines for [DeepSeek model inference](ht
1717
| | 4 x 8 x A100/A800 |
1818
| **Quantized weights (AWQ)** | 8 x H100/800/20 |
1919
| | 8 x A100/A800 |
20+
| **Quantized weights (int8)** | 16 x A100/800 |
2021

2122
<style>
2223
.md-typeset__table {
@@ -54,6 +55,7 @@ Detailed commands for reference:
5455
- [2 x 8 x H200](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h208-nodes)
5556
- [4 x 8 x A100](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes)
5657
- [8 x A100 (AWQ)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-8-a100a800-with-awq-quantization)
58+
- [16 x A100 (int8)](https://github.com/sgl-project/sglang/tree/modify-doc/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
5759

5860
### Download Weights
5961

0 commit comments

Comments
 (0)
Please sign in to comment.