Refactor eagle speculative decoding #3986

Ying1123 · 2025-03-02T02:56:49Z

No description provided.

zhyncs · 2025-03-03T03:20:25Z

@Ying1123 May you help resolve the conflicts? Thanks!

xiezhq-hermann · 2025-03-03T05:53:28Z

python/sglang/srt/managers/tp_worker.py

@@ -151,7 +156,7 @@ def get_attention_tp_cpu_group(self):
    def get_memory_pool(self):
        return (
            self.model_runner.req_to_token_pool,
-            self.model_runner.token_to_kv_pool,
+            self.model_runner.token_to_kv_pool_allocator,


I think style-wise this is a bit confusing, if we define the allocator to be in charge of the agnostic memory operations and define another memory pool class for the underlying layouts, we should be using allocators consistently in scheduler and only use memory pool at lower level codes.

mpjlu · 2025-03-03T10:16:42Z

commit a574770: there is illegal bug when run DeepSeek:
409 File "/data/peng/sglang/python/sglang/srt/managers/scheduler.py", line 1218, in run_batch
410 ) = self.draft_worker.forward_batch_speculative_generation(batch)
411 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 189, in forward_batch_speculative_generation
412 spec_info, to_free_cache_loc = self.draft(batch)
413 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 244, in draft
414 assign_draft_cache_locs[(num_seqs,)](
415 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in
416 return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
417 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run
418 kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
419 File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in call
420 self.launch(*args, **kwargs)
421 RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

Ying1123 · 2025-03-03T10:47:24Z

commit a574770: there is illegal bug when run DeepSeek: 409 File "/data/peng/sglang/python/sglang/srt/managers/scheduler.py", line 1218, in run_batch 410 ) = self.draft_worker.forward_batch_speculative_generation(batch) 411 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 189, in forward_batch_speculative_generation 412 spec_info, to_free_cache_loc = self.draft(batch) 413 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 244, in draft 414 assign_draft_cache_locs[(num_seqs,)]( 415 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in 416 return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 417 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run 418 kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, 419 File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in call 420 self.launch(*args, **kwargs) 421 RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

@mpjlu Thanks for reporting. Could you provide the reproducible command?
@ispobock @zhyncs Could you also help take a look?

mpjlu · 2025-03-03T11:39:33Z

commit a574770: there is illegal bug when run DeepSeek: 409 File "/data/peng/sglang/python/sglang/srt/managers/scheduler.py", line 1218, in run_batch 410 ) = self.draft_worker.forward_batch_speculative_generation(batch) 411 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 189, in forward_batch_speculative_generation 412 spec_info, to_free_cache_loc = self.draft(batch) 413 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 244, in draft 414 assign_draft_cache_locs[(num_seqs,)]( 415 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in 416 return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 417 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run 418 kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, 419 File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in call 420 self.launch(*args, **kwargs) 421 RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

@mpjlu Thanks for reporting. Could you provide the reproducible command? @ispobock @zhyncs Could you also help take a look?
The following command can reproduce:
python3 -m sglang.launch_server
--model-path $model_path
--tp $tp_size
--dist-init-addr 29.224.56.106:5000
--nnodes 2
--node-rank 0
--trust-remote-code
--mem-fraction-static 0.6
--max-running-requests 64
--speculative-draft-model-path $draft_path
--speculative-algorithm NEXTN
--speculative-num-steps 2 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 4 \
--disable-cuda-graph \

test.py

import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
import time

Chat completion

start = time.time()
for i in range(100):
response = client.chat.completions.create(
model="default",
messages=[
{"role": "user", "content": "请以诚信为主题写一篇1000字作文？"},
], temperature=0.6, max_tokens=1000,
extra_body={"top_p": 0.6, "top_k": 50}
)
print(response)
print("dur=", time.time() - start)

ispobock · 2025-03-04T07:56:04Z

I verified e19e733 on 8*H200 for DeepSeek-V3 model with nextn enabled and it works fine.

@mpjlu I cannot reproduce your error in my environment. I am not sure if it's an error for multi-node setting since I tried the same args as your command but on one node.

Ying1123 requested review from merrymercy, hnyls2002, zhyncs, ispobock and ByronHsu as code owners March 2, 2025 02:56

Ying1123 marked this pull request as draft March 2, 2025 02:56

Ying1123 force-pushed the ying-eagle branch 5 times, most recently from e80b2de to 9c28d33 Compare March 2, 2025 10:20

zhyncs mentioned this pull request Mar 2, 2025

[feat] add small vocab table for eagle's draft model[1]. #3822

Merged

6 tasks

Ying1123 force-pushed the ying-eagle branch from 9c28d33 to 397c13e Compare March 3, 2025 01:28

Ying1123 marked this pull request as ready for review March 3, 2025 01:50

Ying1123 requested a review from HaiShaw as a code owner March 3, 2025 01:50

zhyncs added the high priority label Mar 3, 2025

Ying1123 force-pushed the ying-eagle branch 3 times, most recently from 5e9c58a to 172c25f Compare March 3, 2025 05:46

xiezhq-hermann reviewed Mar 3, 2025

View reviewed changes

Ying1123 force-pushed the ying-eagle branch 7 times, most recently from 14f5d33 to 5d9a5ec Compare March 3, 2025 08:08

merrymercy force-pushed the main branch from 3f77ac7 to ac23872 Compare March 3, 2025 08:12

Ying1123 force-pushed the ying-eagle branch from 5d9a5ec to 3dd1d00 Compare March 3, 2025 08:27

Ying1123 force-pushed the ying-eagle branch 3 times, most recently from d2686e5 to a574770 Compare March 3, 2025 08:58

refactor

e19e733

Ying1123 force-pushed the ying-eagle branch from a574770 to e19e733 Compare March 3, 2025 10:49

zhyncs mentioned this pull request Mar 3, 2025

chore: bump v0.4.4 #4041

Draft

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor eagle speculative decoding #3986

Refactor eagle speculative decoding #3986

Ying1123 commented Mar 2, 2025

zhyncs commented Mar 3, 2025

xiezhq-hermann Mar 3, 2025

mpjlu commented Mar 3, 2025

Ying1123 commented Mar 3, 2025

mpjlu commented Mar 3, 2025

ispobock commented Mar 4, 2025

Refactor eagle speculative decoding #3986

Are you sure you want to change the base?

Refactor eagle speculative decoding #3986

Conversation

Ying1123 commented Mar 2, 2025

zhyncs commented Mar 3, 2025

xiezhq-hermann Mar 3, 2025

Choose a reason for hiding this comment

mpjlu commented Mar 3, 2025

Ying1123 commented Mar 3, 2025

mpjlu commented Mar 3, 2025

Chat completion

ispobock commented Mar 4, 2025