Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor eagle speculative decoding #3986

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Refactor eagle speculative decoding #3986

wants to merge 1 commit into from

Conversation

Ying1123
Copy link
Member

@Ying1123 Ying1123 commented Mar 2, 2025

No description provided.

@Ying1123 Ying1123 marked this pull request as draft March 2, 2025 02:56
@Ying1123 Ying1123 force-pushed the ying-eagle branch 5 times, most recently from e80b2de to 9c28d33 Compare March 2, 2025 10:20
@Ying1123 Ying1123 marked this pull request as ready for review March 3, 2025 01:50
@Ying1123 Ying1123 requested a review from HaiShaw as a code owner March 3, 2025 01:50
@zhyncs
Copy link
Member

zhyncs commented Mar 3, 2025

@Ying1123 May you help resolve the conflicts? Thanks!

@Ying1123 Ying1123 force-pushed the ying-eagle branch 3 times, most recently from 5e9c58a to 172c25f Compare March 3, 2025 05:46
@@ -151,7 +156,7 @@ def get_attention_tp_cpu_group(self):
def get_memory_pool(self):
return (
self.model_runner.req_to_token_pool,
self.model_runner.token_to_kv_pool,
self.model_runner.token_to_kv_pool_allocator,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think style-wise this is a bit confusing, if we define the allocator to be in charge of the agnostic memory operations and define another memory pool class for the underlying layouts, we should be using allocators consistently in scheduler and only use memory pool at lower level codes.

@Ying1123 Ying1123 force-pushed the ying-eagle branch 3 times, most recently from d2686e5 to a574770 Compare March 3, 2025 08:58
@mpjlu
Copy link
Contributor

mpjlu commented Mar 3, 2025

commit a574770: there is illegal bug when run DeepSeek:
409 File "/data/peng/sglang/python/sglang/srt/managers/scheduler.py", line 1218, in run_batch
410 ) = self.draft_worker.forward_batch_speculative_generation(batch)
411 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 189, in forward_batch_speculative_generation
412 spec_info, to_free_cache_loc = self.draft(batch)
413 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 244, in draft
414 assign_draft_cache_locs[(num_seqs,)](
415 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in
416 return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
417 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run
418 kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
419 File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in call
420 self.launch(*args, **kwargs)
421 RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

@Ying1123
Copy link
Member Author

Ying1123 commented Mar 3, 2025

commit a574770: there is illegal bug when run DeepSeek: 409 File "/data/peng/sglang/python/sglang/srt/managers/scheduler.py", line 1218, in run_batch 410 ) = self.draft_worker.forward_batch_speculative_generation(batch) 411 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 189, in forward_batch_speculative_generation 412 spec_info, to_free_cache_loc = self.draft(batch) 413 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 244, in draft 414 assign_draft_cache_locs[(num_seqs,)]( 415 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in 416 return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 417 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run 418 kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, 419 File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in call 420 self.launch(*args, **kwargs) 421 RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

@mpjlu Thanks for reporting. Could you provide the reproducible command?
@ispobock @zhyncs Could you also help take a look?

@mpjlu
Copy link
Contributor

mpjlu commented Mar 3, 2025

commit a574770: there is illegal bug when run DeepSeek: 409 File "/data/peng/sglang/python/sglang/srt/managers/scheduler.py", line 1218, in run_batch 410 ) = self.draft_worker.forward_batch_speculative_generation(batch) 411 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 189, in forward_batch_speculative_generation 412 spec_info, to_free_cache_loc = self.draft(batch) 413 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 244, in draft 414 assign_draft_cache_locs[(num_seqs,)]( 415 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in 416 return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 417 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run 418 kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, 419 File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in call 420 self.launch(*args, **kwargs) 421 RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

@mpjlu Thanks for reporting. Could you provide the reproducible command? @ispobock @zhyncs Could you also help take a look?
The following command can reproduce:
python3 -m sglang.launch_server
--model-path $model_path
--tp $tp_size
--dist-init-addr 29.224.56.106:5000
--nnodes 2
--node-rank 0
--trust-remote-code
--mem-fraction-static 0.6
--max-running-requests 64
--speculative-draft-model-path $draft_path
--speculative-algorithm NEXTN
--speculative-num-steps 2 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 4 \
--disable-cuda-graph \

test.py

import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
import time

Chat completion

start = time.time()
for i in range(100):
response = client.chat.completions.create(
model="default",
messages=[
{"role": "user", "content": "请以诚信为主题写一篇1000字作文?"},
], temperature=0.6, max_tokens=1000,
extra_body={"top_p": 0.6, "top_k": 50}
)
print(response)
print("dur=", time.time() - start)

@zhyncs zhyncs mentioned this pull request Mar 3, 2025
12 tasks
@ispobock
Copy link
Collaborator

ispobock commented Mar 4, 2025

I verified e19e733 on 8*H200 for DeepSeek-V3 model with nextn enabled and it works fine.

@mpjlu I cannot reproduce your error in my environment. I am not sure if it's an error for multi-node setting since I tried the same args as your command but on one node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants