-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor eagle speculative decoding #3986
base: main
Are you sure you want to change the base?
Conversation
e80b2de
to
9c28d33
Compare
@Ying1123 May you help resolve the conflicts? Thanks! |
5e9c58a
to
172c25f
Compare
@@ -151,7 +156,7 @@ def get_attention_tp_cpu_group(self): | |||
def get_memory_pool(self): | |||
return ( | |||
self.model_runner.req_to_token_pool, | |||
self.model_runner.token_to_kv_pool, | |||
self.model_runner.token_to_kv_pool_allocator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think style-wise this is a bit confusing, if we define the allocator to be in charge of the agnostic memory operations and define another memory pool class for the underlying layouts, we should be using allocators consistently in scheduler and only use memory pool at lower level codes.
14f5d33
to
5d9a5ec
Compare
d2686e5
to
a574770
Compare
commit a574770: there is illegal bug when run DeepSeek: |
@mpjlu Thanks for reporting. Could you provide the reproducible command? |
test.py import openai Chat completionstart = time.time() |
No description provided.