-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support cos_sin_cache in all cases. #3020
base: main
Are you sure you want to change the base?
Conversation
/bot run --add-multi-gpu-test |
PR_Github #283 [ run ] triggered by Bot |
PR_Github #283 [ run ] completed with state |
/bot run --add-multi-gpu-test |
PR_Github #387 [ run ] triggered by Bot |
PR_Github #387 [ run ] completed with state |
/bot run --add-multi-gpu-test |
PR_Github #430 [ run ] triggered by Bot |
PR_Github #430 [ run ] completed with state |
Signed-off-by: Yuxian Qiu <[email protected]>
86c5593
to
95840f8
Compare
Signed-off-by: Yuxian Qiu <[email protected]>
95840f8
to
54d797b
Compare
/bot run --add-multi-gpu-test |
PR_Github #510 [ run ] triggered by Bot |
I think I am pinged by mistake, is the review request actually pointed to @litaotju ? |
PR_Github #510 [ run ] completed with state |
@yuxianq Can we split this PR to several small PRs? For example, the first item can be a single PR.
|
@QiJune I will have a try. Let me pass the CI first to validate that these features work correctly. |
Signed-off-by: Yuxian Qiu <[email protected]>
/bot run --add-multi-gpu-test |
PR_Github #550 [ run ] triggered by Bot |
PR_Github #550 [ run ] completed with state |
/bot run --disable-fail-fast --add-multi-gpu-test |
PR_Github #584 [ run ] triggered by Bot |
PR_Github #584 [ run ] completed with state |
@@ -5,9 +5,8 @@ | |||
from tensorrt_llm.quantization import (quantize_and_export, | |||
quantize_nemo_and_export) | |||
|
|||
mp.set_start_method("spawn", force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, are someone importing this file?
It should be used a cli command only.
@@ -327,57 +338,77 @@ def from_config(config) -> "RopeParams": | |||
rope_params.beta_slow = rope_scaling.get("beta_slow", 1) | |||
rope_params.mscale = rope_scaling.get("mscale", 1.0) | |||
rope_params.mscale_all_dim = rope_scaling.get("mscale_all_dim", 0.0) | |||
if config.model_type == "deepseek_v3": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks somewaht ad-hoc to me. Is it possible to not relying on the model type hard code string here in a general interface?
assert self.scale_type != RotaryScalingType.longrope, "Long RoPE is not yet supported." | ||
if self.scale_type == RotaryScalingType.yarn: | ||
rope_inv_freq = None | ||
rope_cos_sin = RopeEmbeddingUtils.create_sinusoidal_positions_for_deepseek_attention_plugin( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Nit.
Can we say create*positions_for_yarn or something similar w/o to be too specific to DeepSeek easily?
@@ -102,7 +102,17 @@ def __init__( | |||
self.quant_config = config.get_quant_config() | |||
self.attn_backend = config.attn_backend | |||
self.pos_embd_params = pos_embd_params | |||
self.rotary_emb = rotary_emb | |||
|
|||
self.support_rope = self.attn_backend == "TRTLLM" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be the custom op will do rope fusion right?
Maybe renaming to something like? self.rope_fused_in_custom_op = True
self.rotary_emb = rotary_emb | ||
|
||
self.support_rope = self.attn_backend == "TRTLLM" | ||
self.support_fused_qkv = self.attn_backend == "TRTLLM" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"support" is a vague word.
support means which one?
- configuable, both fused qkv and unfused qkv can run. or
- requires fused qkv?
@@ -249,6 +249,8 @@ def __init__( | |||
attn_backend=attn_backend, | |||
load_format=pytorch_backend_config.load_format, | |||
) | |||
if not hasattr(self.model, 'extra_attrs'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this be true?
Can we always attach the extra_attrs in _load_model such that this won't be needed?
gather_ids) | ||
else: | ||
return self._forward_step(inputs, gather_ids) | ||
with model_extra_attrs(self.model.extra_attrs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we wrap the whole forward function be inside this context manager?
@@ -122,7 +122,7 @@ def submit_sync(self, task: Callable[..., T], *args, **kwargs) -> List[T]: | |||
|
|||
def shutdown(self): | |||
if self.mpi_pool is not None: | |||
self.mpi_pool.shutdown(wait=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Superjomn do you remember if we have some reason to make this "wait=False"?
@@ -36,6 +36,7 @@ def test_llm_api(self, import_oot_code: bool): | |||
llm = LLM(model=model_dir, | |||
kv_cache_config=kv_cache_config, | |||
max_num_tokens=2048) | |||
del llm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will not this llm object be automatically destoryed when the function return? its just a local var.
@@ -216,3 +216,4 @@ async def test(): | |||
1.0), f"Expected '{expected}' but get '{result}'" | |||
|
|||
asyncio.run(test()) | |||
del llm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, is there some reason that the object not deleted by python when function returns?
Signed-off-by: Yuxian Qiu <[email protected]>
/bot run --disable-fail-fast --stage-list "A30-7" |
This MR contains the following updates:
fuse_pos_embd=True/False
and createRotaryEmbedding
inside attention module, so that the users don't need to handle it in the modeling files.cos_sin
for unfused rope implementation. If flashinfer is available, useapply_rope_with_cos_sin_cache_inplace
instead ofapply_rope_inplace
. Otherwise, we fallback to pure pytorch implementation, which can support any rope now.create_rope_const_params
to create and cachecos_sin_cache
for all rope types, including Deepseek yarn rope.