Fix breakage problem when using custom_ar #4052

kkHuang-amd · 2025-03-04T06:32:24Z

Motivation

Run the moel test and found below error is reported

==============================================
File "/root/sglang-0.4.3post3_car/python/sglang/srt/managers/scheduler.py", line 1887, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/managers/scheduler.py", line 243, in __init__
    self.tp_worker = TpWorkerClass(
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/model_executor/model_runner.py", line 192, in __init__
    min_per_gpu_memory = self.init_torch_distributed()
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/model_executor/model_runner.py", line 268, in init_torch_distributed
    initialize_model_parallel(tensor_model_parallel_size=self.tp_size)
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/distributed/parallel_state.py", line 1116, in initialize_model_parallel
    _TP = init_model_parallel_group(
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/distributed/parallel_state.py", line 943, in init_model_parallel_group
    return GroupCoordinator(
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/distributed/parallel_state.py", line 263, in __init__
    self.ca_comm = CustomAllreduce(
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/distributed/device_communicators/custom_all_reduce.py", line 281, in __init__
    self.meta = ops.allocate_meta_buffer(ops.meta_size() + max_size)
  File "/root/sglang-0.4.3post3_car/python/sglang/srt/_custom_ops.py", line 124, in meta_size
    return sgl_kernel.ops.meta_size()
AttributeError: module 'sgl_kernel.ops' has no attribute 'meta_size'
=====================================================

Check the path in sgl-kernel, ops is separated multiple files from init.py

Modifications

Change _custom_ops.py to specify the correct calling path

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

HaiShaw

LG

wunhuang and others added 3 commits March 4, 2025 04:08

Fix running failed issue with new triton version 3.2

61a0e31

Merge branch 'sgl-project:main' into fix-sgl-kernel-breakage

d73005e

Fix the breakage problem when using custom ar

6aea466

kkHuang-amd requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners March 4, 2025 06:32

HaiShaw approved these changes Mar 4, 2025

View reviewed changes

Merge branch 'main' into fix-sgl-kernel-breakage

f65deca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix breakage problem when using custom_ar #4052

Fix breakage problem when using custom_ar #4052

kkHuang-amd commented Mar 4, 2025 •

edited by HaiShaw

Loading

HaiShaw left a comment

Fix breakage problem when using custom_ar #4052

Are you sure you want to change the base?

Fix breakage problem when using custom_ar #4052

Conversation

kkHuang-amd commented Mar 4, 2025 • edited by HaiShaw Loading

Motivation

Modifications

Checklist

HaiShaw left a comment

Choose a reason for hiding this comment

kkHuang-amd commented Mar 4, 2025 •

edited by HaiShaw

Loading