如何让 vLLM 推理自己 ascend的算子
vLLM-Ascend 推理测试教程本文用于在同一台昇腾机器上测试vllm-ascend并说明如何把自己的 Ascend 算子接入 vLLM 路径。注意本项目当前的build/libllm_ascend.so是独立 C 推理引擎vLLM 不会自动调用它。要让 vLLM 推理“自己的算子”需要把算子移植到vllm-ascend的自定义 aclnn 算子体系里然后在 vLLM-Ascend 的模型执行路径中调用。1. 环境检查npu-smi info python -PY import torch import torch_npu print(torch:, torch.__version__) print(torch_npu:, getattr(torch_npu, __version__, unknown)) print(npu available:, torch.npu.is_available()) print(npu count:, torch.npu.device_count()) PY如果你的机器使用 CANN 8.5.1优先使用与当前 CANN、torch、torch-npu 匹配的vllm-ascend版本。下面以vllm0.19.1、vllm-ascend0.19.1rc1为例。2. 安装 vllm-ascend建议新建环境避免影响本项目现有推理环境。python-mvenv ~/venvs/vllm-ascendsource~/venvs/vllm-ascend/bin/activateif[-f/usr/local/Ascend/cann-8.5.1/set_env.sh];thensource/usr/local/Ascend/cann-8.5.1/set_env.shelif[-f/usr/local/Ascend/ascend-toolkit/set_env.sh];thensource/usr/local/Ascend/ascend-toolkit/set_env.shfipython-mpipinstall-Upip setuptools wheel pipinstallvllm0.19.1 pipinstall\--extra-index-url https://mirrors.huaweicloud.com/repository/pypi/simple\vllm-ascend0.19.1rc1验证安装python -PY import vllm import vllm_ascend import torch import torch_npu print(vllm:, vllm.__version__) print(vllm_ascend imported) print(npu available:, torch.npu.is_available()) PY3. 单卡 offline 推理测试这个测试用于和本项目的python_infer.py --lib ./build/libllm_ascend.so做同 prompt、同 128 tokens 对比。cd~/LLM-inference-enginecatvllm_ascend_offline_test.pyPY import time from vllm import LLM, SamplingParams MODEL ./deepseek-r1-7b PROMPT 黑格尔的哲学思想可以概括为 sampling SamplingParams(temperature0.0, max_tokens128) llm LLM( modelMODEL, tokenizerMODEL, trust_remote_codeTrue, dtypefloat16, max_model_len800, max_num_seqs1, gpu_memory_utilization0.90, enforce_eagerTrue, ) llm.generate([PROMPT], sampling) # warmup t0 time.perf_counter() outputs llm.generate([PROMPT], sampling) t1 time.perf_counter() out outputs[0].outputs[0] new_tokens len(out.token_ids) elapsed t1 - t0 print( generated text ) print(out.text) print() print( performance ) print(fgenerated_tokens{new_tokens}) print(felapsed_s{elapsed:.6f}) print(ftokens_per_s{new_tokens / elapsed:.3f}) PYexportASCEND_VISIBLE_DEVICES4exportASCEND_RT_VISIBLE_DEVICES4exportPYTORCH_NPU_ALLOC_CONFmax_split_size_mb:256 python vllm_ascend_offline_test.py21|teevllm_ascend_offline_128.log如果dtypefloat16报错可以改成dtypebfloat16或者dtypeauto4. OpenAI API 服务推理启动服务cd~/LLM-inference-engineexportASCEND_VISIBLE_DEVICES4exportASCEND_RT_VISIBLE_DEVICES4exportPYTORCH_NPU_ALLOC_CONFmax_split_size_mb:256 vllm serve ./deepseek-r1-7b\--served-model-name deepseek-r1-7b\--host0.0.0.0\--port8000\--max-model-len800\--max-num-seqs1\--dtypefloat16\--enforce-eager\--trust-remote-code另开终端请求curlhttp://localhost:8000/v1/completions\-HContent-Type: application/json\-d{ model: deepseek-r1-7b, prompt: 黑格尔的哲学思想可以概括为, max_tokens: 128, temperature: 0 }|python-mjson.tool服务模式适合测 API 行为和并发吞吐如果要和本项目逐 token decode 日志对比优先用第 3 节 offline 脚本。5. torchrun 跑 vLLM offlinetorchrun更适合 offline tensor parallel / pipeline parallel 推理。多进程时使用 vLLM 的distributed_executor_backendexternal_launcher让 worker 由torchrun外部启动。先写脚本cd~/LLM-inference-enginecattorchrun_vllm_ascend_offline.pyPY import argparse import os import time import torch.distributed as dist from vllm import LLM, SamplingParams def is_rank0(): if dist.is_available() and dist.is_initialized(): return dist.get_rank() 0 return int(os.environ.get(RANK, 0)) 0 parser argparse.ArgumentParser() parser.add_argument(--model, default./deepseek-r1-7b) parser.add_argument(--prompt, default黑格尔的哲学思想可以概括为) parser.add_argument(--max-model-len, typeint, default800) parser.add_argument(--max-tokens, typeint, default128) parser.add_argument(--tp-size, typeint, default1) parser.add_argument(--pp-size, typeint, default1) parser.add_argument(--dtype, defaultfloat16) parser.add_argument(--enforce-eager, actionstore_true) args parser.parse_args() world_size int(os.environ.get(WORLD_SIZE, 1)) llm_kwargs dict( modelargs.model, tokenizerargs.model, trust_remote_codeTrue, dtypeargs.dtype, tensor_parallel_sizeargs.tp_size, pipeline_parallel_sizeargs.pp_size, max_model_lenargs.max_model_len, max_num_seqs1, gpu_memory_utilization0.90, seed1, ) if args.enforce_eager: llm_kwargs[enforce_eager] True if world_size 1: llm_kwargs[distributed_executor_backend] external_launcher llm LLM(**llm_kwargs) sampling SamplingParams(temperature0.0, max_tokensargs.max_tokens) llm.generate([args.prompt], sampling) # warmup t0 time.perf_counter() outputs llm.generate([args.prompt], sampling) t1 time.perf_counter() if is_rank0(): out outputs[0].outputs[0] new_tokens len(out.token_ids) elapsed t1 - t0 print( generated text ) print(out.text) print() print( performance ) print(fgenerated_tokens{new_tokens}) print(felapsed_s{elapsed:.6f}) print(ftokens_per_s{new_tokens / elapsed:.3f}) PY单卡torchrunexportASCEND_VISIBLE_DEVICES4exportASCEND_RT_VISIBLE_DEVICES4exportPYTORCH_NPU_ALLOC_CONFmax_split_size_mb:256 torchrun --nproc-per-node1torchrun_vllm_ascend_offline.py\--model./deepseek-r1-7b\--max-model-len800\--max-tokens128\--tp-size1\--pp-size1\--enforce-eager\21|teetorchrun_vllm_ascend_1npu_128.log双卡 tensor parallel 示例exportASCEND_VISIBLE_DEVICES0,1exportASCEND_RT_VISIBLE_DEVICES0,1exportPYTORCH_NPU_ALLOC_CONFmax_split_size_mb:256 torchrun --nproc-per-node2torchrun_vllm_ascend_offline.py\--model./deepseek-r1-7b\--max-model-len800\--max-tokens128\--tp-size2\--pp-size1\--enforce-eager\21|teetorchrun_vllm_ascend_2npu_tp2_128.log--nproc-per-node应等于--tp-size * --pp-size。如果只想和本项目单 batch decode 对比先用单卡。6. 如何让 vLLM 推理自己的算子本项目当前路径python_infer.py - build/libllm_ascend.so - AscendCL / ACLNNvLLM-Ascend 路径vLLM Python engine - vllm-ascend plugin - torch_npu / aclnn custom ops所以不能直接把build/libllm_ascend.so交给 vLLM。要让 vLLM 使用你的算子有两条路线。路线 A移植成 vllm-ascend 自定义 aclnn op适合把你当前的 QKV fusion、MLP gate/up fusion、attention kernel 接入 vLLM。大致步骤cd~gitclone--depth1--branchv0.19.1rc1 https://github.com/vllm-project/vllm-ascend.gitcd~/vllm-ascend然后按照 vllm-ascend 的 custom aclnn op 目录结构新增算子例如csrc/ custom_ops/ my_qkv_fused/ op_host/ op_kernel/ CMakeLists.txt接着做三件事在 C/AscendC 侧实现你的算子 kernel 和 host launch。在 Python binding 中注册成可调用 op例如torch.ops._C_ascend.my_qkv_fused(...)。在 vllm-ascend 的模型执行路径中把原来的 torch/torch_npu 算子替换成你的 op。测试单个自定义 oppython -PY import torch import torch_npu # 示例实际名字以你注册的 op 为准 # y torch.ops._C_ascend.my_qkv_fused(x, w) print(custom op smoke test placeholder) PY路线 B保留本项目 direct .so只把 vLLM 当 baseline这是当前最稳的评测路线torch_npu baseline - python_infer_ascend.py vllm-ascend baseline - vllm_ascend_offline_test.py your direct Ascend .so - python_infer.py --lib ./build/libllm_ascend.so这样可以清楚证明你的 direct .so torch_npu 你的 direct .so vllm-ascend但这不代表 vLLM 已经调用了你的算子。只有完成路线 A才能说“vLLM 推理正在使用我的算子”。7. 推荐对比口径统一使用model: ./deepseek-r1-7b prompt: 黑格尔的哲学思想可以概括为 max_new_tokens: 128 max_model_len / max_seq: 800 batch size: 1 temperature: 0本项目 direct .sopython python_infer.py\--model./deepseek-r1-7b\--lib./build/libllm_ascend.so\--prompt黑格尔的哲学思想可以概括为\--max-new-tokens128\--max-seq800\--tokenizer-backend tokenizers\--no-chat-template\21|teeascend_super_128.logvLLM-Ascendpython vllm_ascend_offline_test.py21|teevllm_ascend_offline_128.logtorchrun vLLM-Ascendtorchrun --nproc-per-node1torchrun_vllm_ascend_offline.py\--model./deepseek-r1-7b\--max-model-len800\--max-tokens128\--tp-size1\--pp-size1\--enforce-eager\21|teetorchrun_vllm_ascend_1npu_128.log参考vLLM-Ascend installation: https://docs.vllm.ai/projects/ascend/en/main/installation.htmlvLLM-Ascend single NPU tutorial: https://vllm-ascend.readthedocs.io/en/v0.8.4rc1/tutorials/single_npu.htmlvLLM torchrun example: https://docs.vllm.ai/en/stable/examples/features/torchrun/