[XPU] Support W4A8C8-TP4-300B Model (#4068)

* support w4a8

* delete ep block attn

* delete moe_topk_select

* update note

* update

* delte useless info

* update

* add some note

* fix some format

* update scale info

* add ans baseline

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
This commit is contained in:
yinwei
2025-10-10 15:41:32 +08:00
committed by GitHub
parent c46d5e48f8
commit 20c7b741f4
21 changed files with 2029 additions and 714 deletions

View File

@@ -15,9 +15,9 @@ The scheduling flow is shown below - users randomly request IP and port, obtain
```python
prompts = [
"Hello, my name is",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲内容是李白穿越到现代最后成为公司文职人员的故事",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲内容是李白穿越到现代最后成为公司文职人员的故事",
"我要采访一位科幻作家创建一个包含5个问题的列表"
]
@@ -83,9 +83,9 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
```
### Parameter Description
- num-servers: Number of API servers to launch
- ports: Ports for API servers
- args: Arguments for API servers
- num-servers: Number of API servers to launch
- ports: Ports for API servers
- args: Arguments for API servers
### Data Parallelism + Disaggregated Deployment
Refer to [Disaggregated Deployment](disaggregated.md#multi-machine-disaggregated-deployment)
@@ -94,9 +94,8 @@ Refer to [Disaggregated Deployment](disaggregated.md#multi-machine-disaggregated
For multi-machine deployment, ensure network cards support RDMA and all cluster nodes are interconnected.
**Note**:
* `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
* The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
- `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
- The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
**Prefill Instance**
```bash
@@ -148,4 +147,4 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-ttl 9000
--scheduler-topic "test" \
--splitwise-role "decode"
```
```