Files
FastDeploy/docs/online_serving/README.md
Jiang-Jia-Jun 2bd7d90929
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Remove useless parameters
2025-09-01 14:43:56 +08:00

411 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# OpenAI Protocol-Compatible API Server
FastDeploy provides a service-oriented deployment solution that is compatible with the OpenAI protocol. Users can quickly deploy it using the following command:
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8188 --tensor-parallel-size 8 \
--max-model-len 32768
```
To enable log probability output, simply deploy with the following command:
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8188 --tensor-parallel-size 8 \
--max-model-len 32768 \
--enable-logprob
```
For more usage methods of the command line during service deployment, refer to [Parameter Descriptions](../parameters.md).
## Chat Completion API
FastDeploy provides a Chat Completion API that is compatible with the OpenAI protocol, allowing user requests to be sent directly using OpenAI's request method.
### Sending User Requests
Here is an example of sending a user request using the curl command:
```bash
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
```
Here's an example curl command demonstrating how to include the logprobs parameter in a user request:
```bash
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello!"}
],
"logprobs": true, "top_logprobs": 0,
}'
```
Here is an example of sending a user request using a Python script:
```python
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
{"role": "user", "content": "Rewrite Li Bai's 'Quiet Night Thought' as a modern poem"},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```
For a description of the OpenAI protocol, refer to the document [OpenAI Chat Completion API](https://platform.openai.com/docs/api-reference/chat/create).
### Compatible OpenAI Parameters
```python
messages: Union[List[Any], List[int]]
# List of input messages, which can be text messages (`List[Any]`, typically `List[dict]`) or token ID lists (`List[int]`).
tools: Optional[List[ChatCompletionToolsParam]] = None
# List of tool call configurations, used for enabling function calling (Function Calling) or tool usage (e.g., ReAct framework).
model: Optional[str] = "default"
# Specifies the model name or version to use, defaulting to `"default"` (which may point to the base model).
frequency_penalty: Optional[float] = None
# Frequency penalty coefficient, reducing the probability of generating the same token repeatedly (`>1.0` suppresses repetition, `<1.0` encourages repetition, default `None` disables).
logprobs: Optional[bool] = False
# Whether to return the log probabilities of each generated token, used for debugging or analysis.
top_logprobs: Optional[int] = 0
# Returns the top `top_logprobs` tokens and their log probabilities for each generated position (default `0` means no return).
max_tokens: Optional[int] = Field(
default=None,
deprecated="max_tokens is deprecated in favor of the max_completion_tokens field",
)
# Deprecated: Maximum number of tokens to generate (recommended to use `max_completion_tokens` instead).
max_completion_tokens: Optional[int] = None
# Maximum number of tokens to generate (recommended alternative to `max_tokens`), no default limit (restricted by the model's context window).
presence_penalty: Optional[float] = None
# Presence penalty coefficient, reducing the probability of generating new topics (unseen topics) (`>1.0` suppresses new topics, `<1.0` encourages new topics, default `None` disables).
stream: Optional[bool] = False
# Whether to enable streaming output (return results token by token), default `False` (returns complete results at once).
stream_options: Optional[StreamOptions] = None
# Additional configurations for streaming output (such as chunk size, timeout, etc.), refer to the specific definition of `StreamOptions`.
temperature: Optional[float] = None
# Temperature coefficient, controlling generation randomness (`0.0` for deterministic generation, `>1.0` for more randomness, default `None` uses model default).
top_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds `top_p` (default `None` disables).
response_format: Optional[AnyResponseFormat] = None
# Specifies the output format (such as JSON, XML, etc.), requires passing a predefined format configuration object.
user: Optional[str] = None
# User identifier, used for tracking or distinguishing requests from different users (default `None` does not pass).
metadata: Optional[dict] = None
# Additional metadata, used for passing custom information (such as request ID, debug markers, etc.).
```
### Additional Parameters Added by FastDeploy
> Note:
When sending requests using curl, the following parameters can be used directly;
When sending requests using openai.Client, these parameters need to be placed in the `extra_body` parameter, e.g. `extra_body={"chat_template_kwargs": {"enable_thinking":True}, "include_stop_str_in_output": True}`.
The following sampling parameters are supported.
```python
top_k: Optional[int] = None
# Limits the consideration to the top K tokens with the highest probability at each generation step, used to control randomness (default None means no limit).
min_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).
min_tokens: Optional[int] = None
# Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).
include_stop_str_in_output: Optional[bool] = False
# Whether to include the stop string content in the output (default False, meaning output is truncated when a stop string is encountered).
bad_words: Optional[List[str]] = None
# List of forbidden words (e.g., sensitive words) that the model should avoid generating (default None means no restriction).
bad_words_token_ids: Optional[List[int]] = None
# List of forbidden token ids that the model should avoid generating (default None means no restriction).
repetition_penalty: Optional[float] = None
# Repetition penalty coefficient, reducing the probability of repeating already generated tokens (`>1.0` suppresses repetition, `<1.0` encourages repetition, default None means disabled).
```
The following extra parameters are supported:
```python
chat_template_kwargs: Optional[dict] = None
# Additional parameters passed to the chat template, used for customizing dialogue formats (default None).
chat_template: Optional[str] = None
# Custom chat template will override the model's default chat template (default None).
reasoning_max_tokens: Optional[int] = None
# Maximum number of tokens to generate during reasoning (e.g., CoT, chain of thought) (default None means using global max_tokens).
structural_tag: Optional[str] = None
# Structural tag, used to mark specific structures of generated content (such as JSON, XML, etc., default None).
guided_json: Optional[Union[str, dict, BaseModel]] = None
# Guides the generation of content conforming to JSON structure, can be a JSON string, dictionary, or Pydantic model (default None).
guided_regex: Optional[str] = None
# Guides the generation of content conforming to regular expression rules (default None means no restriction).
guided_choice: Optional[List[str]] = None
# Guides the generation of content selected from a specified candidate list (default None means no restriction).
guided_grammar: Optional[str] = None
# Guides the generation of content conforming to grammar rules (such as BNF) (default None means no restriction).
return_token_ids: Optional[bool] = None
# Whether to return the token IDs of the generation results instead of text (default None means return text).
prompt_token_ids: Optional[List[int]] = None
# Directly passes the token ID list of the prompt, skipping the text encoding step (default None means using text input).
disable_chat_template: Optional[bool] = False
# Whether to disable chat template rendering, using raw input directly (default False means template is enabled).
temp_scaled_logprobs: Optional[bool] = False
# Whether to divide the logits by the temperature coefficient when calculating logprobs (default is False, meaning the logits are not divided by the temperature coefficient).
top_p_normalized_logprobs: Optional[bool] = False
# Whether to perform top-p normalization when calculating logprobs (default is False, indicating that top-p normalization is not performed).
```
### Differences in Return Fields
Additional return fields added by FastDeploy:
- `arrival_time`: Cumulative time consumed for all tokens
- `reasoning_content`: Return results of the chain of thought
- `prompt_token_ids`: List of token IDs for the input sequence
- `completion_token_ids`: List of token IDs for the output sequence
Overview of return parameters:
```python
ChatCompletionResponse:
id: str
object: str = "chat.completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[ChatCompletionResponseChoice]
usage: UsageInfo
ChatCompletionResponseChoice:
index: int
message: ChatMessage
logprobs: Optional[LogProbs] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls", "recover_stop"]]
ChatMessage:
role: str
content: str
reasoning_content: Optional[str] = None
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
# Fields returned for streaming responses
ChatCompletionStreamResponse:
id: str
object: str = "chat.completion.chunk"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[ChatCompletionResponseStreamChoice]
usage: Optional[UsageInfo] = None
ChatCompletionResponseStreamChoice:
index: int
delta: DeltaMessage
logprobs: Optional[LogProbs] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls"]] = None
arrival_time: Optional[float] = None
DeltaMessage:
role: Optional[str] = None
content: Optional[str] = None
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
reasoning_content: Optional[str] = None
```
## Completion API
The Completion API interface is mainly used for continuation scenarios, suitable for users who have customized context input and expect the model to only output continuation content; the inference process does not add other `prompt` concatenations.
### Sending User Requests
Here is an example of sending a user request using the curl command:
```bash
curl -X POST "http://0.0.0.0:8188/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"prompt": "以下是一篇关于深圳文心公园的500字游记和赏析"
}'
```
Here is an example of sending a user request using a Python script:
```python
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.completions.create(
model="default",
prompt="以下是一篇关于深圳文心公园的500字游记和赏析",
stream=False,
)
print(response.choices[0].text)
```
For an explanation of the OpenAI protocol, refer to the [OpenAI Completion API](https://platform.openai.com/docs/api-reference/completions/create)。
### Compatible OpenAI Parameters
```python
model: Optional[str] = "default"
# Specifies the model name or version to use, defaulting to `"default"` (which may point to the base model).
prompt: Union[List[int], List[List[int]], str, List[str]]
# Input prompt, supporting multiple formats:
# - `str`: Plain text prompt (e.g., `"Hello, how are you?"`).
# - `List[str]`: Multiple text segments (e.g., `["User:", "Hello!", "Assistant:", "Hi!"]`).
# - `List[int]`: Directly passes a list of token IDs (e.g., `[123, 456]`).
# - `List[List[int]]`: List of multiple token ID lists (e.g., `[[123], [456, 789]]`).
best_of: Optional[int] = None
# Generates `best_of` candidate results and returns the highest-scoring one (requires `n=1`).
frequency_penalty: Optional[float] = None
# Frequency penalty coefficient, reducing the probability of generating the same token repeatedly (`>1.0` suppresses repetition, `<1.0` encourages repetition).
logprobs: Optional[int] = None
# Returns the log probabilities of each generated token, can specify the number of candidates to return.
max_tokens: Optional[int] = None
# Maximum number of tokens to generate (including input and output), no default limit (restricted by the model's context window).
presence_penalty: Optional[float] = None
# Presence penalty coefficient, reducing the probability of generating new topics (unseen topics) (`>1.0` suppresses new topics, `<1.0` encourages new topics).
```
### Additional Parameters Added by FastDeploy
> Note:
When sending requests using curl, the following parameters can be used directly;
When sending requests using openai.Client, these parameters need to be placed in the `extra_body` parameter, e.g. `extra_body={"chat_template_kwargs": {"enable_thinking":True}, "include_stop_str_in_output": True}`.
The following sampling parameters are supported.
```python
top_k: Optional[int] = None
# Limits the consideration to the top K tokens with the highest probability at each generation step, used to control randomness (default None means no limit).
min_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).
min_tokens: Optional[int] = None
# Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).
include_stop_str_in_output: Optional[bool] = False
# Whether to include the stop string content in the output (default False, meaning output is truncated when a stop string is encountered).
bad_words: Optional[List[str]] = None
# List of forbidden words (e.g., sensitive words) that the model should avoid generating (default None means no restriction).
bad_words_token_ids: Optional[List[int]] = None
# List of forbidden token ids that the model should avoid generating (default None means no restriction).
repetition_penalty: Optional[float] = None
# Repetition penalty coefficient, reducing the probability of repeating already generated tokens (`>1.0` suppresses repetition, `<1.0` encourages repetition, default None means disabled).
```
The following extra parameters are supported:
```python
guided_json: Optional[Union[str, dict, BaseModel]] = None
# Guides the generation of content conforming to JSON structure, can be a JSON string, dictionary, or Pydantic model (default None).
guided_regex: Optional[str] = None
# Guides the generation of content conforming to regular expression rules (default None means no restriction).
guided_choice: Optional[List[str]] = None
# Guides the generation of content selected from a specified candidate list (default None means no restriction).
guided_grammar: Optional[str] = None
# Guides the generation of content conforming to grammar rules (such as BNF) (default None means no restriction).
return_token_ids: Optional[bool] = None
# Whether to return the token IDs of the generation results instead of text (default None means return text).
prompt_token_ids: Optional[List[int]] = None
# Directly passes the token ID list of the prompt, skipping the text encoding step (default None means using text input).
```
### Overview of Return Parameters
```python
CompletionResponse:
id: str
object: str = "text_completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[CompletionResponseChoice]
usage: UsageInfo
CompletionResponseChoice:
index: int
text: str
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
arrival_time: Optional[float] = None
logprobs: Optional[int] = None
reasoning_content: Optional[str] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls"]]
# Fields returned for streaming responses
CompletionStreamResponse
id: str
object: str = "text_completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[CompletionResponseStreamChoice]
usage: Optional[UsageInfo] = None
CompletionResponseStreamChoice:
index: int
text: str
arrival_time: float = None
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
logprobs: Optional[float] = None
reasoning_content: Optional[str] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls"]] = None
```