diff --git a/docs/offline_inference.md b/docs/offline_inference.md index 20d7fe59c..48bb56aaf 100644 --- a/docs/offline_inference.md +++ b/docs/offline_inference.md @@ -4,6 +4,7 @@ FastDeploy supports offline inference by loading models locally and processing user data. Usage examples: ### Chat Interface (LLM.chat) + ```python from fastdeploy import LLM, SamplingParams @@ -77,10 +78,12 @@ for output in outputs: prompt = output.prompt generated_text = output.outputs.text ``` + > Note: Text completion interface, suitable for scenarios where users have predefined the context input and expect the model to output only the continuation content. No additional `prompt` concatenation will be added during the inference process. > For the `chat` model, it is recommended to use the Chat Interface (`LLM.chat`). For multimodal models, such as `baidu/ERNIE-4.5-VL-28B-A3B-Paddle`, when calling the `generate interface`, you need to provide a prompt that includes images. The usage is as follows: + ```python import io import os @@ -96,7 +99,7 @@ tokenizer = ErnieBotTokenizer.from_pretrained(os.path.dirname(PATH)) messages = [ { - "role": "user", + "role": "user", "content": [ {"type":"image_url", "image_url": {"url":"https://ku.baidu-int.com/vk-assets-ltd/space/2024/09/13/933d1e0a0760498e94ec0f2ccee865e0"}}, {"type":"text", "text":"这张图片的内容是什么"} @@ -141,6 +144,7 @@ for output in outputs: reasoning_text = output.outputs.reasoning_content ``` + >Note: The `generate interface` does not currently support passing parameters to control the thinking function (on/off). It always uses the model's default parameters. ## 2. API Documentation @@ -159,12 +163,12 @@ For ```LLM``` configuration, refer to [Parameter Documentation](parameters.md). * messages(list[dict],list[list[dict]]): Input messages (batch supported) * sampling_params: See 2.4 for parameter details * use_tqdm: Enable progress visualization -* chat_template_kwargs(dict): Extra template parameters (currently supports enable_thinking(bool)) +* chat_template_kwargs(dict): Extra template parameters (currently supports enable_thinking(bool)) *usage example: `chat_template_kwargs={"enable_thinking": False}`* ### 2.3 fastdeploy.LLM.generate -* prompts(str, list[str], list[int], list[list[int]], dict[str, Any], list[dict[str, Any]]): : Input prompts (batch supported), accepts decoded token ids +* prompts(str, list[str], list[int], list[list[int]], dict[str, Any], list[dict[str, Any]]): : Input prompts (batch supported), accepts decoded token ids *example of using a dict-type parameter: `prompts={"prompt": prompt, "multimodal_data": {"image": images}}`* * sampling_params: See 2.4 for parameter details * use_tqdm: Enable progress visualization @@ -176,6 +180,7 @@ For ```LLM``` configuration, refer to [Parameter Documentation](parameters.md). * repetition_penalty(float): Direct penalty for repeated tokens (>1 penalizes, <1 encourages) * temperature(float): Controls randomness (higher = more random) * top_p(float): Probability threshold for token selection +* top_k(int): Number of tokens considered for sampling * max_tokens(int): Maximum generated tokens (input + output) * min_tokens(int): Minimum forced generation length @@ -206,4 +211,4 @@ For ```LLM``` configuration, refer to [Parameter Documentation](parameters.md). * first_token_time(float): First token latency * time_in_queue(float): Queuing time * model_forward_time(float): Forward pass duration -* model_execute_time(float): Total execution time (including preprocessing) \ No newline at end of file +* model_execute_time(float): Total execution time (including preprocessing) diff --git a/docs/zh/offline_inference.md b/docs/zh/offline_inference.md index 382e65740..eac4a8b5b 100644 --- a/docs/zh/offline_inference.md +++ b/docs/zh/offline_inference.md @@ -78,10 +78,12 @@ for output in outputs: prompt = output.prompt generated_text = output.outputs.text ``` -> 注: 续写接口, 适应于用户自定义好上下文输入, 并希望模型仅输出续写内容的场景; 推理过程不会增加其他 `prompt `拼接。 + +> 注: 续写接口, 适应于用户自定义好上下文输入, 并希望模型仅输出续写内容的场景; 推理过程不会增加其他 `prompt`拼接。 > 对于 `chat`模型, 建议使用对话接口(LLM.chat)。 对于多模模型, 例如`baidu/ERNIE-4.5-VL-28B-A3B-Paddle`, 在调用`generate接口`时, 需要提供包含图片的prompt, 使用方式如下: + ```python import io import os @@ -97,7 +99,7 @@ tokenizer = ErnieBotTokenizer.from_pretrained(os.path.dirname(PATH)) messages = [ { - "role": "user", + "role": "user", "content": [ {"type":"image_url", "image_url": {"url":"https://ku.baidu-int.com/vk-assets-ltd/space/2024/09/13/933d1e0a0760498e94ec0f2ccee865e0"}}, {"type":"text", "text":"这张图片的内容是什么"} @@ -142,6 +144,7 @@ for output in outputs: reasoning_text = output.outputs.reasoning_content ``` + > 注: `generate` 接口, 暂时不支持思考开关参数控制, 均使用模型默认思考能力。 ## 2. 接口说明 @@ -155,18 +158,17 @@ for output in outputs: > 2. 模型服务启动后,会在日志文件log/fastdeploy.log中打印如 `Doing profile, the total_block_num:640` 的日志,其中640即表示自动计算得到的KV Cache block数量,将它乘以block_size(默认值64),即可得到部署后总共可以在KV Cache中缓存的Token数。 > 3. `max_num_seqs` 用于配置decode阶段最大并发处理请求数,该参数可以基于第1点中缓存的Token数来计算一个较优值,例如线上统计输入平均token数800, 输出平均token数500,本次计>算得到KV Cache block为640, block_size为64。那么我们可以配置 `kv_cache_ratio = 800 / (800 + 500) = 0.6` , 配置 `max_seq_len = 640 * 64 / (800 + 500) = 31`。 - ### 2.2 fastdeploy.LLM.chat * messages(list[dict],list[list[dict]]): 输入的message, 支持batch message 输入 * sampling_params: 模型超参设置具体说明见2.4 * use_tqdm: 是否打开推理进度可视化 -* chat_template_kwargs(dict): 传递给对话模板的额外参数,当前支持enable_thinking(bool) +* chat_template_kwargs(dict): 传递给对话模板的额外参数,当前支持enable_thinking(bool) *使用示例`chat_template_kwargs={"enable_thinking": False}`* ### 2.3 fastdeploy.LLM.generate -* prompts(str, list[str], list[int], list[list[int]], dict[str, Any], list[dict[str, Any]]): 输入的prompt, 支持batch prompt 输入,解码后的token ids 进行输入 +* prompts(str, list[str], list[int], list[list[int]], dict[str, Any], list[dict[str, Any]]): 输入的prompt, 支持batch prompt 输入,解码后的token ids 进行输入 *dict 类型使用示例`prompts={"prompt": prompt, "multimodal_data": {"image": images}}`* * sampling_params: 模型超参设置具体说明见2.4 * use_tqdm: 是否打开推理进度可视化 @@ -178,7 +180,7 @@ for output in outputs: * repetition_penalty(float): 直接对重复生成的token进行惩罚的系数(>1时惩罚重复,<1时鼓励重复) * temperature(float): 控制生成随机性的参数,值越高结果越随机,值越低结果越确定 * top_p(float): 概率累积分布截断阈值,仅考虑累计概率达到此阈值的最可能token集合 -* top_k(int): 采样概率最高的的token数量,考虑概率最高的k个token进行采样 +* top_k(int): 采样概率最高的token数量,考虑概率最高的k个token进行采样 * max_tokens(int): 限制模型生成的最大token数量(包括输入和输出) * min_tokens(int): 强制模型生成的最少token数量,避免过早结束 diff --git a/fastdeploy/model_executor/layers/sample/ops/__init__.py b/fastdeploy/model_executor/layers/sample/ops/__init__.py index 73424e5be..37c803ca3 100644 --- a/fastdeploy/model_executor/layers/sample/ops/__init__.py +++ b/fastdeploy/model_executor/layers/sample/ops/__init__.py @@ -16,10 +16,10 @@ from .apply_penalty_multi_scores import ( apply_penalty_multi_scores, apply_speculative_penalty_multi_scores) -from .top_p_sampling import top_p_sampling +from .top_k_top_p_sampling import top_k_top_p_sampling __all__ = [ "apply_penalty_multi_scores", "apply_speculative_penalty_multi_scores", - "top_p_sampling", + "top_k_top_p_sampling", ] diff --git a/fastdeploy/model_executor/layers/sample/ops/top_p_sampling.py b/fastdeploy/model_executor/layers/sample/ops/top_k_top_p_sampling.py similarity index 99% rename from fastdeploy/model_executor/layers/sample/ops/top_p_sampling.py rename to fastdeploy/model_executor/layers/sample/ops/top_k_top_p_sampling.py index 08635f810..04eea97a2 100644 --- a/fastdeploy/model_executor/layers/sample/ops/top_p_sampling.py +++ b/fastdeploy/model_executor/layers/sample/ops/top_k_top_p_sampling.py @@ -25,7 +25,7 @@ if current_platform.is_gcu(): from fastdeploy.model_executor.ops.gcu import \ top_p_sampling as gcu_top_p_sampling -def top_p_sampling( +def top_k_top_p_sampling( x: paddle.Tensor, top_p: paddle.Tensor, top_k: Optional[paddle.Tensor] = None, diff --git a/fastdeploy/model_executor/layers/sample/sampler.py b/fastdeploy/model_executor/layers/sample/sampler.py index b5e44af11..598a77ee8 100644 --- a/fastdeploy/model_executor/layers/sample/sampler.py +++ b/fastdeploy/model_executor/layers/sample/sampler.py @@ -27,7 +27,7 @@ from fastdeploy.model_executor.guided_decoding.base_guided_decoding import \ from fastdeploy.model_executor.layers.sample.meta_data import SamplingMetadata from fastdeploy.model_executor.layers.sample.ops import ( apply_penalty_multi_scores, apply_speculative_penalty_multi_scores, - top_p_sampling) + top_k_top_p_sampling) from fastdeploy.platforms import current_platform @@ -214,7 +214,7 @@ class Sampler(nn.Layer): probs = F.softmax(logits) - _, next_tokens = top_p_sampling(probs, sampling_metadata.top_p, sampling_metadata.top_k) + _, next_tokens = top_k_top_p_sampling(probs, sampling_metadata.top_p, sampling_metadata.top_k) self.processor.update_output_tokens(next_tokens, skip_idx_list) return next_tokens @@ -367,5 +367,5 @@ class MTPSampler(nn.Layer): ) probs = F.softmax(logits) - _, next_tokens = top_p_sampling(probs, sampling_metadata.top_p, sampling_metadata.top_k) + _, next_tokens = top_k_top_p_sampling(probs, sampling_metadata.top_p, sampling_metadata.top_k) return next_tokens