 9c962343f2
			
		
	
	9c962343f2
	
	
	
		
			
			* add sampling docs * add minp sampling docs * update sample docs * update * update * add bad words desc * update
		
			
				
	
	
	
		
			6.8 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Sampling Strategies
Sampling strategies are used to determine how to select the next token from the output probability distribution of a model. FastDeploy currently supports multiple sampling strategies including Top-p, Top-k_Top-p, and Min-p Sampling.
- 
Top-p Sampling - Top-p sampling truncates the probability cumulative distribution, considering only the most likely token set that reaches a specified threshold p.
- It dynamically selects the number of tokens considered, ensuring diversity in the results while avoiding unlikely tokens.
 
- 
Top-k_Top-p Sampling - Initially performs top-k sampling, then normalizes within the top-k results, and finally performs top-p sampling.
- By limiting the initial selection range (top-k) and then accumulating probabilities within it (top-p), it improves the quality and coherence of the generated text.
 
- 
Min-p Sampling - Min-p sampling calculates pivot=max_prob * min_p, then retains only tokens with probabilities greater than thepivot(setting others to zero) for subsequent sampling.
- It filters out tokens with relatively low probabilities, sampling only from high-probability tokens to improve generation quality.
 
- Min-p sampling calculates 
Usage Instructions
During deployment, you can choose the sampling algorithm by setting the environment variable FD_SAMPLING_CLASS. Available values are base, base_non_truncated, air, or rejection.
Algorithms Supporting Only Top-p Sampling
- base(default): Directly normalizes using the- top_pvalue, favoring tokens with greater probabilities.
- base_non_truncated: Strictly follows the Top-p sampling logic, first selecting the smallest set that reaches the cumulative probability of- top_p, then normalizing these selected elements.
- air: This algorithm is inspired by TensorRT-LLM and supports Top-p sampling.
Algorithms Supporting Top-p and Top-k_Top-p Sampling
- rejection: This algorithm is inspired by flashinfer and allows flexible settings for- top_kand- top_pparameters for Top-p or Top-k_Top-p sampling.
Configuration Method
Top-p Sampling
- During deployment, set the environment variable to select the sampling algorithm, default is base:
export FD_SAMPLING_CLASS=rejection # base, base_non_truncated, or air
- When sending a request, specify the following parameters:
- Example request with curl:
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "How old are you"}
  ],
  "top_p": 0.8
}'
- Example request with Python:
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "system", "content": "I'm a helpful AI assistant."},
    ],
    stream=True,
    top_p=0.8
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')
Top-k_Top-p Sampling
- During deployment, set the environment variable to select the rejection sampling algorithm:
export FD_SAMPLING_CLASS=rejection
- When sending a request, specify the following parameters:
- Example request with curl:
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "How old are you"}
  ],
  "top_p": 0.8,
  "top_k": 50
}'
- Example request with Python:
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "system", "content": "I'm a helpful AI assistant."},
    ],
    stream=True,
    top_p=0.8,
    top_k=50
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')
Min-p Sampling
If you want to use min-p sampling before top-p or top-k_top-p sampling, specify the following parameters when sending a request:
- Example request with curl:
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "How old are you"}
  ],
  "min_p": 0.1,
  "top_p": 0.8,
  "top_k": 20
}'
- Example request with Python:
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "system", "content": "I'm a helpful AI assistant."},
    ],
    stream=True,
    top_p=0.8,
    top_k=20,
    min_p=0.1
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')
With the above configurations, you can flexibly choose and use the appropriate sampling strategy according to the needs of specific generation tasks.
Parameter Description
top_p: The probability cumulative distribution truncation threshold, considering only the most likely token set that reaches this threshold. It is a float type, with a range of [0.0, 1.0]. When top_p=1.0, all tokens are considered; when top_p=0.0, it degenerates into greedy search.
top_k: The number of tokens with the highest sampling probability, limiting the sampling range to the top k tokens. It is an int type, with a range of [0, vocab_size].
min_p: Low probability filtering threshold, considering only the token set with probability greater than or equal to (max_prob*min_p). It is a float type, with a range of [0.0, 1.0].
Bad Words
Used to prevent the model from generating certain specific words during the inference process. Commonly applied in safety control, content filtering, and behavioral constraints of the model.
Usage Instructions
Include the bad_words parameter in the request:
- Example request with curl:
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "How old are you"}
  ],
  "bad_words": ["age", "I"]
}'
- Example request with Python:
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "system", "content": "I'm a helpful AI assistant."},
    ],
    extra_body={"bad_words": ["you", "me"]},
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')
Parameter Description
bad_words: List of forbidden words. Type: list of str. Each word must be a single token.