OpenAI Platform
Requesting a large amount of generated tokens completions can lead to increased latencies:
- Lower max tokens : for requests with a similar token generation count, those that have a lower max_tokens parameter incur less latency.
- Include stop sequences : to prevent generating unneeded tokens, add a stop sequence. For example, you can use stop sequence
OpenAI Platform
Intuition : Prompt tokens add very little latency to completion calls. Time to generate completion tokens is much longer, as tokens are generated one at a time. Longer generation lengths will accumulate latency due to generation required for each token.
OpenAI Platform
Depending on your use case, batching may help . If you are sending multiple requests to the same endpoint, you can batch the prompts to be sent in the same request. This will reduce the number of requests you need to make. The prompt parameter can hold up to 20 unique prompts. We advise you to test out this method and see if it helps. In some cases... See more