Generative AI Product Problems #3: Latency

Intelligence isn't everything. Sometimes, LLMs are just too slow for good user experiences.

Generative AI Product Problems #3: Latency

LLMs generate tokens one…

at…

a…

time.

That starts to feel slow pretty quickly.

It seems like a no-brainer that LLMs will replace each and every 1st generation customer support chatbot. LLMs surely are smarter, but they lose big in one category—speed.

GPT-3.5’s latencies start in the ballpark of 0.5 seconds. For GPT-4, expect even the shortest tasks to take at least 1 second.

LLM latency grows linearly with the number of tokens in the output. Asking for a multi-paragraph response from GPT-4? You could be waiting on the order of minutes or more.

The quick solution is to set a cap on the maximum number of tokens you request from the LLM. Output length is the most important factor in latency; reducing your input size won’t make a difference.

You could also try using a different model. Do you really need GPT-4 in all parts of your product? You could reduce latency and cost by switching to GPT-3.5.

You may not even need an LLM to handle all parts of your application. Sometimes, old-fashioned business logic works perfectly fine and is much faster.

Test these tradeoffs before committing to them. Speed is important, but understand what else you’re sacrificing to achieve it.

With Context.ai, you can run A/B tests on your LLM-powered products. Experiment with strategies to reduce latency and measure the end-user impact. That way, you can find the right balance between speed and quality.

To learn more, request a demo at context.ai/demo.

Read more