Chamath Palihapitiya recently hosted a Twitter Space with Jonathan Ross, CEO and founder of Groq, an AI solutions company.
They covered Groq’s origin story, generative AI metrics, open source vs closed source models, and much more.
Below is the Twitter Space and my notes from the conversation.
Generative AI is Sequential
Generative AI has a sequential nature. You can’t predict the 100th token or word, until you’ve predicted the 99th token or word.
It’s like playing chess. When playing chess, there are about 30 moves you can make. Which move you make, determines which move can comes next. With a LLM, it’s exactly the same, but instead of 30 moves, it’s 32,000 moves. That’s how many tokens you can chose from. If I start saying, “The President of Botswana is”, but I don’t know who the President is, I’ve checkmated myself. I’m now committed to giving the answer. Like chess, you have to think sequentially, what tokens am I putting out there and what’s coming next.
The existing hardware is not great at sequential processing. It’s great at parallel processing things that don’t depend on each other.
LPUs are designed specifically at going very fast at sequential things.
Quality & Speed
For Generative AI & LLMs, measure Quality & Speed.
Quality of the output the model produces. If it’s bad quality, users won’t trust it and won’t use it.
Speed matters because users have become accustomed to it. Expect websites and apps to load fast. And thus will expect the model to deliver the result quickly.
- Time to first token
- Tokens per second (output rate)
- Time to first sentence
Open source models have high quality (arguably even higher than closed source models). So quality is essentially free. The question then becomes where will you get speed?
App Developer Considerations
You want to watch the new foundational models that are coming out. The risk is if you fine tune your model on a token as a service business, you are locked with them. And when an even better foundational model comes out, and you have to re-fine tune your model which takes forever.
You want to look for those that are running the largest, most powerful open-source models, at the cost that is affordable for your application. Look for where you can run the largest models, at the lowest cost, where you get speed.
The larger the model, the more capable it is at determining there will be a hallucination and avoiding it. A mistake is if you run smaller models (because it’s faster) but you’ll start to get more hallucinations.
Making a design decision between Open source or Closed source models: take the raw, non-fine tuned largest open source model possible, because that is the least business risk.
Ask the LLM to do a task (e.g. write code, write document) and look at the output, how good it is, consider the fact that the LLM didn’t have access to a Backspace or Delete key. It made no mistakes. We are sort of handicapping these models by asking them to provide us with a stream of consciousness output, and that’s what we accept.
Reflection is saying, I got this output, what are 5 ways this answer could be better? Great, apply those 5 ways to your answer.
General rule: 3 reflections are about a model generation of quality improvement.
Conducted an independent analysis of AI models and hosting providers.
Highlighted practical metrics such as Model Quality, Speed (tokens per second) and Price.