LLMs
Memory Considerations
Since co-occurrence matrices are square, they grow exponential with the number of entities being embedded. For 50k entities and a 32-bit data format, a dense matrix will already be at 10GB. 100k entities puts it at 40GB.
If you are trying to embed even more entities than that or have limited RAM available, you may need to use a... See more
Since co-occurrence matrices are square, they grow exponential with the number of entities being embedded. For 50k entities and a 32-bit data format, a dense matrix will already be at 10GB. 100k entities puts it at 40GB.
If you are trying to embed even more entities than that or have limited RAM available, you may need to use a... See more
What I've Learned Building Interactive Embedding Visualizations
Deploying a Generative AI model requires more than a VM with a GPU. It normally includes:
- Container Service : Most often Kubernetes to run LLM Serving solutions like Hugging Face Text Generation Inference or vLLM.
- Compute Resources : GPUs for running models, CPUs for management services
- Networking and DNS : Routing traffic to the appropriate
Understanding the Cost of Generative AI Models in Production
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
Table of Contents
1. Introduction
Large... See more
Table of Contents
- Introduction
- Key LLM Serving Techniques
- Dynamic SplitFuse: A Novel Prompt and Generation Composition Strategy
- Performance Evaluation
- DeepSpeed-FastGen: Implementation and Usage
- Try out DeepSpeed-FastGen
- Acknowledgements
1. Introduction
Large... See more
microsoft • DeepSpeed-FastGen
The Gemini API context caching feature is designed to reduce the cost of requests that contain repeat content with high input token counts.
When to use context caching
Context caching is particularly well suited to scenarios where a substantial initial context is referenced repeatedly by shorter requests. Consider using context caching for use cases... See more
When to use context caching
Context caching is particularly well suited to scenarios where a substantial initial context is referenced repeatedly by shorter requests. Consider using context caching for use cases... See more
Context caching guide | Google AI for Developers | Google for Developers
So right now, LLMs (Large Language Models) are all the rage. But in the future, it’s possible that the way we get things done is composing things with a combination of LLMs, SMMs (Small, Mighty Models), agents and tools.
It’s what I call Cognitive Composition (because it sounds cool and I have a longtime love affair with alliteration).
This is how we... See more
It’s what I call Cognitive Composition (because it sounds cool and I have a longtime love affair with alliteration).
This is how we... See more
Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]
The OpenAI Assistants API offers more than a simple prompt-sharing interface; it provides a sophisticated framework for AI interactions. It allows for persistent conversation sessions with automatic context management (Threads), structured interactions (Messages and Runs), integration with various tools for enhanced capabilities, customization... See more
Discord - A New Way to Chat with Friends & Communities
The multiple cantilevered AI overhangs:
Compute overhang. We have much more compute than we are using. Scale can go much further.
Idea overhang. There are many obvious research ideas and combinations of ideas that haven’t been tried in earnest yet.
Capability overhang. Even if we stopped all research now, it would take ten years to digest the new... See more
Compute overhang. We have much more compute than we are using. Scale can go much further.
Idea overhang. There are many obvious research ideas and combinations of ideas that haven’t been tried in earnest yet.
Capability overhang. Even if we stopped all research now, it would take ten years to digest the new... See more
Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]
“I think a lot of people obviously want to talk about the sexy kind of new consumer applications. I would tell you that I think that the earliest and most significant effect that AI is going to have on our company is actually going to be as it relates to our developer productivity. Some of the tools that we’re seeing are going to allow our devs to... See more
Adam Huda • The Transformative Power of Generative AI in Software Development: Lessons from Uber's Tech-Wide Hackathon
Setting up the necessary machine learning infrastructure to run these big models is another challenge. We need a dedicated model server for running model inference (using frameworks like Triton oder vLLM), powerful GPUs to run everything robustly, and configurability in our servers to make sure they're high throughput and low latency. Tuning the... See more