LLMs
Announcing Together Inference Engine – the fastest inference available
November 13, 2023・By Together
The Together Inference Engine is multiple times faster than any other inference service, with 117 tokens per second on Llama-2-70B-Chat and 171 tokens per second on Llama-2-13B-Chat
Today we are announcing Together Inference Engine, the world’s... See more
November 13, 2023・By Together
The Together Inference Engine is multiple times faster than any other inference service, with 117 tokens per second on Llama-2-70B-Chat and 171 tokens per second on Llama-2-13B-Chat
Today we are announcing Together Inference Engine, the world’s... See more
Announcing Together Inference Engine – the fastest inference available
Matei Zaharia, Omar Khattab, Lingjiao Chen, et al. • The Shift From Models to Compound AI Systems
When it comes to identifying where generative AI can make an impact, we dig into challenges that commonly:
- Involve analysis, interpretation, or review of unstructured content (e.g. text) at scale
- Require massive scaling that may be otherwise prohibitive due to limited resources
- Would be challenging for rules-based or traditional ML approaches
Developing Rapidly with Generative AI
pair-preference-model-LLaMA3-8B by RLHFlow: Really strong reward model, trained to take in two inputs at once, which is the top open reward model on RewardBench (beating one of Cohere’s).
DeepSeek-V2 by deepseek-ai (21B active, 236B total param.): Another strong MoE base model from the DeepSeek team. Some people are questioning the very high MMLU... See more
DeepSeek-V2 by deepseek-ai (21B active, 236B total param.): Another strong MoE base model from the DeepSeek team. Some people are questioning the very high MMLU... See more
Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
Table of Contents
1. Introduction
Large... See more
Table of Contents
- Introduction
- Key LLM Serving Techniques
- Dynamic SplitFuse: A Novel Prompt and Generation Composition Strategy
- Performance Evaluation
- DeepSpeed-FastGen: Implementation and Usage
- Try out DeepSpeed-FastGen
- Acknowledgements
1. Introduction
Large... See more
microsoft • DeepSpeed-FastGen
GPT-4 Turbo can accept images as inputs in the Chat Completions API, enabling use cases such as generating captions, analyzing real world images in detail, and reading documents with figures. For example, BeMyEyes uses this technology to help people who are blind or have low vision with daily tasks like identifying a product or navigating a store.... See more
New models and developer products announced at DevDay
The OpenAI Assistants API offers more than a simple prompt-sharing interface; it provides a sophisticated framework for AI interactions. It allows for persistent conversation sessions with automatic context management (Threads), structured interactions (Messages and Runs), integration with various tools for enhanced capabilities, customization... See more
Discord - A New Way to Chat with Friends & Communities
Setting up the necessary machine learning infrastructure to run these big models is another challenge. We need a dedicated model server for running model inference (using frameworks like Triton oder vLLM), powerful GPUs to run everything robustly, and configurability in our servers to make sure they're high throughput and low latency. Tuning the... See more
Developing Rapidly with Generative AI
Disruptive innovation comes in two flavors: (1) New-market disruption, where the company creates and claims a new segment in an existing market by catering to an underserved customer base, or (2) Low-end disruption, in which a company uses a low-cost business model to enter at the bottom of an existing market and claim a segment.
Copilots don’t... See more
Copilots don’t... See more