LLM Inference
The human-centric platform for production ML & AI
Access data easily, scale compute cost-efficiently, and ship to production confidently with fully managed infrastructure, running securely in your cloud.
Access data easily, scale compute cost-efficiently, and ship to production confidently with fully managed infrastructure, running securely in your cloud.
Infrastructure for ML, AI, and Data Science | Outerbounds
MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with KFServing's V2 Dataplane spec. Watch a quick video introducing the project here.
- Multi-model serving, letting users run multiple models within the same process.
- Ability to run inference in parallel for vertical sc
GitHub - SeldonIO/MLServer: An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
The Most Affordable Cloud for AI/ML Inference at Scale
Deploy AI/ML production models without headaches on the lowest priced GPUs (starting from $0.02/hr) in the market. Get 10X-100X more inferences per dollar compared to managed services and hyperscalers.
Deploy AI/ML production models without headaches on the lowest priced GPUs (starting from $0.02/hr) in the market. Get 10X-100X more inferences per dollar compared to managed services and hyperscalers.
Salad - GPU Cloud | 10k+ GPUs for Generative AI
Koyeb is a developer-friendly serverless platform designed to let businesses easily deploy reliable and scalable applications globally. The platform has been created by Cloud Computing Veterans and is financially backed by industry leaders.
Koyeb allows you to deploy all kind of services including full web applications, APIs, and background workers.
... See more
Koyeb allows you to deploy all kind of services including full web applications, APIs, and background workers.
... See more
Introduction
TL;DR
LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.
... See more
LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.
... See more
microsoft β’ GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
Why Infinity:
Infinity provides the following features:
Infinity provides the following features:
- Deploy virtually any SentenceTransformer - deploy the model you know from SentenceTransformers
- Fast inference backends : The inference server is built on top of torch, fastembed(onnx-cpu) and CTranslate2, getting most out of your CUDA or CPU hardware.
- Dynamic batching : New embedding requests
michaelfeil β’ GitHub - michaelfeil/infinity: Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of sentence-transformer models and frameworks.
LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.
π Table of contents
π Table of contents
- π Table of contents
- π³ Features
- π Models
- πββοΈ Getting started with Docker
- Launch LoRAX Server
- Prompt via REST API
- Prompt via Python
predibase β’ GitHub - predibase/lorax: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
llamafile lets you distribute and run LLMs with a single file. (announcement blog post)
Our goal is to make open source large language models much more accessible to both developers and end users. We're doing that by combining llama.cpp with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file executa... See more
Our goal is to make open source large language models much more accessible to both developers and end users. We're doing that by combining llama.cpp with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file executa... See more
Mozilla-Ocho β’ GitHub - Mozilla-Ocho/llamafile: Distribute and run LLMs with a single file.
Ideas related to this collection