LLM Inference

The human-centric platform for production ML & AI

Access data easily, scale compute cost-efficiently, and ship to production confidently with fully managed infrastructure, running securely in your cloud.

Infrastructure for ML, AI, and Data Science | Outerbounds

MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with KFServing's V2 Dataplane spec. Watch a quick video introducing the project here.

Multi-model serving, letting users run multiple models within the same process.

Ability to run inference in parallel for vertical

GitHub - SeldonIO/MLServer: An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more

The Most Affordable Cloud for AI/ML Inference at Scale

Deploy AI/ML production models without headaches on the lowest priced GPUs (starting from $0.02/hr) in the market. Get 10X-100X more inferences per dollar compared to managed services and hyperscalers.

Salad - GPU Cloud | 10k+ GPUs for Generative AI

Koyeb is a developer-friendly serverless platform designed to let businesses easily deploy reliable and scalable applications globally. The platform has been created by Cloud Computing Veterans and is financially backed by industry leaders.

Koyeb allows you to deploy all kind of services including full web applications, APIs, and background workers.... See more

Introduction

TL;DR

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

microsoft • GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Why Infinity:

Infinity provides the following features:

Deploy virtually any SentenceTransformer - deploy the model you know from SentenceTransformers

Fast inference backends : The inference server is built on top of torch, fastembed(onnx-cpu) and CTranslate2, getting most out of your CUDA or CPU hardware.

Dynamic batching : New embedding requests

michaelfeil • GitHub - michaelfeil/infinity: Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of sentence-transformer models and frameworks.

LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.

📖 Table of contents

📖 Table of contents

🌳 Features

🏠 Models

🏃‍♂️ Getting started with Docker
- Launch LoRAX Server
- Prompt via REST API
- Prompt via Python

predibase • GitHub - predibase/lorax: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

llamafile lets you distribute and run LLMs with a single file. (announcement blog post)

Our goal is to make open source large language models much more accessible to both developers and end users. We're doing that by combining llama.cpp with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file... See more

Mozilla-Ocho • GitHub - Mozilla-Ocho/llamafile: Distribute and run LLMs with a single file.

Ideas related to this collection