LLM Inference

Infrastructure for ML, AI, and Data Science | Outerbounds

GitHub - SeldonIO/MLServer: An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more

Salad - GPU Cloud | 10k+ GPUs for Generative AI

Introduction

microsoft β€’ GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

michaelfeil β€’ GitHub - michaelfeil/infinity: Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of sentence-transformer models and frameworks.

predibase β€’ GitHub - predibase/lorax: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Mozilla-Ocho β€’ GitHub - Mozilla-Ocho/llamafile: Distribute and run LLMs with a single file.

Ideas related to this collection