DeepSpeed-FastGen

RelatedHighlights

Overview

MaxText is a high performance , highly scalable , open-source LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for training and inference . MaxText achieves high MFUs and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler.

MaxText... See more

google • GitHub - google/maxtext: A simple, performant and scalable Jax LLM!

Nicolay Gerold added

in models with between 2 and 7 billion parameters, new capabilities emerge such as the ability to generate different creative text in formats like poems, code, scripts, musical pieces, emails, and letters, and to answer even open-ended and challenging questions in an informative way.

Ben Auffarth • Generative AI with LangChain: Build large language model (LLM) apps with Python, ChatGPT, and other LLMs

Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax. EasyLM can scale up LLM training to hundreds of TPU/GPU accelerators by leveraging JAX's pjit functionality.

Building on top of Hugginface's transformers and datasets, this repo provides an easy to use and easy... See more

young-geng • GitHub - young-geng/EasyLM: Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.

Nicolay Gerold added

TL;DR

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

LLMLingua: Compressing Prompts for Accelerated Inference of La

microsoft • GitHub - microsoft/LLMLingua: To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Nicolay Gerold added

StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.

mit-han-lab • GitHub - mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks

Darren LI added

LLM-PowerHouse: A Curated Guide for Large Language Models with Custom Training and Inferencing

Welcome to LLM-PowerHouse, your ultimate resource for unleashing the full potential of Large Language Models (LLMs) with custom training and inferencing. This GitHub repository is a comprehensive and curated guide designed to empower developers, researche... See more

ghimiresunil • GitHub - ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing: LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.

Nicolay Gerold added

We consider these aspects of our problem:

Latency : How fast does the system need to respond to user input?

Task Complexity : What level of understanding is required from the LLM? Is the input context and prompt super domain-specific?

Prompt Length : How much context needs to be provided for the LLM to do its task?

Quality : What is the acceptable

Developing Rapidly with Generative AI

Nicolay Gerold added

GPT4All: An ecosystem of open-source on-edge large language models.

Important

GPT4All v2.5.0 and newer only supports models in GGUF format (.gguf). Models used with a previous version of GPT4All (.bin extension) will no longer work.

GPT4All is an ecosystem to run powerful and customized large language models that work locally on consumer grade CPUs a... See more

nomic-ai • GitHub - nomic-ai/gpt4all: gpt4all: open-source LLM chatbots that you can run anywhere

Nicolay Gerold added