GitHub - turboderp/exllamav2: A fast inference library for running LLMs locally on modern consumer-class GPUs
A one-plus 24GB mobile running a Mixtral 8x7B at 11 tokens/second with PowerInfer-2🤯
Much faster inference speed vs llama.cpp and MLC-LLM.
Using swap and caching to run the model even if it doesn't fit the available RAM.
📌 Between Apple’s LLM in a flash and... See more
Rohan Paulx.comLlama 2 7B chat, running 100% private on Mac, powered by CoreML! ⚡️
We're optimising this setup to get much more faster generation. 🔥 https://t.co/IchaNckIK2
Vaibhav (VB) Srivastavx.comOne of the best tutorial-style repos since @karpathy's minGPT! GPT-Fast: a minimalistic, PyTorch-only decoding implementation loaded with best practices: int8/int4 quantization, speculative decoding, Tensor parallelism, etc. Boosts the "clock speed" of LLM OS by 10x with no model change!
We need more minGPTs and... See more
Jim Fanx.com