GitHub - turboderp/exllamav2: A fast inference library for running LLMs locally on modern consumer-class GPUs
刚才花2小时看完了DeepSeek V3 的 Technical Report,下面说下我的感想。
首先,文章贡献主要来自系统(Training Infra),而非模型本身。模型本身依然基于传统的Transformer:
1)他们世界首创在大规模LLM训练中系统性部署fp8(8位浮点)量化技术,这大大降低训练对显卡内存的需求,也加快了训练过程;
2)为了正确使用fp8的矩阵乘法,他们优化并改进了CUDA Kernal的调用方式,甚至给NVDA提出了诸多Tensor Core方面的设计建议
3)他们开发了自己的训练框架DualPipe,实现了16/64通道的流水线和专家(MOE)并行,极大改善了并行训练中的通信和计算冲突问题,解决了调度瓶颈。
最终,DeepSeek实现了在2048个... See more
勃勃OCx.comOverview
MaxText is a high performance , highly scalable , open-source LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for training and inference . MaxText achieves high MFUs and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler.
MaxText... See more
MaxText is a high performance , highly scalable , open-source LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for training and inference . MaxText achieves high MFUs and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler.
MaxText... See more
google • GitHub - google/maxtext: A simple, performant and scalable Jax LLM!
2-5x faster 50% less memory local LLM finetuning
- Manual autograd engine - hand derived backprop steps.
- 2x to 5x faster than QLoRA. 50% less memory usage.
- All kernels written in OpenAI's Triton language.
- 0% loss in accuracy - no approximation methods - all exact.
- No change of hardware necessary. Supports NVIDIA GPUs since 2018+. Minimum CUDA Compute Cap
unslothai • GitHub - unslothai/unsloth: 5X faster 50% less memory LLM finetuning
Edition 22: A Framework to Securely Use LLMs in Companies - Part 2: Managing Risk
Sandesh Mysore Anandboringappsec.substack.com"「macOS」にLLMをインストールするには--「Ollama」を試す" - ZDNET Japan #SmartNews