LLM
27
基础知识
12
【论文阅读】ByteScale:Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000
【论文阅读】ScheMoE:An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
【论文阅读】The Llama 3 Herd of Models(Section 3 Pre-Training)
【论文阅读】Reducing Activation Recomputation in Large Transformer Models
深度学习中反向传播及优化器使用详解
Pytorch torch.distributed 及NCCL初探
GPU架构概览
大模型显存占用浅析
Transformer-KV cache浅析
Transformer 中Decoder-only、Encoder-only、Decoder-encoder架构区别
More...