【Megatron-LM源码分析（三）】-性能分析

在算力利用率方面，Megatron-LM支持通过Pytorch Profiler和Nsys进行分析，注意这两者在Megatron-LM中是互斥的。

PyTorch Profiler：框架原生工具，更高层，侧重于 Python/PyTorch 算子层级，可以看到代码级的调用链，适合识别 Python 端的慢算子、内存泄漏、调度开销。
Nsys：系统级追踪工具，更底层，侧重于 CUDA 和硬件性能层级，适合分析 CUDA Kernel 执行、PCIe 带宽利用率、GPU 内存传输、多 GPU 通信（NCCL）等

在显存占用方面，Megatron-LM支持通过Pytorch自带的snapshot的功能来记录显存分配情况。

下面就如何开启这些分析方法以及示例做介绍。

PyTorch Profiler性能分析

使用方法

一般使用pytorch Profile的代码如下：

import torch
import torch.profiler
import os

logdir = "tb_profiler_test"

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2,
        repeat=1,
    ),
    on_trace_ready=torch.profiler.tensorboard_trace_handler(
        logdir
    ),
    record_shapes=True,
    with_stack=True,
) as prof:
    for step in range(6):
        x = torch.randn(4096, 4096, device="cuda")
        y = x @ x
        torch.cuda.synchronize()
        prof.step()

首先需要定义torch.profiler.schedule，然后通过prof.step来更新当前步数。最后的结果可以通过Tensboard或者Chrome的chrome://tracing/来查看。

示例

运行脚本如下，关键是PROFILER_ARGS中添加的对应参数，其指示会采集110、111步。

#!/bin/bash

# Runs the "857m" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=4
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NUM_NODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))

CHECKPOINT_PATH=$1 #<Specify path>
TENSORBOARD_LOGS_PATH=$2 #<Specify path>
VOCAB_FILE=$3 #<Specify path to file>/gpt2-vocab.json
MERGE_FILE=$4 #<Specify path to file>/gpt2-merges.txt
DATA_PATH=$5 #<Specify path and file prefix>_text_document
USE_NSYS=0
if [[ ${6:-} == "--nsys" ]]; then
  USE_NSYS=1
fi

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE 
    --nnodes $NUM_NODES 
    --master_addr $MASTER_ADDR 
    --master_port $MASTER_PORT
)

GPT_MODEL_ARGS=(
    --num-layers 24 
    --hidden-size 1024 
    --num-attention-heads 16 
    --seq-length 2048 
    --max-position-embeddings 2048 
    --attention-backend auto # Can use (flash/fused/unfused/local)
)

TRAINING_ARGS=(
    --micro-batch-size 4 
    --global-batch-size 16 
    # --rampup-batch-size 16 16 5859375 
    --train-iters 20000 
    --weight-decay 0.1 
    --adam-beta1 0.9 
    --adam-beta2 0.95 
    --init-method-std 0.006 
    --clip-grad 1.0 
    --fp16
    --lr 6.0e-5 
    --lr-decay-style cosine 
    --min-lr 6.0e-6
    --lr-warmup-fraction .001 
    --lr-decay-iters 20000 
)

MODEL_PARALLEL_ARGS=(
  --tensor-model-parallel-size 1 
  --pipeline-model-parallel-size 1 
)

DATA_ARGS=(
    --data-path $DATA_PATH 
    --vocab-file $VOCAB_FILE 
    --merge-file $MERGE_FILE 
    --split 949,50,1
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 200
    --save-interval 10000 
    --eval-interval 1000 
    --save $CHECKPOINT_PATH 
    --load $CHECKPOINT_PATH 
    --eval-iters 10
    --tensorboard-dir $TENSORBOARD_LOGS_PATH 
)

PROFILER_ARGS=(
    --profile
    --use-pytorch-profiler
    --profile-step-start 110
    --profile-step-end 112
    --profile-ranks 0
)

# Build command as an array (no string concatenation)
CMD=(
  torchrun
  "${DISTRIBUTED_ARGS[@]}"
  pretrain_gpt.py
  "${GPT_MODEL_ARGS[@]}"
  "${TRAINING_ARGS[@]}"
  "${MODEL_PARALLEL_ARGS[@]}"
  "${DATA_ARGS[@]}"
  "${EVAL_AND_LOGGING_ARGS[@]}"
  "${PROFILER_ARGS[@]}"
)

if [[ "$USE_NSYS" -eq 1 ]]; then
  NSIGHT_PREFIX="./nsight_profile/gpt3_857m"
  echo "Running with Nsight profiling, output prefix: ${NSIGHT_PREFIX}"
  exec nsys profile \
    -s none -t nvtx,cuda \
    --cudabacktrace=all \
    --cuda-graph-trace=node \
    --python-backtrace=cuda \
    --wait all \
    -o "${NSIGHT_PREFIX}" \
    --force-overwrite true \
    --capture-range=cudaProfilerApi \
    --capture-range-end=stop \
    "${CMD[@]}"
else
  exec "${CMD[@]}"
fi

运行的指令如下：

bash examples/gpt3/train_gpt3_857m_distributed.sh     /workspace/megatron-lm/model_ckpt/gpt3_857m_2     /workspace/megatron-lm/tb_logs/gpt3_857m_profiler     /workspace/megatron-lm/data/tokenizer/gpt2-vocab.json     /workspace/megatron-lm/data/tokenizer/gpt2-merges.txt     /workspace/megatron-lm/data/TinyStoriesV2-GPT4-train_text_document      > gpt3_857m2.log 2>&1 &

运行后会在tensorboard-dir下获取对应的pt.trace.json文件，例如本次运行获得的是tb_logs/gpt3_857m_profiler/6dacc15685cd_821091.1766741105666343018.pt.trace.json文件

可以用tensor board或者是Chrome查看该文件，如下是访问Chrome的chrome://tracing/查看的结果：

CPU层面的Python代码分析的结果如下，可以看到整个调用链还是很清楚的：

GPU层面的分析结果如下，由于这里使用的是简单的数据并行，所以每一步后都有一次all reduce进行参数收集，整体逻辑看的还是很清楚的。

Nsys性能分析

使用方法

一般使用Nsys的代码如下，其中range_push(“xxx”)与range_pop()为一段运行的代码区间标注了区间名

import torch
import torch.cuda.nvtx as nvtx
import time

device = "cuda"

# warmup
for _ in range(2):
    x = torch.randn(4096, 4096, device=device)
    y = x @ x
    torch.cuda.synchronize()

# profile 区间
nvtx.range_push("matmul_step")

x = torch.randn(4096, 4096, device=device)
y = x @ x
torch.cuda.synchronize()

nvtx.range_pop()

time.sleep(0.2)  # 让 CPU timeline 更明显

需要用如下的nsys开头的命令运行：

nsys profile \
  --trace=cuda,nvtx,osrt \
  -o simple_matmul \
  python example.py

运行后会生成simple_matmul.nsys-rep，然后可以下载Nsight Systems对其进行查看。

示例

运行的脚本如下，注意这里相比Pytorch Profile删掉了--use-pytorch-profiler

#!/bin/bash

# Runs the "857m" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=4
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NUM_NODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))

CHECKPOINT_PATH=$1 #<Specify path>
TENSORBOARD_LOGS_PATH=$2 #<Specify path>
VOCAB_FILE=$3 #<Specify path to file>/gpt2-vocab.json
MERGE_FILE=$4 #<Specify path to file>/gpt2-merges.txt
DATA_PATH=$5 #<Specify path and file prefix>_text_document
USE_NSYS=0
if [[ ${6:-} == "--nsys" ]]; then
  USE_NSYS=1
fi

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE 
    --nnodes $NUM_NODES 
    --master_addr $MASTER_ADDR 
    --master_port $MASTER_PORT
)

GPT_MODEL_ARGS=(
    --num-layers 24 
    --hidden-size 1024 
    --num-attention-heads 16 
    --seq-length 2048 
    --max-position-embeddings 2048 
    --attention-backend auto # Can use (flash/fused/unfused/local)
)

TRAINING_ARGS=(
    --micro-batch-size 4 
    --global-batch-size 16 
    # --rampup-batch-size 16 16 5859375 
    --train-iters 20000 
    --weight-decay 0.1 
    --adam-beta1 0.9 
    --adam-beta2 0.95 
    --init-method-std 0.006 
    --clip-grad 1.0 
    --fp16
    --lr 6.0e-5 
    --lr-decay-style cosine 
    --min-lr 6.0e-6
    --lr-warmup-fraction .001 
    --lr-decay-iters 20000 
)

MODEL_PARALLEL_ARGS=(
  --tensor-model-parallel-size 1 
  --pipeline-model-parallel-size 1 
)

DATA_ARGS=(
    --data-path $DATA_PATH 
    --vocab-file $VOCAB_FILE 
    --merge-file $MERGE_FILE 
    --split 949,50,1
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 200
    --save-interval 10000 
    --eval-interval 1000 
    --save $CHECKPOINT_PATH 
    --load $CHECKPOINT_PATH 
    --eval-iters 10
    --tensorboard-dir $TENSORBOARD_LOGS_PATH 
)

PROFILER_ARGS=(
    --profile
    --profile-step-start 110
    --profile-step-end 112
    --profile-ranks 0
)

# Build command as an array (no string concatenation)
CMD=(
  torchrun
  "${DISTRIBUTED_ARGS[@]}"
  pretrain_gpt.py
  "${GPT_MODEL_ARGS[@]}"
  "${TRAINING_ARGS[@]}"
  "${MODEL_PARALLEL_ARGS[@]}"
  "${DATA_ARGS[@]}"
  "${EVAL_AND_LOGGING_ARGS[@]}"
  "${PROFILER_ARGS[@]}"
)

if [[ "$USE_NSYS" -eq 1 ]]; then
  NSIGHT_PREFIX="./nsight_profile/gpt3_857m"
  echo "Running with Nsight profiling, output prefix: ${NSIGHT_PREFIX}"
  exec nsys profile \
    -s none -t nvtx,cuda \
    --cudabacktrace=all \
    --cuda-graph-trace=node \
    --python-backtrace=cuda \
    --wait all \
    -o "${NSIGHT_PREFIX}" \
    --force-overwrite true \
    --capture-range=cudaProfilerApi \
    --capture-range-end=stop \
    "${CMD[@]}"
else
  exec "${CMD[@]}"
fi

运行的指令如下，注意这里添加了--nsys来在脚本中用nsys启动：

bash examples/gpt3/train_gpt3_857m_distributed.sh     /workspace/megatron-lm/model_ckpt/gpt3_857m_2     /workspace/megatron-lm/tb_logs/gpt3_857m_profiler     /workspace/megatron-lm/data/tokenizer/gpt2-vocab.json     /workspace/megatron-lm/data/tokenizer/gpt2-merges.txt     /workspace/megatron-lm/data/TinyStoriesV2-GPT4-train_text_document     --nsys      > gpt3_857m2.log 2>&1 &

最后会得到nsight_profile/gpt3_857m.nsys-rep，将其放入Nsight Systems中查看结果如下：

确实看下来是更底层了些，cuda相关的分析更加全面了。

Memory Snap显存分析

使用方法

Pytorch的Memory snap的整体使用方法如下：

torch.cuda.memory._record_memory_history()               # 开始记录
run_your_code()                                          # 训练或推理代码
torch.cuda.memory._dump_snapshot("my_snapshot.pickle")   # 保存文件
torch.cuda.memory._record_memory_history(enabled=None)   # 终止记录

运行后得到my_snapshot.pickle，然后可以到https://docs.pytorch.org/memory\_viz中进行查看。

示例

运行的脚本如下，关键是在PROFILER_ARGS参数中添加了--record-memory-history以及--memory-snapshot-path './snapshot/snapshot.pickle'

#!/bin/bash

# Runs the "857m" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=4
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NUM_NODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))

CHECKPOINT_PATH=$1 #<Specify path>
TENSORBOARD_LOGS_PATH=$2 #<Specify path>
VOCAB_FILE=$3 #<Specify path to file>/gpt2-vocab.json
MERGE_FILE=$4 #<Specify path to file>/gpt2-merges.txt
DATA_PATH=$5 #<Specify path and file prefix>_text_document
USE_NSYS=0
if [[ ${6:-} == "--nsys" ]]; then
  USE_NSYS=1
fi

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE 
    --nnodes $NUM_NODES 
    --master_addr $MASTER_ADDR 
    --master_port $MASTER_PORT
)

GPT_MODEL_ARGS=(
    --num-layers 24 
    --hidden-size 1024 
    --num-attention-heads 16 
    --seq-length 2048 
    --max-position-embeddings 2048 
    --attention-backend auto # Can use (flash/fused/unfused/local)
)

TRAINING_ARGS=(
    --micro-batch-size 4 
    --global-batch-size 16 
    # --rampup-batch-size 16 16 5859375 
    --train-iters 20000 
    --weight-decay 0.1 
    --adam-beta1 0.9 
    --adam-beta2 0.95 
    --init-method-std 0.006 
    --clip-grad 1.0 
    --fp16
    --lr 6.0e-5 
    --lr-decay-style cosine 
    --min-lr 6.0e-6
    --lr-warmup-fraction .001 
    --lr-decay-iters 20000 
)

MODEL_PARALLEL_ARGS=(
  --tensor-model-parallel-size 1 
  --pipeline-model-parallel-size 1 
)

DATA_ARGS=(
    --data-path $DATA_PATH 
    --vocab-file $VOCAB_FILE 
    --merge-file $MERGE_FILE 
    --split 949,50,1
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 200
    --save-interval 10000 
    --eval-interval 1000 
    --save $CHECKPOINT_PATH 
    --load $CHECKPOINT_PATH 
    --eval-iters 10
    --tensorboard-dir $TENSORBOARD_LOGS_PATH 
)

PROFILER_ARGS=(
    --profile
    --record-memory-history
    --profile-step-start 110
    --profile-step-end 112
    --profile-ranks 0
)

# Build command as an array (no string concatenation)
CMD=(
  torchrun
  "${DISTRIBUTED_ARGS[@]}"
  pretrain_gpt.py
  "${GPT_MODEL_ARGS[@]}"
  "${TRAINING_ARGS[@]}"
  "${MODEL_PARALLEL_ARGS[@]}"
  "${DATA_ARGS[@]}"
  "${EVAL_AND_LOGGING_ARGS[@]}"
  "${PROFILER_ARGS[@]}"
)

if [[ "$USE_NSYS" -eq 1 ]]; then
  NSIGHT_PREFIX="./nsight_profile/gpt3_857m"
  echo "Running with Nsight profiling, output prefix: ${NSIGHT_PREFIX}"
  exec nsys profile \
    -s none -t nvtx,cuda \
    --cudabacktrace=all \
    --cuda-graph-trace=node \
    --python-backtrace=cuda \
    --wait all \
    -o "${NSIGHT_PREFIX}" \
    --force-overwrite true \
    --capture-range=cudaProfilerApi \
    --capture-range-end=stop \
    "${CMD[@]}"
else
  exec "${CMD[@]}"
fi

运行指令为：

bash examples/gpt3/train_gpt3_857m_distributed.sh     /workspace/megatron-lm/model_ckpt/gpt3_857m_2     /workspace/megatron-lm/tb_logs/gpt3_857m_profiler     /workspace/megatron-lm/data/tokenizer/gpt2-vocab.json     /workspace/megatron-lm/data/tokenizer/gpt2-merges.txt     /workspace/megatron-lm/data/TinyStoriesV2-GPT4-train_text_document      > gpt3_857m2.log 2>&1 &

运行后会得到snapshot/snapshot.pickle，将其放入到https://docs.pytorch.org/memory\_viz中进行查看，结果如下：

其最底层的就是基础的模型、优化器的显存占用，上面的动态激活显存可以看到呈现明显的周期性，其显存占用最高的时候就是通过cross_entropy计算loss的时候，可以达到约15GB。这是因为这时前向传播的激活全部都计算完毕，后续反向传播的时候激活依次释放。

LLM > Megatron-LM

#LLM #Megatron-LM

【Megatron-LM源码分析（三）】-性能分析

http://example.com/2025/12/26/megatron-lm-profiler/

作者

滑滑蛋

发布于

2025年12月26日

许可协议

【Megatron-LM源码分析（四）】-DDP数据并行上一篇

【论文阅读】ScheMoE:An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling 下一篇

【Megatron-LM源码分析（三）】-性能分析

PyTorch Profiler性能分析

使用方法

相关代码

示例

Nsys性能分析

使用方法

相关代码

示例

Memory Snap显存分析

使用方法

相关代码

示例