【Verl源码分析（四）】Verl中使用sandbox进行训练

在强化学习中，当reward依赖“运行结果”而不是“文本相似度或打分模型”时，就需要sandbox来运行模型rollout的结果进行判别。最典型的场景就是在强化学习场景中加入生成代码的任务，或者是更复杂的有关工具调用的训练场景。

Verl目前已支持将fusin sandbox加入到训练流程中，故本文随着Verl的官方示例来对其一探究竟，其主要是在Eurus-2-RL-Data 数据集上利用Fusion sandbox进行数学与代码能力的增强。

注意查看的是0.4.1.x版本的Verl代码：https://github.com/verl-project/verl/tree/v0.4.1.x

Fusion sandbox简介

详细介绍可以见官网：https://bytedance.github.io/SandboxFusion/

Fusin sandbox是字节开发的适用于LLM的多功能code sandbox。其可以通过拉取并运行官方镜像来快速启动一个sandbox环境。其主要包含两个功能：

运行代码：可以向运行的端口，如http://localhost:8080/run_code提交代码，sandbox将会编译并运行这部分代码，然后将运行结果返回。这是最基础的通用能力。
数据集运行判别：其在运行代码能力的基础上进一步封装，将一些典型的代码数据集进行封装，支持直接提交某个数据集某个问题的模型执行结果，然后sandbox自行提取出代码并编译运行，然后将运行结果与这个问题应有的代码进行对比并返回判别结果。此外也支持自行拓展支持的数据集。

在verl中使用到的是fusion sandbox运行代码的功能，关于与数据集中的ground truth的对比就自行实现了，这样更加灵活。

Eurus-2-RL-Data数据集简介

Eurus-2-RL-Data数据集包含了45.5 万道数学题和2.6 万道编程题，具体介绍如下：

数学题：其选用的是NuminaMath-CoT题库。题目涵盖了从中国高中数学到国际数学奥林匹克竞赛的各类题型，示例如下，其计算结果要求用latex格式。示例如下：

{
    'data_source': 'numina_olympiads',
    'prompt': array([
        {'content': '\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\n\n[ASSESS]\n\n[ADVANCE]\n\n[VERIFY]\n\n[SIMPLIFY]\n\n[SYNTHESIZE]\n\n[PIVOT]\n\n[OUTPUT]\n\nYou should strictly follow the format below:\n\n[ACTION NAME]\n\n# Your action step 1\n\n# Your action step 2\n\n# Your action step 3\n\n...\n\nNext action: [NEXT ACTION NAME]\n\n', 'role': 'system'},
        {'content': 'Find the matrix of the operator $\\widehat{A}$ in the basis $\\mathbf{e}_{1}^{\\prime}, \\mathbf{e}_{2}^{\\prime}, \\mathbf{e}_{3}^{\\prime}$, where\n\n$$\n\\begin{aligned}\n& \\mathbf{e}_{1}^{\\prime}=\\mathbf{e}_{1}+\\mathbf{e}_{2}+2 \\mathbf{e}_{3}, \\\\\n& \\mathbf{e}_{2}^{\\prime}=2 \\mathbf{e}_{1}-\\mathbf{e}_{2} \\\\\n& \\mathbf{e}_{3}^{\\prime}=-\\mathbf{e}_{1}+\\mathbf{e}_{2}+\\mathbf{e}_{3},\n\\end{aligned}\n$$\n\nif in the basis $\\mathbf{e}_{1}, \\mathbf{e}_{2}, \\mathbf{e}_{3}$ its matrix is given by\n\n$$\nA_{\\mathbf{e}}=\\left(\\begin{array}{rrr}\n2 & 0 & -1 \\\\\n0 & 1 & -2 \\\\\n-1 & 2 & 0\n\\end{array}\\right)\n$$\n\nPresent the answer in LaTex format: \\boxed{Your answer}', 'role': 'user'}],
      dtype=object),
    'ability': 'math',
    'reward_model': {'ground_truth': '\\begin{pmatrix}\n   -7 & 6 & -8 \\\\\n   11 & -9 & 12 \\\\\n   15 & -16 & 19\n   \\end{pmatrix}', 'style': 'rule'},
    'extra_info': {'index': 0, 'split': 'dummy'}
}

编程题：其主要从APPS、CodeContests、TACO和Codeforces 等网站获取题目。题目的难度主要与编程竞赛级别相当。数据集中的ground truth不再是一串结果数字，而是多个输入以及对应预期的输出结果。示例如下：

{
    'data_source': 'taco',
    'prompt': array([
        {'content': '\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\n\n[ASSESS]\n\n[ADVANCE]\n\n[VERIFY]\n\n[SIMPLIFY]\n\n[SYNTHESIZE]\n\n[PIVOT]\n\n[OUTPUT]\n\nYou should strictly follow the format below:\n\n[ACTION NAME]\n\n# Your action step 1\n\n# Your action step 2\n\n# Your action step 3\n\n...\n\nNext action: [NEXT ACTION NAME]\n\n', 'role': 'system'},
        {'content': 'Xander Cage has a list of cities he can visit on his new top-secret mission. He represents each city as a tuple of $(latitude,longitude,height,points)$. The values of $latitude$, $longitude$, and $height$ are distinct across all cities.\n\nWe define a mission as a sequence of cities, $c_1,c_2,c_3,\\ldots,c_k$, that he visits. We define the total $\\text{points}$ of such a mission to be the sum of the $\\text{points}$ of all the cities in his mission list.\n\nBeing eccentric, he abides by the following rules on any mission:\n\nHe can choose the number of cities he will visit (if any).\nHe can start the mission from any city.\nHe visits cities in order of strictly increasing $height$.\nThe absolute difference in $latitude$ between adjacent visited cities in his mission must be at most $d_l\\textbf{at}$.\nThe absolute difference in $longitude$ between adjacent visited cities in his mission must be at most $d_long$.\n\nGiven $\\boldsymbol{d\\text{_lat}}$, $d\\text{_long}$, and the definitions for $n$ cities, find and print the maximum possible total $\\text{points}$ that Xander can earn on a mission.\n\nInput Format\n\nThe first line contains three space-separated integers describing the respective values of $n$, $\\boldsymbol{d\\text{_lat}}$, and $d\\text{_long}$. \n\nEach line $\\boldsymbol{i}$ of the $n$ subsequent lines contains four space-separated integers denoting the respective $latitude$, $longitude$, $height$, and $\\text{points}$ for a city.\n\nConstraints\n\n$1\\leq n\\leq2\\times10^5$  \n$1\\leq d\\_\\textit{lat},d\\textit{long}\\leq2\\times10^5$  \n$1\\leq latitude,longitude,height\\leq2\\times10^5$  \n$-2\\times10^5\\leq\\textit{points}\\leq2\\times10^5$\n\nOutput Format\n\nPrint a single integer denoting the maximum possible $\\text{points}$ that Xander can earn on a mission.\n\nSample Input 0\n3 1 1\n1 1 1 3\n2 2 2 -1\n3 3 3 3\n\nSample Output 0\n5\n\nExplanation 0\n\nXander can start at city $1$, then go to city $2$, and then go to city $3$ for a maximum value of total $points=3+-1+3=5$  \n\nNote that he cannot go directly from city $1$ to city $3$ as that would violate his rules that the absolute difference in $latitude$ between adjacent visited cities be $\\leq d\\text{_lat}$ and the absolute difference in $longitude$ between adjacent visited cities be $\\leq d\\text{_long}$. Because $d\\textit{_lat}=1$ and $d\\textit{_long}=1$, he cannot directly travel between those cities.\n\nWrite Python code to solve the problem. Present the code in \n```python\nYour code\n```\nat the end.', 'role': 'user'}],
      dtype=object),
    'ability': 'code',
    'reward_model': {'ground_truth': '{"inputs": ["3 2 2\\n1 1 1 3\\n2 2 2 -1\\n3 3 3 3\\n", "4 2 2\\n1 1 1 3\\n2 2 2 -1\\n3 3 3 3\\n4 4 4 5\\n", "5 2 2\\n1 1 1 3\\n2 2 2 -1\\n3 3 3 3\\n4 4 4 5\\n5 5 5 1\\n", "2 1 1\\n1 1 1 3\\n2 2 2 5\\n", "3 1 1\\n1 1 1 3\\n1 2 2 5\\n1 3 3 6\\n", "5 200000 200000\\n1 1 1 200000\\n200000 200000 200000 200000\\n400000 400000 400000 200000\\n600000 600000 600000 200000\\n800000 800000 800000 200000\\n"], "outputs": ["6", "11", "12", "8", "14", "1000000"]}', 'style': 'rule'},
    'extra_info': {'index': 0, 'split': 'dummy'}
}

Verl使用sandbox计算奖励的运行示例

关于环境配置可以见之前的博客：https://slipegg.github.io/2026/01/29/Verl-Install-Demo/

我这里使用的是4个4090，所以跑不起来verl官方给的示例脚本examples/ppo_trainer/run_deepseek7b_llm_sandbox_fusion.sh，所以自己调整了一下，运行脚本如下，主要是调整了batch的大小，并且将模型改为了更小的Qwen2.5-0.5B-Instruct。

set -x

python3 -m verl.trainer.main_ppo \
    reward_model.sandbox_fusion.url='http://localhost:8080/run_code' \
    reward_model.sandbox_fusion.max_concurrent=16 \
    reward_model.reward_manager=prime \
    algorithm.adv_estimator=gae \
    data.train_files=huggingface/Eurus-2-RL-Data/train.parquet \
    data.val_files=huggingface/Eurus-2-RL-Data/validation.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=512 \
    data.max_response_length=512 \
    data.filter_overlong_prompts=True \
    data.truncation=right \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    critic.model.enable_gradient_checkpointing=True \
    critic.ppo_micro_batch_size_per_gpu=4 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console'] \
    trainer.project_name='verl_example_sandbox_fusion' \
    trainer.experiment_name='deepseek_llm_7b_function_sandbox_fusion' \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=-1 \
    trainer.total_epochs=15 2>&1 | tee verl_sandbox.log

运行结果如下所示，可以正常迭代：

Verl使用sandbox计算奖励的运行流程

这里专门看需要使用sandbox的代码题的处理。其整体处理的流程还是类似的，先从数据集中获取到一批数据，然后rollout角色针对这一批问题进行推理得到答案，对于代码问题就是生成对应的代码，然后进行reward计算，为了验证生成代码的准确性，就需要将代码放入sandbox中编译运行，并得到运行数据集中的输入后得到的输出，将其与数据集中的预期输出进行对比，从而得到reward。然后再依据算法按需计算advantage等，最后进行模型的更新迭代。

这里主要关注在reward计算过程中调用sandbox的一些细节，尤其是其中如何进行任务编排。

奖励函数初始化

在初始化时，需要初始化reward_fn与val_reward_fn。在本示例中，依据配置reward_model.reward_manager=prime加载的是PrimeRewardManager，然后还依据reward_model.sandbox_fusion.max_concurrent=16这一配置设置了全局sandbox api访问并发控制信号量允许的并发限制是16，然后将这一全局并发控制信号传递给reward计算函数，然后再传入到PrimeRewardManager。

def load_reward_manager(config, tokenizer, num_examine, **reward_kwargs):
    """
    Load and initialize a reward manager based on the configuration.

    Args:
        config: PPO trainer configuration object containing reward_model fields.
        tokenizer: Tokenizer object used for processing text.
        num_examine: Number of samples to examine.
        **reward_kwargs: Additional keyword arguments for the reward manager.

    Returns:
        An instance of the specified reward manager class.
    """
    from verl.workers.reward_manager import get_reward_manager_cls

    # The list of pre-defined reward managers are defined in `verl/workers/reward_manager/`:
    # naive: NaiveRewardManager
    # prime: PrimeRewardManager
    # batch: BatchRewardManager
    # dapo: DAPORewardManager
    # Note(haibin.lin): For custom reward managers, please make sure they are imported and
    # registered via `verl.workers.reward_manager.register`
    # By default reward_manager is set to naive (NaiveRewardManager)
    reward_manager_name = config.reward_model.get("reward_manager", "naive")
    reward_manager_cls = get_reward_manager_cls(reward_manager_name)

    # Try to get a custom reward function based on the configuration
    compute_score = get_custom_reward_fn(config)
    final_compute_score = compute_score

    if compute_score is None:
        sandbox_config = config.reward_model.get("sandbox_fusion")
        sandbox_url = sandbox_config.get("url") if sandbox_config else None
        memory_limit_mb = sandbox_config.get("memory_limit_mb", 1024)
        if sandbox_url:
            sandbox_manager = multiprocessing.Manager()
            # Create a semaphore to control concurrent access to the sandbox
            _concurrent_semaphore = sandbox_manager.Semaphore(sandbox_config.get("max_concurrent", 64))
            final_compute_score = partial(default_compute_score, sandbox_fusion_url=sandbox_url, concurrent_semaphore=_concurrent_semaphore, memory_limit_mb=memory_limit_mb)
        else:
            final_compute_score = default_compute_score

    # Instantiate and return the reward manager with the specified parameters
    return reward_manager_cls(
        tokenizer=tokenizer,
        num_examine=num_examine,
        compute_score=final_compute_score,
        reward_fn_key=config.data.reward_fn_key,
        **reward_kwargs,
    )

奖励函数计算

在训练流程中，当从train_dataloader取出一个batch的数据，并且rollout角色对其进行推理后就会调用reward manager（PrimeRewardManager）进行奖励计算。奖励计算的整体流程如下图所示。

这里采用的PrimeRewardManager内部在计算时，首先默认会生成一个包含64进程的进程池，然后batch中的每个问题对应的rollout结果都会组成为一个单问题任务，然后提交到进程池中等待进程处理，注意进程池中无空闲进程时任务会排队等待。
每个进程对单问题任务处理时会先生成一个线程池，线程池的数量是max(32, os.cpu_count() * 5)，因为需要测试各个输入下的结果，所以这里是给每个问题都专门生成了一个单输入任务，提交后让各线程去处理。

这里有一个明显的优化点：虽然都是同一份代码，但是每个输入都会去重新提交给sandbox一遍，这导致多次重复的代码传递以及编译，实际应该让sandbox考虑支持一次性提交多次输入，以提高效率。

每个单输入任务在处理时，都会先进行数据处理，然后在请求sandbox api时会向全局并发控制信号发起请求，只有获得允许后才会去发起请求，注意这个信息量控制是全局的，其控制了全局最多同时会向sandbox发起多少请求。
在发起sandbox请求时，其首先会生成一个payload，然后将其传递给sandbox进行处理，然后再返回response。

LLM > Verl

#LLM #RL #Verl

【Verl源码分析（四）】Verl中使用sandbox进行训练

http://example.com/2026/02/23/Verl-sandbox-usage/

作者

滑滑蛋

发布于

2026年2月23日

许可协议

【论文阅读】GLM-5:from Vibe Coding to Agentic Engineering 上一篇

【Verl源码分析（三）】Verl中训练引擎与推理引擎共置处理（以FSDP、vLLM为例）下一篇