【Verl源码分析(四)】Verl中使用sandbox进行训练

在强化学习中, 当reward依赖“运行结果”而不是“文本相似度或打分模型”时,就需要sandbox来运行模型rollout的结果进行判别。最典型的场景就是在强化学习场景中加入生成代码的任务,或者是更复杂的有关工具调用的训练场景。

Verl目前已支持将fusin sandbox加入到训练流程中,故本文随着Verl的官方示例来对其一探究竟,其主要是在Eurus-2-RL-Data 数据集上利用Fusion sandbox进行数学与代码能力的增强。

注意查看的是0.4.1.x版本的Verl代码:https://github.com/verl-project/verl/tree/v0.4.1.x

Fusion sandbox简介

详细介绍可以见官网:https://bytedance.github.io/SandboxFusion/

Fusin sandbox是字节开发的适用于LLM的多功能code sandbox。其可以通过拉取并运行官方镜像来快速启动一个sandbox环境。其主要包含两个功能:

  • 运行代码:可以向运行的端口,如http://localhost:8080/run_code提交代码,sandbox将会编译并运行这部分代码,然后将运行结果返回。这是最基础的通用能力。

  • 数据集运行判别:其在运行代码能力的基础上进一步封装,将一些典型的代码数据集进行封装,支持直接提交某个数据集某个问题的模型执行结果,然后sandbox自行提取出代码并编译运行,然后将运行结果与这个问题应有的代码进行对比并返回判别结果。此外也支持自行拓展支持的数据集。

在verl中使用到的是fusion sandbox运行代码的功能,关于与数据集中的ground truth的对比就自行实现了,这样更加灵活。

Eurus-2-RL-Data数据集简介

Eurus-2-RL-Data数据集包含了45.5 万道数学题和2.6 万道编程题,具体介绍如下:

  • 数学题:其选用的是NuminaMath-CoT题库。题目涵盖了从中国高中数学到国际数学奥林匹克竞赛的各类题型,示例如下,其计算结果要求用latex格式。示例如下:
1
2
3
4
5
6
7
8
9
10
11
{
'data_source': 'numina_olympiads',
'prompt': array([
{'content': '\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\n\n[ASSESS]\n\n[ADVANCE]\n\n[VERIFY]\n\n[SIMPLIFY]\n\n[SYNTHESIZE]\n\n[PIVOT]\n\n[OUTPUT]\n\nYou should strictly follow the format below:\n\n[ACTION NAME]\n\n# Your action step 1\n\n# Your action step 2\n\n# Your action step 3\n\n...\n\nNext action: [NEXT ACTION NAME]\n\n', 'role': 'system'},
{'content': 'Find the matrix of the operator $\\widehat{A}$ in the basis $\\mathbf{e}_{1}^{\\prime}, \\mathbf{e}_{2}^{\\prime}, \\mathbf{e}_{3}^{\\prime}$, where\n\n$$\n\\begin{aligned}\n& \\mathbf{e}_{1}^{\\prime}=\\mathbf{e}_{1}+\\mathbf{e}_{2}+2 \\mathbf{e}_{3}, \\\\\n& \\mathbf{e}_{2}^{\\prime}=2 \\mathbf{e}_{1}-\\mathbf{e}_{2} \\\\\n& \\mathbf{e}_{3}^{\\prime}=-\\mathbf{e}_{1}+\\mathbf{e}_{2}+\\mathbf{e}_{3},\n\\end{aligned}\n$$\n\nif in the basis $\\mathbf{e}_{1}, \\mathbf{e}_{2}, \\mathbf{e}_{3}$ its matrix is given by\n\n$$\nA_{\\mathbf{e}}=\\left(\\begin{array}{rrr}\n2 & 0 & -1 \\\\\n0 & 1 & -2 \\\\\n-1 & 2 & 0\n\\end{array}\\right)\n$$\n\nPresent the answer in LaTex format: \\boxed{Your answer}', 'role': 'user'}],
dtype=object),
'ability': 'math',
'reward_model': {'ground_truth': '\\begin{pmatrix}\n -7 & 6 & -8 \\\\\n 11 & -9 & 12 \\\\\n 15 & -16 & 19\n \\end{pmatrix}', 'style': 'rule'},
'extra_info': {'index': 0, 'split': 'dummy'}
}

  • 编程题:其主要从APPSCodeContestsTACOCodeforces 等网站获取题目。题目的难度主要与编程竞赛级别相当。数据集中的ground truth不再是一串结果数字,而是多个输入以及对应预期的输出结果。示例如下:
1
2
3
4
5
6
7
8
9
10
11
{
'data_source': 'taco',
'prompt': array([
{'content': '\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\n\n[ASSESS]\n\n[ADVANCE]\n\n[VERIFY]\n\n[SIMPLIFY]\n\n[SYNTHESIZE]\n\n[PIVOT]\n\n[OUTPUT]\n\nYou should strictly follow the format below:\n\n[ACTION NAME]\n\n# Your action step 1\n\n# Your action step 2\n\n# Your action step 3\n\n...\n\nNext action: [NEXT ACTION NAME]\n\n', 'role': 'system'},
{'content': 'Xander Cage has a list of cities he can visit on his new top-secret mission. He represents each city as a tuple of $(latitude,longitude,height,points)$. The values of $latitude$, $longitude$, and $height$ are distinct across all cities.\n\nWe define a mission as a sequence of cities, $c_1,c_2,c_3,\\ldots,c_k$, that he visits. We define the total $\\text{points}$ of such a mission to be the sum of the $\\text{points}$ of all the cities in his mission list.\n\nBeing eccentric, he abides by the following rules on any mission:\n\nHe can choose the number of cities he will visit (if any).\nHe can start the mission from any city.\nHe visits cities in order of strictly increasing $height$.\nThe absolute difference in $latitude$ between adjacent visited cities in his mission must be at most $d_l\\textbf{at}$.\nThe absolute difference in $longitude$ between adjacent visited cities in his mission must be at most $d_long$.\n\nGiven $\\boldsymbol{d\\text{_lat}}$, $d\\text{_long}$, and the definitions for $n$ cities, find and print the maximum possible total $\\text{points}$ that Xander can earn on a mission.\n\nInput Format\n\nThe first line contains three space-separated integers describing the respective values of $n$, $\\boldsymbol{d\\text{_lat}}$, and $d\\text{_long}$. \n\nEach line $\\boldsymbol{i}$ of the $n$ subsequent lines contains four space-separated integers denoting the respective $latitude$, $longitude$, $height$, and $\\text{points}$ for a city.\n\nConstraints\n\n$1\\leq n\\leq2\\times10^5$ \n$1\\leq d\\_\\textit{lat},d\\textit{long}\\leq2\\times10^5$ \n$1\\leq latitude,longitude,height\\leq2\\times10^5$ \n$-2\\times10^5\\leq\\textit{points}\\leq2\\times10^5$\n\nOutput Format\n\nPrint a single integer denoting the maximum possible $\\text{points}$ that Xander can earn on a mission.\n\nSample Input 0\n3 1 1\n1 1 1 3\n2 2 2 -1\n3 3 3 3\n\nSample Output 0\n5\n\nExplanation 0\n\nXander can start at city $1$, then go to city $2$, and then go to city $3$ for a maximum value of total $points=3+-1+3=5$ \n\nNote that he cannot go directly from city $1$ to city $3$ as that would violate his rules that the absolute difference in $latitude$ between adjacent visited cities be $\\leq d\\text{_lat}$ and the absolute difference in $longitude$ between adjacent visited cities be $\\leq d\\text{_long}$. Because $d\\textit{_lat}=1$ and $d\\textit{_long}=1$, he cannot directly travel between those cities.\n\nWrite Python code to solve the problem. Present the code in \n```python\nYour code\n```\nat the end.', 'role': 'user'}],
dtype=object),
'ability': 'code',
'reward_model': {'ground_truth': '{"inputs": ["3 2 2\\n1 1 1 3\\n2 2 2 -1\\n3 3 3 3\\n", "4 2 2\\n1 1 1 3\\n2 2 2 -1\\n3 3 3 3\\n4 4 4 5\\n", "5 2 2\\n1 1 1 3\\n2 2 2 -1\\n3 3 3 3\\n4 4 4 5\\n5 5 5 1\\n", "2 1 1\\n1 1 1 3\\n2 2 2 5\\n", "3 1 1\\n1 1 1 3\\n1 2 2 5\\n1 3 3 6\\n", "5 200000 200000\\n1 1 1 200000\\n200000 200000 200000 200000\\n400000 400000 400000 200000\\n600000 600000 600000 200000\\n800000 800000 800000 200000\\n"], "outputs": ["6", "11", "12", "8", "14", "1000000"]}', 'style': 'rule'},
'extra_info': {'index': 0, 'split': 'dummy'}
}

Verl使用sandbox计算奖励的运行示例

关于环境配置可以见之前的博客:https://slipegg.github.io/2026/01/29/Verl-Install-Demo/

我这里使用的是4个4090,所以跑不起来verl官方给的示例脚本examples/ppo_trainer/run_deepseek7b_llm_sandbox_fusion.sh,所以自己调整了一下,运行脚本如下,主要是调整了batch的大小,并且将模型改为了更小的Qwen2.5-0.5B-Instruct。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
set -x

python3 -m verl.trainer.main_ppo \
reward_model.sandbox_fusion.url='http://localhost:8080/run_code' \
reward_model.sandbox_fusion.max_concurrent=16 \
reward_model.reward_manager=prime \
algorithm.adv_estimator=gae \
data.train_files=huggingface/Eurus-2-RL-Data/train.parquet \
data.val_files=huggingface/Eurus-2-RL-Data/validation.parquet \
data.train_batch_size=256 \
data.max_prompt_length=512 \
data.max_response_length=512 \
data.filter_overlong_prompts=True \
data.truncation=right \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
critic.model.enable_gradient_checkpointing=True \
critic.ppo_micro_batch_size_per_gpu=4 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='verl_example_sandbox_fusion' \
trainer.experiment_name='deepseek_llm_7b_function_sandbox_fusion' \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=-1 \
trainer.total_epochs=15 2>&1 | tee verl_sandbox.log

运行结果如下所示,可以正常迭代:

Verl使用sandbox计算奖励的运行流程

这里专门看需要使用sandbox的代码题的处理。其整体处理的流程还是类似的,先从数据集中获取到一批数据,然后rollout角色针对这一批问题进行推理得到答案,对于代码问题就是生成对应的代码,然后进行reward计算,为了验证生成代码的准确性,就需要将代码放入sandbox中编译运行,并得到运行数据集中的输入后得到的输出,将其与数据集中的预期输出进行对比,从而得到reward。然后再依据算法按需计算advantage等,最后进行模型的更新迭代。

这里主要关注在reward计算过程中调用sandbox的一些细节,尤其是其中如何进行任务编排。

奖励函数初始化

在初始化时,需要初始化reward_fnval_reward_fn。在本示例中,依据配置reward_model.reward_manager=prime加载的是PrimeRewardManager,然后还依据reward_model.sandbox_fusion.max_concurrent=16这一配置设置了全局sandbox api访问并发控制信号量允许的并发限制是16,然后将这一全局并发控制信号传递给reward计算函数,然后再传入到PrimeRewardManager

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def load_reward_manager(config, tokenizer, num_examine, **reward_kwargs):
"""
Load and initialize a reward manager based on the configuration.

Args:
config: PPO trainer configuration object containing reward_model fields.
tokenizer: Tokenizer object used for processing text.
num_examine: Number of samples to examine.
**reward_kwargs: Additional keyword arguments for the reward manager.

Returns:
An instance of the specified reward manager class.
"""
from verl.workers.reward_manager import get_reward_manager_cls

# The list of pre-defined reward managers are defined in `verl/workers/reward_manager/`:
# naive: NaiveRewardManager
# prime: PrimeRewardManager
# batch: BatchRewardManager
# dapo: DAPORewardManager
# Note(haibin.lin): For custom reward managers, please make sure they are imported and
# registered via `verl.workers.reward_manager.register`
# By default reward_manager is set to naive (NaiveRewardManager)
reward_manager_name = config.reward_model.get("reward_manager", "naive")
reward_manager_cls = get_reward_manager_cls(reward_manager_name)

# Try to get a custom reward function based on the configuration
compute_score = get_custom_reward_fn(config)
final_compute_score = compute_score

if compute_score is None:
sandbox_config = config.reward_model.get("sandbox_fusion")
sandbox_url = sandbox_config.get("url") if sandbox_config else None
memory_limit_mb = sandbox_config.get("memory_limit_mb", 1024)
if sandbox_url:
sandbox_manager = multiprocessing.Manager()
# Create a semaphore to control concurrent access to the sandbox
_concurrent_semaphore = sandbox_manager.Semaphore(sandbox_config.get("max_concurrent", 64))
final_compute_score = partial(default_compute_score, sandbox_fusion_url=sandbox_url, concurrent_semaphore=_concurrent_semaphore, memory_limit_mb=memory_limit_mb)
else:
final_compute_score = default_compute_score

# Instantiate and return the reward manager with the specified parameters
return reward_manager_cls(
tokenizer=tokenizer,
num_examine=num_examine,
compute_score=final_compute_score,
reward_fn_key=config.data.reward_fn_key,
**reward_kwargs,
)

奖励函数计算

  • 在训练流程中,当从train_dataloader取出一个batch的数据,并且rollout角色对其进行推理后就会调用reward manager(PrimeRewardManager)进行奖励计算。奖励计算的整体流程如下图所示。

  • 这里采用的PrimeRewardManager内部在计算时,首先默认会生成一个包含64进程的进程池,然后batch中的每个问题对应的rollout结果都会组成为一个单问题任务,然后提交到进程池中等待进程处理,注意进程池中无空闲进程时任务会排队等待。

  • 每个进程对单问题任务处理时会先生成一个线程池,线程池的数量是max(32, os.cpu_count() * 5),因为需要测试各个输入下的结果,所以这里是给每个问题都专门生成了一个单输入任务,提交后让各线程去处理。

这里有一个明显的优化点:虽然都是同一份代码,但是每个输入都会去重新提交给sandbox一遍,这导致多次重复的代码传递以及编译,实际应该让sandbox考虑支持一次性提交多次输入,以提高效率。

  • 每个单输入任务在处理时,都会先进行数据处理,然后在请求sandbox api时会向全局并发控制信号发起请求,只有获得允许后才会去发起请求,注意这个信息量控制是全局的,其控制了全局最多同时会向sandbox发起多少请求。

  • 在发起sandbox请求时,其首先会生成一个payload,然后将其传递给sandbox进行处理,然后再返回response。


【Verl源码分析(四)】Verl中使用sandbox进行训练
http://example.com/2026/02/23/Verl-sandbox-usage/
作者
滑滑蛋
发布于
2026年2月23日
许可协议