版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
基于verl
进行大模型强化学习的最佳实践Sofar,verl
hasgained:•
18k+stars•
3k+forks•
1.9k+commits•
490contributors•
2k+
issuesMany
popularRLprojects
built
ontop
ofverl:•
TinyZero(12k
stars)•
Easy-R1(4.4k
stars)•
Search-R1(3.8kstars)•
SimpleRL-Zoo(3.8kstars)•
OpenManus-RL(3.8kstars)•
SkyThought(3.4kstars)verl’sOpen-SourceCommunityMegatron-LM
TensorRT-LLM•…Lessionlearnedoverthe
pastyear•
Training:lack
of
abstraction,redundant
code
for
different
backends.•
Rollout:spmd
mode
is
intrusive
and
unfriendly
to
multi-turn
conversation.•
Single-controller:coupled
control
flow
and
data
flow,limiting
scalability•
LacknativesupportforasynchronoustrainingCore
Design:
HybridFlowHybridFlow=Single-controller(MPMD)+
Multi-controller(SPMD)Programminginterfacebasedonthe
“single-controller”
paradigmWithsingle-controller,
RLalgorithmcorelogicis
implemented
in
a
few
lines
of
code!
Facilitatediverse
RLalgorithmslike:PPO,GRPO,RLOO,ReMax,PRIME,DAPO,etc.Flexibility
in
Programming:
“Single-Controller”➡verl-core:building
blocks
for
RL
pipline•Model
Engine:
efficienttraining•RolloutEngine:generation,
environment
interaction,reward
calculation•TransferQueue:datatransmission,replay
buffer•Checkpoint
Engine:weight
synchronizationverl-trainer:
RL
pipelines
built
on
top
verl-core•On
policy:synchronous•One-step-off/Fully
async:asynchronous•
VLA•...verlarchitectureModel
EngineCoretrainingengineforbothSFT
and
RLtraining.Goal:support
largermodel,
longercontext,pushing
MFU
to
its
limit.API
Design:•
Abstract
Tinker-like
API•
RuninSPMD,parallelelsm
aware
dispatch&collect•
Traineragnostictobackend,new
backend
pluginwithouttrainercodechangeFeature:•
Multiplebackendsandvariousparallelismsupport•
Sequencebalancing•
LoRA•
Efficientkernelsfortraining
FlashAttenion
Liger-kernel
GroupGEMM/FusedMoE/DeepEP
FP8trainingBackendParallelismPerformanceSupport
ModelNew
ModelDaysFSDPFSDP+SPDensemedium/MoELowAlltransformermodelsDay
0MCoreDP+TP+PP+EP+CPHighsee
Megatron-Bridge
support
listfew
weeks
ormonthVeOmniFSDP+SP+EPMediumseeVeOmnisupport
list~1weekRollout
EngineAgent
loopcentricmulti-turnconversation
rollout•
LLMServer:
nativeinferenceserverwithout
intrusivemodifications Weight
synchronization
FP8onlinequantization
Router
replay•
AgentLoop:customizableagentictask
loops,
ReAct,
SWE,GUI,
etc•
RewardLoop:asynchronous
rewardcalculation,support
rulebased,model
based(GRM,
DistRM)Source:https://novasky-ai.notion.site/skyrl-v0Databusforallcomponents,replay
buffer.•
Currentlimitation:singlecontroller
handlesbothcontrolanddataflow,performanceissue
in
large
scale•
Earlierfailattempt:
rayobjectstore,
hightensorserializationcost,lackfine-grainedaccess,opaquegcmechanism•
TransferQueue
Zero-serialization
Extensible:multipletransport
layer,TCP,
RDMA
Fine-grainedaccess:read/write/appendsubsetcolumns
Proactivelifecycle
managementTransferQueueLongercontextandmulti-turnagentictasks
amplifythe"long-tail"problem,
increasethe
needforasynchronoustraining.Abstractionlayertosynchronizeweightsbetweentrainingandinference
backends.•
UnifiedAPI:send_weights/receive_weights/get_weights•
Extensible:plugabletransportbackend
collective:nccl,
hccl,
uccl
p2p:nixl,
mooncake
localcache:sharememory,
local
diskBackendTopologyPerformanceElasticUse
caseCollectiveall_gather+
broadcastVery
HighLow:
rebuild
ncclgroupFixedclusterP2Pall_gather+
ring
p2pMedium/HighHigh:dynamicadjust•
Elastic
rollout•
Faulttolerance•
HeterogeneousCheckpoint
Engineverl-trainerBuiltontopverl-core,construct
RLtrainingpipelines
flexibly.•
Onpolicy
trainer•
One-step-off-policytrainer•
Fullyasynctrainer•
VLAtrainer•
Manymorecustomtrainer
inverl-recipeSource:
MeituanSearchAI
InfraTeamSource:openvla/openvlaAgentic
RL:AgentloopAbstractionAgent:softwaresystemsthatuseAI
to
reasoning,planning,and
memoryandautonomyto
makedecisions,
learn,and
adapt.●Toolcalling:Allowingthe
LLMtoselect
and
use
varioustoolsas
needed.●Memory:
Enablingtheagent
to
retain
and
use
informationfrom
previoussteps.●Planning:
Empoweringthe
LLMtocreate
andfollow
multi-step
planstoachievegoals.Agent
RL:training
LLMto
makebetterdecisionsin
complex,dynamic,realworld.What
is
Agent?ReAct(fromlangchain-ai)Drawbacksofsynchronousrollout●
Batchgenerateandenvironmentexecutionareserial●
Rolloutandreward
calculation
stagesare
serial●
Rolloutandtrainingstagesare
serialLowinferenceandtrainingefficiency!How
to
do
agentic
RL?source:https://novasky-ai.notion.site/skyrl-v0●
Search:onlinewebsearch●
MCPtools:
image,videoedit,
...●
Codesandbox:executecode,python,java,
...●Virtual
machine:
operate
browser,
ppt,
excel,
...●
Androidemulator:operateappAgentLoop:givenauserprompt,execute
user
defined
loop,output
multi-turnchathistoryas
trajectory.AgentLoop●
Servermode:vllm/sglangAsyncLLM
engine●
Parallel
running:asyncio
loop
run
multiple
prompts
in
parallel●
Loadbalanceandsticky
session:
betterkv
cache
utilizationAgentLoopHighlightAgentic
RL
Practice
1:
RetoolReTool:training
LLMtowrite
pythoncodetosolvemath
problem.ReTool●
Basemodel:Qwen/Qwen2.5-32B-Instruct●
SFTdataset:JoeYing/ReTool-SFT●
RLdataset:
BytedTsinghua-SIA/DAPO-Math-17k●
Valdataset:yentinglin/aime_2025●
Recipe:verl/recipe/retoolReToolwithAgentLoopOverviewstage2:
GRPOstage
1:
SFTAgentic
RL
Practice2:SWEagent●SWEAgent:enable
LLMto
autonomously
use
tools
to
fix
issues●Sandbox:dockercontainer
launched
by
remote
container
service●SWE-Rex:runtime
interfacefor
interactingwith
sandbox
shell
environmentSWEAgentInfrastracture●Step
1~5:setup
container,
installtools,
and
initialize
shell
session●Step
6:setup
agentwith
tool
config
yaml,
e.g
tool
definition●Step7~11:
agent
query
model,
parse
action
and
executeshell
commandSWEAgent
Loophttps:/swe-ag/latest/background/architecture/Retokenization
Drift●BPE
Irreversible:
“HAVING”
=“H”+
“AVING”
or
“HAV”+
“
ING”●Tool
parser:
Parsingand
re-rendering
mightchangewhitespaceandformat.●Chattemplatedifference:vLLM,
SGLang
and
HuggingFaceChatModel:Avoid
Retokenization
DriftSome
Early
Experiment
Result●
Model:Qwen3-Coder-30B-A3B-Instruct●
Context
Length:64k●Training
dataset:
r2e-gym
(4500+images)●
Evaluationdataset:swe-verified(500
images)OngoingWork●
Fullyasync:taming“long-tail”
problem●
LLMGateway•OpenAIAPI,tokenize/detokenize,
prefix
change
detection•
Partialrollout
auto
resume•
KVcacheawareness
load
balancing●
Multi-trajecties:contextcompression,multi-agent,etcPerformancewith
NVIDIAsupportNsightSystem•
Profiler.nsys.
discrete=False,True•Afiler_enable:True•Actor.all_ranks:
True;
ranks:
[1,2]Profile
&
Iteratewith
Nsight
System•
Profiler.tool:
nsysrewardOld
log
prob
ref•
Profiler
steps:
[1,2,
5],
null,
[]•
Profiler.continuous_steps=True,
Falsegeneration
Update_actorWorkload
balance
for
Megatron
trainingWorkload
balance
in
long
tailed
data
training•RL
datasets
havevariable-lengthsequences,
causing
significant
efficiency
challenges
duringtrainingLong-TailedSequence
Length
Distributionsequence
lengthfrequency•RL
datasets
show
skewed,
long-tailed
distributions
of
sequence
lengths
.•Result:
GPU
under-utilization
in
both
memory
and
computation
efficiency..Rank
1Waitfor
thes
lowestRank
2Rank
3Rank
0Workload
balance
in
long
tailed
data
trainingImbalance
in
data
parallel•RLwithout
Packing/Dynamic
Batching•DP
synchronization
waits
for
the
s
lowest
rank
(stragglers).
.•GRPO
Qwen2.5-7B,
DP=4,
PP=2,
no
sequence
packing/dynamic
batchingWorkload
balance
in
long
tailed
data
trainingImbalance
in
pipeline
parallel•GRPO
Qwen2.5-7B,
DP=4,
PP=2,
no
sequence
packing/dynamic
batching.Workload
balance
in
long
tailed
data
trainingSolution•
Inter
DP•
Intra
DP
.•Workloadware
dynamic
batchingto
eventheworkload
across
micro-batches•Sortthe
micro-batchesto
make
consecutive
ones
have
similarworkloads•Place
smaller
micro-batches
at
both
endsto
reducethe
bubbles
exposed
duringthewarm-up
and
cool-down..•Workload
aware
data
parallel
split,
including
quadratic
complexity
of
attention
and
linear
complexity
of
FFNReduce
PP
bubbles
atwarmup
and
cooldown
stagesSorted
Dynamic
BatchingWorkload
balance
in
long
tailed
data
trainingPerformance•GRPO
training
7B
model
with
8*
Hopper
80G
GPUBest
Performancewith
Megatron
backendMegatron
Perf
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 环境保护执行细则
- 殡葬服务标准化工程师考试试卷及答案
- IDSA 2025版耐药革兰氏阴性杆菌感染治疗指南 完整版权威解读
- 心肌修复中生物材料与外泌体互作
- 专题六光、热学和近代物理(培优学生版)
- 少数民族医疗创新技术的知情同意文化适配
- 安徽省蚌埠市2026年高三年级新起点考试化学试题含解析
- 患者参与:线上线下共同决策模式构建
- 超市转让合同
- 被迫解除劳动合同通知书
- 2026年4月23日四川省宜宾市五方面人员选拔笔试真题及答案深度解析
- 2026广东建设职业技术学院第二批招聘6人备考题库附答案详解(考试直接用)
- 2026年科级干部任职资格政治理论考核要点
- GB/T 17498.6-2026室内固定式健身器材第6部分:跑步机附加的特殊安全要求和试验方法
- 2026秋招:重庆水务环境控股集团笔试题及答案
- 四百米障碍完整的教案
- 《材料分析测试技术》全套教学课件
- 天津英华插班生考试卷五年级
- 2021一级消防工程师继续教育考试石油化工类答案
- 小学音乐人教版 六年级下册爱我中华1 课件
- 深圳珠宝参展商名录
评论
0/150
提交评论