版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
基于verl
进行大模型强化学习的最佳实践Sofar,verl
hasgained:•
18k+stars•
3k+forks•
1.9k+commits•
490contributors•
2k+
issuesMany
popularRLprojects
built
ontop
ofverl:•
TinyZero(12k
stars)•
Easy-R1(4.4k
stars)•
Search-R1(3.8kstars)•
SimpleRL-Zoo(3.8kstars)•
OpenManus-RL(3.8kstars)•
SkyThought(3.4kstars)verl’sOpen-SourceCommunityMegatron-LM
TensorRT-LLM•…Lessionlearnedoverthe
pastyear•
Training:lack
of
abstraction,redundant
code
for
different
backends.•
Rollout:spmd
mode
is
intrusive
and
unfriendly
to
multi-turn
conversation.•
Single-controller:coupled
control
flow
and
data
flow,limiting
scalability•
LacknativesupportforasynchronoustrainingCore
Design:
HybridFlowHybridFlow=Single-controller(MPMD)+
Multi-controller(SPMD)Programminginterfacebasedonthe
“single-controller”
paradigmWithsingle-controller,
RLalgorithmcorelogicis
implemented
in
a
few
lines
of
code!
Facilitatediverse
RLalgorithmslike:PPO,GRPO,RLOO,ReMax,PRIME,DAPO,etc.Flexibility
in
Programming:
“Single-Controller”➡verl-core:building
blocks
for
RL
pipline•Model
Engine:
efficienttraining•RolloutEngine:generation,
environment
interaction,reward
calculation•TransferQueue:datatransmission,replay
buffer•Checkpoint
Engine:weight
synchronizationverl-trainer:
RL
pipelines
built
on
top
verl-core•On
policy:synchronous•One-step-off/Fully
async:asynchronous•
VLA•...verlarchitectureModel
EngineCoretrainingengineforbothSFT
and
RLtraining.Goal:support
largermodel,
longercontext,pushing
MFU
to
its
limit.API
Design:•
Abstract
Tinker-like
API•
RuninSPMD,parallelelsm
aware
dispatch&collect•
Traineragnostictobackend,new
backend
pluginwithouttrainercodechangeFeature:•
Multiplebackendsandvariousparallelismsupport•
Sequencebalancing•
LoRA•
Efficientkernelsfortraining
FlashAttenion
Liger-kernel
GroupGEMM/FusedMoE/DeepEP
FP8trainingBackendParallelismPerformanceSupport
ModelNew
ModelDaysFSDPFSDP+SPDensemedium/MoELowAlltransformermodelsDay
0MCoreDP+TP+PP+EP+CPHighsee
Megatron-Bridge
support
listfew
weeks
ormonthVeOmniFSDP+SP+EPMediumseeVeOmnisupport
list~1weekRollout
EngineAgent
loopcentricmulti-turnconversation
rollout•
LLMServer:
nativeinferenceserverwithout
intrusivemodifications Weight
synchronization
FP8onlinequantization
Router
replay•
AgentLoop:customizableagentictask
loops,
ReAct,
SWE,GUI,
etc•
RewardLoop:asynchronous
rewardcalculation,support
rulebased,model
based(GRM,
DistRM)Source:https://novasky-ai.notion.site/skyrl-v0Databusforallcomponents,replay
buffer.•
Currentlimitation:singlecontroller
handlesbothcontrolanddataflow,performanceissue
in
large
scale•
Earlierfailattempt:
rayobjectstore,
hightensorserializationcost,lackfine-grainedaccess,opaquegcmechanism•
TransferQueue
Zero-serialization
Extensible:multipletransport
layer,TCP,
RDMA
Fine-grainedaccess:read/write/appendsubsetcolumns
Proactivelifecycle
managementTransferQueueLongercontextandmulti-turnagentictasks
amplifythe"long-tail"problem,
increasethe
needforasynchronoustraining.Abstractionlayertosynchronizeweightsbetweentrainingandinference
backends.•
UnifiedAPI:send_weights/receive_weights/get_weights•
Extensible:plugabletransportbackend
collective:nccl,
hccl,
uccl
p2p:nixl,
mooncake
localcache:sharememory,
local
diskBackendTopologyPerformanceElasticUse
caseCollectiveall_gather+
broadcastVery
HighLow:
rebuild
ncclgroupFixedclusterP2Pall_gather+
ring
p2pMedium/HighHigh:dynamicadjust•
Elastic
rollout•
Faulttolerance•
HeterogeneousCheckpoint
Engineverl-trainerBuiltontopverl-core,construct
RLtrainingpipelines
flexibly.•
Onpolicy
trainer•
One-step-off-policytrainer•
Fullyasynctrainer•
VLAtrainer•
Manymorecustomtrainer
inverl-recipeSource:
MeituanSearchAI
InfraTeamSource:openvla/openvlaAgentic
RL:AgentloopAbstractionAgent:softwaresystemsthatuseAI
to
reasoning,planning,and
memoryandautonomyto
makedecisions,
learn,and
adapt.●Toolcalling:Allowingthe
LLMtoselect
and
use
varioustoolsas
needed.●Memory:
Enablingtheagent
to
retain
and
use
informationfrom
previoussteps.●Planning:
Empoweringthe
LLMtocreate
andfollow
multi-step
planstoachievegoals.Agent
RL:training
LLMto
makebetterdecisionsin
complex,dynamic,realworld.What
is
Agent?ReAct(fromlangchain-ai)Drawbacksofsynchronousrollout●
Batchgenerateandenvironmentexecutionareserial●
Rolloutandreward
calculation
stagesare
serial●
Rolloutandtrainingstagesare
serialLowinferenceandtrainingefficiency!How
to
do
agentic
RL?source:https://novasky-ai.notion.site/skyrl-v0●
Search:onlinewebsearch●
MCPtools:
image,videoedit,
...●
Codesandbox:executecode,python,java,
...●Virtual
machine:
operate
browser,
ppt,
excel,
...●
Androidemulator:operateappAgentLoop:givenauserprompt,execute
user
defined
loop,output
multi-turnchathistoryas
trajectory.AgentLoop●
Servermode:vllm/sglangAsyncLLM
engine●
Parallel
running:asyncio
loop
run
multiple
prompts
in
parallel●
Loadbalanceandsticky
session:
betterkv
cache
utilizationAgentLoopHighlightAgentic
RL
Practice
1:
RetoolReTool:training
LLMtowrite
pythoncodetosolvemath
problem.ReTool●
Basemodel:Qwen/Qwen2.5-32B-Instruct●
SFTdataset:JoeYing/ReTool-SFT●
RLdataset:
BytedTsinghua-SIA/DAPO-Math-17k●
Valdataset:yentinglin/aime_2025●
Recipe:verl/recipe/retoolReToolwithAgentLoopOverviewstage2:
GRPOstage
1:
SFTAgentic
RL
Practice2:SWEagent●SWEAgent:enable
LLMto
autonomously
use
tools
to
fix
issues●Sandbox:dockercontainer
launched
by
remote
container
service●SWE-Rex:runtime
interfacefor
interactingwith
sandbox
shell
environmentSWEAgentInfrastracture●Step
1~5:setup
container,
installtools,
and
initialize
shell
session●Step
6:setup
agentwith
tool
config
yaml,
e.g
tool
definition●Step7~11:
agent
query
model,
parse
action
and
executeshell
commandSWEAgent
Loophttps:/swe-ag/latest/background/architecture/Retokenization
Drift●BPE
Irreversible:
“HAVING”
=“H”+
“AVING”
or
“HAV”+
“
ING”●Tool
parser:
Parsingand
re-rendering
mightchangewhitespaceandformat.●Chattemplatedifference:vLLM,
SGLang
and
HuggingFaceChatModel:Avoid
Retokenization
DriftSome
Early
Experiment
Result●
Model:Qwen3-Coder-30B-A3B-Instruct●
Context
Length:64k●Training
dataset:
r2e-gym
(4500+images)●
Evaluationdataset:swe-verified(500
images)OngoingWork●
Fullyasync:taming“long-tail”
problem●
LLMGateway•OpenAIAPI,tokenize/detokenize,
prefix
change
detection•
Partialrollout
auto
resume•
KVcacheawareness
load
balancing●
Multi-trajecties:contextcompression,multi-agent,etcPerformancewith
NVIDIAsupportNsightSystem•
Profiler.nsys.
discrete=False,True•Afiler_enable:True•Actor.all_ranks:
True;
ranks:
[1,2]Profile
&
Iteratewith
Nsight
System•
Profiler.tool:
nsysrewardOld
log
prob
ref•
Profiler
steps:
[1,2,
5],
null,
[]•
Profiler.continuous_steps=True,
Falsegeneration
Update_actorWorkload
balance
for
Megatron
trainingWorkload
balance
in
long
tailed
data
training•RL
datasets
havevariable-lengthsequences,
causing
significant
efficiency
challenges
duringtrainingLong-TailedSequence
Length
Distributionsequence
lengthfrequency•RL
datasets
show
skewed,
long-tailed
distributions
of
sequence
lengths
.•Result:
GPU
under-utilization
in
both
memory
and
computation
efficiency..Rank
1Waitfor
thes
lowestRank
2Rank
3Rank
0Workload
balance
in
long
tailed
data
trainingImbalance
in
data
parallel•RLwithout
Packing/Dynamic
Batching•DP
synchronization
waits
for
the
s
lowest
rank
(stragglers).
.•GRPO
Qwen2.5-7B,
DP=4,
PP=2,
no
sequence
packing/dynamic
batchingWorkload
balance
in
long
tailed
data
trainingImbalance
in
pipeline
parallel•GRPO
Qwen2.5-7B,
DP=4,
PP=2,
no
sequence
packing/dynamic
batching.Workload
balance
in
long
tailed
data
trainingSolution•
Inter
DP•
Intra
DP
.•Workloadware
dynamic
batchingto
eventheworkload
across
micro-batches•Sortthe
micro-batchesto
make
consecutive
ones
have
similarworkloads•Place
smaller
micro-batches
at
both
endsto
reducethe
bubbles
exposed
duringthewarm-up
and
cool-down..•Workload
aware
data
parallel
split,
including
quadratic
complexity
of
attention
and
linear
complexity
of
FFNReduce
PP
bubbles
atwarmup
and
cooldown
stagesSorted
Dynamic
BatchingWorkload
balance
in
long
tailed
data
trainingPerformance•GRPO
training
7B
model
with
8*
Hopper
80G
GPUBest
Performancewith
Megatron
backendMegatron
Perf
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年品牌推广合作协议合同三篇
- 软件许可使用合同协议范本
- 化工企业采购管理制度
- 单位采购业务控制制度
- 医疗器械采购评审制度
- 橡胶采购管理制度及流程
- 工地日常采购管理制度
- 工厂车间采购制度
- 学校采购需求制度
- 工程采购物资管理制度范本
- 广东省化工(危险化学品)企业安全隐患排查指导手册(危险化学品仓库企业专篇)
- 卫生院防雷安全生产制度
- 齐成控股集团招聘笔试题库2026
- 卫生部病历书写基本规范2025年版
- QGDW11337-2023输变电工程工程量清单计价规范
- 大学生创新创业基础(创新创业课程)完整全套教学课件
- 中药香囊制作(中药学基础课件)
- 沉井专项施工方案-9310
- 小儿喂养与膳食安排
- 组合分析样送样单
- 遥感原理与应用-第2章
评论
0/150
提交评论