2026英伟达GTC大会:基于 verl 进行大模型强化学习的最佳实践_第1页
2026英伟达GTC大会:基于 verl 进行大模型强化学习的最佳实践_第2页
2026英伟达GTC大会:基于 verl 进行大模型强化学习的最佳实践_第3页
2026英伟达GTC大会:基于 verl 进行大模型强化学习的最佳实践_第4页
2026英伟达GTC大会:基于 verl 进行大模型强化学习的最佳实践_第5页
已阅读5页,还剩34页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

基于verl

进行大模型强化学习的最佳实践Sofar,verl

hasgained:•

18k+stars•

3k+forks•

1.9k+commits•

490contributors•

2k+

issuesMany

popularRLprojects

built

ontop

ofverl:•

TinyZero(12k

stars)•

Easy-R1(4.4k

stars)•

Search-R1(3.8kstars)•

SimpleRL-Zoo(3.8kstars)•

OpenManus-RL(3.8kstars)•

SkyThought(3.4kstars)verl’sOpen-SourceCommunityMegatron-LM

TensorRT-LLM•…Lessionlearnedoverthe

pastyear•

Training:lack

of

abstraction,redundant

code

for

different

backends.•

Rollout:spmd

mode

is

intrusive

and

unfriendly

to

multi-turn

conversation.•

Single-controller:coupled

control

flow

and

data

flow,limiting

scalability•

LacknativesupportforasynchronoustrainingCore

Design:

HybridFlowHybridFlow=Single-controller(MPMD)+

Multi-controller(SPMD)Programminginterfacebasedonthe

“single-controller”

paradigmWithsingle-controller,

RLalgorithmcorelogicis

implemented

in

a

few

lines

of

code!

Facilitatediverse

RLalgorithmslike:PPO,GRPO,RLOO,ReMax,PRIME,DAPO,etc.Flexibility

in

Programming:

“Single-Controller”➡verl-core:building

blocks

for

RL

pipline•Model

Engine:

efficienttraining•RolloutEngine:generation,

environment

interaction,reward

calculation•TransferQueue:datatransmission,replay

buffer•Checkpoint

Engine:weight

synchronizationverl-trainer:

RL

pipelines

built

on

top

verl-core•On

policy:synchronous•One-step-off/Fully

async:asynchronous•

VLA•...verlarchitectureModel

EngineCoretrainingengineforbothSFT

and

RLtraining.Goal:support

largermodel,

longercontext,pushing

MFU

to

its

limit.API

Design:•

Abstract

Tinker-like

API•

RuninSPMD,parallelelsm

aware

dispatch&collect•

Traineragnostictobackend,new

backend

pluginwithouttrainercodechangeFeature:•

Multiplebackendsandvariousparallelismsupport•

Sequencebalancing•

LoRA•

Efficientkernelsfortraining

FlashAttenion

Liger-kernel

GroupGEMM/FusedMoE/DeepEP

FP8trainingBackendParallelismPerformanceSupport

ModelNew

ModelDaysFSDPFSDP+SPDensemedium/MoELowAlltransformermodelsDay

0MCoreDP+TP+PP+EP+CPHighsee

Megatron-Bridge

support

listfew

weeks

ormonthVeOmniFSDP+SP+EPMediumseeVeOmnisupport

list~1weekRollout

EngineAgent

loopcentricmulti-turnconversation

rollout•

LLMServer:

nativeinferenceserverwithout

intrusivemodifications Weight

synchronization

FP8onlinequantization

Router

replay•

AgentLoop:customizableagentictask

loops,

ReAct,

SWE,GUI,

etc•

RewardLoop:asynchronous

rewardcalculation,support

rulebased,model

based(GRM,

DistRM)Source:https://novasky-ai.notion.site/skyrl-v0Databusforallcomponents,replay

buffer.•

Currentlimitation:singlecontroller

handlesbothcontrolanddataflow,performanceissue

in

large

scale•

Earlierfailattempt:

rayobjectstore,

hightensorserializationcost,lackfine-grainedaccess,opaquegcmechanism•

TransferQueue

Zero-serialization

Extensible:multipletransport

layer,TCP,

RDMA

Fine-grainedaccess:read/write/appendsubsetcolumns

Proactivelifecycle

managementTransferQueueLongercontextandmulti-turnagentictasks

amplifythe"long-tail"problem,

increasethe

needforasynchronoustraining.Abstractionlayertosynchronizeweightsbetweentrainingandinference

backends.•

UnifiedAPI:send_weights/receive_weights/get_weights•

Extensible:plugabletransportbackend

collective:nccl,

hccl,

uccl

p2p:nixl,

mooncake

localcache:sharememory,

local

diskBackendTopologyPerformanceElasticUse

caseCollectiveall_gather+

broadcastVery

HighLow:

rebuild

ncclgroupFixedclusterP2Pall_gather+

ring

p2pMedium/HighHigh:dynamicadjust•

Elastic

rollout•

Faulttolerance•

HeterogeneousCheckpoint

Engineverl-trainerBuiltontopverl-core,construct

RLtrainingpipelines

flexibly.•

Onpolicy

trainer•

One-step-off-policytrainer•

Fullyasynctrainer•

VLAtrainer•

Manymorecustomtrainer

inverl-recipeSource:

MeituanSearchAI

InfraTeamSource:openvla/openvlaAgentic

RL:AgentloopAbstractionAgent:softwaresystemsthatuseAI

to

reasoning,planning,and

memoryandautonomyto

makedecisions,

learn,and

adapt.●Toolcalling:Allowingthe

LLMtoselect

and

use

varioustoolsas

needed.●Memory:

Enablingtheagent

to

retain

and

use

informationfrom

previoussteps.●Planning:

Empoweringthe

LLMtocreate

andfollow

multi-step

planstoachievegoals.Agent

RL:training

LLMto

makebetterdecisionsin

complex,dynamic,realworld.What

is

Agent?ReAct(fromlangchain-ai)Drawbacksofsynchronousrollout●

Batchgenerateandenvironmentexecutionareserial●

Rolloutandreward

calculation

stagesare

serial●

Rolloutandtrainingstagesare

serialLowinferenceandtrainingefficiency!How

to

do

agentic

RL?source:https://novasky-ai.notion.site/skyrl-v0●

Search:onlinewebsearch●

MCPtools:

image,videoedit,

...●

Codesandbox:executecode,python,java,

...●Virtual

machine:

operate

browser,

ppt,

excel,

...●

Androidemulator:operateappAgentLoop:givenauserprompt,execute

user

defined

loop,output

multi-turnchathistoryas

trajectory.AgentLoop●

Servermode:vllm/sglangAsyncLLM

engine●

Parallel

running:asyncio

loop

run

multiple

prompts

in

parallel●

Loadbalanceandsticky

session:

betterkv

cache

utilizationAgentLoopHighlightAgentic

RL

Practice

1:

RetoolReTool:training

LLMtowrite

pythoncodetosolvemath

problem.ReTool●

Basemodel:Qwen/Qwen2.5-32B-Instruct●

SFTdataset:JoeYing/ReTool-SFT●

RLdataset:

BytedTsinghua-SIA/DAPO-Math-17k●

Valdataset:yentinglin/aime_2025●

Recipe:verl/recipe/retoolReToolwithAgentLoopOverviewstage2:

GRPOstage

1:

SFTAgentic

RL

Practice2:SWEagent●SWEAgent:enable

LLMto

autonomously

use

tools

to

fix

issues●Sandbox:dockercontainer

launched

by

remote

container

service●SWE-Rex:runtime

interfacefor

interactingwith

sandbox

shell

environmentSWEAgentInfrastracture●Step

1~5:setup

container,

installtools,

and

initialize

shell

session●Step

6:setup

agentwith

tool

config

yaml,

e.g

tool

definition●Step7~11:

agent

query

model,

parse

action

and

executeshell

commandSWEAgent

Loophttps:/swe-ag/latest/background/architecture/Retokenization

Drift●BPE

Irreversible:

“HAVING”

=“H”+

“AVING”

or

“HAV”+

ING”●Tool

parser:

Parsingand

re-rendering

mightchangewhitespaceandformat.●Chattemplatedifference:vLLM,

SGLang

and

HuggingFaceChatModel:Avoid

Retokenization

DriftSome

Early

Experiment

Result●

Model:Qwen3-Coder-30B-A3B-Instruct●

Context

Length:64k●Training

dataset:

r2e-gym

(4500+images)●

Evaluationdataset:swe-verified(500

images)OngoingWork●

Fullyasync:taming“long-tail”

problem●

LLMGateway•OpenAIAPI,tokenize/detokenize,

prefix

change

detection•

Partialrollout

auto

resume•

KVcacheawareness

load

balancing●

Multi-trajecties:contextcompression,multi-agent,etcPerformancewith

NVIDIAsupportNsightSystem•

Profiler.nsys.

discrete=False,True•Afiler_enable:True•Actor.all_ranks:

True;

ranks:

[1,2]Profile

&

Iteratewith

Nsight

System•

Profiler.tool:

nsysrewardOld

log

prob

ref•

Profiler

steps:

[1,2,

5],

null,

[]•

Profiler.continuous_steps=True,

Falsegeneration

Update_actorWorkload

balance

for

Megatron

trainingWorkload

balance

in

long

tailed

data

training•RL

datasets

havevariable-lengthsequences,

causing

significant

efficiency

challenges

duringtrainingLong-TailedSequence

Length

Distributionsequence

lengthfrequency•RL

datasets

show

skewed,

long-tailed

distributions

of

sequence

lengths

.•Result:

GPU

under-utilization

in

both

memory

and

computation

efficiency..Rank

1Waitfor

thes

lowestRank

2Rank

3Rank

0Workload

balance

in

long

tailed

data

trainingImbalance

in

data

parallel•RLwithout

Packing/Dynamic

Batching•DP

synchronization

waits

for

the

s

lowest

rank

(stragglers).

.•GRPO

Qwen2.5-7B,

DP=4,

PP=2,

no

sequence

packing/dynamic

batchingWorkload

balance

in

long

tailed

data

trainingImbalance

in

pipeline

parallel•GRPO

Qwen2.5-7B,

DP=4,

PP=2,

no

sequence

packing/dynamic

batching.Workload

balance

in

long

tailed

data

trainingSolution•

Inter

DP•

Intra

DP

.•Workloadware

dynamic

batchingto

eventheworkload

across

micro-batches•Sortthe

micro-batchesto

make

consecutive

ones

have

similarworkloads•Place

smaller

micro-batches

at

both

endsto

reducethe

bubbles

exposed

duringthewarm-up

and

cool-down..•Workload

aware

data

parallel

split,

including

quadratic

complexity

of

attention

and

linear

complexity

of

FFNReduce

PP

bubbles

atwarmup

and

cooldown

stagesSorted

Dynamic

BatchingWorkload

balance

in

long

tailed

data

trainingPerformance•GRPO

training

7B

model

with

8*

Hopper

80G

GPUBest

Performancewith

Megatron

backendMegatron

Perf

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论