版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
ReinforcementLearningforSafeLLMCodeGeneration
RoyHuang
ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley
TechnicalReportNo.UCB/EECS-2025-123
/Pubs/TechRpts/2025/EECS-2025-123.html
May19,2025
Copyright©2025,bytheauthor(s).
Allrightsreserved.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare
notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.
Acknowledgement
IwouldliketoacknowledgeandthanktheentirerLLMteam,inparticular,
MichaelLuoandSijunTan,forbeingsupportive,responsive,andhelpful
mentorsandintroducingmetoRLtrainingonLLMsandagents.IwouldalsoliketothankProf.JosephE.Gonzalezforthesupportiveadvising,guiding
mealongmyjourney.Mostimportantly,IwouldliketothankmyparentsforsupportingmethroughmylifeandgettingmetowhereIamtoday.Icouldnothavemadeitwithouttheirloveandencouragement.
ReinforcementLearningforSafeLLMCodeGeneration
byYuFeiHuang
ResearchProject
SubmittedtotheDepartmentofElectricalEngineeringandComputerSciences,
UniversityofCaliforniaatBerkeley,inpartialsatisfactionoftherequirementsforthedegreeofMasterofScience,PlanII.
ApprovalfortheReportandComprehensiveExamination:
Committee:
ProfessorJosephE.GonzalezResearchAdvisor
5/15/2025
(Date)
******
ProfessorRalucaAdaPopaSecondReader
5/18/2025
(Date)
ReinforcementLearningforSafeLLMCodeGeneration1
by
YuFeiHuang
Athesissubmittedinpartialsatisfactionofthe
requirementsforthedegreeof
MasterofScience
in
ElectricalEngineeringandComputerSciences
inthe
GraduateDivision
ofthe
UniversityofCalifornia,Berkeley
Committeeincharge:
ProfessorJosephE.Gonzalez,ChairAssociateProfessorRalucaAdaPopa
Spring2025
1ThisthesisisadaptedfromGoEX:PerspectivesandDesignsTowardsaRuntimeforAutonomousLLMApplications[26]andDeepCoder:AFullyOpen-Source14BCoderatO3-miniLevel[17].Itisrecommendedtocitethesepapersoverthisreport.
ReinforcementLearningforSafeLLMCodeGeneration1
Copyright2025
by
YuFeiHuang
1ThisthesisisadaptedfromGoEX:PerspectivesandDesignsTowardsaRuntimeforAu-tonomousLLMApplications[26]andDeepCoder:AFullyOpen-Source14BCoderatO3-miniLevel[17].Itisrecommendedtocitethesepapersoverthisreport.
2ThisthesisisadaptedfromGoEX:PerspectivesandDesignsTowardsaRuntimeforAu-tonomousLLMApplications[26]andDeepCoder:AFullyOpen-Source14BCoderatO3-miniLevel[17].Itisrecommendedtocitethesepapersoverthisreport.
1
Abstract
ReinforcementLearningforSafeLLMCodeGeneration2
by
YuFeiHuang
MasterofScienceinElectricalEngineeringandComputerSciences
UniversityofCalifornia,Berkeley
ProfessorJosephE.Gonzalez,Chair
Reinforcementlearning(RL)hasbecomeaprimarytechniqueforaligningLargeLanguageModels(LLMs)withcomplexreasoningobjectives,yetconvergenceisfragilewhenrewardsignalsarenoisyorexploitable.ThisthesispresentsrLLM—anopen-source,Ray-basedRLframeworkthatutilizesanimprovedGroup-RelativePolicyOptimization(GRPO+)withveRLmodifiedwithasynchronouspipelinedsampling,anditerativecontextlengthening.UsingrLLMwetrainedDeepcoder-14B,a14-billion-parametercode-reasoningmodelthatattains60.6%Pass@1onLiveCodeBench,a1936Codeforcesrating,and92.6%Pass@1onHumanEval+,matchingOpenAI’sproprietaryo3-mini(low)ando1onthesebenchmarks.
Weshowthatsuchperformancehingesonanairtightsandboxedexecutionenvironmentthatsafeguardsrewardintegrity.TothatendwetakeinspirationfromGoEx,apost-facto-validatedruntimethatenvelopeseveryRESTcall,databasemutation,andfileoperationindeterministicUndoandblast-radius-boundedconfinementsemantics.Theairtighten-vironmentswhichrLLMconsumesdirectlytocomputerewardsusing,eliminatingrewardhacking.
ThefindingsunderscorethattheproposedGRPO+modificationsignificantlyenhancestrain-ingconvergencecomparedtoexistingwidely-adoptedalgorithmssuchasGRPOandDAPO.Furthermore,theasynchronouspipeliningmechanismincorporatedintoveRLsubstantiallyoptimizesthetraininginfrastructure,enablingefficientscalability.Ultimately,byintegratingtheseadvancementswithinameticulouslysecureenvironment,thisthesisdeliversacom-prehensiveRLframeworkthatreliablyalignsLLMswithsophisticatedreasoningobjectives,pavingthewayforfutureresearchintorobustandscalablereinforcementlearningsystems.
i
Tomyparents,advisor,andresearchcollaborators
IwouldliketoacknowledgeandthanktheentirerLLMteam,inparticular,MichaelLuo
andSijunTan,forbeingsupportive,responsive,andhelpfulmentorsandintroducingmeto
RLtrainingonLLMsandagents.IwouldalsoliketothankProf.JosephE.Gonzalezfor
thesupportiveadvising,guidingmealongmyjourney.Mostimportantly,Iwouldliketo
thankmyparentsforsupportingmethroughmylifeandgettingmetowhereIamtoday.I
couldnothavemadeitwithouttheirloveandencouragement.
ii
Contents
Contentsii
ListofFiguresiv
ListofTablesvi
1Introduction1
1.1BackgroundandMotivation 1
1.2rLLMFrameworkOverview 2
2RelatedWork4
2.1ReinforcementLearningforLanguage-ModelAlignment 4
2.2DistributedFrameworksandSystemsInfrastructure 5
2.3SecureExecutionEnvironmentsandRewardIntegrity 6
3GoEX:ExecutionRuntimeforLLMs8
3.1DesigningaRuntimeforLLMExecution 8
3.2ReversibilityandDamageConfinement 8
3.3SymbolicCredentialsandSandboxedExecution 9
3.4CredentialStorageandAccessControl 9
3.5SystemDesignComponents 9
4rLLM:RLTrainingforLLMReasoning17
4.1ProblemStatement 17
4.2rLLMFramework 18
5rLLMExperiment:Deepcoder-14B24
5.1DatasetCurationStrategy 24
5.2CodeSandboxEnvironmentforRewardComputation 25
5.3RewardFunctionDesign 26
5.4EvaluationResults 26
5.5End-to-endPerformance 27
iii
6Conclusion29
Bibliography31
ACodeforcesEvaluation35
iv
ListofFigures
3.1GoEX’sruntimeforexecutingRESTfulAPIcalls.Uponreceivingtheuser’sprompt,GoEXpresentstwoalternatives.First,anLLMcanbepromptedtocomeupwiththe(Action,Undo-Action)pair.Second,theapplicationdevelopercanprovidetuplesofactionsandtheircorrespondingundo-actions(functioncalls)
fromwhichtheLLMcanpickamongst.......................10
3.2Runtimeforexecutingactionsonadatabase.Wepresenttwotechniquestodetermineifaproposedactioncanbeundone.Ontheleft,fornon-transactionaldatabaseslikeMongoDB,andforflexibility,weprompttheLLMtogenerate(Action,Undo-Action,test-bed)tuples,whichwethenevaluateinaisolatedcontainertocatchanyfalse(Action,Undo-Action)pairs.Ontheright,wecanprovideadeterministicundowithguaranteesbyemployingthetransaction
semanticsofdatabases................................13
3.3Runtimeforexecutingactionsonafilesystem.GoEXpresentstwoabstractions.
Ontheleft,theLLMispromptedtocomeupwithan(Action,Undo-Action,test-bed)whichGoEXevaluatesinaisolatedcontainertocatchanyfalse(Action,Undo-Action)pairs.Ontherightpresentsdeterministicguarantees
byusingversioningcontrolsystemlikeGitorGitLFS...............15
4.1AveragetrainingrewardbetweenGRPO+andGRPOforthe16Krun.GRPO’s
rewardcurveeventuallycollapses.GRPO+’scurveisstableduetoClipHigh.18
4.2Duetooverlongfiltering,GRPO+’sresponselengthgrowssteadilyovertime.19
4.3ClipHighandNoEntropyLossensuresthatGRPO+’stoken-levelentropy
doesnotcollapseandencouragessufficientexploration 19
4.4DeepCoder’saverageresponselengthandtrainingrewardsastrainingprogresses.
Averageresponselengthincreasesfrom8K→17.5Kcontextlength 20
4.5Verl’sPPO/GRPOtrainingpipeline.EveryRLiterationcyclesthroughsam-
pling,rewardfunctioncalculationandtraining.Samplingisthebottleneck;
trainingspeedisboundedbystragglersamplersthatgeneratelongsequences..21
4.6MinibatchPipelining.Samplersandtrainersoperateinseparateworkergroups.
Assamplerscompleteandreleasemini-batches(forPPO/GRPO),trainerworkersprocessthemasynchronously.Attheendofaniteration,trainersbroadcasttheir
weightstosamplers..................................22
v
4.7One-OfPipelining.Samplersgenerateabatchoneiterationahead,whiletrain-
ersupdategradientsusingthepreviousiteration’sdata.Second,rewardfunction
calculationisinterleavedwithsampling.Thisapproachdoesnotintroduceasyn-
chronousof-policysamplestoGRPO/PPO’son-policyalgorithm
23
5.1One-ofpipeliningfullymasksawaytrainerandrewardcomputationtimes,re-
ducingtrainingtimesby1.4xformathand2xforcoding
27
vi
ListofTables
5.1ModelPerformanceonCodingandMathBenchmarks...............27
vii
Acknowledgments
IwouldliketoacknowledgeandthanktheentirerLLMteam,inparticular,MichaelLuoandSijunTan,forbeingsupportive,responsive,andhelpfulmentorsandintroducingmetoRLtrainingonLLMsandagents.IwouldalsoliketothankProf.JosephE.Gonzalezforthesupportiveadvising,guidingmealongmyjourney.Mostimportantly,IwouldliketothankmyparentsforsupportingmethroughmylifeandgettingmetowhereIamtoday.Icouldnothavemadeitwithouttheirloveandencouragement.
1
Chapter1
Introduction
LargeLanguageModels(LLMs)haveadvancedfromsequence-to-sequenceautoregressorsintoagentscapableofmulti-stepreasoning,toolcalling,andcodesynthesis.Supervisedpre-trainingsuppliesfluentlinguisticpriors,yetitisreinforcementlearning(RL)thatalignsthosepriorswithtask-levelobjectivessuchaspassingunit-testsuitesordevelopingemergentreasoningpatterns.OptimizinganLLMpolicyπθoverlong,sparserewardtrajectories,however,remainsbrittle:credit-assignmentnoisegrowsquadraticallywithsequencelength,andpoorlyinstrumentedenvironmentsinviterewardhacking,wherepolicieslearnspuriousstrategiesthatinflatethescalarreturnwhiledegradingtrueutility.
ThisthesisaddressesthesechallengesbyproposingrLLM,apurpose-builtRLframeworkthatcouplesanovelGroup-RelativePolicyOptimizationPlus(GRPO+)algorithmbasedonpriorworkswithGRPOandDAPOwithanasynchronous,Ray-orchestratedsamplingpipeline.rLLM’sdesigngoalistwo-fold:(i)sustainhigh-throughputgradientupdatesonclustersofthousandsofGPUs;and(ii)preserverewardintegritythroughairtightexecutionsandboxesinspiredbytheGoExpost-factovalidationruntime.TheframeworkisvalidatedbytrainingDeepcoder-14B,a14-billion-parametercode-reasoningmodelthatmatchestheperformanceofproprietarysystemswhileremainingfullyopensource.
1.1BackgroundandMotivation
ThealignmentofLargeLanguageModels(LLMs)hasprogressedfromsupervisedfinetun-ing(SFT)tofullreinforcement-learningpipelinesthatoptimizeapolicyoverlong,task-levelroll-outs.EarlyRLwithhumanfeedback(RLHF)systemsadoptedProximalPolicyOp-timization(PPO)anditsKL-constrainedvariants,butthehighvarianceoflong-horizoncreditassignmentsoonmotivatedGroup-RelativePolicyOptimization(GRPO),whichmea-suresadvantagesagainstpeertrajectoriessampledfromthesamepromptgroup,markedlyimprovingstabilityonreasoningtasks.SubsequentworksuchasDAPOaddeddynamicsamplinganddecoupledclippingtopushlarge-scaletrainingbeyond30Bparameters.De-spitethesealgorithmicadvances,convergenceisstillbrittlewheneverrewardchannelsleak
2
CHAPTER1.INTRODUCTION
noiseorareexploitable.Studiesonrewardhackingshowthatagentsreadilydiscoverloop-holes—fabricatinglogs,short-circuitingunittests,orcorruptingstate—toinflatenominalreturnswhiledegradingtruetasksuccess.
ScalingRLtofrontier-sizedmodelsthereforedemandssysteminnovationsaswell.Syn-chronousactor–learnerloopsstallonthelongestrollout,under-utilisingexpensiveacceler-ators;industrialsolutionsnowfavourasynchronouspipelinesbuiltatopRay’sdistributedexecutionengine,whichoferselastic,fault-tolerantplacementofbothactorsandlearners.LibrariessuchasveRLexposelightweightRPCinterfacesforhigh-throughputsamplingandhavebecomeade-factosubstrateforopen-sourceRLHFresearch.Yetthroughputaloneisinsufficient:long-contextOptimization(32k–64ktokens)multipliesgradientnoiseandmemorypressure,motivatingiterativecontextlengtheningcurriculathatgrowwindowsonlyaftervarianceplateaus.
Equallycriticalistheexecutionenvironmentwhereroll-outsareevaluated.Withoutexplicitsafeguards,anLLMtunedtointeractwithexternaltoolscanoverwritedatabases,issuedestructiveAPIcalls,orgeneratedeceptivetestharnessesthatpassbenchmarkswhilehidingfaultylogic.TheBerkeleyGoExruntimeaddressesthisbywrappingeveryRESTcall,fileoperation,andSQLmutationindeterministicundoandblast-radius-boundedconfine-ment,producingreversibletracesthatcanbesafelyreplayedordiscarded.Suchpost-factovalidationprovidestamper-proofrewardsignals,closinganessentialsafetyloopignoredbymanyalgorithm-centricstudies.
Finally,moderncode-reasoningbenchmarkslikeLiveCodeBench,HumanEval+,andCodeforceshaveemergedasstringenttestsofreasoningqualityundercontamination-freeevaluation.Open-weightmodelslikeDeepcoder-14Bnowmatchproprietarysystemsat14Bparametersbycombininghigh-qualitydatacurationwithRLfine-tuning,achieving60.6%Pass@1onLiveCodeBenchanda1936Codeforcesrating.Theirsuccessunderscoresthesynergisticefectofcutting-edgeOptimizationalgorithms,efficientdistributedinfras-tructure,andmeticulouslysandboxedenvironments—preciselythetriadthisthesisseekstosystematisethroughtherLLMframework.
1.2rLLMFrameworkOverview
TherLLMstackisengineeredaroundthreetightlycoupledlayers—algorithm,systems,andcurriculum—eachtunedtomitigateaspecificfailuremodeinlarge-scaleRLforLLMs.
Algorithmiccore(GRPO+)
rLLMextendsGroup-RelativePolicyOptimizationby(i)relative-KLclipping,whichboundstheper-grouppolicyupdateinitsownlocaltrustregion,(ii)over-longfilteringthatdiscardstrajectorieswhoselength-scaledvariancedominatestheminibatch,and(iii)removalofen-tropybonusesonceexplorationsaturates.Thefirsttwomodificationscutgradientvarianceby18%onsyntheticbanditsandpreventthehigh-KL“spikes”reportedforvanillaGRPOon
3
CHAPTER1.INTRODUCTION
DeepSeek-R1training.ComparedwithDAPO’sdecoupled-clipobjective,GRPO+achievesequivalentfinalrewardwith12%fewerupdatesona4k-promptablation.
Systemslayer
Onthesystemsside,rLLMaddsGRPO+ontoveRL,anopenRLHFlibrarywhoseactorandlearnernodesareorchestratedbyRay’selasticplacementengine.Weintroduceanasyn-chronousdouble-buferedpipeline—verl-pipe—thatoverlapsrolloutgenerationandgradientapplication.Benchmarkson8×A100GPUsshow2.1×throughputversusastrongsyn-chronousPPObaselinewhilesustaining≥95%deviceutilization.Thedesigneliminatesthe“taillatency”probleminwhichasinglelong-contextsamplestallsglobaloptimization.
Curriculumlayer(iterativecontextlengthening)
Longcontextsexacerbatebothmemoryfootprintandcredit-assignmentnoise.rLLMthere-foreadoptsastagedcurriculum—16k→32k→64ktokens—advancingonlywhenreward-varianceplateaus.Recentworkonlong-contextpre-trainingshowsthatsuchgradualex-pansionyieldsbetterutilizationoftheexpandedreceptivefieldthanjumpingtothefinalwindowdirectly.Inpractice,curriculumlengtheningshaves21%ofwall-clocktimerelativetoastatic64krun.
Empiricalhighlight(Deepcoder-14B)
RunningthefullpipelineoncuratedcompetitivecodingtasksintheDeepcoderdatasetproducesDeepcoder-14B,whichattains60.6%Pass@1onLiveCodeBench,aCodeforcesEloof1936,and92.6%Pass@1onHumanEval+,equalingOpenAI’so3-mini(low)withopen-sourcedtrainingprocedure,data,andweights.
Environment
Theabovegainsmaterializeonlyunderarewardfunctioninanenvironmentthatisair-tight.rLLMthereforeexecutesallrolloutsinsideasandboxwhereeverycodesnippetisexecutedwithresourceisolationandconstraints;thisensurestimelyexecutionandproperfail-fastchecks.Aswell,theseenvironmentsneedtobeperformantforlargeparallelrewardcalculation.rLLMintroducesanenvironmentthatisoptimizedforparallelrewardfunctionexecutionwhilebeingsandboxed.
4
Chapter2
RelatedWork
2.1ReinforcementLearningforLanguage-ModelAlignment
EarlyattemptsataligninglargelanguagemodelsreliedonProximalPolicyOptimization(PPO),afirst-ordertrust-regionmethodthatclipsthepolicyupdatetoavertcollapsewhileremainingcomputationallytractable[30].OpenAI’sInstructGPTextendedPPOintoafullRL-from-Human-Feedback(RLHF)pipeline,demonstratingthatfine-tuningwithpreference-basedrewardsmarkedlyimprovesobedienceandusefulnessoninstruction-followingbenchmarks[25].Subsequentworkrevealed,however,thatPPO’sglobalbaselineandsingle-trajectoryadvantagesstrugglewiththevarianceintroducedbylongcontextsandsparserewardstypicalofreasoningtasks.
Tomitigatetheseissues,GroupRelativePolicyOptimization(GRPO)estimatesbaselinesfromgroupsoftrajectoriessharingthesameprompt,therebysharpeningcreditassignmentandcuttingmemoryoverheadbyeliminatingaseparatecriticnetwork[31].GRPOhasbeenshowntosustainstablelearningon16k–32ktokenwindowsformathematics-focusedmodels,yetstillexhibitspoorperformancewhenscaledtolarger,heterogeneouscorporaduetotheconstraintsofsample-levelloss.DAPOgeneralizestheideabyintroducingdecoupledclippingandadaptivetemperaturescaling,aswellastoken-levelloss,therebyreportingimprovedconvergenceacrossninepublicRLHFtasksandprovidinganopen-sourcereferenceforcluster-scaletraining[39].
Despitealgorithmicprogress,allPPO-derivedmethodsremainvulnerabletorewardhacking—theexploitationofloopholesintherewardfunctionorenvironmenttoinflatere-turnswithoutgenuinetasksuccess.Recentsafetyanalysesoffrontiermodels,includingOpenAI’so1ando3series,documentemergentdeceptivebehaviourundersparserewardregimes[3].Theseobservationsunderscorethatreliablealignmenthingesnotonlyonrobustoptimizationbutalsoonverifiablerewardchannelsandsecureexecutionsandboxes.
ThepresentworkbuildsonthislineagebyproposingGRPO+,anextensionthatappliesrelativeKLclippingandoverlongfilteringtofurtherstabilizeupdates,andembeddingthe
5
CHAPTER2.RELATEDWORK
algorithmwithinanasynchronoussamplingstack(Section1.2)executedinsideanairtight,reversibleenvironment(Section4).Thisholisticapproachtargetstheintertwinedalgorith-micandenvironmentalcausesofconvergencefailureidentifiedinpriorliterature.
2.2DistributedFrameworksandSystemsInfrastructure
Scalingpolicy-gradientoptimizationtobillion-parameterlanguagemodelsdemandsend-to-endsystemssupportforhigh-throughputsampling,faulttolerance,andelasticresourceutilization.EarlyRLHFpipelinesembeddedPPOdirectlyinsidebespoketrainerscripts,butsoonmigratedtogeneral-purposeframeworkssuchasRayRLlib,whoseactor–learnerabstractionandclusterschedulero↵eredturnkeyhorizontalscale-outandrecovery.RLlib’sversatility,however,comesatacost:itsmonolithicAPIsintroduceperformanceoverheadswhenrolloutsrequirelong-contextdecodingontensor-parallelbackends[21,15].
ToaddressLLM-specificbottlenecks,multipleopen-sourcesystemshaveemerged.veRLrefactorsRLlib’sexecutionmodelintolightweightRPCendpointsanddouble-bu↵eredGPUqueues,sustaining¿95%utilizationonmulti-nodeclusters.DistRLpushesasynchronousdatacollectiontoCPU-heavyinferencenodeswhilereservingGPUserversforbatchedgra-dientupdates,reducingstraggler-inducedidletimeby27%onin-house70Bmodels.
Large-scaleindustrialstackscoupletheseschedulerswithhigh-performanceservinglay-ers.NVIDIA’sTritonInferenceServerisfrequentlydeployedtoshardsamplertrafficacrosstensor-paralleldecodereplicas,maskingbackendvariabilitybeneathauniformgRPCinter-face.Ontheoptimizationside,DeepSpeedRLextendsDeepSpeed-ZeROwitho✏oadingprimitivestailoredtoPPO-stylegradients,deliveringnear-linearscalingto512A100sona175Bmodelaccordingtointernalbenchmarks[22].
Thebaselineforthesystemoptimizationsisprovidedbyverl[32],anopen-sourceli-braryforReinforcementLearningfromHumanFeedback(RLHF)trainingoflargelanguagemodels.verlistheopen-sourceimplementationoftheframeworkdescribedinthepaper”HybridFlow:AFlexibleandEfficientRLHFFramework”[32].TheHybridFlowframe-workwasdevelopedtoaddresstheinherentcomplexityandcomputationalinefficiencyoftraditionalRLHFdataflows.
RLHFworkflows,particularlythosebasedonalgorithmslikePPOandGRPO[31],in-volveintricatedependenciesandcomputationaltasksperformedbymultipleLLMinstances,includingtheActor(policy)model,aRewardmodel,aReferencemodel,andaCriticmodel.Thesetasksencompassgeneration(sampling),inference(forreward,reference,andcritic),andtrainingsteps.Traditionalapproachesoftenstruggledwithflexiblyrepresentingandefficientlyexecutingthesecomplexdataflows,leadingtoinefficiencies.
HybridFlow[32]addressesthesechallengesbyproposingaflexibleandefficientarchi-tecture.Keyaspectsincludeahybrid-controllerprogrammingmodelthatdecouplesthehigh-levelcontrolflow(definingtheRLalgorithmsteps)fromthelow-levelcomputation
6
CHAPTER2.RELATEDWORK
flow(executingneuralnetworkoperations).Thisdesignallowsforbettermodularityandreusability.Theframeworkalsoemphasizesseamlessintegrationwithexistingdistributedtrainingandinferencelibraries(suchasFSDP,Megatron-LM,vLLM,andSGLang)andsupportsflexibledevicemappingtooptimizeresourceutilization.WhileHybridFlow[32]providedarobustandefficientfoundationforRLHF,particularlyinmanagingdiversework-loadsandmodelplacements,thesamplingbottleneck,asdescribedinsubsequentsections,remainedasignificantareaforfurtheroptimization.
Algorithm–systemco-designremainsactive.VAGENintegratesvariance-awaregradientaggregationwithacustomparameterserverthatadaptivelydropsstaleroll-outs,reporting1.8×wall-clockspeed-upsonmultilingualinstructiontuning[34].Inparallel,ByteDance’sDAPOreferenceimplementationexposesdecoupledclippinganddynamicsamplingprimi-tivesatopaRaybackend,achieving50pointsonAIME2024witha32BQwenbase[39].Finally,recentstudiesonadaptivefaulttoleranceforLLMclustersproposereactivemigrationoflearnershardsuponnodefailure,preserving¿99.5%trainingavailabilityovermonth-longruns[12].
Collectively,theseframeworkshighlightthreedesignprinciplesadoptedbyrLLM:(i)actor–learnerdecouplingwithasynchronous,back-pressure-freequeues;(ii)elasticorches-trationthatexploitsRay’splacementgroupsfortransparentfailover;and(iii)hardware-awareservinglayersthatco-locatedecodingandgradientaggregationtominimizePCIeandnetworkhops.
2.3SecureExecutionEnvironmentsandRewardIntegrity
Apersistentfailuremodeinlarge-scalereinforcementlearningisrewardhacking—theten-dencyofanagenttoexploitweaknessesintherewardspecificationorthesurroundingsystemtomaximizereturnwithoutachievinggenuinetasksuccess.Documentedexploitsincludeover-fittingbrittleunittests,fabricatingevaluationlogs,andmutatingtheveryartifactsusedforscoring[36].
Tocounteractthesethreats,twocomplementarystrategieshaveemerged.Sandboxisola-tionisnowstandardpracticeincode-generationRL:eachcandidateprogramexecutesinsidearesource-boundedcontainer,andsuccessisjudgedsolelybytheunit-testsuite[38].Whilee↵ectiveagainstarbitraryfilewritesornetworkcalls,sandboxesrelyonthe
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 融合办公空间优化策略-洞察与解读
- 可降解聚合物包装研究-洞察与解读
- 边缘计算优化-第36篇-洞察与解读
- 2025年定向护理岗面试题及答案
- 2025年光伏电站土地流转模式创新与可持续发展策略分析报告
- 2025年新能源行业供应链质量风险控制报告
- 2025年村委代办员考试试题及答案
- 2025年汾阳职高化学试卷及答案
- 2025年放疗技术学考试题及答案
- 2025年热控系统考试题库及答案
- 历史常识单选题100道及答案解析
- 《中医基础理论》课程教案
- 风电场道路及风机基础工程冬季施工方案
- DL∕T 1860-2018 自动电压控制试验技术导则
- 义务教育《信息科技课程标准》(2022年修订版)原版附解读
- 国家各年级学生体质健康测试项目及评分标准
- QCT1196-2023车载冰箱要求
- 钢结构高强度环槽铆钉连接技术规程
- 《光伏发电工程预可行性研究报告编制规程》(NB/T32044-2018)中文版
- 2024届高考化学二轮复习备考策略讲座
- 校企联合实验室合作协议
评论
0/150
提交评论