DeepSeek R1的思考和启发-邱锡鹏

上传人：海*** IP属地：江苏上传时间：2025-02-19 格式：PPTX 页数：15 大小：438.29KB 积分：12 举报 版权申诉

已阅读5页，还剩10页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

关于DeepSeek

R1的思考和启发复旦大学/上海创智学院2025年2月12日human01在竞赛题目上到达了人类专家水平

01实现了AGI的第二阶段(Reasoner)OpenAI

o1标志着大模型推理能力突破以推理者ReasoningAI组织者OrganizationalAIPhD-LevelScienceQuestions(GPQA

Diamond)聊天机器人Conversational

AIpreview

preview创新者InnovatingAICompetition

Math(AlME2024)智能体AutonomousAICompetitionCode(Codeforces)accuracyaccuracypercentilePre-training

aswe

know

itwill

end

Compute

isgrowing:Better

hardwareBetteralgorithmsLarger

clustersData

not

growing:We

have

butone

intemetThe

fossil

fuelof

AI预训练时代可能即将结束，但ScalingLaw继续?01引入的新范式：Scale

强化学习和推理时计算train-time

compute

(log

scale)test-time

compute(log

scale)01AIME

accuracy01AIME

accuracypasselaccuracypasselaccuracy强化学习

大模型场景下的强化学习Zeng

et.al,Scaling

and

Learning:ARoadmap

Reproduce

o1from

Reinforcement

Learning

Perspectve,htps//axiv.org/abs/2412.14135以强化学习为核心的推理模型Agent

=>LLMAction=>NextToken/Step/SolutionState

=>LLMinputsPolicy:π(action|state)AgentPolicyStep1:xxxxxRewardStept:xxxxxEnvironmentAction

Stateπ(a|s)EnvironmentStep

t+1RewardState入Policyπ(a|s)QuestionLLMReward

DesignReinforcement

LearningDatao1PolicyPolicy

Initialization

1.Policy

Initialization:塑造推理行为2.Reward

Design:为RL提供奖励信号3.Search:

找寻问题的最优解法4.Learning:

优化模型参数Zeng

et.al,Scalingof

Searchand

Learning:ARoadmap

Reproduce

from

Reinforcement

Learning

Perspectve,htps//axiv.org/abs/2412.14135以强化学习为核心的推理模型LearningSearchPolicy

初

始

化

的

猜

想

：>预训练：通过大量富含逻辑的数学和代码文本，习得基础的逻辑推理和分析能力提示工程&监督微调：进

一

步注入类人的推理行为，具备探索复杂解空间的能力Zenget.al,ScalingofSearchandLearning:A

Roadmapto

Reproduceo¹from

Reinforcement

Learning

Perspective,ttps://abs/2412.14135思维转换Aternateskypenfhoapsthecipheris

complex.

验证分析IsmacrodontiaassociatedwithKBGsyndrome?….Yes…修正错误pH=7+0.5\times….Wnt,thecoreetformuhale\[pH=7+

…Policy初始化(Policy

Init)通过预训练、提示工程、监督微调让大模型模仿人类推理行为问题理解OurtaskicUse

the

example

above

decode:

…

.任务拆解ImplernenCknSnepse:1.Capture

input….2.Remove

…

尝试解决LeTetrycoding

thebash

script

step

…

.GoalClarificationSolutionGenerationTaskDecompositionSelf-EvaluationAlternativeProposalSelf-CorrectionSFTExpert

TrajectoryPromptHuman-like

Reasoning

BehaviorsInstruction

Fine-TuningPre-Training专家数据Reward

Model偏好数据PRMQuestion:Three

pencis

and

ajumbo

erasr

cost5124.Five

pencils

and

ajumbo

eraser

costS182

Nopncesincludetas.incents,whutisthecostofapncil?Answer:Lethecostofapenclfbepcents,andthecostofajumbo

erasr

beecents.Fromtheproblem,wearegiventwoequationsbasedon

thecost

3p+2=124,5p+e=182Tosolvethisyitemof

equations,wesaubtractthefirstequatlenfremthe

scand

tolimiute:(5p+-3p+)=182-124SolMing

forp=58/2=27Thus,the

cost

ofapencit

is29cents01奖励设计的猜想：>

混

合

了

多

种rewarddesign的方法>

有groundtruth环境：将ORM

转换为PRM>无ground

truth环境：从专家或者偏好数据学习奖励>在大量领域上训练reward

model,提升泛化性ORMQuestion:ThreepenclsandajumboerasercostS124.Fiepencisand

ajumbo

easrcost

S1.82.Nopricesicludetx.incnts,whatisthecostolapenci?Answer:Letthecost

ofapendl

bepcents,ind

the

cost

ofajumbo

eraser

ecents.Fromtheproblem,wearegiventwoequationsbased

the

cost3p+2=1245p+=182To

solvethis

syitemofequatloms,wubtractthe

frit

squstionfromthesecondtoeliminatee3p+)-(3p+)=182-124Solwing

forp=58/2=27Thus,the

cost

ofa

pencls29cnts.奖励设计(Reward

Design)从环境或者数据中学习奖励信号Zenget.al,ScalingofSearchandLearning:A

Roadmapto

Reproduceo¹from

Reinforcement

Learning

Perspective,htps///abs/2412.14135Reward

Shaping:从结果监督转变为过程监督从专家数据或者偏好数据学习奖励从环境直接获取奖励信号哥SGuidingSignalsSearchStrategies0

搜

索

的

猜

想：>在训练时，01使用树搜索+外部指导>在推理时，01使用连续改正+内部指导Zenget.al,ScalingofSearchandLearning:ARoadmaptoReproduceo1fromReinforcementLearningPerspectve,htps///abs/2412.14135Question:Letw≠1bea

13th

rootofunity.Findthe

remainder

when

∩(from

k=0to

12)(2-2w^k+w^{2k})isdividedby

1000.Sequential

RevisionsQuestion:Letw≠1bea

13th

rootofunity.Findthe

remainderwhen7(fromk=0to

12)(2-2w^k+w^{2k})isdividedby

1000.S2S2S21made

amistake.TheS3S3S3Sorry,I

think

→

theanswer

is搜索(Search)针对给定问题寻找最优解决方案321197561Isee,The→

finalansweris321EnvironmentalFeedbackInternal+ExternalGuidanceansweris

561.

257.Uncertainty+VerifierModelUncertaintyExternalGuidanceInternalGuidanceSelf-evaluationanswer

197.HeuristicRulesValue

FunctionTree

SearchS1;2;…TheGuiding→S1Solutionwith

the

highest

rewardsPolicy

GradientVs.BehaviorCloning高数据利用率

显存消耗大

高效稳定

无法利用负样本Learning的猜想：二阶段训练>Warmup

阶段使用Behaviorcloning,快速收敛第二阶段使用PolicyGradient,

提升上限Zenget.al,ScalingofSearchand

Learning:ARoadmapto

Reproduceo1from

Reinforcement

Learning

Perspectve,htps///abs/2412.14135从搜索产生的数据与中学习，迭代地提升policyREINFORCEPolicyGradientDsearchAll

solutionsSolutionsSolutionswith

RewardsBehavior

CloningDExpert学习(Learning)SearchDPOPPOQuestion:If

a>1,then

the

sum

the

real

solutions

Va-Va+x=x

equal

toResponse:<think>To

solve

the

equation

va-Va+x=x,let's

start

squaring

both…(Va-va+x)²=x²

一a-Va+x=x²

.Rearrangeto

isolate

the

inner

square

root

term:(a-x²)²=a+x=a²-2ax²+(x²)²=a+x

x⁴-2ax²-x+(a²-a)=0Wait,wait.Wait.That's

aha

moment

can

flag

here.Let'sreevaluatethisstep-by-step

identify

the

correct

sum

can

be…

startedwith

the

equation:Va-Va+x=xFirst,let's

squareboth

sides:a-Va+x=x²=Va+x=a-x²Next,Icouldsquare

both

sides

again,treating

the

equation:…AconversationbetweenUserandAssistant.Theuserasksaquestion,andtheAssistant

solves

it.Theassistantfirstthinksaboutthereasoningprocessinthemind

andthenprovidestheuserwith

the

answer.The

reasoning

process

and

answer

are

enclosed

within<think></think>and<answer></answer>tags,respectively,ie.,<think>reasoningprocesshere</think><answer>answerhere</answer>.User:prompt.Assistant:

Table1|Template

forDeepSeek-R1-Zero.promptwillbereplaced

with

thespecific

reasoningquestion

during

training."aha

moment"R1-Zero不足：poor

readability,language

mixingR1-Zero:

纯RL驱动的推理模型R1-Zero奖励：准确性奖励+格式奖励Steps自然涌现Long-CoT能力Averapelengthper

responseR1的训练由四个阶段组成：1.

冷启动>

以DeepSeek-V3

为基础，构建并收集少量Long-CoT

数据来微

调模型，防止RL

训练早期不稳定和可读性差问题。2.

推理导向的强化学习以

阶段1模型为基础，针对代码、数学和逻辑推理等推理密集

型任务，采用与R1-Zero相同的大规模RL

来进行训练。>引入语言

一

致性奖励

(CoT

中目标语言单词的比例)来缓解

语言混杂问题3.

拒绝采样和监督微调>通过拒绝采样，使用阶段2模型合成高质量推理数据；>

通

用

领

域

的SFT

数据(V3

SFT数据+V3

COT合成数据);以DeepSeek-V3

为基础，

微调模型以增强模型在写作、角色扮

演等通用任务中的能力。4.

通用任务的强化学习>以阶段3模型为基础，

通

过RL提高模型的有用性和无害性，同

时完善其推理能力。>

对

于推理任务，利用基于规则的奖励来指导；对于其他任务，

采用奖励模型来对齐人类偏好。R1技术路线图DeepSeekV

3Base+CS

SFT+RORL

DeepSeek-V3(671B/378Actvated)图片来源：https://www.zhihu.com/question/10175007563/answer/87819242331DeepSeok-R1.Zero

DeepSoekR1-Distil-(OwenLlama)-B)Distillation通用场景的强化学习推理导向的强化学习Owen25Mam-158Qwen25148Uama-3.3-708-instructReasoningOriented

RLGRPORulebesed

Reward(Accuracy.Formatfng)Reasoning

Prompts+RejoctionSampling(Rule-basedDS-V3as

judpo)Owen2.5-Mam-

7BOwen2.5328Lama-3.1-8B监督微调拒绝采样RLReasoning+Preference

RewardCombinedSFTData(800k

samples)NonReasoning

Data(200k

samples)ColdStartLong

CoT

Data(-k

samples)SFT2

epochs800k

samplesDeopSeek-V3Base(671B/37BAciwated)CoTLangs8gConsistenceyReward冷启动RensoningData(600ksemples)SFT2epochs800k

samples知开mpoSekySupervlsoduing(SFT)DeepSookv3SFT

DataDwerseTraining

PrompisCoTPrompting关于Deepseek

R1的思考和启发口

R1/R1-zero的技术路线和社区对01复现的差异>此前社区对01的复现基本都会涉及到蒸馏和搜索。>R1-Zero没有SFT,没有过程监督，没有Search,也能训练出类似o1的效果。学术界之前也有很多实验，但在较小的模型上都没有成功。说明只有基模型足够强，ScalingRL才能取得比较好的效果。>

虽

然R1强调MCTS没有效果，但是简单的majorityvote能大幅提升R1的效果，说明搜索仍然是重要的scale的

范

式

。>R1的成功还依赖Deepseek

强大的系统效率和RL调教能力。口

策略初始化>R1-zero是一个比较好的尝试，但是R1还是经过了先SFT

(大概几千条)后再进行RL。>未来后训练的重心会逐步倾向于RL,但是少量训练用于SFT可能还是必须的。口

奖励模型>R1的奖励设计跟普通的后训练没特别大的区别

(Qwen2,Tulu3),有groundtruth用groundtruth做EM,

否则

用RM。>RM

的(训练数据量，模型大小，OOD

问题，迭代周期)的相关问题在整个训练的流程中还是比较关键。可能使用当前开源的比较强大的RM

可以达到比较好的

人人文库> 全部分类> 行业资料 > 信息产业

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

DeepSeek R1的思考和启发-邱锡鹏

文档简介

温馨提示

最新文档

评论

DeepSeek R1的思考和启发-邱锡鹏

文档简介

温馨提示

最新文档

评论

相关文档