版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
unTowardsRevealingtheMysterybehindChainofunGuhaoFeng*BohangZhang*YuntianGu*HaotianYe*DiHeLiweiWangfenguhao@zhangbohang@guyuntian@haotianye@dihe@wanglw@Abstract1Introductionl2r2PreliminaryX(l)=X(l−1)+Attn(l)(X(l−1))+FFN(l)(X(l−1)+Attn(l)(X(l−1))),l∈[L],(1)HAttn(l)(X)=工softmax(XWl,h)(XW))⊤+M)XWl,h)Wh=1FFN(l)(X)=σ(XWl))Wx3nearEquationsInput:3北+3y+12=6;2北+5y+14=7;2北+4y+15=6;⟹Output:北+y+4=2;3y+6=3;2y+7=2;⟹北+2=1;y+2=1;3=0;⟹y=1;=0;nsns3CoTistheKeytoSolvingMathematicalProblems3.1Problemformulationeconsistingofnumbers,addition(+),subtrac-ArithmeticExpressionInput(7+5)÷(6+4×3−2×7)=Output:12÷(6+4×3−2×7)=12÷(6+12−2×7)=12÷(18−2×7)=12÷(18−14)=12÷4=3at43.2Theoreticalresultsyllcnp5eso4CoTistheKeytoSolvingDecision-MakingProblems4.1DynamicProgrammingngsndp(i)=T(i,s,{(j,dp(j)):j≺i}),(4)ydp(i)=f(i,sg1(i),···,sgJ(i),dp(h1(i)),···,dp(hK(i))),efA({(i,dp(i)):i∈I},s)=u(□i∈Adp(i)),(6)6Twostringss(1),s(2)oflengthn1=|s(1)|{(j,k):j∈[n],k∈{0,···,j−1}}{0,···,n1}×{0,···,n2}dp(j,k)=〈(1ifk=0max(dp(j,k−1),dp(k,k−1)×ifk>0I[sj>sk]+1)dp(j,k)=〈(akbjifj=0ifk=0min(dp(j,k−1)+a,dp(j−1,k−1)otherwise+dp(j−1,k−1)otherwise+cI[ss])maxi∈[n]dp(i,i−1)dp(n1,n2)4.2TheoreticalresultsIdpiIfinalanswerfXinXoutwhereXin⊂RdinandXout⊂Rdout,wesayfcanbeapproximatedbyPθwithθsatisfyingthat(i)∥f(x)−Pθ(x+δ)∥∞<ϵ+λ∥δ∥∞forallx∈Xinandall7Nmes5Experiments5.1ExperimentalDesign8Accuracy(%)Accuracy(%)Accuracy(%)Accuracy(%)Accuracy(%)Direct(L=4)Direct(L=4)10090807060504030201009080706050403020ArithmeticExpressionDirect(L=3)Direct(L=5)CoT(L=3)45645NumberofOperatorsLongestIncreasingSubsequenceDirect(L=3)Direct(L=5)CoT(L=3)50805080InputSequenceLength10080604020010090807060504030LinearEquationDirect(L=3)Direct(L=4)Direct(L=5)CoT(L=3)45345NumberofVariablesEditDistanceDirect(L=3)Direct(L=4)Direct(L=5)CoT(L=3)1216201216LengthoftheFirstStringe5.2ExperimentalResultsnLengthextrapolation.WenextstudywhethertheAccuracy(%)100Accuracy(%)806040200ArithmeticExpressionExtrapolationCoTCoT(L=3)161718NumberofOperators96RelatedWorknals7LimitationsandFutureDirectionsReferencessrnnnddsamsnegowonlrainedppendixllAAdditionalBackgroundandNotationA.1FinitefieldoAssociativityforanya,b,c∈F,(a+b)+c=a+(b+c)and(a×b)×c=a×(b×c);alla∈F;a+(−a)=0;a+(−a)=0;fA.2CircuitcomplexityenfA.3Log-precisionshBFormalDefinitionsofCoTinSection3a•op∈{+,−}ands1∈{(,empty},s2{×,÷};•op∈{×,÷}ands1{×,÷}.Systemoflinearequations.Assumethatwehaveasystemofmlinearequationswithvariablesx1,x2,...,xm.Thei-thequationintheinputsequenceisgrammaticallywrittenasai1x1+ai2x2+ij···+aimxm=bi,whereaij∈{0,···,pijsaxi+axi+1+···+axm=bi,(7)−(a))−1a)andaddtheresultingequationtothej-thequation.Thiswilleliminatethetermxiinthej-thequation.Wefurthernormalizeequationisothatthecoefficientabecomes1.Dependingonwhetherj≤iorj>i,theresultingequationintheCoToutputwillhavethe•Ifj≤i,thej-thequationwillbewrittenasxj+j,i+1xi+1+···+jmxm=j;•Ifj>i,thej-thequationwillbewrittenasj,i+1xi+1+···+jmxm=j.CTechnicalLemmasC.1UsefullemmasforMLPorO(poly(M,1/ϵ))suchthat|f(a,b)−ab|≤ϵholdsforalla,b∈[−M,M].f(a,b)=(σ()+σ(−aλ−b)−σ(ab)−σ(−aλ+b)).ngλ>2M,wehave'σ()+σ(−aλ−b)−σ(ab)−σ(−aλ+b)−'≤'4(()2+(−aλ−b)2−(ab)2−(−aλ+b)2)−'+'x1]σ(3)(x)'3λ3x∈[−1,1]^2π23^2πλ3.=16M3max13λ3x∈[−1,1]^2π23^2πλ3.thatforallx∈Rd1,wehave∥f(x)−g(x)∥∞≤ϵ.Proof.Letg(x)=W2·ReLU(W1x).Weconstructf(x)=W2·GeLU(λW1x)whereλ>0inceinceW2(ReLU(z)−GeLU(λz))∞≤∥W2∥∞ReLU(z)−GeLU(λz))∞"1"≤MdReLU(z)−λGeLU(λz))∞,'GeLU(λy)−ReLU(y)'='ReLU(λy)−\−exp(−)dt'.'\−exp(−)dt−\−exp(−)dt'=\exp(−)dt.s'1''λGeLU(λy)−ReLU(y)'=\exp(−)dt\texp(−)dt=exp(−)1^2πλ.dimensionisd.LetW∈Rd2×d1beanymatrixanddenoteM=maxij|Wij|.Then,foranyoranyx∈Rd1,wehave∥f(x)−Wx∥∞≤ϵ.Wx=ReLU(Wx)+ReLU(−Wx)g(x,y,t)=(9)nddThenforanyandMthereexistMLPparameterswithavefxytgxytℓ∞avefxytgxyth(x,y,t)=(h1,h2,h3,h4):=(x+2Mt1d1,y−2Mt1d2,2Mt,−2Mt)∈Rd1+d2+2,f(x,y,t)=(ReLU(h1)−ReLU(h3)1d1,ReLU(h2)−ReLU(h4)1d2),[−M,M]d2,andt∈[−∞,−1/2]∪[1/2,+∞],wehavef(x,y,t)=g(x,y[−M,M]d2,andt∈[−∞,−1/2]∪[1/2,+∞],wehavef(x,y,t)=g(x,y,t).Moreover,allh(i1,···,ik)(x)=2(fj(x)=之ReLU(h(i1,···,ik)(x)).gj(ei1,···,eik)=1onhave''∥f(x+δ)−g(x)∥∞=∥f(x+δ)''='之(ReLU(h(i1,···,ik)(x+δ))−ReLU(h(i1,···,ik)(x))''g(e,···,e)=1'(i1,···,ik)∈[d]k≤max'h(i1,···,ik)(x+δ))−h(i1,···,ik)(i1,···,ik)∈[d]kTherefore,byusingLemmaC.2,wecanalsoimplementg(x)byatwo-layerMLPwithGeLUC.2Usefullemmasfortheattentionlayerlayer(withcausalmask).Below,letn∈Nbeanintegerandletx1,x2,···,xnbeasequenceofCOPYTheoutputisasequenceofvectorsuunwithui=vpos(i),wherepos(i)=argmaxj∈Sirj.•MEAN:Theoutputisasequenceofvectorsu1,···,unwithui=meanj∈Sivj.•Foranyi,j∈[n],either|qi·kj|≤ρorqi·kj≤−δ.•Foranyi,j∈[n],eitheri=jor|ri−rj|≥δ.Y•Query:(λqi,µ)∈Rd+1•Key:(ki,ri)∈Rd+1ai,j== exp(λ(qi·kj)+µrj)exp(λ(qiai,j==对j\exp(λ(qi·kj\)+µrj\)对j\exp(λ(qi·kj\)+µ(rj\−rj)).dedbyOpolyMlognlogwehaveSinceρ≤andM≥δ,wehaveδ−ρ≥δ.Bysettingλ=8MldedbyOpolyMlognlogwehaveai,pos(i)ai,pos(i)≥exp(−λρ)+(n−1)exp(max(−λδ+2Mµ,λρ−µδ))=1+(n−1)exp(max(−λ(δ−ρ)+2Mµ,2λρ−µδ))≥1−nexp(max(−λ(δ−ρ)+2Mµ,2λρ−µδ))≥1−nexp(max(−ln(),−ln()))≥1−nexp(−ln())=1=1−4M,itherqi·kj\≤−δor(qi·kj\≤ρandrj\−rj≤−δ);in(∥oi−ui∥∞=工aijv∥oi−ui∥∞=工aijvj−vpos(i)≤2M·\1−ai,pos(i)+工ai,j=2M(2−2ai,pos(iLemmaLemmaC.8.AssumeAssumptionC.6holdswithρ≤).Foranyϵ>0,thereexistsanattentionlayerwithembeddingsizeOdandonecausalattentionheadthatcanapproximatetheMEANoperationdefinedabove.Formally,foranyconsideredsequenceofvectorsx1,x2,...,xn,•Key:ki∈Rdexp(λqi·kj)对j\exp(λqi·kj\).Bysettingλn)(whichisboundedbyO(poly(M,1/δ,log(n),log(1/ϵ)))),wehave:工aij≤(n−|Si|)exp(−λδ)jSi11+ni|exp(−λ(ρ−δ))ϵ8M.图aij−|i|图≤max(|i|−|Si|exp(ρλ)i|)exp(−λδ),|Si)−|i|)≤|i|max(1−exp(2ρλ)+nxp(−λ(δ−ρ)),exp(2ρλ)−1) (exp(2ρλ)−exp(−2ρλ)1+nexp(λδ−ρλ))(exp(2ρλ)−exp(−2ρλ)(1−8))8Mvj∞≤2M·aij+图aij−|i|图≤ϵ,≤nexp(λ(ρ−δ))≤nexp(−λδ/2)≤(n−|Si|)exp(−λδ)+|Si|exp(−λρ)""∥oi−ui∥∞工1i1i≤aijvj−ai,j=|Si|"j"DArithmeticFormulaD.1ProofofTheorem3.3cnpp).ThiscanbeachievedbysettingSi={j≤i:sj=‘=’},rj=jand,vj=jinLemmaC.7.Usingtheresidualconnection,theoutputoftheattentionlayerhastheform(ei,i,1,n/i,p).WecanuseanMLPtomultiplyn/ianditoobtainnaccordingtoLemmaC.1andsimultaneousQx=[1,(n)2,−2n,1,(d)2,−2d]⊤.Kx=[(n)2−2n+1,1,Qx=[1,(n)2,−2n,1,(d)2,−2d]⊤.Kx·Qx=(n−n−1)2+(d−d+k)2.(ei,ej1,ej2,ej3,ej4,ej5,i,n,d,(n)2,(d)2,1),whethertheexpressioncanbecalculatedornotformsalook-uptable.Therefore,itcanbeimple-mentedbyatwo-layerMLPwithhiddendimensionO(poly(p))acorrdingtoLemmaC.5.Similarly,outputofthethirdlayerisrepresentedbyx=(eutcome,ej,tagi,d,n,numi).Here,tagiisaBooleanvaluerecordingwhetherthecopiedexpressioncanbecalculated,eutcomeistheone-hotembeddingoftheoutcomewhentheexpressioncanbecalculated,andnumirecordsthelengthofthemggieMEANoperationandtheCOPYoperationsuchthatKx=[(n)2,1,n]⊤andQx=[1,(n)2,−2n]⊤.WehaveKx·Qx=(n−n)2.Therefore,Kx·Qx=0onlyskcansocopythecorrespondingtokenforgeneratingtheoutputwhentag=1.Similartopreviouslayers,wecancopytheembeddingejlocatedatpositionjsuchthatn=n−1andd=dThen,wepasstheembeddingsthroughtheMLPtofilterandobtaintheoutput.Iftag=1,thenmaintainejandsetj=0,utcome=0,andiftagi=0thenmaintainjandutcomeandsetyO(poly(n)).D.2ProofofTheorem3.1Cy•f(0)=0andf(1)=1;•ForanyBooleanformulaφ,f(¬φ)=(1−φ);•ForanyBooleanformulaeφ1,φ2,f((φ1∧φ2))=f(φ1)×f(φ2);•ForanyBooleanformulaeφ1,φ2,f((φ1∨φ2))=(1−(1−f(φ1))×(1−f(φ2))).nESystemofLinearEquationsiiE.1ProofofTheorem3.4egandki=0.Ifitisthevariablexki(ki∈[m]),thenli=(ki,m2sin(),m2cos()).3.Determinethenumberofvariablesbycopyingkjsuchthatrj=5kjm3+jisthelargest...t(ei,li,i,i2,1,nq,not,nar,pei−1,li−1).thecurrentpositionweareeliminatingvariablexn.Bytheuniquenessofthesolution,theremustexistanequationwithaokentputoftheMLPis(ei,li,i,i2,1,d(dq)2,not,(not)2,nar,nyardflagieextkeiions1.not=n+12.not=kjork=kj3.nq−nar=norn=nq\rE.2ProofofTheorem3.2ft(x∗=对q∈Fxn,qx0,q0=1x0,q=0xi,q=对6(r,ωi)=qxi−1,rfor0<i≤n,q∈QonisfyingxFDynamicProgrammingF.1Examplesdp(i)=1+maxdp(j).j<i,sj<siddp(j,k)={ax(dp(j,k−1),dp(k,k−1)·I[sj>sk]+1)Thefinalanswerwillbemaxi∈[n]dp(i,i−1).ThisDPformulationfitsourframework(5).syasakdp(j,k)=〈bjifjakdp(j,k)=〈bjmin(dp(min(dp(j,k−1)+a,dp(j−1,k)+b,dp(j−1,k−1)+cI[ss])lyinChomskyNormalFormMoreoverfortheruleRis→s2,wedenoteleft(R[i])=s1,right(R[i])=s2.rithmispaceI={(i,j,k,l):1≤i≤k≤j≤n,l∈[|R|]}.dp(i,j,k,l)meanswhetherv[i:j]canbedp(i,j,k,l)={i,t[1l)·](,),dp(i,j,k−1,l),dp(i,j,j,l\)),otherwise,numbersuchthatright(R[l])[0]=left(R[l1]),right(R[l])[1]=left(R[l2]).Andtheaggregatione1.max(a,b),a,b∈R2.I[ab],a,b∈Z3.I[a<b],a,b∈ZF.2ProofofTheorem4.7tebeddingsoftheinput,states,valuesintheDParray,answers,andsomespecialsignsrespectively.leN=(pp1,pp2−pp1,···,ppN−pp(N-1)).ftartask,weuseoneattentionheadtosimplycopytheembeddingwithtanswer=1andthelargestorsmallestvalueinthedparray,whichcanimplementthemin,maxoperation.Forthe对operation,otherattentionheadcanbeusedtocalculatethefractionofelementsinthesequenceandthenobtainthenumber.Wedenotethemaximum,theminimum,orthesummationaseTherefore,theoutputofthislayerisesF.3ProofoftheTheorem4.8GExperimentalDetailsifsp÷or(o∈{+,−}ands[p−1]∈{−,×})or(o∈{+,−}andsp∈{×,÷})thenG.1Datasets1+5×(1−2)=71+5×(1−2)=1+5×10=1+6=7Input:VocabularyofnumbersV={0,1...10}//numberfieldp=111SamplethefirstnumbertuniformlyfromV;2s=[];3Appendttos;4fori←1tondo10insert[(],[t1],[o],[t2],[)]sequentiallyintos[p];11else12pops[p];13insert[t1],[o],[t2]sequentiallyintos[p];14end15end2x1+3x2+3x3=8,1x1+7x2+0x3=0,0x1+2x2+1
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 教育硕士学位论文
- 餐饮工作制度与规范
- 企业战略制度规范
- 健全规范管理制度
- 打印机管理制度规范
- 台球室防盗制度规范
- 规范车间成本管控制度
- 司法规范化三项制度
- 车站规范化管理制度
- 通知规范化管理制度
- 老年医院重点专科建设方案
- 2025年江苏省苏州市初二(上)英语期末模拟卷(二)含答案
- 规培中医病例讨论流程规范
- 银行解封协议书模板
- 小学生必读书试题及答案
- 超星尔雅学习通《学术规范与学术伦理(华东师范大学)》2025章节测试附答案
- (完整版)现用九年级化学电子版教材(下册)
- 卫生院、社区卫生服务中心《死亡医学证明书》领用、发放、管理制度
- 《金融科技概论》完整全套课件
- 市政道路工程危大工程安全管理措施
- 康复治疗技术历年真题单选题100道及答案
评论
0/150
提交评论