版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
ceepsee《DeepSeek-V4:TowardsHighlyEfficientMillion-TokenContextIntelligenceDeepSeek-AIWepresentapreviewversionofDeepSeek-V4series,includingtwostrongMixture-of-Experts(MoE)languagemodels—DeepSeek-V4-Prowith1.6Tparameters(49Bactivated)andDeepSeek-V4-Flashwith284Bparameters(13Bactivated)—bothsupportingacontextlengthofonemilliontokens.DeepSeek-V4seriesincorporateseveralkeyupgradesinarchitectureandop-timization:(1)ahybridattentionarchitecturethatcombinesCompressedSparseAttention(CSA)andHeavilyCompressedAttention(HCA)toimprovelong-contextefficiency;(2)Manifold-ConstrainedHyper-Connections(mHC)thatenhanceconventionalresidualconnections;(3)andtheMuonoptimizerforfasterconvergenceandgreatertrainingstability.Wepre-trainbothmodelsonmorethan32Tdiverseandhigh-qualitytokens,followedbyacomprehensivepost-trainingpipelinethatunlocksandfurtherenhancestheircapabilities.DeepSeek-V4-Pro-Max,themaximumreasoningeffortmodeofDeepSeek-V4-Pro,redefinesthestate-of-the-artforopenmodels,outperformingitspredecessorsincoretasks.Meanwhile,DeepSeek-V4seriesarehighlyefficientinlong-contextscenarios.Intheone-million-tokencontextsetting,DeepSeek-V4-Prorequiresonly27%ofsingle-tokeninferenceFLOPsand10%ofKVcachecomparedwithDeepSeek-V3.2.Thisenablesustoroutinelysupportone-million-tokencontexts,therebymakinglong-horizontasksandfurthertest-timescalingmorefeasible.Themodelcheckpointsareavailableathttps://huggingface.co/collections/deepseek-ai/deepseek-vDeepSeek-V3.21.0DeepSeek-V4-ProDeepSeek-V4-Flash0.83.7×lower0.69.8×lower02565127681024TokenPosition(K)Single-TokenFLOPsSingle-TokenFLOPs(T)y/Pass@y/Pass@1(%)AccuracAccumulatedKVCache(GB)9.5×smaller13.7×smaller02565127681024SequenceLength(K)DeepSeek-V3.2DeepSeek-V4-ProDeepSeek-V4-Flash30201005040SimpleQAVerifiedHLEDeepSeek-V4-Pro-MaxClaude-Opus-4.6-MaxGPT-5.4-xHighGemini-3.1-Pro-HighKnowledge&ReasoningAgenticCapabilitiesTerminalBench2.0ApexShortlistSWEVerifiedCodeforcesToolathlon402080601000Figure1|Left:benchmarkperformanceofDeepSeek-V4-Pro-Maxanditscounterparts.Right:inferenceFLOPsandKVcachesizeofDeepSeek-V4seriesandDeep21Introduction42.1DesignsInheritedfromDeepSeek-V3 2.2Manifold-ConstrainedHyper-Connections 2.3HybridAttentionwithCSAandHCA 2.3.1CompressedSparseAttention 2.3.2HeavilyCompressedAttention 2.3.3OtherDetails 2.3.4EfficiencyDiscussion 2.4MuonOptimizer 3.1Fine-GrainedCommunication-ComputationOverlapinExpertParallelism 3.2FlexibleandEfficientKernelDevelopmentwithTileLang 3.3High-PerformanceBatch-InvariantandDeterministicKernelLibraries 3.4FP4Quantization-AwareTraining 3.5TrainingFramework 3.5.1EfficientImplementationofMuon 3.5.2Cost-EffectiveandMemory-EfficientImplementationofmHC 3.5.3ContextualParallelismforLong-ContextAttention 3.5.4ExtendedAutomaticDifferentiationforFlexibleActivationCheckpointing213.6InferenceFramework 3.6.1KVCacheStructureandManagement 3.6.2On-DiskKVCacheStorage 4.1DataConstruction 4.2Pre-TrainingSetups 4.2.1ModelSetups 4.2.2TrainingSetups 4.2.3MitigatingTrainingInstability 4.3Evaluations 4.3.1EvaluationBenchmarks 4.3.2EvaluationResults 35.1Post-TrainingPipeline 5.1.1SpecialistTraining 5.1.2On-PolicyDistillation 5.2RLandOPDInfrastructures 5.2.1FP4QuantizationIntegration 5.2.2EfficientTeacherSchedulingforFull-VocabularyOPD 5.2.3PreemptibleandFault-TolerantRolloutService 5.2.4ScalingRLFrameworkforMillion-TokenContext 5.2.5SandboxInfrastructureforAgenticAI 5.3StandardBenchmarkEvaluation 5.3.1EvaluationSetup 5.3.2EvaluationResults 5.4PerformanceonReal-WorldTasks 5.4.1ChineseWriting 5.4.3White-CollarTask 5.4.4CodeAgent 6Conclusion,Limitations,andFutureDirectAAuthorListandAcknowledgmentA.1AuthorList A.2Acknowledgment BEvaluationDetails41.IntroductionTheemergenceofreasoningmodels(DeepSeek-AI,2025;OpenAI,2024c)hasestablishedanewparadigmoftest-timescaling,drivingsubstantialperformancegainsforLargeLanguageModels(LLMs).However,thisscalingparadigmisfundamentallyconstrainedbythequadraticcomputationalcomplexityofthevanillaattentionmechanism(Vaswanietal.,2017),whichcreatesaprohibitivebottleneckforultra-longcontextsandreasoningprocesses.Concurrently,theemergenceoflong-horizonscenariosandtasks—fromcomplexagenticworkflowstomassivecross-documentanalysis—hasalsomadeefficientsupportforultra-longcontextscriticalforfutureprogress.Whilerecentopen-sourceefforts(Baietal.,2025a;DeepSeek-AI,2024;MiniMax,2025;Qwen,2025)haveadvancedgeneralcapabilities,thiscorearchitecturalinefficiencyinhandlingultra-longsequencesremainsakeyimpediment,limitingfurthergainsfromtest-timescalingandhinderingfurtherexplorationintolong-horizonscenariosandtasks.Inordertobreaktheefficiencybarrierinultra-longcontexts,wedeveloptheDeepSeek-V4series,includingthepreviewversionsofDeepSeek-V4-Prowith1.6Tparameters(49Bactivated)andDeepSeek-V4-Flashwith284Bparameters(13Bactivated).Througharchitecturalinnova-tions,DeepSeek-V4seriesachieveadramaticleapincomputationalefficiencyforprocessingultra-longsequences.Thisbreakthroughenablesefficientsupportforacontextlengthofonemilliontokens,usheringinaneweraofmillion-lengthcontextsfornext-generationLLMs.Webelieveourcapabilitytoefficientlyhandleultra-longsequencesunlocksthenextfrontieroftest-timescaling,pavesthewayfordeeperresearchintolong-horizontasks,andestablishesanecessaryfoundationforexploringfutureparadigmslikeonlinelearning.ComparedwiththeDeepSeek-V3architecture(DeepSeek-AI,2024),DeepSeek-V4seriesretaintheDeepSeekMoEframework(Daietal.,2024)andMulti-TokenPrediction(MTP)strategy,whileintroducingseveralkeyinnovationsinarchitectureandoptimization.Toenhancelong-contextefficiency,wedesignahybridattentionmechanismcombiningCompressedSparseAttention(CSA)andHeavilyCompressedAttention(HCA).CSAcompressestheKVcachesalongthesequencedimensionandthenperformsDeepSeekSparseAttention(DSA)(DeepSeek-AI,2025),whereasHCAappliesmoreaggressivecompressiontotheKVcachesbutkeepsdenseattention.Tostrengthenmodelingcapability,weincorporateManifold-ConstrainedHyper-Connections(mHC)(Xieetal.,2026)thatupgradeconventionalresidualconnections.Additionally,weintroducetheMuon(Jordanetal.,2024;Liuetal.,2025)optimizertothetrainingofDeepSeek-V4series,leadingtofasterconvergenceandimprovedtrainingstability.ToenableefficienttrainingandinferenceforDeepSeek-V4seriesaswellasproductivede-velopment,weintroduceseveralinfrastructureoptimizations.First,wedesignandimplementasinglefusedkernelforMoEmodulesthatfullyoverlapscomputation,communication,andmemoryaccess.Second,weemployTileLang(Wangetal.,2026),aDomain-SpecificLanguage(DSL)tobalancedevelopmentproductivityandruntimeefficiency.Third,weprovideefficientbatch-invariantanddeterministickernellibrariestoensurebitwisereproducibilityacrosstrain-ingandinference.Fourth,weincorporateFP4quantization-awaretrainingforMoEexpertweightsandtheindexerQKpathtoreducememoryandcomputation.Fifth,forthetrainingframework,weextendtheautogradframeworkwithtensor-levelcheckpointingforfine-grainedrecomputationcontrol;andweenhancetrainingefficiencywithahybridZeROstrategyfortheMuonoptimizer,cost-effectivemHCimplementationsviarecomputationandfusedkernels,andtwo-stagecontextualparallelismtomanagecompressedattention.Finally,fortheinferenceframework,wedesignaheterogeneousKVcachestructurewithon-diskstoragestrategiestoenableefficientshared-prefixreuse.5ByemployinghybridCSAandHCA,alongwithprecisionoptimizationsoncomputationandstorage,DeepSeek-V4seriesachievesignificantlylowerinferenceFLOPsandasubstantiallyreducedKVcachesizecomparedwithDeepSeek-V3.2,especiallyinlong-contextsettings.TherightpartofFigure1demonstratestheestimatedsingle-tokeninferenceFLOPsandaccumulatedKVcachesizeofDeepSeek-V3.2andDeepSeek-V4series.Inthescenarioof1M-tokencontext,evenDeepSeek-V4-Pro,whichhasalargernumberofactivatedparameters,attainsonly27%ofthesingle-tokenFLOPs(measuredinequivalentFP8FLOPs)and10%oftheKVcachesizerelativetoDeepSeek-V3.2.Furthermore,DeepSeek-V4-Flash,withitssmallernumberofactivatedparameters,pushesefficiencyevenfurther:inthe1M-tokencontextsetting,itachievesonly10%ofthesingle-tokenFLOPsand7%oftheKVcachesizecomparedwithDeepSeek-V3.2.Additionally,forDeepSeek-V4series,theroutedexpertparametersutilizeFP4precision.WhilethepeakFLOPsforFP4×FP8operationsarecurrentlythesameasFP8×FP8onexistinghardware,theycantheoreticallybeimplementedtobe1/3moreefficientonfuturehardware,whichwillfurtherenhancetheefficiencyofDeepSeek-V4series.Duringpre-training,wetrainDeepSeek-V4-Flashon32TtokensandDeepSeek-V4-Proon33Ttokens,respectively.Afterpre-training,thesetwomodelscannativelyandefficientlysupport1M-lengthcontexts.Inourinternalevaluations,DeepSeek-V4-Flash-BasealreadysurpassesDeepSeek-V3.2-Baseacrossamajorityofbenchmarkswithitsmoreparameter-efficientdesign.DeepSeek-V4-Pro-BasefurtherextendsDeepSeekfoundationmodels,achievingcomprehensivesuperiorityacrossreasoning,coding,long-context,andworldknowledgetasks.Thepost-trainingpipelineofDeepSeek-V4seriesfeaturesatwo-stageparadigm:theinde-pendentcultivationofdomain-specificexperts,followedbyunifiedmodelconsolidationviaon-policydistillation(LuandLab,2025).Initially,foreachtargetdomain—suchasmathematics,coding,agent,andinstructionfollowing—aseparateexpertmodelistrainedindependently.ThebasemodelfirstundergoesSupervisedFine-Tuning(SFT)onhigh-quality,domain-specificdatatoestablishfoundationalcapabilities.Subsequently,ReinforcementLearning(RL)isap-pliedusingGroupRelativePolicyOptimization(GRPO)(DeepSeek-AI,2025),whichfurtheroptimizesthemodelfordomain-alignedbehaviorsguidedbyrewardmodelstailoredtospecificsuccesscriteria.Thisphaseyieldsadiversesetofspecializedexperts,eachexcellinginitsrespectivefield.Finally,tointegratethesedistinctproficiencies,asingleunifiedmodelistrainedthroughon-policydistillation,whereintheunifiedmodelactsasthestudentlearningtooptimizethereverseKLlosswithteachermodels.SummaryofCoreEvalu•Knowledge:Inassessmentsofbroadworldknowledge,DeepSeek-V4-Pro-Max,themaxi-mumreasoningeffortmodeofDeepSeek-V4-Pro,significantlyoutperformsleadingopen-sourcemodelsontheSimpleQA(OpenAI,2024d)andChinese-SimpleQA(Heetal.,2024)benchmarks.Regardingeducationalknowledge—evaluatedviaMMLU-Pro(Wangetal.,2024b),HLE(Phanetal.,2025),andGPQA(Reinetal.,2023)—DeepSeek-V4-Pro-Maxshowsamarginalleadoveritsopen-sourcecounterparts.DeepSeek-V4-Pro-Maxhassignificantlyclosedthegapwiththeleadingproprietarymodel,Gemini-3.1-Pro,despitestilltrailingitintheseknowledge-basedevaluations.•Reasoning:Throughtheexpansionofreasoningtokens,DeepSeek-V4-Pro-Maxdemon-stratessuperiorperformancerelativetoGPT-5.2andGemini-3.0-Proonstandardreabenchmarks.Nevertheless,itsperformancefallsmarginallyshortofGPT-5.4andGemini-3.1-Pro,suggestingadevelopmentaltrajectorythattrailsstate-of-the-artfrontiermodelsbyapproximately3to6months.Furthermore,DeepSeek-V4-Flash-Maxachievescomparable6MTPMTPLossTransformerBlock×LPost-BlockMixingDeepSeekMoEPre-BlockMixingPost-BlockMixingCSA/HCAPre-BlockMixingMTPModulesPredictionHeadResidualMixingResidualMixingEmbeddingFigure2|OverallarchitectureofDeepSeek-V4series.WeusehybridCSA(CompressedSparseAttention)andHCA(HeavilyCompressedAttention)forattentionlayers,DeepSeekMoEforfeed-forwardlayers,andstrengthenconventionalresidualconnectionswithmHC.performancetoGPT-5.2andGemini-3.0-Pro,establishingitselfasahighlycost-effectivearchitectureforcomplexreasoningtasks.•Agent:Onpublicbenchmarks,DeepSeek-V4-Pro-Maxisonparwithleadingopen-sourcemodels,suchasKimi-K2.6andGLM-5.1,butslightlyworsethanfrontierclosedmodels.Inourinternalevaluation,DeepSeek-V4-Pro-MaxoutperformsClaudeSonnet4.5andapproachesthelevelofOpus4.5.•Long-Context:DeepSeek-V4-Pro-Maxdeliversstrongresultsonsyntheticandrealusecaseswitha1-million-tokencontextwindow,surpassingevenGemini-3.1-Proonacademicbenchmarks.•DeepSeek-V4-Prov.s.DeepSeek-V4-Flash:DeepSeek-V4-Flash-Maxexhibitslowerper-formanceinknowledgeevaluationsduetoitssmallerparameterscale.However,itachievescomparableresultsonreasoningtaskswhenallocatedalargerthinkingbud-get.Inagentevaluations,whileDeepSeek-V4-Flash-MaxmatchestheperformanceofDeepSeek-V4-Pro-Maxonseveralbenchmarks,itstilltrailsitslargercounterpartonmorecomplex,high-difficultytasks.Overall,DeepSeek-V4seriesretaintheTransformer(Vaswanietal.,2017)architectureandMulti-TokenPrediction(MTP)modules(DeepSeek-AI,2024;Gloeckleetal.,2024),whileintroducingseveralkeyupgradesoverDeepSeek-V3:(1)firstly,weintroducetheManifold-ConstrainedHyper-Connections(mHC)(Xieetal.,2026)tostrengthenconventionalresidualconnections;7(2)secondly,wedesignahybridattentionarchitecture,whichgreatlyimproveslong-contextefficiencythroughCompressedSparseAttentionandHeavilyCompressedAttention.(3)thirdly,weemployMuon(Jordanetal.,2024;Liuetal.,2025)astheoptimizer.FortheMixture-of-Experts(MoE)components,westilladopttheDeepSeekMoE(Daietal.,2024)architecture,withonlyminoradjustmentsfromDeepSeek-V3.TheMulti-TokenPrediction(MTP)(DeepSeek-AI,2024;Gloeckleetal.,2024;Lietal.,2024;Qietal.,2020)configurationremainsidenticaltothatofDeepSeek-V3.AllotherunspecifieddetailsfollowthesettingsestablishedinDeepSeek-V3(DeepSeek-AI,2024).Figure2illustratestheoverallarchitectureofDeepSeek-V4,andthedetailsaredescribedbelow.2.1.DesignsInheritedfromDeepSeek-V3Mixture-of-Experts.AspreviousDeepSeek-seriesmodels(DeepSeek-AI,2024;DeepSeek-AI,2024),DeepSeek-V4seriesalsoadopttheDeepSeekMoEparadigm(Daietal.,2024)forFeed-ForwardNetworks(FFNs),whichsetsfine-grainedroutedexpertsandsharedexperts.DifferentfromDeepSeek-V3,wechangetheactivationfunctionthatcomputestheaffinityscoresfromSigmoid(·)intoSqrt(Softplus(·)).Forloadbalancing,wealsoemploytheauxiliary-loss-freestrategy(DeepSeek-AI,2024;Wangetal.,2024a),augmentedbyaslightsequence-wisebalancelossthatpreventsextremeimbalancewithinindividualsequences.ForDeepSeek-V4,weremovetheconstraintonthenumberofroutingtargetnodes,andcarefullyredesigntheparallelismstrategytomaintaintrainingefficiency.Furthermore,comparedwithDeepSeek-V3,wereplacethedenseFFNlayersintheinitialseveralTransformerblockswithMoElayersthatemployHashrouting(Rolleretal.,2021).TheHashroutingstrategydeterminesthetargetexpertsofeachtokenaccordingtoapredefinedhashfunctionwithregardtotheinputtokenID.Multi-TokenPrediction.AsDeepSeek-V3,DeepSeek-V4seriesalsosetMTPmodulesandobjectives.GiventhattheMTPstrategyhasbeenvalidatedinDeepSeek-V3,weadoptthesamestrategyforDeepSeek-V4serieswithoutmodification.2.2.Manifold-ConstrainedHyper-ConnectionsAsshowninFigure2,DeepSeek-V4seriesincorporateManifold-ConstrainedHyper-Connections(mHC)(Xieetal.,2026)tostrengthentheconventionalresidualconnectionsbetweenadjacentTransformerblocks.ComparedwithnaiveHyper-Connections(HC)(Zhuetal.,2025),thecoreideaofmHCistoconstraintheresidualmappingontoaspecificmanifold,andthusenhancethestabilityofsignalpropagationacrosslayerswhilepreservingmodelexpressivity.ThissubsectionbrieflyintroducesthestandardHCanddescribeshowwedesignmHCforstabletraining.StandardHyper-Connections.ThestandardHCexpandsthewidthoftheresidualstreambyafactorofnhc.Specifically,theshapeoftheresidualstreamisexpandedfromRdtoRnhc×d,wheredisthehiddensizeoftheactuallayerinput.Letxl=[xl,1;...;xl,nhc]T∈Rnhc×dbetheresidualstatebeforethel-thlayer.HCintroducesthreelinearmappings:aninputmappingAl∈R1×nhc,aresidualtransformationBl∈Rnhc×nhc,andanoutputmappingcl∈Rnhc×1.Theupdateoftheresidualstateisthenformulatedas:xl+1=Blxl+clFl(Alxl),(1)whereFldenotesthel-thlayer(e.g.,anMoElayer),whoseinputandoutputshapesarebothRd.NotethattheactuallayerinputAlxl∈Rdisalsod-dimensional,sotheexpandedresidual8widthdoesnotinfluencethedesignoftheinnerlayers.HCdecouplestheresidualwidthfromtheactualhiddensize,offeringacomplementaryscalingaxiswithminimalcomputationaloverhead,asnhcistypicallymuchsmallerthanthehiddensized.However,eventhoughHChasdemonstratedpotentialinimprovingmodelperformance,wefindthatthetrainingwillfrequentlyexhibitnumericalinstabilitywhenstackingmultiplelayers,whichhindersthescalingofHC.Manifold-ConstrainedResidualMapping.ThecoreinnovationofmHCistoconstraintheresidualmappingmatrixBltothemanifoldofdoublystochasticmatrices(theBirkhoffpolytope)M,andthusenhancethestabilityofsignalpropagationacrosslayers:BlThisconstraintensuresthatthespectralnormofthemappingmatrix聂Bl聂2isboundedby1,sotheresidualtransformationisnon-expansive,whichincreasesthenumericalstabilityduringboththeforwardpassandbackpropagation.Besides,thesetMisclosedundermultiplication,whichguaranteesstabilityinthescenariosofdeepstacksofmHC.Inaddition,theinputtransformationAlandoutputtransformationClarealsoconstrainedtobenon-negativeandboundedviaaSigmoidfunctiontoavoidtheriskofsignalcancellation.DynamicParameterization.Theparametersofthreelinearmappingsaredynamicallygen-erated,whicharedecomposedintoadynamic(input-dependent)componentandastatic(input-independent)component.Giventheinputxl∈Rnhc×d,itisfirstflattenedandnormal-ized:l=RMSNorm(vec(xl))∈R1×nhcd.Then,wefollowtheconventionalHCtogeneratetheunconstrainedrawparametersl∈R1×nhc,l∈Rnhc×nhc,andl∈Rnhc×1:l=αre·(lwlpre)+sre,(3)l=αes·Mat(lwlres)+ses,(4)l=αost·(lwlpost)T+sost,(5)wherewlpre,wlpost∈Rnhcd×nhcandwlres∈Rnhcd×ncarelearnableparametersforgeneratingthedynamiccomponents;Mat(·)reshapesavectorofsize1×ncintoamatrixofsizenhc×nhc;arelearnablegatingfactorsinitializedtosmallvalues.ApplyingParameterConstraints.Afterobtainingtheunconstrainedrawparametersl,l,l,wethenapplyconstraintsdescribedearliertothemtoenhancethenumericalstability.Tobespecific,fortheinputandoutputmappings,weemployaSigmoidfunctionσ(·)toensuretheirnon-negativityandboundedness:Al=σ(l),(6)l=2σ(l).Asfortheresidualmappingl,weprojectitontothemanifoldofdoublystochasticmatricesM.ThisisachievedbytheSinkhorn-Knoppalgorithm,whichfirstappliesanexponentialfunctiontoltoensurepositivity,gettingM(0)=exp(l),andtheniterativelyperformscolumnandrownormalization:M(t)=Tr(Tc(M(t__1))),(8)whereTrandTcdenoterowandcolumnnormalization,respectively.ThisiterationconvergestoaconstraineddoublystochasticmatrixBl=M(tmax).Wechoosetmax=20asapracticalvalue.9Multi-QueryMulti-QueryAttentionIndexerQueriesQueriesConcatenationSelectedCompressedKVEntriesTop-kSelectorHiddenStatesofKVTokensHiddenStateofQueryTokenIndexScoresCompressedSharedKey-ValueMulti-QueryAttentionSlidingWindowKVEntriesToken-LevelCompressorToken-LevelCompressorLightningIndexerCompressedKVEntriesFigure3|CorearchitecturesofCSA.ItcompressesthenumberofKVentriestotimes,andthenappliesDeepSeekSparseAttentionforfurtheracceleration.Additionally,asmallsetofslidingwindowKVentriesiscombinedwiththeselectedcompressedKVentriestoenhancelocalfine-graineddependencies.2.3.HybridAttentionwithCSAandHCAAsthecontextlengthreachesextremescales,theattentionmechanismemergesasthedominantcomputationalbottleneckinamodel.ForDeepSeek-V4,wedesigntwoefficientattentionarchitectures—CompressedSparseAttention(CSA)andHeavilyCompressedAttention(HCA)—andemploytheirinterleavedhybridconfiguration,whichsubstantiallyreducesthecompu-tationalcostofattentioninlong-textscenarios.CSAintegratesbothcompressionandsparseattentionstrategies:itfirstcompressestheKey-Value(KV)cacheofeverymtokensintooneentry,andthenappliesDeepStokenattendstoonlykcompressedKVentries.HCAaimsforextremecompressionbyconsol-idatingtheKVcacheofeverym,(≫m)tokensintoasingleentry.ThehybridarchitectureofCSAandHCAremarkablyimprovesthelong-contextefficiencyofDeepSeek-V4series,makingone-million-tokencontextfeasibleinpractice.Thissubsectiondescribesthecoretechniquesofourhybridattentionarchitecture,andwealsoprovideanopen-sourceimplementation1tospecifymoredetailsunambiguously.ThecorearchitectureofCSAisillustratedinFigure3,whichfirstcompressestheKVcacheofeachmtokensintooneentry,andthenappliesDeepSeekSparseAttentionforfurtheracceleration.CompressedKey-ValueEntries.LetH∈Rn×dbeasequenceofinputhiddenstates,wherenisthesequencelengthanddisthehiddensize.CSAfirstcomputestwoseriesofKVentriesca,cb∈Rn×candtheircorrespondingcompressionweightsza,zb∈Rn×c,wherecisthehead1https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/tree/main/inferencedimension:ca=H·wakv,cb=H·wbkv,(9)za=H·waz,zb=H·wbz,(10)wherewakv,wbkv,waz,wbz∈Rd×caretrainableparameters.Next,eachmKVentriesincaandcbwillbecompressedintooneentryaccordingtotheircompressionweightsandlearnablepositionalbiasesBa,Bb∈Rm×c,producingcComp.EachcompressedentryciComp∈Rciscomputedbyiwhere⊙denotestheHadamardproduct;Softmaxrow(·)denotesthesoftmaxoperationalongtherowdimension,whichperformsnormalizationacrossthetotalof2melementsfrombothzaandzb.Wheni=0,z(i-1):mi-1ispaddedwithnegativeinfinityandc(i-1):mi-1ispaddedwithzeros.NotethateachciCompisderivedfrom2mKVentries,buttheindexesofcbusedforciCompandtheindexesofcausedforciC-mpareoverlapped.Therefore,CSAinfactcompressesthesequencelengthtotimes.LightningIndexerforSparseSelection.AfterobtainingthecompressedKVentriescComp,CSAappliestheDSAstrategytoselecttop-kcompressedKVentriesforcoreattention.First,CSAperformsthesamecompressionoperationusedforcComptogetcompressedindexerkeyskICompI,wherecIistheindexerheaddimension.Then,foraquerytokent,weproducetheindexerqueries{q,1;q,2;...;q,n}inalow-rankmanner:t·wDQ,(13)wherehtdistheinputhiddenstateofthequerytokent;c∈Rdcisthecompressedlatentvectorforqueries;dcdenotesthequerycompressiondimension;ndenotesthenumberofindexerqueryheads;wDQ∈Rd×dcandwIUQ∈Rdc×cInarethedown-projectionandup-projectionmatricesforindexerqueries,respectively.Next,theindexscoreIt,s∈Rbetweenthequerytokentandaprecedingcompressedblocks(s<Floor())iscomputedby[w,1;w,2;...;w,n]=w=ht·ww,(15)whereww∈Rd×nisalearnablematrix;w,h∈Ristheweightoftheh-thindexerhead.Foraquerytokent,givenitsindexscoresIt,:,weemployatop-kselectortoselectivelyretainasubsetofcompressedKVentriesCtSprsCompforsubsequentcoreattention:SharedSharedKey-ValueMulti-QueryAttentionConcatenationHeavilyCompressedKVEntriesQueriesToken-LevelCompressorSlidingWindowKVEntriesHiddenStatesofKVTokensHiddenStateofQueryTokenFigure4|CorearchitecturesofHCA.Itperformsheaviercompression,wheretheKVentriesofm,(≫m)tokenswillbeconsolidatedintoone.Also,weadditionallyintroduceasmallsetofslidingwindowKVentriestoenhancelocalfine-graineddependencies.SharedKey-ValueMQA.AfterselectingthesparseKVentries,CSAthenperformscoreattentioninaMulti-QueryAttention(MQA)(Shazeer,2019)manner,whereeachcom
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 项目进度更新和用款确认函(7篇)
- 筑牢安全防线,远离溺水事故,小学主题班会课件
- 传统建筑修缮维护承诺书范文3篇
- 2026年高职(农村区域发展)农村社会工作实务综合测试题及答案
- 高中美术湘美版美术鉴赏(选修)第一课 什么是美术作品教案
- 社区护理儿童护理
- 健康管理科学建议保障承诺书9篇范文
- 用户需求分析软件设计指南
- 快乐阅读:培养阅读习惯的小学主题班会课件
- 开学第一课主题班会教学设计1
- 计算机科学导论课后习题答案-第七章
- 2025年中国特殊医学用途配方食品(FSMP)行业及消费者洞察白皮书-Arla
- 2026年决战行测5000题言语理解与表达附答案(黄金题型)
- 码头入场安全培训知识课件
- 幼儿园培训班骨科小知识课件
- 2026届江苏省泰州市兴化市数学九年级第一学期期末调研模拟试题含解析
- 反制无人机课件
- 污水处理站运行管理与调度方案
- 肝与肾中医课件
- IECQ QC 080000:2025 第四版标准(中文版)
- 2025年云南省中考化学真题(原卷版)
评论
0/150
提交评论