版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
arXiv:2512.24880v1[cs.CL]31Dec2025
mHC:Manifold-ConstrainedHyper-Connections
ZhendaXie*†,YixuanWei*,HuanqiCao*,
ChenggangZhao,ChengqiDeng,JiashiLi,DamaiDai,HuazuoGao,JiangChang,LiangZhao,ShangyanZhou,ZheanXu,ZhengyanZhang,WangdingZeng,ShengdingHu,YuqingWang,JingyangYuan,LeanWang,WenfengLiang
DeepSeek-AI
Abstract
Recently,studiesexemplifiedbyHyper-Connections(HC)haveextendedtheubiquitousresid-ualconnectionparadigmestablishedoverthepastdecadebyexpandingtheresidualstreamwidthanddiversifyingconnectivitypatterns.Whileyieldingsubstantialperformancegains,thisdiversificationfundamentallycompromisestheidentitymappingpropertyintrinsictotheresidualconnection,whichcausesseveretraininginstabilityandrestrictedscalability,andadditionallyincursnotablememoryaccessoverhead.Toaddressthesechallenges,wepro-poseManifold-ConstrainedHyper-Connections(mHC),ageneralframeworkthatprojectstheresidualconnectionspaceofHContoaspecificmanifoldtorestoretheidentitymappingproperty,whileincorporatingrigorousinfrastructureoptimizationtoensureefficiency.Em-piricalexperimentsdemonstratethatmHCiseffectivefortrainingatscale,offeringtangibleperformanceimprovementsandsuperiorscalability.WeanticipatethatmHC,asaflexibleandpracticalextensionofHC,willcontributetoadeeperunderstandingoftopologicalarchitecturedesignandsuggestpromisingdirectionsfortheevolutionoffoundationalmodels.
x𝑙+1
Layerℱ
x𝑙
x𝑙+1
hpost
𝑙
PostMapping
ℋpost
𝑙
hout
𝑙
Layerℱ
hres
𝑙
hin
𝑙
ResMapping
PreMapping
ℋres
𝑙
ℋpre
𝑙
x𝑙
x𝑙+1
hpost
𝑙
𝘗post(ℋ )
PostMapping
post
ℳ 𝑙
hout
𝑙
Layerℱ
hres
𝑙
hin
𝑙
ResMapping
PreMapping
𝘗 (ℋ)
res
pre
ℳ 𝑙
res
𝘗 (ℋ)
ℳ 𝑙
pre
x𝑙
(a)ResidualConnection (b)Hyper-Connections(HC) (c)Manifold-ConstrainedHC(mHC)
|
Figure1IllustrationsofResidualConnectionParadigms.Thisfigurecomparesthestructuraldesignof(a)standardResidualConnection,(b)Hyper-Connections(HC),and(c)ourproposedManifold-ConstrainedHyper-Connections(mHC).UnliketheunconstrainedHC,mHCfocusesonoptimizingtheresidualconnectionspacebyprojectingthematricesontoaconstrainedmanifoldtoensurestability.
*Corecontributors.†Correspondingauthor:
xie.zhenda@
PAGE
10
Contents
Introduction
3
RelatedWorks
4
MicroDesign
4
MacroDesign
5
Preliminary
5
NumericalInstability
6
SystemOverhead
7
Method
8
Manifold-ConstrainedHyper-Connections
8
ParameterizationandManifoldProjection
9
EfficientInfrastructureDesign
9
KernelFusion
9
Recomputing
10
OverlappingCommunicationinDualPipe
11
Experiments
12
ExperimentalSetup
12
MainResults
12
ScalingExperiments
13
StabilityAnalysis
14
ConclusionandOutlook
15
AAppendix
19
DetailedModelSpecificationsandHyper-parameters.
19
Introduction
DeepneuralnetworkarchitectureshaveundergonerapidevolutionsincetheintroductionofResNets
(Heetal.,
2016a).
AsillustratedinFig.
1(a),
thestructureofasingle-layercanbeformulatedasfollows:
x𝑙+1=x𝑙+/(x𝑙,W𝑙), (1)
/ /
wherex𝑙andx𝑙+1denotethe𝐶-dimensionalinputandoutputofthe𝑙-thlayer,respectively,andrepresentstheresidualfunction.Althoughtheresidualfunctionhasevolvedover
thepastdecadetoincludevariousoperationssuchasconvolution,attentionmechanisms,andfeedforwardnetworks,theparadigmoftheresidualconnectionhasmaintaineditsoriginalform.AccompanyingtheprogressionofTransformer(
Vaswanietal.,
2017)
architecture,thisparadigmhascurrentlyestablisheditselfasafundamentaldesignelementinlargelanguagemodels(LLMs)
(Brownetal.,
2020;
Liuetal.,
2024b;
Touvronetal.,
2023).
Thissuccessisprimarilyattributedtotheconciseformoftheresidualconnection.Moreimportantly,earlyresearch
(Heetal.,
2016b)
revealedthattheidentitymappingpropertyoftheresidualconnectionmaintainsstabilityandefficiencyduringlarge-scaletraining.Byrecursivelyextendingtheresidualconnectionacrossmultiplelayers,Eq.
(1)
yields:
x𝐿=x𝑙+
/(x𝑖,W𝑖), (2)
∑︁𝐿–1
𝑖=𝑙
where𝐿and𝑙correspondtodeeperandshallowerlayers,respectively.Thetermidentitymappingreferstothecomponentx𝑙itself,whichemphasizesthepropertythatthesignalfromtheshallowerlayermapsdirectlytothedeeperlayerwithoutanymodification.
Recently,studiesexemplifiedbyHyper-Connections(HC)
(Zhuetal.,
2024)
haveintroducedanewdimensiontotheresidualconnectionandempiricallydemonstrateditsperformancepotential.Thesingle-layerarchitectureofHCisillustratedinFig.
1(b).
Byexpandingthewidthoftheresidualstreamandenhancingconnectioncomplexity,HCsignificantlyincreasestopologicalcomplexitywithoutalteringthecomputationaloverheadofindividualunitsregardingFLOPs.Formally,single-layerpropagationinHCisdefinedas:
𝑙
𝑙
𝑙
x𝑙+1=7resx𝑙+7postt/(7prex𝑙,W𝑙), (3)
wherex𝑙andx𝑙+1denotetheinputandoutputofthe𝑙-thlayer,respectively.Unliketheformu-lationinEq.
(1)
,thefeaturedimensionofx𝑙andx𝑙+1isexpandedfrom𝐶to𝑛×𝐶,where𝑛is
theexpansionrate.Theterm7reseR𝑛×𝑛representsalearnablemappingthatmixesfeatures
𝑙 pre
1×𝑛
withintheresidualstream.Alsoasalearnablemapping,7𝑙 eR aggregatesfeaturesfrom
𝑙
the𝑛𝐶-dimstreamintoa𝐶-dimlayerinput,andconversely,7posteR1×𝑛mapsthelayeroutput
backontothestream.
However,asthetrainingscaleincreases,HCintroducespotentialrisksofinstability.TheprimaryconcernisthattheunconstrainednatureofHCcompromisestheidentitymappingpropertywhenthearchitectureextendsacrossmultiplelayers.Inarchitecturescomprisingmultipleparallelstreams,anidealidentitymappingservesasaconservationmechanism.Itensuresthattheaveragesignalintensityacrossstreamsremainsinvariantduringbothforwardandbackwardpropagation.RecursivelyextendingHCtomultiplelayersviaEq.
(3)
yields:
.𝐿–𝑙
res!
∑︁𝐿–1©𝐿.–1–𝑖
res\
postt
pre
x𝐿=
𝑖=1
7𝐿–𝑖
x𝑙+
𝑖=𝑙IZ
𝑗=1
7𝐿–𝑗I¬7𝑖 /(7𝑖 x𝑖,W𝑖), (4)
.
𝑖=1
𝐿–𝑖
where𝐿and𝑙representadeeperlayerandashallowerlayer,respectively.IncontrasttoEq.
(2)
,thecompositemapping 𝐿–𝑙7resinHCfailstopreservetheglobalmeanofthefeatures.This
discrepancyleadstounboundedsignalamplificationorattenuation,resultingininstabilityduringlarge-scaletraining.Afurtherconsiderationisthat,whileHCpreservescomputationalefficiencyintermsofFLOPs,thehardwareefficiencyconcerningmemoryaccesscostsforthewidenedresidualstreamremainsunaddressedintheoriginaldesign.ThesefactorscollectivelyrestrictthepracticalscalabilityofHCandhinderitsapplicationinlarge-scaletraining.
.
𝑙
7
𝑙
7
Toaddressthesechallenges,weproposeManifold-ConstrainedHyper-Connections(mHC),asshowninFig.
1(c),
ageneralframeworkthatprojectstheresidualconnectionspaceofHContoaspecificmanifoldtorestoretheidentitymappingproperty,whileincorporatingrigorousinfrastructureoptimizationtoensureefficiency.Specifically,mHCutilizestheSinkhorn-Knoppalgorithm
(SinkhornandKnopp,
1967)
toentropicallyprojectresontotheBirkhoffpolytope.Thisoperationeffectivelyconstrainstheresidualconnectionmatriceswithinthemanifoldthatisconstitutedbydoublystochasticmatrices.Sincetherowandcolumnsumsofthesematricesequalto1,theoperationresx𝑙functionsasaconvexcombinationoftheinputfeatures.Thischaracteristicfacilitatesawell-conditionedsignalpropagationwherethefeaturemeanisconserved,andthesignalnormisstrictlyregularized,effectivelymitigatingtheriskofvanishingorexplodingsignals.Furthermore,duetotheclosureofmatrixmultiplicationfor
Consequently,m
𝑖=1
𝐿–𝑖
doublystochasticmatrices,thecompositemapping𝐿–𝑙7resretainsthisconservationproperty.
HCeffectivelymaintainsthestabilityofidentitymappingsbetweenarbitrarydepths.Toensureefficiency,weemploykernelfusionanddevelopmixedprecisionkernelsutilizingTileLang(
Wangetal.,
2025).
Furthermore,wemitigatethememoryfootprintthroughselectiverecomputingandcarefullyoverlapcommunicationwithintheDualPipeschedule
(Liu
etal.,
2024b).
ExtensiveexperimentsonlanguagemodelpretrainingdemonstratethatmHCexhibitsexceptionalstabilityandscalabilitywhilemaintainingtheperformanceadvantagesofHC.In-houselarge-scaletrainingindicatesthatmHCsupportstrainingatscaleandintroducesonlya6.7%additionaltimeoverheadwhenexpansionrate𝑛=4.
RelatedWorks
Architecturaladvancementsindeeplearningcanbeprimarilyclassifiedintomicro-designandmacro-design.Micro-designconcernstheinternalarchitectureofcomputationalblocks,specifyinghowfeaturesareprocessedacrossspatial,temporal,andchanneldimensions.Incontrast,macro-designestablishestheinter-blocktopologicalstructure,therebydictatinghowfeaturerepresentationsarepropagated,routed,andmergedacrossdistinctlayers.
MicroDesign
Drivenbyparametersharingandtranslationinvariance,convolutioninitiallydominatedthepro-cessingofstructuredsignals.Whilesubsequentvariationssuchasdepthwiseseparable
(Chollet,
2017)
andgroupedconvolutions
(Xieetal.,
2017)
optimizedefficiency,theadventofTrans-formers(
Vaswanietal.,
2017)
establishedAttentionandFeed-ForwardNetworks(FFNs)asthefundamentalbuildingblocksofmodernarchitecture.Attentionmechanismsfacilitateglobalinformationpropagation,whileFFNsenhancetherepresentationalcapacityofindividualfeatures.TobalanceperformancewiththecomputationaldemandsofLLMs,attentionmecha-nismshaveevolvedtowardsefficientvariantssuchasMulti-QueryAttention(MQA)
(Shazeer,
2019),
Grouped-QueryAttention(GQA)
(Ainslieetal.,
2023),
andMulti-HeadLatentAttention
(MLA)
(Liuetal.,
2024a).
Simultaneously,FFNshavebeengeneralizedintosparsecomputingparadigmsviaMixture-of-Experts(MoE)
(Fedusetal.,
2022;
Lepikhinetal.,
2020;
Shazeeretal.,
2017),
allowingformassiveparameterscalingwithoutproportionalcomputationalcosts.
MacroDesign
Macro-designgovernstheglobaltopologyofthenetwork
(Srivastavaetal.,
2015).
FollowingResNet
(Heetal.,
2016a),
architecturessuchasDenseNet
(Huangetal.,
2017)
andFractal-Net
(Larssonetal.,
2016)
aimedtoenhanceperformancebyincreasingtopologicalcomplexitythroughdenseconnectivityandmulti-pathstructures,respectively.DeepLayerAggregation(DLA)
(Yuetal.,
2018)
furtherextendedthisparadigmbyrecursivelyaggregatingfeaturesacrossvariousdepthsandresolutions.
Morerecently,thefocusofmacro-designhasshiftedtowardexpandingthewidthoftheresidualstream
(Chaietal.,
2020;
Fangetal.,
2023;
Heddesetal.,
2025;
MakandFlanigan,
2025;
Menghanietal.,
2025;
Pagliardinietal.,
2024;
Xiaoetal.,
2025;
Xieetal.,
2023;
Zhuetal.,
2024).
Hyper-Connections(HC)
(Zhuetal.,
2024)
introducedlearnablematricestomodulateconnectionstrengthsamongfeaturesatvaryingdepths,whiletheResidualMatrixTransformer(RMT)
(MakandFlanigan,
2025)
replacedthestandardresidualstreamwithanouter-productmemorymatrixtofacilitatefeaturestorage.Similarly,MUDDFormer
(Xiaoetal.,
2025)
employsmultiwaydynamicdenseconnectionstooptimizecross-layerinformationflow.Despitetheirpotential,theseapproachescompromisetheinherentidentitymappingpropertyoftheresidualconnection,therebyintroducinginstabilityandhinderingscalability.Furthermore,theyincursignificantmemoryaccessoverheadduetoexpandedfeaturewidths.BuildinguponHC,theproposedmHCrestrictstheresidualconnectionspaceontoaspecificmanifoldtorestoretheidentitymappingproperty,whilealsoincorporatingrigorousinfrastructureoptimizationstoensureefficiency.Thisapproachenhancesstabilityandscalabilitywhilemaintainingthetopologicalbenefitsofexpandedconnections.
Preliminary
Wefirstestablishthenotationusedinthiswork.IntheHCformulation,theinputtothe𝑙-thlayer,
x𝑙eR1×𝐶,isexpandedbyafactorof𝑛toconstructahiddenmatrixx𝑙=(x𝑙t,0,...,x𝑙t,𝑛
1)teR𝑛×𝐶
whichcanbeviewedas𝑛-streamresidual.Thisoperationeffectivelybroadensth–ewidthof
𝑙
𝑙
𝑙
theresidualstream.Togoverntheread-out,write-in,andupdatingprocessesofthisstream,HCintroducesthreelearnablelinearmappings—7pre,7posteR1×𝑛,and7reseR𝑛×𝑛.These
mappingsmodifythestandardresidualconnectionshowninEq.
(1)
,resultingintheformulation
giveninEq.
(3).
𝑙 (𝑙)
IntheHCformulation,learnablemappingsarecomposedoftwopartsofcoefficients:theinput-dependentoneandtheglobalone,referredtoasdynamicmappingsandstaticmappings,respectively.Formally,HCcomputesthecoefficientsasfollows:
7pre=𝛼pre·tanh(𝜃prex˜t)+bpre
x˜=RMSNormx
𝑙
𝑙
𝑙
𝑙
𝑙
(5)
𝑙 𝑙 𝑙 𝑙 𝑙
𝑙
𝑙
𝑙
𝑙
7post=𝛼post·tanh(𝜃postx˜t)+bpost
7res=𝛼res·tanh(𝜃resx˜𝑙t)+bres,
whereRMSNorm(·)
(ZhangandSennrich,
2019)
isappliedtothelastdimension,andthescalars
𝑙
𝑙
𝑙
𝛼pre,𝛼postand𝛼reseRarelearnablegatingfactorsinitializedtosmallvalues.Thedynamic
𝑙
𝑙
𝑙
mappingsarederivedvialinearprojectionsparameterizedby𝜃pre,𝜃posteR1×𝐶and𝜃reseR𝑛×𝐶,
𝑙
𝑙
𝑙
whilethestaticmappingsarerepresentedbylearnablebiasesbpre,bposteR1×𝑛andbreseR𝑛×𝑛.
𝑙
𝑙
𝑙
Itisworthnotingthattheintroductionofthesemappings—7pre,7post,and7res—incurs
negligiblecomputationaloverhead,asthetypicalexpansionrate𝑛,e.g.4,ismuchsmallerthantheinputdimension𝐶.Withthisdesign,HCeffectivelydecouplestheinformationcapacityoftheresidualstreamfromthelayer’sinputdimension,whichisstronglycorrelatedwiththemodel’scomputationalcomplexity(FLOPs).Consequently,HCoffersanewavenueforscalingbyadjustingtheresidualstreamwidth,complementingthetraditionalscalingdimensionsofmodelFLOPsandtrainingdatasizediscussedinpre-trainingscalinglaws
(Hoffmannetal.,
2022).
𝑙
7
AlthoughHCnecessitatesthreemappingstomanagethedimensionalmismatchbetweentheresidualstreamandthelayerinput,preliminaryexperimentspresentedinTab.
1
indicatethattheresidualmappingresyieldsthemostsignificantperformancegain.Thisfindingunderscoresthecriticalimportanceofeffectiveinformationexchangewithintheresidualstream.
𝑙
𝑙
𝑙
Table1|AblationStudyofHCComponents.Whenaspecificmapping(7pre,7post,or7res)is
𝑙
𝑙
𝑙
disabled,weemployafixedmappingtomaintaindimensionalconsistency:uniformweightsof1/𝑛for7pre,uniformweightsofonesfor7post,andtheidentitymatrixfor7res.
res
pre
post
𝑙 𝑙 𝑙
7 7
7
AbsoluteLossGap
0.0
–0.022
✓ ✓ –0.025
✓ ✓ ✓ –0.027
NumericalInstability
.
𝑙
7
Whiletheresidualmappingresisinstrumentalforperformance,itssequentialapplicationposesasignificantrisktonumericalstability.AsdetailedinEq.
(4)
,whenHCisextendedacrossmultiplelayers,theeffectivesignalpropagationfromlayer𝑙to𝐿isgovernedbythecomposite
𝑖=1
𝐿–𝑖
𝑙
mapping𝐿–𝑙7res.Sincethelearnablemapping7resisunconstrained,thiscompositemapping
inevitablydeviatesfromtheidentitymapping.Consequently,thesignalmagnitudeispronetoexplosionorvanishingduringboththeforwardpassandbackpropagation.Thisphenomenonunderminesthefundamentalpremiseofresiduallearning,whichreliesonunimpededsignalflow,therebydestabilizingthetrainingprocessindeeperorlarger-scalemodels.
Empiricalevidencesupportsthisanalysis.Weobserveunstablelossbehaviorinlarge-scaleexperiments,asillustratedinFig.
2.
TakingmHCasthebaseline,HCexhibitsanunexpectedlosssurgearoundthe12kstep,whichishighlycorrelatedwiththeinstabilityinthegradient
norm.Furthermore,theanalysison7resvalidatesthemechanismofthisinstability.Toquantify
howthecompositemapping.𝐿–𝑙 𝑙amplifiessignalsalongtheresidualstream,weutilize
𝑖=17
res
–
𝐿𝑖
twometrics.Thefirst,basedonthemaximumabsolutevalueoftherowsumsofthecomposite
mapping,capturestheworst-caseexpansionintheforwardpass.Thesecond,basedonthemaximumabsolutecolumnsum,correspondstothebackwardpass.WerefertothesemetricsastheAmaxGainMagnitudeofthecompositemapping.AsshowninFig.
3
(b),theAmaxGainMagnitudeyieldsextremevalueswithpeaksof3000,astarkdivergencefrom1thatconfirmsthepresenceofexplodingresidualstreams.
HC
mHC
0.012
0.010
AbsoluteLossGap
0.008
0.006
0.004
0.002
0.000
-0.002
0 10000 20000 30000 40000 50000
Steps
AbsoluteTrainingLossGapvs.TrainingSteps
0.25
mHCHC
0.20
GradNorm
0.15
0.10
0.05
0.00
0 10000 20000 30000 40000 50000
Steps
GradientNormvs.TrainingSteps
|
Figure2TrainingInstabilityofHyper-Connections(HC).Thisfigureillustrates(a)theabsolutelossgapofHCrelativetomHC,and(b)thecomparisonsofgradientnorms.Allresultsarebasedon27Bmodels.
Hres
l
Hres
l
ignalGain
GradientGain
ForwardSBackward
61− res
Yl
i=1
Hres Fo
l+1−i
rwardSign
alGain
Yi=1
lH Ba
61−i
ckwardGr
adientGai
n
105
AmaxGainMagnitude
101
100
0 10 20 30 40 50 60
LayerIndexl
Single-LayerMapping
104
AmaxGainMagnitude
103
102
101
0 10 20 30 40 50 60
LayerIndexl
CompositeMapping
Figure3|PropagationInstabilityofHyper-Connections(HC).Thisfigureillustratesthe
propagationdynamicsof(a)thesingle-layermapping7resand(b)thecompositemapping
.𝐿–𝑙 res 𝑙
7–
𝑖=1𝐿𝑖withinthe27Bmodel.Thelayerindex𝑙(x-axis)unrollseachstandardTransformerblockintotwoindependentlayers(AttentionandFFN).TheAmaxGainMagnitude(y-axis)is
calculatedasthemaximumabsoluterowsum(fortheforwardsignal)andcolumnsum(forthebackwardgradient),averagedoveralltokensinaselectedsequence.
SystemOverhead
WhilethecomputationalcomplexityofHCremainsmanageableduetothelinearityoftheadditionalmappings,thesystem-leveloverheadpreventsanon-negligiblechallenge.Specifically,memoryaccess(I/O)costsoftenconstituteoneoftheprimarybottlenecksinmodernmodelarchitectures,whichiswidelyreferredtoasthe“memorywall”
(Daoetal.,
2022).
Thisbottleneckisfrequentlyoverlookedinarchitecturaldesign,yetitdecisivelyimpactsruntimeefficiency.
𝑙
𝑙
𝑙
pre post
Focusingonthewidelyadoptedpre-normTransformer(
Vaswanietal.,
2017)
architecture,weanalyzetheI/OpatternsinherenttoHC.Tab.
2
summarizesthepertokenmemoryaccessoverheadinasingleresiduallayerintroducedbythe𝑛-streamresidualdesign.TheanalysisrevealsthatHCincreasesthememoryaccesscostbyafactorapproximatelyproportionalto𝑛.ThisexcessiveI/Odemandsignificantlydegradestrainingthroughputwithoutthemitigationoffusedkernels.Besides,since7,7,and7resinvolvelearnableparameters,theirinterme-
diateactivationsarerequiredforbackpropagation.Thisresultsinasubstantialincreaseinthe
GPUmemoryfootprint,oftennecessitatinggradientcheckpointingtomaintainfeasiblememoryusage.Furthermore,HCrequires𝑛-foldmorecommunicationcostinpipelineparallelism
(Qi
etal.,
2024),
leadingtolargerbubblesanddecreasingthetrainingthroughput.
|
Table2ComparisonofMemoryAccessCostsPerToken.Thisanalysisaccountsfortheoverheadintroducedbytheresidualstreammaintenanceintheforwardpass,excludingtheinternalI/Oofthelayerfunction/.
Method
ResidualConnection
OperationResidualMerge
TotalI/O
Calculate7𝑙 ,7
post
res
pre
𝑙
,7
𝑙
Hyper-Connections
7p𝑙ost
7𝑙
Read(Elements)2𝐶
2C
𝑛𝐶
𝑛𝐶+𝑛
Write(Elements)
𝐶
C
pre
𝑛2+2𝑛
𝐶+𝑛
ResidualMerge
TotalI/O
7
res
𝑙
2𝑛𝐶
(5n+1)C+n2+2n
𝑛𝐶+𝑛2
𝐶
𝑛𝐶
𝑛𝐶
𝑛𝐶
(3n+1)C+n2+2n
Method
Manifold-ConstrainedHyper-Connections
Drawinginspirationfromtheidentitymappingprinciple
(Heetal.,
2016b),
thecorepremise
ofmHCistoconstraintheresidualmapping7resontoaspecificmanifold.Whiletheoriginal
identitymappingensuresstabilitybyenforcing
𝑙
res
7𝑙 =I,itfundamentallyprecludesinformation
exchangewithintheresidualstream,whichiscriticalformaximizingthepotentialofmulti-streamarchitectures.Therefore,weproposeprojectingtheresidualmappingontoamanifoldthatsimultaneouslymaintainsthestabilityofsignalpropagationacrosslayersandfacilitatesmutualinteractionamongresidualstreamstopreservethemodel’sexpressivity.Tothisend,
werestrict7restobeadoublystochasticmatrix,whichhasnon-negativeentrieswhereboth
𝑙
therowsandcolumnssumto1.Formally,letM
res
denotethemanifoldofdoublystochastic
𝑙
𝑙
matrices(alsoknownastheBirkhoffpolytope).Weconstrain7restoPMres(7res),definedas:
𝑙
𝑙
𝑙
𝑙
𝑙
PMres(7res)≔7reseR𝑛×𝑛|7res1𝑛=1𝑛,1t𝑛7res=1t𝑛,7res⩾0, (6)
where1𝑛representsthe𝑛-dimensionalvectorofallones.
Itisworthnotingthatwhen𝑛=1,thedoublystochasticconditiondegeneratestothescalar1,therebyrecoveringtheoriginalidentitymapping.Thechoiceofdoublestochasticityconfersseveralrigoroustheoreticalpropertiesbeneficialforlarge-scalemodeltraining:
NormPreservation:Thespectralnormofadoublystochasticmatrixisboundedby1
𝑙
(i.e.,7res2≤1).Thisimpliesthatthelearnablemappingisnon-expansive,effectively
mitigatingthegradientexplosionproblem.
.
CompositionalClosure:Thesetofdoublystochasticmatricesisclosedundermatrixmultiplication.Thisensuresthatthecompositeresidualmappingacrossmultiplelayers,
𝑖=1
𝐿–𝑖
𝐿–𝑙7res,remainsdoublystochastic,therebypreservingstabilitythroughouttheentire
depthofthemodel.
M
GeometricInterpretationviatheBirkhoffPolytope:ThesetresformstheBirkhoffpolytope,whichistheconvexhullofthesetofpermutationmatrices.Thisprovidesacleargeometricinterpretation:theresidualmappingactsasaconvexcombinationofpermutations.Mathematically,therepeatedapplicationofsuchmatricestendstoincrease
themixingofinformationacrossstreamsmonotonically,effectivelyfunctioningasarobustfeaturefusionmechanism.
𝑙
𝑙
7
Additionally,weimposenon-negativityconstraintsontheinputmappings7preandoutputmappingspost.Thisconstrainpreventssignalcancellationarisingfromthecompositionofpositiveandnegativecoefficients,whichcanalsobeconsideredasaspecialmanifoldprojection.
ParameterizationandManifoldProjection
𝑙
𝑙
𝑙
Inthissection,wedetailthecalculationprocessof7pre,7post,and7resinmHC.Giventhe
e → ()e
→𝑙 (→𝑙)
inputhiddenmatrixx𝑙R𝑛×𝐶atthe𝑙-thlayer,wefirstflattenitintoavectorx𝑙=vecx𝑙R1×𝑛𝐶topreservefullcontextinformation.Then,wefollowtheoriginalHCformulationtogetthedynamicmappingsandthestaticmappingsasfollows:
7˜pre=𝛼pre·(x→′𝜑pre)+bpre
x′=RMSNormx
𝑙
𝑙
𝑙
𝑙
𝑙
(7)
7˜post=𝛼post·(x→′𝜑post)+bpost
𝑙 𝑙
˜
𝑙𝑙 𝑙
𝑙
𝑙
𝑙
𝑙
where𝜑pre,𝜑posteR𝑛𝐶×𝑛and𝜑reseR𝑛𝐶×𝑛2arelinearprojectionsfordynamicmappingsand
7res=𝛼res·mat(x→′𝑙𝜑res)+bres,
𝑙
𝑙
𝑙
mat(·)isareshapefunctionfromR1×𝑛2toR𝑛×𝑛.
p)ost
Then,thefinalconstrainedmappingsareobtainedvia:
7𝑙 (7𝑙 )
𝑙
post
(7𝑙˜
7res=Sinkhorn-Knopp(7˜res),
=2𝜎
7pre=𝜎
˜pre
(8)
𝑙
𝑙
(·) (·)
where𝜎denotestheSigmoidfunction.TheSinkhorn-Knoppoperatorfirstlymakesallelementstobepositiveviaanexponentoperatorandthenconductsiterativenormalizationprocessthatalternatelyrescalesrowsandcolumnstosumto1.Specifically,givenapositive
𝑙
matrixM(0)=exp(7˜res)asthestartpoint,thenormalizationiterationproceedsas:
M(𝑡)=T𝑟T𝑐(M(𝑡–1)), (9)
𝑙
whereT𝑟andT𝑐denoterowandcolumnnormalization,respectively.Thisprocessconvergestoadoublystochasticmatrix7res=M(𝑡max)as𝑡max→∞.Wechoose𝑡max=20asapracticalvaluein
ourexperiments.
EfficientInfrastructureDesign
Inthissection,wedetailtheinfrastructuredesigntailoredformHC.Throughrigorousoptimiza-tion,weimplementmHC(with𝑛=4)inlarge-scalemodelswithamarginaltrainingoverheadofonly6.7%.
KernelFusion
ObservingthatRMSNorminmHCimposessignificantlatencywhenoperatingonthehigh-dimensionalhiddenstatex→𝑙eR1×𝑛𝐶,wereorderthedividing-by-normoperationtofollowthe
𝑙
𝑙
𝑙
matrixmultiplication.Thisoptimizationmaintainsmathematicalequivalencewhileimprovingefficiency.Furthermore,weemploymixed-precisionstrategiestomaximizenumericalaccuracywithoutcompromisingspeed,andfusemultipleoperationswithsharedmemoryaccessintounifiedcomputekernelstoreducememorybandwidthbottlenecks.BasedontheinputsandparametersdetailedinEq.
(10)
to
(13)
,weimplementthreespecializedmHCkernelstocompute7pre,7post,and7res.Inthesekernels,thebiasesandlinearprojectionsareconsolidatedintob𝑙
and𝜑𝑙,andtheRMSNormweightisalsoabsorbedin𝜑𝑙.
→
→
Eq.
(14)
to
(15)
:Wedevelopaunifiedkernelthatfusestwoscansonx𝑙,leveragingma-trixmultiplicationunitstomaximizememorybandwidthutilization.Thebackwardpass—comprisingtwomatrixmultiplications—issimilarlyconsolidatedintoasingleker-nel,eliminatingredundantreloadingofx𝑙.Bothkernelsfeatureafinelytunedpipeline(load,cast,compute,store)toefficientlyhandlemixed-precisionprocessing.
Eq.
(16)
to
(18)
:Theselightweightoperationsonsmallcoefficientsareopportunisticallyfusedintoasinglekernel,significantlyreducingkernellaunchoverhead.
Eq.
(19)
:WeimplementtheSinkhorn-Knoppiterationwithinasinglekernel.Forthebackwardpass,wederiveacustombackwardkernelthatrecomputestheintermediateresultson-chipandtraversestheentireiteration.
𝜑𝑙:tfloat32 [𝑛𝐶,𝑛2+2𝑛] (10)
x→𝑙:bfloat16 [1,𝑛𝐶] (11)
𝛼pre,𝛼post,𝛼res:float32 Scalars (12)
𝑙 𝑙
7˜pre,7˜post,7˜res
h
𝑙
:float32 =x→𝑙𝜑𝑙 (14)
bi𝑙:float32 [1,𝑛
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 深度解析(2026)《GBT 35472.5-2017湿式自动变速箱摩擦元件试验方法 第5部分:耐久性试验方法》
- 深度解析(2026)《GBT 35418-2017纳米技术 碳纳米管中杂质元素的测定 电感耦合等离子体质谱法》
- 《DLT 1043-2022钢弦式测缝计》从合规成本到利润增长全案:避坑防控+降本增效+商业壁垒构建
- 民事诉讼法题目及分析
- 财务人员工作计划
- 哲学家西方哲学题目及详解
- 元宇宙生态应用题库及答案
- 西班牙语DELEA2试题及解析
- 建筑设计原理试题及分析
- 2024-2025学年江苏盐城五校联考高一下学期4月期中数学试题含答案
- 2026广西梧州苍海投资集团有限责任公司招聘总会计师1人笔试模拟试题及答案解析
- 2024-2025学年四川省成都市石室联中教育集团八年级(下)期中数学试卷
- 小学科学教学中的跨学科融合创新实践研究教学研究课题报告
- 《AQ3067-2026化工和危险化学品重大生产安全事故隐患判定准则》解读
- 2026 年山东春考英语提分技巧全解
- 2026广东东莞市康复实验学校招聘18人备考题库及答案详解(各地真题)
- 2026届湖北黄冈中学等十一校高三下学期第二次联考物理试卷(含答案)
- 2026年智慧树答案【人工智能原理与技术】智慧树网课章节综合提升测试卷及答案详解(夺冠系列)
- 2026年浙江省新月联盟高三语文第二次调研模拟试卷附答案解析
- 企业信息安全程序指南(标准版)
- 2026北京市公安局监所管理总队招聘勤务辅警300人笔试参考题库及答案解析
评论
0/150
提交评论