Deep+Seek:mHC:流形约束超连接-mHC+Manifold-Constrained+Hyper-Connections_第1页
Deep+Seek:mHC:流形约束超连接-mHC+Manifold-Constrained+Hyper-Connections_第2页
Deep+Seek:mHC:流形约束超连接-mHC+Manifold-Constrained+Hyper-Connections_第3页
Deep+Seek:mHC:流形约束超连接-mHC+Manifold-Constrained+Hyper-Connections_第4页
Deep+Seek:mHC:流形约束超连接-mHC+Manifold-Constrained+Hyper-Connections_第5页
已阅读5页,还剩15页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

arXiv:2512.24880v1[cs.CL]31Dec2025

mHC:Manifold-ConstrainedHyper-Connections

ZhendaXie*†,YixuanWei*,HuanqiCao*,

ChenggangZhao,ChengqiDeng,JiashiLi,DamaiDai,HuazuoGao,JiangChang,LiangZhao,ShangyanZhou,ZheanXu,ZhengyanZhang,WangdingZeng,ShengdingHu,YuqingWang,JingyangYuan,LeanWang,WenfengLiang

DeepSeek-AI

Abstract

Recently,studiesexemplifiedbyHyper-Connections(HC)haveextendedtheubiquitousresid-ualconnectionparadigmestablishedoverthepastdecadebyexpandingtheresidualstreamwidthanddiversifyingconnectivitypatterns.Whileyieldingsubstantialperformancegains,thisdiversificationfundamentallycompromisestheidentitymappingpropertyintrinsictotheresidualconnection,whichcausesseveretraininginstabilityandrestrictedscalability,andadditionallyincursnotablememoryaccessoverhead.Toaddressthesechallenges,wepro-poseManifold-ConstrainedHyper-Connections(mHC),ageneralframeworkthatprojectstheresidualconnectionspaceofHContoaspecificmanifoldtorestoretheidentitymappingproperty,whileincorporatingrigorousinfrastructureoptimizationtoensureefficiency.Em-piricalexperimentsdemonstratethatmHCiseffectivefortrainingatscale,offeringtangibleperformanceimprovementsandsuperiorscalability.WeanticipatethatmHC,asaflexibleandpracticalextensionofHC,willcontributetoadeeperunderstandingoftopologicalarchitecturedesignandsuggestpromisingdirectionsfortheevolutionoffoundationalmodels.

x𝑙+1

Layerℱ

x𝑙

x𝑙+1

hpost

𝑙

PostMapping

ℋpost

𝑙

hout

𝑙

Layerℱ

hres

𝑙

hin

𝑙

ResMapping

PreMapping

ℋres

𝑙

ℋpre

𝑙

x𝑙

x𝑙+1

hpost

𝑙

𝘗post(ℋ )

PostMapping

post

ℳ 𝑙

hout

𝑙

Layerℱ

hres

𝑙

hin

𝑙

ResMapping

PreMapping

𝘗 (ℋ)

res

pre

ℳ 𝑙

res

𝘗 (ℋ)

ℳ 𝑙

pre

x𝑙

(a)ResidualConnection (b)Hyper-Connections(HC) (c)Manifold-ConstrainedHC(mHC)

|

Figure1IllustrationsofResidualConnectionParadigms.Thisfigurecomparesthestructuraldesignof(a)standardResidualConnection,(b)Hyper-Connections(HC),and(c)ourproposedManifold-ConstrainedHyper-Connections(mHC).UnliketheunconstrainedHC,mHCfocusesonoptimizingtheresidualconnectionspacebyprojectingthematricesontoaconstrainedmanifoldtoensurestability.

*Corecontributors.†Correspondingauthor:

xie.zhenda@

PAGE

10

Contents

Introduction

3

RelatedWorks

4

MicroDesign

4

MacroDesign

5

Preliminary

5

NumericalInstability

6

SystemOverhead

7

Method

8

Manifold-ConstrainedHyper-Connections

8

ParameterizationandManifoldProjection

9

EfficientInfrastructureDesign

9

KernelFusion

9

Recomputing

10

OverlappingCommunicationinDualPipe

11

Experiments

12

ExperimentalSetup

12

MainResults

12

ScalingExperiments

13

StabilityAnalysis

14

ConclusionandOutlook

15

AAppendix

19

DetailedModelSpecificationsandHyper-parameters.

19

Introduction

DeepneuralnetworkarchitectureshaveundergonerapidevolutionsincetheintroductionofResNets

(Heetal.,

2016a).

AsillustratedinFig.

1(a),

thestructureofasingle-layercanbeformulatedasfollows:

x𝑙+1=x𝑙+/(x𝑙,W𝑙), (1)

/ /

wherex𝑙andx𝑙+1denotethe𝐶-dimensionalinputandoutputofthe𝑙-thlayer,respectively,andrepresentstheresidualfunction.Althoughtheresidualfunctionhasevolvedover

thepastdecadetoincludevariousoperationssuchasconvolution,attentionmechanisms,andfeedforwardnetworks,theparadigmoftheresidualconnectionhasmaintaineditsoriginalform.AccompanyingtheprogressionofTransformer(

Vaswanietal.,

2017)

architecture,thisparadigmhascurrentlyestablisheditselfasafundamentaldesignelementinlargelanguagemodels(LLMs)

(Brownetal.,

2020;

Liuetal.,

2024b;

Touvronetal.,

2023).

Thissuccessisprimarilyattributedtotheconciseformoftheresidualconnection.Moreimportantly,earlyresearch

(Heetal.,

2016b)

revealedthattheidentitymappingpropertyoftheresidualconnectionmaintainsstabilityandefficiencyduringlarge-scaletraining.Byrecursivelyextendingtheresidualconnectionacrossmultiplelayers,Eq.

(1)

yields:

x𝐿=x𝑙+

/(x𝑖,W𝑖), (2)

∑︁𝐿–1

𝑖=𝑙

where𝐿and𝑙correspondtodeeperandshallowerlayers,respectively.Thetermidentitymappingreferstothecomponentx𝑙itself,whichemphasizesthepropertythatthesignalfromtheshallowerlayermapsdirectlytothedeeperlayerwithoutanymodification.

Recently,studiesexemplifiedbyHyper-Connections(HC)

(Zhuetal.,

2024)

haveintroducedanewdimensiontotheresidualconnectionandempiricallydemonstrateditsperformancepotential.Thesingle-layerarchitectureofHCisillustratedinFig.

1(b).

Byexpandingthewidthoftheresidualstreamandenhancingconnectioncomplexity,HCsignificantlyincreasestopologicalcomplexitywithoutalteringthecomputationaloverheadofindividualunitsregardingFLOPs.Formally,single-layerpropagationinHCisdefinedas:

𝑙

𝑙

𝑙

x𝑙+1=7resx𝑙+7postt/(7prex𝑙,W𝑙), (3)

wherex𝑙andx𝑙+1denotetheinputandoutputofthe𝑙-thlayer,respectively.Unliketheformu-lationinEq.

(1)

,thefeaturedimensionofx𝑙andx𝑙+1isexpandedfrom𝐶to𝑛×𝐶,where𝑛is

theexpansionrate.Theterm7reseR𝑛×𝑛representsalearnablemappingthatmixesfeatures

𝑙 pre

1×𝑛

withintheresidualstream.Alsoasalearnablemapping,7𝑙 eR aggregatesfeaturesfrom

𝑙

the𝑛𝐶-dimstreamintoa𝐶-dimlayerinput,andconversely,7posteR1×𝑛mapsthelayeroutput

backontothestream.

However,asthetrainingscaleincreases,HCintroducespotentialrisksofinstability.TheprimaryconcernisthattheunconstrainednatureofHCcompromisestheidentitymappingpropertywhenthearchitectureextendsacrossmultiplelayers.Inarchitecturescomprisingmultipleparallelstreams,anidealidentitymappingservesasaconservationmechanism.Itensuresthattheaveragesignalintensityacrossstreamsremainsinvariantduringbothforwardandbackwardpropagation.RecursivelyextendingHCtomultiplelayersviaEq.

(3)

yields:

.𝐿–𝑙

res!

∑︁𝐿–1©𝐿.–1–𝑖

res\

postt

pre

x𝐿=

𝑖=1

7𝐿–𝑖

x𝑙+

𝑖=𝑙IZ

𝑗=1

7𝐿–𝑗I¬7𝑖 /(7𝑖 x𝑖,W𝑖), (4)

.

𝑖=1

𝐿–𝑖

where𝐿and𝑙representadeeperlayerandashallowerlayer,respectively.IncontrasttoEq.

(2)

,thecompositemapping 𝐿–𝑙7resinHCfailstopreservetheglobalmeanofthefeatures.This

discrepancyleadstounboundedsignalamplificationorattenuation,resultingininstabilityduringlarge-scaletraining.Afurtherconsiderationisthat,whileHCpreservescomputationalefficiencyintermsofFLOPs,thehardwareefficiencyconcerningmemoryaccesscostsforthewidenedresidualstreamremainsunaddressedintheoriginaldesign.ThesefactorscollectivelyrestrictthepracticalscalabilityofHCandhinderitsapplicationinlarge-scaletraining.

.

𝑙

7

𝑙

7

Toaddressthesechallenges,weproposeManifold-ConstrainedHyper-Connections(mHC),asshowninFig.

1(c),

ageneralframeworkthatprojectstheresidualconnectionspaceofHContoaspecificmanifoldtorestoretheidentitymappingproperty,whileincorporatingrigorousinfrastructureoptimizationtoensureefficiency.Specifically,mHCutilizestheSinkhorn-Knoppalgorithm

(SinkhornandKnopp,

1967)

toentropicallyprojectresontotheBirkhoffpolytope.Thisoperationeffectivelyconstrainstheresidualconnectionmatriceswithinthemanifoldthatisconstitutedbydoublystochasticmatrices.Sincetherowandcolumnsumsofthesematricesequalto1,theoperationresx𝑙functionsasaconvexcombinationoftheinputfeatures.Thischaracteristicfacilitatesawell-conditionedsignalpropagationwherethefeaturemeanisconserved,andthesignalnormisstrictlyregularized,effectivelymitigatingtheriskofvanishingorexplodingsignals.Furthermore,duetotheclosureofmatrixmultiplicationfor

Consequently,m

𝑖=1

𝐿–𝑖

doublystochasticmatrices,thecompositemapping𝐿–𝑙7resretainsthisconservationproperty.

HCeffectivelymaintainsthestabilityofidentitymappingsbetweenarbitrarydepths.Toensureefficiency,weemploykernelfusionanddevelopmixedprecisionkernelsutilizingTileLang(

Wangetal.,

2025).

Furthermore,wemitigatethememoryfootprintthroughselectiverecomputingandcarefullyoverlapcommunicationwithintheDualPipeschedule

(Liu

etal.,

2024b).

ExtensiveexperimentsonlanguagemodelpretrainingdemonstratethatmHCexhibitsexceptionalstabilityandscalabilitywhilemaintainingtheperformanceadvantagesofHC.In-houselarge-scaletrainingindicatesthatmHCsupportstrainingatscaleandintroducesonlya6.7%additionaltimeoverheadwhenexpansionrate𝑛=4.

RelatedWorks

Architecturaladvancementsindeeplearningcanbeprimarilyclassifiedintomicro-designandmacro-design.Micro-designconcernstheinternalarchitectureofcomputationalblocks,specifyinghowfeaturesareprocessedacrossspatial,temporal,andchanneldimensions.Incontrast,macro-designestablishestheinter-blocktopologicalstructure,therebydictatinghowfeaturerepresentationsarepropagated,routed,andmergedacrossdistinctlayers.

MicroDesign

Drivenbyparametersharingandtranslationinvariance,convolutioninitiallydominatedthepro-cessingofstructuredsignals.Whilesubsequentvariationssuchasdepthwiseseparable

(Chollet,

2017)

andgroupedconvolutions

(Xieetal.,

2017)

optimizedefficiency,theadventofTrans-formers(

Vaswanietal.,

2017)

establishedAttentionandFeed-ForwardNetworks(FFNs)asthefundamentalbuildingblocksofmodernarchitecture.Attentionmechanismsfacilitateglobalinformationpropagation,whileFFNsenhancetherepresentationalcapacityofindividualfeatures.TobalanceperformancewiththecomputationaldemandsofLLMs,attentionmecha-nismshaveevolvedtowardsefficientvariantssuchasMulti-QueryAttention(MQA)

(Shazeer,

2019),

Grouped-QueryAttention(GQA)

(Ainslieetal.,

2023),

andMulti-HeadLatentAttention

(MLA)

(Liuetal.,

2024a).

Simultaneously,FFNshavebeengeneralizedintosparsecomputingparadigmsviaMixture-of-Experts(MoE)

(Fedusetal.,

2022;

Lepikhinetal.,

2020;

Shazeeretal.,

2017),

allowingformassiveparameterscalingwithoutproportionalcomputationalcosts.

MacroDesign

Macro-designgovernstheglobaltopologyofthenetwork

(Srivastavaetal.,

2015).

FollowingResNet

(Heetal.,

2016a),

architecturessuchasDenseNet

(Huangetal.,

2017)

andFractal-Net

(Larssonetal.,

2016)

aimedtoenhanceperformancebyincreasingtopologicalcomplexitythroughdenseconnectivityandmulti-pathstructures,respectively.DeepLayerAggregation(DLA)

(Yuetal.,

2018)

furtherextendedthisparadigmbyrecursivelyaggregatingfeaturesacrossvariousdepthsandresolutions.

Morerecently,thefocusofmacro-designhasshiftedtowardexpandingthewidthoftheresidualstream

(Chaietal.,

2020;

Fangetal.,

2023;

Heddesetal.,

2025;

MakandFlanigan,

2025;

Menghanietal.,

2025;

Pagliardinietal.,

2024;

Xiaoetal.,

2025;

Xieetal.,

2023;

Zhuetal.,

2024).

Hyper-Connections(HC)

(Zhuetal.,

2024)

introducedlearnablematricestomodulateconnectionstrengthsamongfeaturesatvaryingdepths,whiletheResidualMatrixTransformer(RMT)

(MakandFlanigan,

2025)

replacedthestandardresidualstreamwithanouter-productmemorymatrixtofacilitatefeaturestorage.Similarly,MUDDFormer

(Xiaoetal.,

2025)

employsmultiwaydynamicdenseconnectionstooptimizecross-layerinformationflow.Despitetheirpotential,theseapproachescompromisetheinherentidentitymappingpropertyoftheresidualconnection,therebyintroducinginstabilityandhinderingscalability.Furthermore,theyincursignificantmemoryaccessoverheadduetoexpandedfeaturewidths.BuildinguponHC,theproposedmHCrestrictstheresidualconnectionspaceontoaspecificmanifoldtorestoretheidentitymappingproperty,whilealsoincorporatingrigorousinfrastructureoptimizationstoensureefficiency.Thisapproachenhancesstabilityandscalabilitywhilemaintainingthetopologicalbenefitsofexpandedconnections.

Preliminary

Wefirstestablishthenotationusedinthiswork.IntheHCformulation,theinputtothe𝑙-thlayer,

x𝑙eR1×𝐶,isexpandedbyafactorof𝑛toconstructahiddenmatrixx𝑙=(x𝑙t,0,...,x𝑙t,𝑛

1)teR𝑛×𝐶

whichcanbeviewedas𝑛-streamresidual.Thisoperationeffectivelybroadensth–ewidthof

𝑙

𝑙

𝑙

theresidualstream.Togoverntheread-out,write-in,andupdatingprocessesofthisstream,HCintroducesthreelearnablelinearmappings—7pre,7posteR1×𝑛,and7reseR𝑛×𝑛.These

mappingsmodifythestandardresidualconnectionshowninEq.

(1)

,resultingintheformulation

giveninEq.

(3).

𝑙 (𝑙)

IntheHCformulation,learnablemappingsarecomposedoftwopartsofcoefficients:theinput-dependentoneandtheglobalone,referredtoasdynamicmappingsandstaticmappings,respectively.Formally,HCcomputesthecoefficientsasfollows:

7pre=𝛼pre·tanh(𝜃prex˜t)+bpre

x˜=RMSNormx

𝑙

𝑙

𝑙

𝑙

𝑙

(5)

𝑙 𝑙 𝑙 𝑙 𝑙

𝑙

𝑙

𝑙

𝑙

7post=𝛼post·tanh(𝜃postx˜t)+bpost

7res=𝛼res·tanh(𝜃resx˜𝑙t)+bres,

whereRMSNorm(·)

(ZhangandSennrich,

2019)

isappliedtothelastdimension,andthescalars

𝑙

𝑙

𝑙

𝛼pre,𝛼postand𝛼reseRarelearnablegatingfactorsinitializedtosmallvalues.Thedynamic

𝑙

𝑙

𝑙

mappingsarederivedvialinearprojectionsparameterizedby𝜃pre,𝜃posteR1×𝐶and𝜃reseR𝑛×𝐶,

𝑙

𝑙

𝑙

whilethestaticmappingsarerepresentedbylearnablebiasesbpre,bposteR1×𝑛andbreseR𝑛×𝑛.

𝑙

𝑙

𝑙

Itisworthnotingthattheintroductionofthesemappings—7pre,7post,and7res—incurs

negligiblecomputationaloverhead,asthetypicalexpansionrate𝑛,e.g.4,ismuchsmallerthantheinputdimension𝐶.Withthisdesign,HCeffectivelydecouplestheinformationcapacityoftheresidualstreamfromthelayer’sinputdimension,whichisstronglycorrelatedwiththemodel’scomputationalcomplexity(FLOPs).Consequently,HCoffersanewavenueforscalingbyadjustingtheresidualstreamwidth,complementingthetraditionalscalingdimensionsofmodelFLOPsandtrainingdatasizediscussedinpre-trainingscalinglaws

(Hoffmannetal.,

2022).

𝑙

7

AlthoughHCnecessitatesthreemappingstomanagethedimensionalmismatchbetweentheresidualstreamandthelayerinput,preliminaryexperimentspresentedinTab.

1

indicatethattheresidualmappingresyieldsthemostsignificantperformancegain.Thisfindingunderscoresthecriticalimportanceofeffectiveinformationexchangewithintheresidualstream.

𝑙

𝑙

𝑙

Table1|AblationStudyofHCComponents.Whenaspecificmapping(7pre,7post,or7res)is

𝑙

𝑙

𝑙

disabled,weemployafixedmappingtomaintaindimensionalconsistency:uniformweightsof1/𝑛for7pre,uniformweightsofonesfor7post,andtheidentitymatrixfor7res.

res

pre

post

𝑙 𝑙 𝑙

7 7

7

AbsoluteLossGap

0.0

–0.022

✓ ✓ –0.025

✓ ✓ ✓ –0.027

NumericalInstability

.

𝑙

7

Whiletheresidualmappingresisinstrumentalforperformance,itssequentialapplicationposesasignificantrisktonumericalstability.AsdetailedinEq.

(4)

,whenHCisextendedacrossmultiplelayers,theeffectivesignalpropagationfromlayer𝑙to𝐿isgovernedbythecomposite

𝑖=1

𝐿–𝑖

𝑙

mapping𝐿–𝑙7res.Sincethelearnablemapping7resisunconstrained,thiscompositemapping

inevitablydeviatesfromtheidentitymapping.Consequently,thesignalmagnitudeispronetoexplosionorvanishingduringboththeforwardpassandbackpropagation.Thisphenomenonunderminesthefundamentalpremiseofresiduallearning,whichreliesonunimpededsignalflow,therebydestabilizingthetrainingprocessindeeperorlarger-scalemodels.

Empiricalevidencesupportsthisanalysis.Weobserveunstablelossbehaviorinlarge-scaleexperiments,asillustratedinFig.

2.

TakingmHCasthebaseline,HCexhibitsanunexpectedlosssurgearoundthe12kstep,whichishighlycorrelatedwiththeinstabilityinthegradient

norm.Furthermore,theanalysison7resvalidatesthemechanismofthisinstability.Toquantify

howthecompositemapping.𝐿–𝑙 𝑙amplifiessignalsalongtheresidualstream,weutilize

𝑖=17

res

𝐿𝑖

twometrics.Thefirst,basedonthemaximumabsolutevalueoftherowsumsofthecomposite

mapping,capturestheworst-caseexpansionintheforwardpass.Thesecond,basedonthemaximumabsolutecolumnsum,correspondstothebackwardpass.WerefertothesemetricsastheAmaxGainMagnitudeofthecompositemapping.AsshowninFig.

3

(b),theAmaxGainMagnitudeyieldsextremevalueswithpeaksof3000,astarkdivergencefrom1thatconfirmsthepresenceofexplodingresidualstreams.

HC

mHC

0.012

0.010

AbsoluteLossGap

0.008

0.006

0.004

0.002

0.000

-0.002

0 10000 20000 30000 40000 50000

Steps

AbsoluteTrainingLossGapvs.TrainingSteps

0.25

mHCHC

0.20

GradNorm

0.15

0.10

0.05

0.00

0 10000 20000 30000 40000 50000

Steps

GradientNormvs.TrainingSteps

|

Figure2TrainingInstabilityofHyper-Connections(HC).Thisfigureillustrates(a)theabsolutelossgapofHCrelativetomHC,and(b)thecomparisonsofgradientnorms.Allresultsarebasedon27Bmodels.

Hres

l

Hres

l

ignalGain

GradientGain

ForwardSBackward

61− res

Yl

i=1

Hres Fo

l+1−i

rwardSign

alGain

Yi=1

lH Ba

61−i

ckwardGr

adientGai

n

105

AmaxGainMagnitude

101

100

0 10 20 30 40 50 60

LayerIndexl

Single-LayerMapping

104

AmaxGainMagnitude

103

102

101

0 10 20 30 40 50 60

LayerIndexl

CompositeMapping

Figure3|PropagationInstabilityofHyper-Connections(HC).Thisfigureillustratesthe

propagationdynamicsof(a)thesingle-layermapping7resand(b)thecompositemapping

.𝐿–𝑙 res 𝑙

7–

𝑖=1𝐿𝑖withinthe27Bmodel.Thelayerindex𝑙(x-axis)unrollseachstandardTransformerblockintotwoindependentlayers(AttentionandFFN).TheAmaxGainMagnitude(y-axis)is

calculatedasthemaximumabsoluterowsum(fortheforwardsignal)andcolumnsum(forthebackwardgradient),averagedoveralltokensinaselectedsequence.

SystemOverhead

WhilethecomputationalcomplexityofHCremainsmanageableduetothelinearityoftheadditionalmappings,thesystem-leveloverheadpreventsanon-negligiblechallenge.Specifically,memoryaccess(I/O)costsoftenconstituteoneoftheprimarybottlenecksinmodernmodelarchitectures,whichiswidelyreferredtoasthe“memorywall”

(Daoetal.,

2022).

Thisbottleneckisfrequentlyoverlookedinarchitecturaldesign,yetitdecisivelyimpactsruntimeefficiency.

𝑙

𝑙

𝑙

pre post

Focusingonthewidelyadoptedpre-normTransformer(

Vaswanietal.,

2017)

architecture,weanalyzetheI/OpatternsinherenttoHC.Tab.

2

summarizesthepertokenmemoryaccessoverheadinasingleresiduallayerintroducedbythe𝑛-streamresidualdesign.TheanalysisrevealsthatHCincreasesthememoryaccesscostbyafactorapproximatelyproportionalto𝑛.ThisexcessiveI/Odemandsignificantlydegradestrainingthroughputwithoutthemitigationoffusedkernels.Besides,since7,7,and7resinvolvelearnableparameters,theirinterme-

diateactivationsarerequiredforbackpropagation.Thisresultsinasubstantialincreaseinthe

GPUmemoryfootprint,oftennecessitatinggradientcheckpointingtomaintainfeasiblememoryusage.Furthermore,HCrequires𝑛-foldmorecommunicationcostinpipelineparallelism

(Qi

etal.,

2024),

leadingtolargerbubblesanddecreasingthetrainingthroughput.

|

Table2ComparisonofMemoryAccessCostsPerToken.Thisanalysisaccountsfortheoverheadintroducedbytheresidualstreammaintenanceintheforwardpass,excludingtheinternalI/Oofthelayerfunction/.

Method

ResidualConnection

OperationResidualMerge

TotalI/O

Calculate7𝑙 ,7

post

res

pre

𝑙

,7

𝑙

Hyper-Connections

7p𝑙ost

7𝑙

Read(Elements)2𝐶

2C

𝑛𝐶

𝑛𝐶+𝑛

Write(Elements)

𝐶

C

pre

𝑛2+2𝑛

𝐶+𝑛

ResidualMerge

TotalI/O

7

res

𝑙

2𝑛𝐶

(5n+1)C+n2+2n

𝑛𝐶+𝑛2

𝐶

𝑛𝐶

𝑛𝐶

𝑛𝐶

(3n+1)C+n2+2n

Method

Manifold-ConstrainedHyper-Connections

Drawinginspirationfromtheidentitymappingprinciple

(Heetal.,

2016b),

thecorepremise

ofmHCistoconstraintheresidualmapping7resontoaspecificmanifold.Whiletheoriginal

identitymappingensuresstabilitybyenforcing

𝑙

res

7𝑙 =I,itfundamentallyprecludesinformation

exchangewithintheresidualstream,whichiscriticalformaximizingthepotentialofmulti-streamarchitectures.Therefore,weproposeprojectingtheresidualmappingontoamanifoldthatsimultaneouslymaintainsthestabilityofsignalpropagationacrosslayersandfacilitatesmutualinteractionamongresidualstreamstopreservethemodel’sexpressivity.Tothisend,

werestrict7restobeadoublystochasticmatrix,whichhasnon-negativeentrieswhereboth

𝑙

therowsandcolumnssumto1.Formally,letM

res

denotethemanifoldofdoublystochastic

𝑙

𝑙

matrices(alsoknownastheBirkhoffpolytope).Weconstrain7restoPMres(7res),definedas:

𝑙

𝑙

𝑙

𝑙

𝑙

PMres(7res)≔7reseR𝑛×𝑛|7res1𝑛=1𝑛,1t𝑛7res=1t𝑛,7res⩾0, (6)

where1𝑛representsthe𝑛-dimensionalvectorofallones.

Itisworthnotingthatwhen𝑛=1,thedoublystochasticconditiondegeneratestothescalar1,therebyrecoveringtheoriginalidentitymapping.Thechoiceofdoublestochasticityconfersseveralrigoroustheoreticalpropertiesbeneficialforlarge-scalemodeltraining:

NormPreservation:Thespectralnormofadoublystochasticmatrixisboundedby1

𝑙

(i.e.,7res2≤1).Thisimpliesthatthelearnablemappingisnon-expansive,effectively

mitigatingthegradientexplosionproblem.

.

CompositionalClosure:Thesetofdoublystochasticmatricesisclosedundermatrixmultiplication.Thisensuresthatthecompositeresidualmappingacrossmultiplelayers,

𝑖=1

𝐿–𝑖

𝐿–𝑙7res,remainsdoublystochastic,therebypreservingstabilitythroughouttheentire

depthofthemodel.

M

GeometricInterpretationviatheBirkhoffPolytope:ThesetresformstheBirkhoffpolytope,whichistheconvexhullofthesetofpermutationmatrices.Thisprovidesacleargeometricinterpretation:theresidualmappingactsasaconvexcombinationofpermutations.Mathematically,therepeatedapplicationofsuchmatricestendstoincrease

themixingofinformationacrossstreamsmonotonically,effectivelyfunctioningasarobustfeaturefusionmechanism.

𝑙

𝑙

7

Additionally,weimposenon-negativityconstraintsontheinputmappings7preandoutputmappingspost.Thisconstrainpreventssignalcancellationarisingfromthecompositionofpositiveandnegativecoefficients,whichcanalsobeconsideredasaspecialmanifoldprojection.

ParameterizationandManifoldProjection

𝑙

𝑙

𝑙

Inthissection,wedetailthecalculationprocessof7pre,7post,and7resinmHC.Giventhe

e → ()e

→𝑙 (→𝑙)

inputhiddenmatrixx𝑙R𝑛×𝐶atthe𝑙-thlayer,wefirstflattenitintoavectorx𝑙=vecx𝑙R1×𝑛𝐶topreservefullcontextinformation.Then,wefollowtheoriginalHCformulationtogetthedynamicmappingsandthestaticmappingsasfollows:

7˜pre=𝛼pre·(x→′𝜑pre)+bpre

x′=RMSNormx

𝑙

𝑙

𝑙

𝑙

𝑙

(7)

7˜post=𝛼post·(x→′𝜑post)+bpost

𝑙 𝑙

˜

𝑙𝑙 𝑙

𝑙

𝑙

𝑙

𝑙

where𝜑pre,𝜑posteR𝑛𝐶×𝑛and𝜑reseR𝑛𝐶×𝑛2arelinearprojectionsfordynamicmappingsand

7res=𝛼res·mat(x→′𝑙𝜑res)+bres,

𝑙

𝑙

𝑙

mat(·)isareshapefunctionfromR1×𝑛2toR𝑛×𝑛.

p)ost

Then,thefinalconstrainedmappingsareobtainedvia:

7𝑙 (7𝑙 )

𝑙

post

(7𝑙˜

7res=Sinkhorn-Knopp(7˜res),

=2𝜎

7pre=𝜎

˜pre

(8)

𝑙

𝑙

(·) (·)

where𝜎denotestheSigmoidfunction.TheSinkhorn-Knoppoperatorfirstlymakesallelementstobepositiveviaanexponentoperatorandthenconductsiterativenormalizationprocessthatalternatelyrescalesrowsandcolumnstosumto1.Specifically,givenapositive

𝑙

matrixM(0)=exp(7˜res)asthestartpoint,thenormalizationiterationproceedsas:

M(𝑡)=T𝑟T𝑐(M(𝑡–1)), (9)

𝑙

whereT𝑟andT𝑐denoterowandcolumnnormalization,respectively.Thisprocessconvergestoadoublystochasticmatrix7res=M(𝑡max)as𝑡max→∞.Wechoose𝑡max=20asapracticalvaluein

ourexperiments.

EfficientInfrastructureDesign

Inthissection,wedetailtheinfrastructuredesigntailoredformHC.Throughrigorousoptimiza-tion,weimplementmHC(with𝑛=4)inlarge-scalemodelswithamarginaltrainingoverheadofonly6.7%.

KernelFusion

ObservingthatRMSNorminmHCimposessignificantlatencywhenoperatingonthehigh-dimensionalhiddenstatex→𝑙eR1×𝑛𝐶,wereorderthedividing-by-normoperationtofollowthe

𝑙

𝑙

𝑙

matrixmultiplication.Thisoptimizationmaintainsmathematicalequivalencewhileimprovingefficiency.Furthermore,weemploymixed-precisionstrategiestomaximizenumericalaccuracywithoutcompromisingspeed,andfusemultipleoperationswithsharedmemoryaccessintounifiedcomputekernelstoreducememorybandwidthbottlenecks.BasedontheinputsandparametersdetailedinEq.

(10)

to

(13)

,weimplementthreespecializedmHCkernelstocompute7pre,7post,and7res.Inthesekernels,thebiasesandlinearprojectionsareconsolidatedintob𝑙

and𝜑𝑙,andtheRMSNormweightisalsoabsorbedin𝜑𝑙.

Eq.

(14)

to

(15)

:Wedevelopaunifiedkernelthatfusestwoscansonx𝑙,leveragingma-trixmultiplicationunitstomaximizememorybandwidthutilization.Thebackwardpass—comprisingtwomatrixmultiplications—issimilarlyconsolidatedintoasingleker-nel,eliminatingredundantreloadingofx𝑙.Bothkernelsfeatureafinelytunedpipeline(load,cast,compute,store)toefficientlyhandlemixed-precisionprocessing.

Eq.

(16)

to

(18)

:Theselightweightoperationsonsmallcoefficientsareopportunisticallyfusedintoasinglekernel,significantlyreducingkernellaunchoverhead.

Eq.

(19)

:WeimplementtheSinkhorn-Knoppiterationwithinasinglekernel.Forthebackwardpass,wederiveacustombackwardkernelthatrecomputestheintermediateresultson-chipandtraversestheentireiteration.

𝜑𝑙:tfloat32 [𝑛𝐶,𝑛2+2𝑛] (10)

x→𝑙:bfloat16 [1,𝑛𝐶] (11)

𝛼pre,𝛼post,𝛼res:float32 Scalars (12)

𝑙 𝑙

7˜pre,7˜post,7˜res

h

𝑙

:float32 =x→𝑙𝜑𝑙 (14)

bi𝑙:float32 [1,𝑛

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论