43-Very High-Speed Computing Systems.pdf_第1页
43-Very High-Speed Computing Systems.pdf_第2页
43-Very High-Speed Computing Systems.pdf_第3页
43-Very High-Speed Computing Systems.pdf_第4页
43-Very High-Speed Computing Systems.pdf_第5页
已阅读5页,还剩4页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

PROCEEDINGSOFTHEm,VOL.54,NO.12,DECEMBER,1966190123.H.A.Ernst,“TCS,anexperimentalmultiprogrammingsystemfortheIBM7090,”IBMCorp.,YorktownHeights,N.Y.,ResearchRept.RJ248,June1963,41pages.theIBM7090.”IBMResearchRept.RJ248,June1963,41pages.24.M.Lehman.R.Eshed,andZ.Netter,“SABRAC,atimesharinglow-costcomputer.”Commun.ACM,vol.6,pp.427-429,August1963.25.R.V.SmithandD.N.Senzig,“Computerorganizationforarrayprocessing,”IBMCorp.,YorktownHeights,N.Y.,ResearchRept.RC1330,December1964.26.A.S.Critchlow,“Generalizedmultiprocessingandmultiprogram-mingsystems,”1963AFIPSProc.FJCC,pp.107-126.27.M.E.Conway,“Amultiprocessorsystemdesign,”ibid.,pp.139-146.28.R.R.SeeberandA.B.Lindquist.“Associativelogicforhighlyparallelsystems,”ihid.,pp.489493.29.R.M.Meade,“604machinedescription,”IBMinternalmemo.,December1963.38pages.30.M.Lehman,R.Eshed,andZ.Netter,“SABRA-anewgenerationserialcomputer,”IEEETrans.onElectronicComputers,vol.EC-12.pp.618628,December1963.31.M.W.Allen,T.Pearcey.J.P.Penny,G.A.Rose,andJ.G.Sander-son.“CIRRUS,aneconomicalmultiprogramcomputerwithmicro-programcontrol,”ibid.,pp.663-671.32.W.F.MillerandR.A.Aschenbrenner,“TheGUSmulticomputersystem,”ibid.,pp.671-676.33.G.Estrin,B.Russenl,R.Turn,andJ.Bibb,“Parallelprocessinginarestructurablecomputersystem,”ibid.,pp.747-755.34.G.GregoryandR.McReynolds,“TheSolomoncomputer,ibid.,35.H.S.Bright,“APhilcomulti-processingsystem,”1964Proc.FJCC,pp.97-141.36.R.G.EwingandP.M.Davies,“Anassociativeprocessor,”1964AFIPSProc.FJCC,pp.147-158.37.H.A.Kinslow,“Thetime-sharingmonitorsystem,ibid.,pp.443-454.38.J.Nievergelt.“Parallelmethodsforintegratingordinarydifferentialequations,”Commun.ACM,vol.7,pp.731-733,December1964.39.W.H.Desmonde,RealTimeDuraProcessingSystems.EnglewoodCliffs,N.J.:Prentice-Hall,1964.40.M.Lehman.“Serialmodeoperationandhigh-speedparallelpro-cessing,”InformationProcessing,1965Proc.IFIP,pt.2.NewYork:Spartan,1966,pp.631-633.41.R.V.SmithandD.N.Senzig,“Ccessing.”1965Proc.FJCC,pp.117-128.42.E.W.Dijkstra,“Solutionofaprobleminconcurrentprogrammingcontrol.”Commun.ACM,vol.8,p.569.September1965.43.J.B.Dennis,“Segmentationandthedesignofmultiprogrammedcomputersystems,”J.ACM,vol12,pp.589-602,October1965.44.F.J.CorbatoandV.A.Vyssotsky,“Introductionandoverviewofthemulticssystem,”1965Proc.FJCC,pp.185-196.45.E.L.Glaser.J.Couleur,andG.Oliver,“Systemdesignofacom-puterfortimesharingapplications,”ibid.pp.197-202.46.V.A.Vyssotsky,F.J.Corbato,andR.M.Graham,“Structureofthemulticssupervisor,”ibid.pp.203-212.47.R.C.DaleyandP.G.Neumann,“Ageneral-purposefilesystemforsecondarystorage,”ibid.,pp.213-229.48.J.F.Ossanna,L.E.Mikus,andS.D.Dursten,“Communicationandinput-outputswitchinginamultiplecomputingsystem,”ihid.49.J.W.Forgie,“Atimeandmemorysharingexecutiveprogramforquick-responseon-lineapplications,”ibid.,pp.599610.50.J.D.McCullogh,K.H.Speiennan,andF.W.Zurcher,“Designforamultipleusermultiprocessingsystem,”ibid.,pp.611-618.51.W.T.Comfort,“Acomputingsystemdesignforuserdevice,”ibid.,pp.619-628.52.J.P.Anderson,“Programstructuresforparallelprocessing,”Commun.ACM,vol.8,pp.786788,December1965.53.B.W.Arden,B.A.Galler,T.C.D.OBrien,andF.H.Westervelt,“Programandaddressingstructureinatime-sharingenvironment,”J.ACM,vol.13,pp.1-16.January1966.54.J.H.Katz,“Simulationofamultiprocessorcomputersystem,”SR&DRept.LA-009,February1966,tobepublishedin1966Proc.SJCC.55.G.S.ShedlerandM.M.Lehman,“Parallelcomputationandthesolutionofpolynomialequations,”IBM,YorktownHeights,N.Y.,ResearchRept.RC1550,February1966.56.H.Hellerman,“Parallelprocessingofalgebraicexpressions,”IEEETrans.onElectronicComputers,vol.EC-15,pp.82-91,February1966.57.J.B.DennisandE.C.VanHorn,“Programmingsemanticsformulti-programmedcomputation,Commun.ACM,vol.9,pp.143-155,March1966.58.N.Wirth,“Anoteonprogramstructuresforparallelprogram-ming,”Commun.ACM,vol.9,pp.32G321,May1966.59.D.E.Knuth,“Additionalcommentsonproblemsinconcurrentpro-grammingcontrol,”ihid.,pp.321-322.pp.231-241.VeryHigh-speedComputingSystemsMICHAELJ.FLYNN;MEMBER,IEEEAbstract-Veryhigh-speedcomputersmaybeclnssifiedasfollows:1)SingleJktmctionStrdingleDataStream(SISD)2)SileImbnctioaStream-MultipleDataStream(SIMD)3)MultiplehstmcthStrePntSingleDataStream(MSD)4)MnltipleInstroctiooStream-MultipleDataStream(”D).“Stream,”asnsedhere,referstothesequenceofdataorirstructiomasseenbythemachinedaringtbeexecutionofaprogram.mecoastitaeatsofasystem:storage,exeation,andhtrudonbandhg(braoching)aredkcussdwitbregardtorecentdevelopmen6and/orsystemsLimitatim.TheCOlsMnentsaredkcuwdmtermofcoocarrentSEDManuscriptreceivedJune30,1966;revisedAugust16,1966.ThisworkwasperformedundertheauspicesoftheU.S.AtomicEnergyCommission.TheauthoriswithNorthwesternUniversity,Evanston,Ill.,andArgonneNationalLaboratory,Argonne,Ill.systems(CDC6600seriesand,ioparticnlar,IBMModd90series),sincemnltiplestreamorganizationsusuallydonotquireanymoreelaboratecompooents.Representativeorganizationsareselectedfromeachclassandthearrangementofthecomtitwntsisshown.INTRODUCTIONMANYSIGNIFICANTscientificproblemsrequiretheuseofprodigiousamountsofcomputingtime.Inordertohandletheseproblemsadequately,thelarge-scalescientificcomputerhasbeendeveloped.Thiscomputeraddressesitselftoaclassofproblemscharacter-izedbyhavingahighratioofcomputingrequirementtoinput/outputrequirements(apartiallydefactosituation1902PROCEEDINGSOFTHEIEEEDECEMBERcausedbytheunavailabilityofmatchinginput/outputequipment).Thecomplexityoftheseprocessors,coupledwiththeadvancementofthestateofthecomputingarttheyrepresent,hasfocusedattentiononscientificcom-puters.Insightthusgainedisfrequentlyapredictorofcom-puterdevelopmentsonamoreuniversalbasis.Thispaperisanattempttoexplorelargescientificcomputingequip-ment,reviewingpossibleorganizationsstartingwiththe“concurrent”organizationswhicharepresentlyinopera-tionandthenexaminingtheothertheoreticalorganiza-tionalpossibilities.ORGANIZATIONThecomputingprocess,initsessentialform,istheper-formanceofasequenceofinstructionsonasetofdata.Eachinstructionperformsacombinatorialmanipulation(although,foreconomy,.subsequencingisalsoinvolved)ononeortwoelementsofthedataset.Iftheelementwereasinglebitandonlyonemchbitcouldbemanipulatedatanyunitoftime,wewouldhaveavariationoftheTuringma-chine-thestrictlyserialsequmtialmachine.Thenaturalextensionofthisistointroduceadatasetwhoseelementsmorecloselycorrespondtoa“natural”dataquantum(character,integer,floatingpointnumber,etc.).Sincethesizeofdatumhasincreased,sotoohasthenumberofcombinatorialmanipulationsthatcanbeper-formed(manipulationsontwonbitargumentshave22”possibleoutcomes).Ofcourse,attentionisrestrictedtothoseoperationswhichhavearithmeticorlogical.sig-nificance.Aprogramconsistsofanorderedsetofinstructions.Theprogramhasconsiderablyfewerwritten(orstored)in-structionsthanthenumberofmachineinstructionstobeperformed.Thedifferenceisintherecursionsor“loops”whichareinherentintheprogram.Itishighlyadvantageousifthealgorithmbeingimplementedishighlyrecursive.Thebasicmechanismforsettinguptheloopsistheconditionalbranchinstructions.Forconvenienceweadopttwoworkingdefinitions:Zn-structionStreamisthesequenceofinstructionsasperformedbythemachine;DataStreamisthesequenceofdatacalledforbytheinstructionstream(includinginputandpartialortemporaryresults).Thesetwoconceptsarequiteusefulincategorizingcomputerorganizationsinanattempttoavoidtheubiquitousandambiguousterm“parallelism.”Orga-nizationswillbecharacterizedbythemultiplicityofthehardwareprovidedtoservicetheInstruction.andDataStreams.Themutiplicityistakenasthemaximumpossiblenumberofsimultaneousoperations(instructions)oroper-ands(data)beinginthesamephaseofexecutionatthemostconstrainedcomponentof.theorganization.Severalquestionsareimmediatelyevident:whatisaninstruction;whatisanoperand;howisthe“constrainingcomponentfound?Theseproblemscanbeansweredbetterbyestablishmentofareference.IftheIBM704.werecomparedtotheTuringmachine,the704wouldappearhighlyparallel.Ontheotherhand,ifadefinitionweremadeintermsofthe“natural”dataunitcalledforbyaproblem,thesituationwouldbeequallyuntenable,sinceinmanyproblemsonewouldconsideralargematrixofdataaunit.Thuswearbitrarilyselectareferenceorganization:theIBM704-70927090.Thisorganizationisthenregardedastheprototypeoftheclassofmachineswhichwelabel:1)SingleInstructionStream-SingleDataStream(SISD).Threeadditionalorganizationalclassesareevident.2)SingleInstructionStream-MultipleDataStream3)MultipleInstructionStream-SingleDataStream4)MultipleInstructionStream-MultipleDataStream(SIMD)(MISD)(MIMD).Beforecontinuing,wedefinetwoadditionalusefulnotions.Bandwidthisanexpressionoftime-rateofoccurrence.Inparticular,computationalorexecutionbandwidthisthenumberofinstructionsprocessedpersecondandstoragebandwidthistheretrievalrateofoperandandoperationmemorywords(words/second).Latencyorlatentperiodisthetotaltimeassociatedwiththeprocessing(fromexcitationtoresponse)ofaparticulardataunitataphaseinthecomputingprocess.Thusfar,categorizationhasdependedonthemultiplicityofsimultaneouseventsatthesystemscomponentwhichimposesthemostconstraints.Theratioofthenumberofsimultaneousinstructionsbeingprocessedtothiscon-strainedmultiplicityiscalledtheconfluence(orconcurrence)ofthesystem.ConfluenceisillustratedinFig.1foranSISDorganiza-tion.Itseffectistoincreasethecomputationalbandwidth(instructionsprocessed/second)bymaximizingtheutilityGENERATEAOORESSOFINSTRUCTION-FETCHINSTRUCTIONDECOOENSTRUCTIONGENERATEAOORESSOFOPERANO111FETCHOPERfDEXECUTEINSTRUCTIONINST.#2INST.#IL.4-INSTRUCTION#ISTARTSINSTRUCTION#2STARTSINSTRUCTION#3STARTSFig.1.Concurrencyandinstructionprocessing.1966FLYNN:VERYHIGH-SPEEDCOMPUTING1903oftheconstrainingcomponent(or“bottleneck”).Thepro-cessingofthefirstinstructionproceedsinthephasesshown.Inordertoincreasethecomputationalspeed,webeginprocessinginstruction2assoonasinstruction1completesitsfirstphase.Clearly,itisdesirabletominimizethetimeineachphase,butthereisnoadvantageinminimizingbelowthetimerequiredbyaparticularphaseormechanism.Suppose,foragivenorganization,thattheinstructionde-coderisanabsoluteserialmechanismwithresolutiontime,At.Iftheaverageinstructionpreparationandexecutiontimeist,thenthecomputationalbandwidthmaybeimprovedbyt,/At.Thusifonlyoneinstructioncanbehandledatatimeatthebottleneck,thenthemaximumachevableper-formanceisl/Ar.(1beingthemultiplicityoftheconstraint.)Inordertoprocessagivennumberofinstructionsinaparticularunitoftime,acertainbandwidthofmemorymustbeachievedtoinsureanamplesupplyofoperandsandoperationsintheformofinstructions.Similarly,thedatamustbeoperatedon(executed)atarateconsistentwiththedesiredcomputationalrate.CONSTITUENTSOFTHESYSTEMWewilltreatstorage,execution,andinstructionhandling(branching)asthemajorconstituentsofasystem.Sincemultiplestreamorganizationsusuallyrepresentmultipleattachmentsofoneormoreoftheaboveconstituents,welosenogeneralityindiscussingtheconstituentsofasys-temastheyhaveevolvedinSISDorganizations.Indeed,confluentSISDorganizations-bytheirnature-mustallowarbitraryinteractionbetweenelementsofthedatastreamorinstructionstream.Multiplestreamorganiza-tionmaylimitthisinteraction,thussimplifyingsomeoftheconsiderationsinanarea.Therefore,fornow,weshallbemainlyconcernedwithtechniquestoextendcomputationalperformanceinacontextofaconfluentSISDsystem.Wewillfurtherassumethattheserialmechanismwhichcon-strainstheorganizationisthedecodingofinstructions.Thus,ontheaverage,theprocessingofoneinstructionperdecodecyclewillbeanupperlimitonperformance.TherelationshipbetweentheconstituentsisshowninFig.2fortheSISDorganization.I/I71STORAGEjn-EXECUTIONBANDWIDTHINSTRUCTIONSTREAYINSTRUCTIONHANDLINGUNITOPERAUOSTREAYn.0(IFig.2.SISDOrganization.A.StorageTheinstructionanddatastreamsareassumedsequential.Thus,accessingtostoragewillalsobesequential.Iftherewereavailablestoragewhoseaccessmechanismcouldbeoperatedinonedecodecycleandthisaccessingcouldberepeatedeverycycle,thesystemwouldberelativelysimple.Theonlyinterferencewoulddevelopwhenoperandshadtobefetchedinapatternconflictingwiththeinstructions.Unfortunately,accessinginmoststoragemediaisconsider-ablyslowerthandecoding.Thismakestheuseofinter-leavingtechniquesnecessarytoachievetherequiredmemorybandwidth(seeFig.3).Intheinterleavingscheme,nmemoriesareused.Wordsaredistributedineachofthememoriessequentially(modulon).Inmemoryiisstoredwordi,n+i,and2n+i,etc.However,theaccessingmecha-nismdoesnotalterthelatencysituationforthesystem,andthsmustbeincludedinthedesign.Memorybandwidthmustbesufficienttohandletheaccessingofinstructionsandoperands,thestorageofresultsandinput-outputtraffic9.Theamountofinput-outputbandwidthisproblem-depen-dent.Largememoryrequirementswillnecessitatetransferofblocksofdatatoandfrommemory.Theresultingtrafficwillbeinverselyproportionaltotheoverallsizeofthememory.However,anincreaseinthesizeofthememorytominimizethisinterferencealsoactstoincreasethebulkofthememory,andusuallytheinterconnectiondistance.Theresult,then,maybeanincreaseinthelatencycausedbythesecommunicationsproblems.Figure3wasgeneratedaftertheworkofFlores9andassumescompletelyrandomaddressrequests.The“waitingtime”istheaverageadditionalamountoftime(overtheaccesstime)requiredtoretrieveanitemduetoconflictingACCESSREGENERATIONLATENCY-tatrII.1.InUNITSBUSYn.at(MEMORYCYCLE)ASSUMEDEMANDRATEOFnWORDSPERMEMORYCYCLEWAITINGTIMEASAFRACTIONOF1.0MEMORYCYCLEIRANDOMADDRESSREQUESTSI.oI.52:oNUMBEROFMEMORYUNITSDEMANDRATEFig.3.Interleavingmemoryunits.1904PROCEEDINGSOFTHEIEEEDECEMBERrequests.Sincememoryrequestsareusuallysequentialforinstructionsandatleastregularfordata,theresultisoverlypessimisticforallbutheavilybranch-dependentproblems.Typicalprogramswillexperienceabouthalfthewaitingtimeshown.Thislatencyinthesystemgreatlyincreasesthecomplexityofthecontrolmechanismforthestoragesystem.Thestorageunitmustnowincludequeuingmechanismstoorganize,fetch,andstorerequestswhichareinprocessorcouldnotbehonoredduetoconflicts.Thesequeuingregisters,containingoutstandingservicerequests,mustbecontinuallycomparedagainstnewrequestsforservice.Extensivecontrolinterlocksmustbemadeavailabletoserveatleastthefollowingfunctions2,3:3)4).directthefetchedwordtotheappropriaterequestor;preventout-of-sequencefetch-storeorstore-fetchinthesamememorylocation;eliminateduplicaterequestsofthesamememoryloca-tion(especiallywherethememorylocationcontainsmorethanoneinstructionordataunit);analyze“keys”or“boundaries”wherefetchorstorememoryprotectionareused.Inaddition,confluentcomputingsystemsfrequentlyem-ploybufferstominimizetrafficrequirementsonthememory2.Aninstructionbuffermightcontainthenextninstruc-tionsinthesequenceaswellasahistoryofminstructionstogetherwithseverallevelsofalternatepathsofbranch.Thepresenceofahistoricalpictureofinstructions(intheinstructionbuffer)allowsfortheopportunitytostoresmallloops,thusavoidingthepenaltyofreaccessinginstructions.Anoperandbufferisusedbytheexecutionunitasitsin-trinsicstoragemedium.Theinstructionunitwouldre-structureinstructionsthatcalledforoperandsfrommemoryintoinstructionswhichcallforthecontentsoftheseregisters.Theinstructionunitwould,afterdecodingtheinstruction,initiatetherequestsfortheappropriateoperandsanddirecttheirplacementintotheoperandbuffer.Thustheexecutionunitwouldactasanindependentcom-puter,whosestoragewouldbelimitedtothecontentsofthisbuffer,andwhoseinstructionswouldbeinashortenedformat.B.ExecutionWecenterourdiscussiononthefloating-pointinstruc-tionssincetheyarethemostwidelyusedinlargescientificcomputing,andrequirethemostsophisticatedexecution.Inordertoachievetheappropriatebandwidthlevels,wecanrepeatthedeploymentschemethatwasusedformemory,onlyheretheindependentunitswillbededicatedtoservic-ingoneclassofinstructions.Inadditiontodirectlyincreas-ingthebandwidthbyprovidinganumberofunits,thespecializationoftheunitaidsinexecutionefficiencyinseveralotherways.Inparticular,eachunitmightactasasmallinsularunitoflogic,henceminimizingintra-unitwirecommunicationproblems.Secondly,thededicatedunithasfewerlogicaldecisionstomakethanauniversalunit.Thirdly,theunitmaybeimplementedinamoreefficientfashion.Onemayalsoconsider“pipelining”techniques6toimprovethebandwidthwithinaparticularinsularunit,inadditiontooptimizinganalgorithmtominimizelatenttime.“Pipelining”isaprocesswhereinnaturalpointsinthedecision-makinghardwarearesoughttolatchupinter-mediateresultsandresuscitatetheuseoftheunit.Thelatchingmayincludeextradecisionelementsforstorage,butthesemayalsobeusedinaidingthecontroloftimeskew(differences)inthevariousparallelpaths.I)FloatingAdd-SubtractOperations:Floatingadd-sub-tractoperationsconsistofthreebasicparts.First,thejustificationofthefractionalpartsofthetwooperandsbytheamountofthedifferenceoftheexponents.Second,theaddingofthefractionalparts(orappropriatecomplement).Third,thepostnormalizationofthefractioniftheresulthasaleadingzeroinasignificantpositionoranoverflowoccurs.Becauseofthedecision-making(shifting)problemsasso-ciatedwiththeexponenthandling,anormallatchpointforpipeliningthetwoaddoperationsinoneunit(duplexingtheunit)isattheinterfacefromthepreshiftintotheadderorafterthefirstleveloftheadderstructure.Thefloatingaddclassofinstructionhasbeenreported5tooperateinthe120nanosecondrangeforaduplexedunitwith56-bitfrac-tion(Systems360format).2)FloatingMultiply5,lo,12,13:Theessentialdecisionprocessofthemultiplicationalgorithmistheaddi-tionofthemultiplicandtoitselfbyasmanytimesasareindicatedbythemultiplier.Iftherearenbitsinamultiplierandmultiplicand,thenthemultiplicandmustbeaddedtoitselfntimeswithaone-placeshiftbeforeeachaddition.Standardtechniquesexistforreducingthenumberofaddi-tions(e.g.,multiplierbits:1111.maybetreatedas+1oooO.-1.)requiredbyencodingthemultiplierintoalessernum-berofsignedbits,say42.Oncethisisdone,Wallace12suggeststhedirectadditionofthe42,n-bit,shiftedmulti-plicands.Thebasicdecisionelementinaddingofeachbitofthen/2operandsistheso-calledbinaryfulladder,whichtakesthreeinputsofequalsignificanceandproducestwooutputs,oneofthesamesignificance,thesum,andoneofhighersignificance,thecarry.Bytheassociativelaw,thecarryisinjectedintoanyavailablelowerleveladdstruc-tureoftheappropriatesignificance.(Theneteffectissome-timesreferredtoasaflushadder.)Thefinalresult,ofcourse,isthereductionofthen/2multiplesintotworesultsegments-thesumsandthecarries-whicharethenassimilatedbyaconventionalcarry-propagateadderwithananticipationmechanism.Thebasicproblemistheimplementationofthisalgorithm.Forlargen,theplanesegmentsthatpartitionthealgorithmareinconsistentwithconventionalpackagingsizesandinterconnectionlimitations(seebelow).ConsiderasimplervariationFig.4(b)SI:hereonlyn/mmultiplesofthemultiplicandareretiredperiteration(miterationsarerequired).Then/mmultiplesaredecodedinton/2madditionsofthemultiplier.Thesemultiplesare1966FLYNN:VERYHIGH-SPEEDCOMPUTING1905theninsertedintothefirstleveloftheaddertreeFig.4(b).Nowtheaddassimilationofbothcarriesandsumspro-ceedsasbeforewiththeintroductionofintermediatestag-ingpointswheretheresultsaretemporarilylatched.Thestoragepointsactasskew(relativetiming)controlandim-provetheoverallexecutionbandwidthefficiencyofthealgorithm.Afterthefirstsetofmultipleshasbeeninsertedintothefirstlevelofthetreeandassimilated,itislatchedatthefirstlatchpoint.Thestoragepointservestodecouplethefirstlevelfromsubsequentlevelsofthehardware;there-forethefirstsetofmultiplesmaynowproceedtobeab-sorbedinthesecondleveland,simultaneously,asecondsetofmultiplesmaybeinsertedintothefirstlevelofthetree.Theprocesscontinuesand,atthebottomofthetree,thecarriesandsumsareassimilatedinanadderwithcarrypropagation.Noticethatthelatencyinthisvariationmaybetwicethatoftheoriginalalgorithm.Alsonoticethati+lii-1,/J/1/1REGULARCUTTINGPLANEREGISTERSCONTAININGMULTIPLES(ASSUME61TREEOFLATCHINGFULLADDERSCARRYPRWAGATE(INCLUDESCARRYADDERANTICIPATION)UNITINTERCONNECTIONSARESHOWNONCUTTINGPLANE0000006ithbitMULTIPLEREGISTERSPARTIALSUMACCUMUL4TlONithREWRcumNPLANE(b)Fig.4.Outlineofamultipliertree.Li+lii-lithPL4NEPROFILEpotentialbandwidthinthisalgorithmhasnotsufferedsinceasecondmultiplicationmayproceedindependentlyassoonasthelastsetofmultiplesisinsertedandhaspassedthefirstlevelofthetree.Implementationsofthissecondschemehaveyieldedaperformanceof180nanosecondsforafloat-ing-pointmultiplyincludingpostshiftingandexponentupdating(56-bitfraction,Systems360format).Figure4,parts(a)and(b)arethree-dimensionalrepresen-tationsofthesecondvariationofthemultiplyalgorithm.Figure4(a)isanisometricprojectionandFig.4(b)isacross-section(or“regularcuttingplane”)andprofileview.Thisrepresentationwaschosentoillustratesomeofthedifficultiesofimplementinghigh-speedsystemsingeneral.Itiswellknown1thatpropagationdelayandnonuniformtransmissionlineloadingaremajorfactorsintheswitchmgspeedofalogicstage.Onewould,therefore,desireaphysicalpackageconsistentwiththe“natural”communica-tionpatternofthealgorithmsothattheimplementationcouldbeoptimized.Thedifficultiesofaplanarpackageareobvious.Communicationsbetween“regularcuttingplanes”profileview,Fig.4(b)mustbemadeaxiallyinthephysicalimplementation.Indeed,thereisnoassurancethatthecapacityofthephysicalpackage-planewillmatchthere-quirementsoftheregularcuttingplane-severalpackage-planesmayberequired.Noticethatthefirstvariationofthealgorithmhadmuchmoreextensiverequirementsperplane,thusitsimple-mentationwouldverylikelybelessefficientthanthesecondvariation.3)FloatingPointDivide5,IO,lI:Historically,dividehasbeenlimitedbythefactthattheiterativeprocesswasdependentonpreviouspartialresults.Recently,atten-tionhasbeengiventotechniqueswhichdonothavethislimitation.ThesetechniquesincludeuseofNewton-Raphsoniterationsforseriesapproximationstoquotients.AssumeaQuotient=-bLet11b1-(-X)-=1-x+x2-3+.fX“f.whichcanberewrittenas1b-=(1-x)(l+x2)(1+x4).(1+x2m)Thus:Quotient=a.(I-x).(l+x2).(1+x4).(1+x2”).Recallthatinbinarythefactor(1+x“)isrelatedto(1-x“)byacomplementationoperation.Thus,thedenominatorisrewrittenas(1+x),complementedtoform(1-x),theproductis(1-x2)whichiscomplementedtoform(1+x2),whichcontinuesthedevelopment.1906PROCEEDINGSOFTHEIEEEDECEMBERNoticethatthespeedissubstantiallyenhancedbythepresenceofahigh-speedmultiplier.Infact,asmanyastwomultiplicationsmightproceedsimultaneously.oneforthequotientdevelopmentandoneforthenextdenominatorterm,ifthehardwarepermits.Inforcingquickcon-vergenceoftheseries,usuallyatablelookuparrange-mentisusedtoprovidethefirstapproximationforthequotient.Thereafter,subsequentiterationsdevelopdoubletheprecisionofthefirstapproximation.Floating-pointdividesunder700nanosecondshavebeenreportedusingthisscheme5.4)AlgorithmsforAchievingMaximumEjiciencyintheConcurrentExecutionofIndependentUnits:Oneofthepurposesinhavingindependentunitsdedicatedtoin-dividualinstructiontypesistoimprovetheexecutionofeachoftheinstructionclasses.Theotheradvantageisthattheseindependentunitsmayoperateconcurrentlyandservetoincreasetheoverallbandwidthoftheexecutionunit.Tomasulo4describesanelegantalgorithmtoallowtheachievementofmaximumefficiencyintheconcurrentexecu-tionofunits(Fig.5,illustrationfromAmdahlI).Theoperationofthisalgorithmtakesadvantageofthelatenttimefollowingtheissuingoftherequesttoaccesstheoperandsfromstorageintothememoryintheexecutionarea(eithervirtualoraddressablebuffers).Whenanin-structionisforwardedtotheexecutionarea,awordintheexecutionunitmemoryisreservedforitsoperand,buttheoperandhasnotyetarrived.Forexample,iftheinstructiontobeperformedconsistedofamultiply,thenitcouldbeimmediatelyforwardedtothemultiplicationunit.TheLOADR,aMULTRI,pADDR,R2STORERLOADRI,(b)Fig.5.EffectofTomasuloconcurrencyalgorithm.(a)Sequencewithconventionaldependency.(b)Sequenceasperformedaftertagwasforwarded.multiplierwouldrequireatagrepresentingtheoperand.Theexecutionareaisprovidedwithacommondatabusandthetagrepresentingtheoperandisbroadcastonthebusonecycleearly,thentheexecutionfunctionalunitsexaminetheirqueuesofrequiredoperandsandgateinimmediatelyanappropriateonewithoutwaitingforittogotothebufferstorage.Thetagforwardingfreesthebufferstorageofhav-inganyresponsibilityforthisparticularoperand.Ofcourse,atthesametime,aloadcouldthenbeexecutedintothatsameword.Thus,thesecondloadinstructionandthemultiplyinstructioncouldproceedconcurrentlyinafashionimpossiblebefore.Results,astheybecomeavail-able,mightalsobeforwardedtounits(theadderinFig.5)requestingactionbyasimilarmechanism,ratherthanproceedingdirectlytostorage,whethervirtualoraddressed.Thisavoidsintermediatestopsandhenceimprovesoverallefficiencyofexecutionbyimprovingtheoverlapinthecon-currencysystem.C.BranchingAssumeforthemomentthatmemorybandwidthandexecutionbandwidthhavenowbeenarrangedsothattheymorethansatisfytherequirementofoneinstructionpro-cessedperdecodecycle.Whatthenwouldlimittheper-formanceofaSISDsystem?Ifitisassumedthatthereareafixednumberofdata-dependentbranchpointsinagivenprogram,then,byoperatinginaconfluentorotherhigh-speedmode,weessentiallybringthebranchpointsclosertogether.However,theresolutiontimeofthedata-depen-dentbranchpointisbasicallyfixedforasystemwithagivenexecutionand/oraccessinglatency.Thebasicretrogressivefactoristhepresenceofbranchdependenciesintheinstructionstream.Amongthemanytypesofbranchinstructions,wehave2,7:1)Execute:Theoperandwhichisfetchedistobetreatedasaninstruction(oraninstructioncounter).Thispresentssomeproblemssincetheaccesslatencyisin-sertedintotheinstructionstream.Itisessentiallythesameproblemasindirectaddressingoradoubleoperandfetch.Onecouldanticipatesomeofthisdifficultybykeepingaheadintheinstructionstream.However,thisparticularinstructiondoesnotposeaseriesdegradationproblembecauseitsuseisnormallyrestrictedtolinkages.2)BranchUnconditional:Theeffectiveaddresssogen-eratedbecomesthecontentsoftheinstructioncounter.Thissubclassofinstructionpresentsthesameproblemasexecute.3)BranchonIndex:Thecontentsofanindexregisterisdecrementedbyoneoneachiterationuntilthereg-isteriszero,uponwhichthealternatepathistaken.Onlytheaccesslatencyisafactorsincezerocanbeanticipated.4)Data-DependentBranch:Thepathsofthebrancharedeterminedbythecondition(sign,bit,status,etc.)ofsomedatacell.Invariably,thisconditionisdependentontheexecutionofapreviouslyissuedinstruction.1966FLYNN:VERYHIGH-SPEEDCOMPUTING1907Herethedegradationisseriousandunavoidable.Firsttheexecutionofthecondition-generatingdatamustbecompleted.Then,thetestismadeandthepathisselected.Now,theoperandscanbefetchedandconfluencycanberestored.Itispresumedthatbothinstructionpathswerepreviouslyfetched-butnotethatthisisdoneonlyattheexpenseofgreatermemorybandwidthrequirements.Anotherslightimprovementcanbemadeiftheoperandsare.fetchedforonealterna-tivepathsolongastheunresolvedbranchpathisnotexecuted(toavoidseriousrecoveryandreconstructionproblemsiftheguessproveswrong).Ofthetwo“loopclosing”orconditionalbranchinstruc-tions,BranchonIndexhaslessdegradationduetoresolu-tionoflatencythanData-DependentBranch.Whenanoptionexists(asintheperformanceofaknownnumberofiterations)theprogrammershouldselecttheformer.AssumingthattheData-DependentBranch-resolvinglatencyisaconstantforanyparticularmachine,wemayshowitsrelationship(Fig.6)toperformancebyassumingvariouspercentages(ofoccurrenceinexecutedcode)ofthistypeinstruction.Thelatencyrepresentsthesumofaverageexecutiontimeplusoperandaccesstimefrommemory.Figure6assumesthattheorganizationunderconsidcra-tionhasenoughconfluencytoperformoneinstruction/cycleontheaveragewithoutbranch-resolutioninterrup-tions.Toresolvethebranch,itisgenerallynecessarytofullyexecuteandtesttheresultoftheprecedinginstruction.Duringthistime,fetchingofinstructionsanddatamayproceedforonebranchpathandpossiblyafewinstructionsmaybefetchedforthealternate.Fetchingofbothpathsdoublesthebandwidthrequiredandincreasesthewaitingtime(Fig.3).Thusthelatencyincludesanaverageexecutiontime,testtime,and.a.percentage(wrongpathguesses)oftheoperand-accesstime.Noticethatwhilewehavestudieddegradation.dueton=NUMBEROFCYCLESOFEXECFTIONANDACCESSLATENCY“=II.o-WWJvuZtUVE0.5U=+Wu)=z0Fig.6.1020%PIIXNTASEOFOCEURREEYGEffCONDlTlOULBRANUIINSTRUCTIOISINImmnowsTRumDegradationduetodatadependentbranchinstructions.latencyfortheSJSDorganization,multiplestreamorgani-zationsmayexhibitbranchinduceddegradationduetoeitherthssamephenomenonorananalogspatialin-efficiencyinwhichonlyonepathactivatesordeterminestheoutcomeofadependency.CLASSESOFORGANIZATIONIntlussectionweshallconsidereachoftheorganizationalclasseslistedinthefirstsectionofthepaper.Wewillre-markonorillustraterepresentativesystemsthatfallintoeachclass.Thevariousconfigurationsarebynomeansexhaustive.A.ConJluentSISDTheconfluentSISDprocessor(IBMSTRETCH7,CDC6600series8,IBM360/90series2l-5)achievesitspowerbyoverlappingthevarioussequentialdecisionpro-cesseswhich-makeuptheexecutionoftheinstruction(Figs.1and2).Inspiteofthevariousschemesforachievingarbitrarilyhighmemorybandwidthandexecutionband-width,thereremainsanessentialconstraintinthistypeoforganization.Asweimpliedbefore,thisbottleneckisthedecodingofoneinstructioninaunittime,thus,nomorethanoneinstructioncanheretiredinthesametimequan-tum,.ontheaverage.Ifoneweretotrytoextend.thisorgan-izationbytakingtwo,three,orndifferentinstructionsinthesamedecodecycle,and-nolimitationswereplacedoninstructioninterdependence,thenumberofinstructiontypestobeclassifiedwouldbeincreasedbythecombina-torialamount(MdifferentinstructionstakennatatimerepresentsM”differentoutcomes)andthedecodingmechanismwouldbecorrespondinglyincreasedincom-plexity.Ontheotherhand,onecouldplacerestrictionsontheoccurrenceofeitherspecifiedtypesofinstructionsorin-structiondependencies.This,inturn,narrowstheclassofproblemsforwhichthemachineissuitableand/ordemandsrestrictiveprogrammingpractices.Indeed,thisisachar-acteristicofmultiplestreamorganizationssincethemulti-plicity(or“parallelism”)impliesindependentsimultaneousaction.B.SIMD14-18SIMD-typestructureshavebeenproposedbyUnger14,SlotnikI51(SOLOMON,ILLIACIV),CraneandGithens16,and,morerecently,byHellerman17.SOLOMONistheclassicSIMD.Therearenuniversalexecutionunitseachwithitsownaccesstooperandstorage.Thesingleinstructionstreamactssimultaneouslyonthenoperandswithoutusingconfluencetechniques.Increasedperformanceisgainedstrictlybyusingmoreunits.Com-municationbetweenunitsisrestrictedtoapredeterminedneighborhoodpatternandmustalsoproceedinauniversal,uniformfashionFig.7(a).(Note:SOLOMONhasbeensupersededbyILLIACIVasasystembeingactivelyde-veloped,whichisnolongercompletelySIMD.)ThedifficultieswithSIMDare:1)LatencyintheinstructionstreamforSIMDbranchesisnowreplacedbylatencyinthedatastreamcausedbyPROCEEDINGSOFTHEIEEEDECEMBER19082)operandcommunication(forwarding)problems.Presently,thenumberofclassesofproblemswhoseoperandstreamshavetherequiredcommunicationregularityisnotwellestablished.SIMDorganizationsareinconsistentwithstandardalgorithmictech-niques(including,andespecially,compilertech-niClUeS).3)Theuniversalityoftheexecutionunitsdeprivethemof-Imaximumefficiency.C.MISDThesestructureshavereceivedmuchlessattention18,19.AnexampleofsuchastructureisshowninFig.7(b).Itbasicallyemploysthehighbandwidthdedicatedexecu-tionunitasdescribedintheconfluentSISDsection.Thisunitisthensharedbynyirtualmachinesoperatingonprogramsequencesindependentofoneanother.Eachvirtualmachinehasaccesstotheexecutionhardwareoncepercycle.Eachvirtualmachine,ofcourse,hasitsownEXECUTION1UNITSUINSTRUCTIONUNIT-LIMITEDWMMUNICATION(4BUSESARETIMESHAREDMULTIPLYBUSIUNIT9INSTRUCTIONIIIIDAlAd*H*lSTORAGEEXECUTIONUNITEXECUTIONUNITEXECUTIONSOURCEDATASTREAYDERIVEDDATISTREAY(c)Fig.7.(a)SIMD.(b)MISD.ConvertstoMIMDifinstructionsanddataareprivatelymaintained(together).(c)MISD.privateinstructionmemoryandinteractionbetweenin-structionstreamsoccursonlyviathecommondatamemory.Presumably,ifthereareNinstructionunitsthentheband-widthofthecommondatastoragemustbeNtimesgreaterthantheindividualinstructionstorage.ThisrequirementcouldbesubstantiallyreducedbyuseofamodifiedversionofTomasulostag-forwardingalgorithmandaseparatecommonforwardingbus.AnotherversionofthisFig.7(c)wouldforceforward-ingofoperands.ThusthedatastreampresentedtoExecu-tionUnit2istheresultantofExecutionUnit1operatingitsinstructiononthesourcedatastream.Theinstructionthatanyunitperformsmaybefixed(specializedsuchthattheinterconnection(orsetup)ofunitsmustbeflexible),semi-fixed(suchthatthefunctionofanyunitisfixedforonepassofadatafile)orvariable(sothatthestreamofinstruc-tionsoperatesatanypointonthesingledatastream).Undersuchanarrangement,onlythefirstexecutionunitseesthesourcedatastreamandwhileitisprocessingtheithoperandtheithexecutionunitisprocessingtheithderivationofthefirstoperandofthesourcestream.D.MIMDIfwereconstructtheorganizationofFig.7(b)sothatthedataandinstructionstreamsaremaintainedtogetherinprivatememories,wehaveanexampleofMIMD.Thereisnointeractionorminimuminteractionallowedbetweenthesevirtualmachines.Solongasthelatenttimeforexecu-tionofanyoperationislessthanthememorycycle,noproblemsariseduetobranching.Thus,suchanarrange-mentallowsmaximumadvantageoftheallowableband-widthsofexecutionunitandmemoryunit.Ofcourse,suchanapproachmightwellbecriticizedonthebasisthatthere-quirementofindependenceoftheinstructionsequencesdoesnotaddressitselftotherequirementsoflargescientificproblems.Itwouldbebettersuitedtotheneedsofthetime-sharingand/orutilityenvironment.ThisrestrictedMIMDpointsuptheshortcomingofourorganizationaldefinitions.Thespecificationsfailtoincludeaclassdicationofhowthestreamsmayinteract,thusare-strictedMIMDmaybeorganizationallymuchsimplerthanaconfluentSISD.GeneralMIMDstructureshavebeenmorewidelyde-scribed20E23withlarge-scalemultiplicitybeingen-visionedbyHolland21andmorerestrictedimplementa-tionsbeingundertakenbyBurroughsandUnivac23.Inhisoriginalproposal,Hollandconsidersanarrayofprocessorseachwithaone-wordstorage.Themodulesareindependentandarecapableofconcurrentexecution.Theprocessorshavearithmeticabilitybutcommunicationislimitedbytheirneedto“buildapath”totheappropriateoperand.Inpathbuilding,thedisplacementanddirectionoftheoperandisspecifiedandtheinterveningmoduleprocessorsformavinculum.SuchanarrangementmightsolvesomeoftheessentialblockageproblemsofSEDandSIMD,sinceindependentprogramsegmentsproceedsimultaneously.Also,memorybandwidthisalwaysadequate.1966FLYNN:VERYHIGH-SPEEDCOMPUTING1909ThedifficultieswithsuchanarrangementofMIMDtime,andthismaybefurtherdegradedbytheoccurrenceofgenerallyincluder201:“conditionalbranch”instructions.1)interconnectionsbetweenunits,whichposeserious2)theuniversalnatureoftheindividualmodule,which3)theclassofproblemswhichcouldutilizeMIMDinterferenceproblemslimitsitsefficiencyorganization,whchispresentlysmall.SYSTEMSREQUIREMENTSTheeffectivenessoftheoverallcomputingprocessesmustbemeasuredonabasismuchlargerthanperformance(nanosecondsperinstruction)alone.Evenonthisprimi-tivebasis,itisobviousthatdifferentinstructionrepertoiremayimplysubstantiallydifferenteffectivenesswiththesameaveragenanosecondperinstructionratio.Despiteouroriginalassumptionthattheverylargeprob-lemdoesnotemploysignificantinput/output,itis.soonrealizedthatthis.requirementisunduly.restrictive.Pres-ently,high-speedsystemsremainsolimited.Itmaybesometimebeforebroadereffectiveutilizationmayberealizedthroughnewdevelopmentsininput/outputequip-ment.Ofcourse,theoverallmeasureofefficiencyofthepro-cessoristhenumberofcorrectlycompletedcomputationsandprogramsdoneoveranextendedperiodoftime.In-cludedinthismeasureisthereliability.andthemaintain-abilityofthesystem.Complexsystemswithverylargenum-bersofcomponentsarenaturallyverydifficulttomaintain,unlessfeatureswhichprovidefaultlocationareincluded.Amajorcomponentofsuchfeatureswouldbechecking(orfaultdetection).ofalloperationsanddatatransfers.Ondetectionoferror,hardware-aideddiagnosticsshouldbeprovidedsothatservicingandmaintenancemightbereadilyandeasilyaccomplished.Ifcheckingisnotincludedinthehardware,th

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论