会员注册 | 登录 | 微信快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

   首页 人人文库网 > 资源分类 > PDF文档下载

43-Very High-Speed Computing Systems.pdf

  • 资源星级:
  • 资源大小:959.11KB   全文页数:9页
  • 资源格式: PDF        下载权限:注册会员/VIP会员
您还没有登陆,请先登录。登陆后即可下载此文档。
  合作网站登录: 微信快捷登录 支付宝快捷登录   QQ登录   微博登录
友情提示
2:本站资源不支持迅雷下载,请使用浏览器直接下载(不支持QQ浏览器)
3:本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰   

43-Very High-Speed Computing Systems.pdf

PROCEEDINGSOFTHEm,VOL.54,NO.12,DECEMBER,1966190123.H.A.Ernst,TCS,anexperimentalmultiprogrammingsystemfortheIBM7090,IBMCorp.,YorktownHeights,N.Y.,ResearchRept.RJ248,June1963,41pages.theIBM7090.IBMResearchRept.RJ248,June1963,41pages.24.M.Lehman.R.Eshed,andZ.Netter,SABRAC,atimesharinglowcostcomputer.Commun.ACM,vol.6,pp.427429,August1963.25.R.V.SmithandD.N.Senzig,Computerorganizationforarrayprocessing,IBMCorp.,YorktownHeights,N.Y.,ResearchRept.RC1330,December1964.26.A.S.Critchlow,Generalizedmultiprocessingandmultiprogrammingsystems,1963AFIPSProc.FJCC,pp.107126.27.M.E.Conway,Amultiprocessorsystemdesign,ibid.,pp.139146.28.R.R.SeeberandA.B.Lindquist.Associativelogicforhighlyparallelsystems,ihid.,pp.489493.29.R.M.Meade,604machinedescription,IBMinternalmemo.,December1963.38pages.30.M.Lehman,R.Eshed,andZ.Netter,SABRAanewgenerationserialcomputer,IEEETrans.onElectronicComputers,vol.EC12.pp.618628,December1963.31.M.W.Allen,T.Pearcey.J.P.Penny,G.A.Rose,andJ.G.Sanderson.CIRRUS,aneconomicalmultiprogramcomputerwithmicroprogramcontrol,ibid.,pp.663671.32.W.F.MillerandR.A.Aschenbrenner,TheGUSmulticomputersystem,ibid.,pp.671676.33.G.Estrin,B.Russenl,R.Turn,andJ.Bibb,Parallelprocessinginarestructurablecomputersystem,ibid.,pp.747755.34.G.GregoryandR.McReynolds,TheSolomoncomputer,ibid.,35.H.S.Bright,APhilcomultiprocessingsystem,1964Proc.FJCC,pp.97141.36.R.G.EwingandP.M.Davies,Anassociativeprocessor,1964AFIPSProc.FJCC,pp.147158.37.H.A.Kinslow,Thetimesharingmonitorsystem,ibid.,pp.443454.38.J.Nievergelt.Parallelmethodsforintegratingordinarydifferentialequations,Commun.ACM,vol.7,pp.731733,December1964.39.W.H.Desmonde,RealTimeDuraProcessingSystems.EnglewoodCliffs,N.J.PrenticeHall,1964.40.M.Lehman.Serialmodeoperationandhighspeedparallelprocessing,InformationProcessing,1965Proc.IFIP,pt.2.NewYorkSpartan,1966,pp.631633.41.R.V.SmithandD.N.Senzig,Computerorganizationforarraypp.774755.processing.1965Proc.FJCC,pp.117128.42.E.W.Dijkstra,Solutionofaprobleminconcurrentprogrammingcontrol.Commun.ACM,vol.8,p.569.September1965.43.J.B.Dennis,Segmentationandthedesignofmultiprogrammedcomputersystems,J.ACM,vol12,pp.589602,October1965.44.F.J.CorbatoandV.A.Vyssotsky,Introductionandoverviewofthemulticssystem,1965Proc.FJCC,pp.185196.45.E.L.Glaser.J.Couleur,andG.Oliver,Systemdesignofacomputerfortimesharingapplications,ibid..pp.197202.46.V.A.Vyssotsky,F.J.Corbato,andR.M.Graham,Structureofthemulticssupervisor,ibid..pp.203212.47.R.C.DaleyandP.G.Neumann,Ageneralpurposefilesystemforsecondarystorage,ibid.,pp.213229.48.J.F.Ossanna,L.E.Mikus,andS.D.Dursten,Communicationandinputoutputswitchinginamultiplecomputingsystem,ihid..49.J.W.Forgie,Atimeandmemorysharingexecutiveprogramforquickresponseonlineapplications,ibid.,pp.599610.50.J.D.McCullogh,K.H.Speiennan,andF.W.Zurcher,Designforamultipleusermultiprocessingsystem,ibid.,pp.611618.51.W.T.Comfort,Acomputingsystemdesignforuserdevice,ibid.,pp.619628.52.J.P.Anderson,Programstructuresforparallelprocessing,Commun.ACM,vol.8,pp.786788,December1965.53.B.W.Arden,B.A.Galler,T.C.D.OBrien,andF.H.Westervelt,Programandaddressingstructureinatimesharingenvironment,J.ACM,vol.13,pp.116.January1966.54.J.H.Katz,Simulationofamultiprocessorcomputersystem,SRDRept.LA009,February1966,tobepublishedin1966Proc.SJCC.55.G.S.ShedlerandM.M.Lehman,Parallelcomputationandthesolutionofpolynomialequations,IBM,YorktownHeights,N.Y.,ResearchRept.RC1550,February1966.56.H.Hellerman,Parallelprocessingofalgebraicexpressions,IEEETrans.onElectronicComputers,vol.EC15,pp.8291,February1966.57.J.B.DennisandE.C.VanHorn,Programmingsemanticsformultiprogrammedcomputation,Commun.ACM,vol.9,pp.143155,March1966.58.N.Wirth,Anoteonprogramstructuresforparallelprogramming,Commun.ACM,vol.9,pp.32G321,May1966.59.D.E.Knuth,Additionalcommentsonproblemsinconcurrentprogrammingcontrol,ihid.,pp.321322.pp.231241.VeryHighspeedComputingSystemsMICHAELJ.FLYNNMEMBER,IEEEAbstractVeryhighspeedcomputersmaybeclnssifiedasfollows1SingleJktmctionStrdingleDataStreamSISD2SileImbnctioaStreamMultipleDataStreamSIMD3MultiplehstmcthStrePntSingleDataStreamMSD4MnltipleInstroctiooStreamMultipleDataStreamD.Stream,asnsedhere,referstothesequenceofdataorirstructiomasseenbythemachinedaringtbeexecutionofaprogram.mecoastitaeatsofasystemstorage,exeation,andhtrudonbandhgbraochingaredkcussdwitbregardtorecentdevelopmen6and/orsystemsLimitatim.TheCOlsMnentsaredkcuwdmtermofcoocarrentSEDManuscriptreceivedJune30,1966revisedAugust16,1966.ThisworkwasperformedundertheauspicesoftheU.S.AtomicEnergyCommission.TheauthoriswithNorthwesternUniversity,Evanston,Ill.,andArgonneNationalLaboratory,Argonne,Ill.systemsCDC6600seriesand,ioparticnlar,IBMModd90series,sincemnltiplestreamorganizationsusuallydonotquireanymoreelaboratecompooents.Representativeorganizationsareselectedfromeachclassandthearrangementofthecomtitwntsisshown.INTRODUCTIONMANYSIGNIFICANTscientificproblemsrequiretheuseofprodigiousamountsofcomputingtime.Inordertohandletheseproblemsadequately,thelargescalescientificcomputerhasbeendeveloped.Thiscomputeraddressesitselftoaclassofproblemscharacterizedbyhavingahighratioofcomputingrequirementtoinput/outputrequirementsapartiallydefactosituation1902PROCEEDINGSOFTHEIEEEDECEMBERcausedbytheunavailabilityofmatchinginput/outputequipment.Thecomplexityoftheseprocessors,coupledwiththeadvancementofthestateofthecomputingarttheyrepresent,hasfocusedattentiononscientificcomputers.Insightthusgainedisfrequentlyapredictorofcomputerdevelopmentsonamoreuniversalbasis.Thispaperisanattempttoexplorelargescientificcomputingequipment,reviewingpossibleorganizationsstartingwiththeconcurrentorganizationswhicharepresentlyinoperationandthenexaminingtheothertheoreticalorganizationalpossibilities.ORGANIZATIONThecomputingprocess,initsessentialform,istheperformanceofasequenceofinstructionsonasetofdata.Eachinstructionperformsacombinatorialmanipulationalthough,foreconomy,.subsequencingisalsoinvolvedononeortwoelementsofthedataset.Iftheelementwereasinglebitandonlyonemchbitcouldbemanipulatedatanyunitoftime,wewouldhaveavariationoftheTuringmachinethestrictlyserialsequmtialmachine.Thenaturalextensionofthisistointroduceadatasetwhoseelementsmorecloselycorrespondtoanaturaldataquantumcharacter,integer,floatingpointnumber,etc..Sincethesizeofdatumhasincreased,sotoohasthenumberofcombinatorialmanipulationsthatcanbeperformedmanipulationsontwonbitargumentshave22possibleoutcomes.Ofcourse,attentionisrestrictedtothoseoperationswhichhavearithmeticorlogical.significance.Aprogramconsistsofanorderedsetofinstructions.Theprogramhasconsiderablyfewerwrittenorstoredinstructionsthanthenumberofmachineinstructionstobeperformed.Thedifferenceisintherecursionsorloopswhichareinherentintheprogram.Itishighlyadvantageousifthealgorithmbeingimplementedishighlyrecursive.Thebasicmechanismforsettinguptheloopsistheconditionalbranchinstructions.ForconvenienceweadopttwoworkingdefinitionsZnstructionStreamisthesequenceofinstructionsasperformedbythemachineDataStreamisthesequenceofdatacalledforbytheinstructionstreamincludinginputandpartialortemporaryresults.Thesetwoconceptsarequiteusefulincategorizingcomputerorganizationsinanattempttoavoidtheubiquitousandambiguoustermparallelism.OrganizationswillbecharacterizedbythemultiplicityofthehardwareprovidedtoservicetheInstruction.andDataStreams.Themutiplicityistakenasthemaximumpossiblenumberofsimultaneousoperationsinstructionsoroperandsdatabeinginthesamephaseofexecutionatthemostconstrainedcomponentof.theorganization.SeveralquestionsareimmediatelyevidentwhatisaninstructionwhatisanoperandhowistheconstrainingcomponentfoundTheseproblemscanbeansweredbetterbyestablishmentofareference.IftheIBM704.werecomparedtotheTuringmachine,the704wouldappearhighlyparallel.Ontheotherhand,ifadefinitionweremadeintermsofthenaturaldataunitcalledforbyaproblem,thesituationwouldbeequallyuntenable,sinceinmanyproblemsonewouldconsideralargematrixofdataaunit.ThuswearbitrarilyselectareferenceorganizationtheIBM70470927090.Thisorganizationisthenregardedastheprototypeoftheclassofmachineswhichwelabel1SingleInstructionStreamSingleDataStreamSISD.Threeadditionalorganizationalclassesareevident.2SingleInstructionStreamMultipleDataStream3MultipleInstructionStreamSingleDataStream4MultipleInstructionStreamMultipleDataStreamSIMDMISDMIMD.Beforecontinuing,wedefinetwoadditionalusefulnotions.Bandwidthisanexpressionoftimerateofoccurrence.Inparticular,computationalorexecutionbandwidthisthenumberofinstructionsprocessedpersecondandstoragebandwidthistheretrievalrateofoperandandoperationmemorywordswords/second.Latencyorlatentperiodisthetotaltimeassociatedwiththeprocessingfromexcitationtoresponseofaparticulardataunitataphaseinthecomputingprocess.Thusfar,categorizationhasdependedonthemultiplicityofsimultaneouseventsatthesystemscomponentwhichimposesthemostconstraints.Theratioofthenumberofsimultaneousinstructionsbeingprocessedtothisconstrainedmultiplicityiscalledtheconfluenceorconcurrenceofthesystem.ConfluenceisillustratedinFig.1foranSISDorganization.Itseffectistoincreasethecomputationalbandwidthinstructionsprocessed/secondbymaximizingtheutilityGENERATEAOORESSOFINSTRUCTIONFETCHINSTRUCTIONDECOOENSTRUCTIONGENERATEAOORESSOFOPERANO111FETCHOPERfDEXECUTEINSTRUCTIONINST.2INST.IL...4INSTRUCTIONISTARTSINSTRUCTION2STARTSINSTRUCTION3STARTSFig.1.Concurrencyandinstructionprocessing.1966FLYNNVERYHIGHSPEEDCOMPUTING1903oftheconstrainingcomponentorbottleneck.Theprocessingofthefirstinstructionproceedsinthephasesshown.Inordertoincreasethecomputationalspeed,webeginprocessinginstruction2assoonasinstruction1completesitsfirstphase.Clearly,itisdesirabletominimizethetimeineachphase,butthereisnoadvantageinminimizingbelowthetimerequiredbyaparticularphaseormechanism.Suppose,foragivenorganization,thattheinstructiondecoderisanabsoluteserialmechanismwithresolutiontime,At.Iftheaverageinstructionpreparationandexecutiontimeist,,thenthecomputationalbandwidthmaybeimprovedbyt,/At.Thusifonlyoneinstructioncanbehandledatatimeatthebottleneck,thenthemaximumachevableperformanceisl/Ar.1beingthemultiplicityoftheconstraint.Inordertoprocessagivennumberofinstructionsinaparticularunitoftime,acertainbandwidthofmemorymustbeachievedtoinsureanamplesupplyofoperandsandoperationsintheformofinstructions.Similarly,thedatamustbeoperatedonexecutedatarateconsistentwiththedesiredcomputationalrate.CONSTITUENTSOFTHESYSTEMWewilltreatstorage,execution,andinstructionhandlingbranchingasthemajorconstituentsofasystem.Sincemultiplestreamorganizationsusuallyrepresentmultipleattachmentsofoneormoreoftheaboveconstituents,welosenogeneralityindiscussingtheconstituentsofasystemastheyhaveevolvedinSISDorganizations.Indeed,confluentSISDorganizationsbytheirnaturemustallowarbitraryinteractionbetweenelementsofthedatastreamorinstructionstream.Multiplestreamorganizationmaylimitthisinteraction,thussimplifyingsomeoftheconsiderationsinanarea.Therefore,fornow,weshallbemainlyconcernedwithtechniquestoextendcomputationalperformanceinacontextofaconfluentSISDsystem.Wewillfurtherassumethattheserialmechanismwhichconstrainstheorganizationisthedecodingofinstructions.Thus,ontheaverage,theprocessingofoneinstructionperdecodecyclewillbeanupperlimitonperformance.TherelationshipbetweentheconstituentsisshowninFig.2fortheSISDorganization.I\/I71STORAGEjnEXECUTIONBANDWIDTHINSTRUCTIONSTREAYINSTRUCTIONHANDLINGUNITOPERAUOSTREAYn...0IFig.2.SISDOrganization.A.StorageTheinstructionanddatastreamsareassumedsequential.Thus,accessingtostoragewillalsobesequential.Iftherewereavailablestoragewhoseaccessmechanismcouldbeoperatedinonedecodecycleandthisaccessingcouldberepeatedeverycycle,thesystemwouldberelativelysimple.Theonlyinterferencewoulddevelopwhenoperandshadtobefetchedinapatternconflictingwiththeinstructions.Unfortunately,accessinginmoststoragemediaisconsiderablyslowerthandecoding.ThismakestheuseofinterleavingtechniquesnecessarytoachievetherequiredmemorybandwidthseeFig.3.Intheinterleavingscheme,nmemoriesareused.Wordsaredistributedineachofthememoriessequentiallymodulon.Inmemoryiisstoredwordi,ni,and2ni,etc.However,theaccessingmechanismdoesnotalterthelatencysituationforthesystem,andthsmustbeincludedinthedesign.Memorybandwidthmustbesufficienttohandletheaccessingofinstructionsandoperands,thestorageofresultsandinputoutputtraffic9.Theamountofinputoutputbandwidthisproblemdependent.Largememoryrequirementswillnecessitatetransferofblocksofdatatoandfrommemory.Theresultingtrafficwillbeinverselyproportionaltotheoverallsizeofthememory.However,anincreaseinthesizeofthememorytominimizethisinterferencealsoactstoincreasethebulkofthememory,andusuallytheinterconnectiondistance.Theresult,then,maybeanincreaseinthelatencycausedbythesecommunicationsproblems.Figure3wasgeneratedaftertheworkofFlores9andassumescompletelyrandomaddressrequests.ThewaitingtimeistheaverageadditionalamountoftimeovertheaccesstimerequiredtoretrieveanitemduetoconflictingACCESSREGENERATIONLATENCYtatrII.1.InUNITSBUSYn.atMEMORYCYCLE\ASSUMEDEMANDRATEOFnWORDSPERMEMORYCYCLE\WAITINGTIMEASAFRACTIONOF1.0MEMORYCYCLEIRANDOMADDRESSREQUESTSI.oI.52oNUMBEROFMEMORYUNITSDEMANDRATEFig.3.Interleavingmemoryunits.1904PROCEEDINGSOFTHEIEEEDECEMBERrequests.Sincememoryrequestsareusuallysequentialforinstructionsandatleastregularfordata,theresultisoverlypessimisticforallbutheavilybranchdependentproblems.Typicalprogramswillexperienceabouthalfthewaitingtimeshown.Thislatencyinthesystemgreatlyincreasesthecomplexityofthecontrolmechanismforthestoragesystem.Thestorageunitmustnowincludequeuingmechanismstoorganize,fetch,andstorerequestswhichareinprocessorcouldnotbehonoredduetoconflicts.Thesequeuingregisters,containingoutstandingservicerequests,mustbecontinuallycomparedagainstnewrequestsforservice.Extensivecontrolinterlocksmustbemadeavailabletoserveatleastthefollowingfunctions2,334.directthefetchedwordtotheappropriaterequestorpreventoutofsequencefetchstoreorstorefetchinthesamememorylocationeliminateduplicaterequestsofthesamememorylocationespeciallywherethememorylocationcontainsmorethanoneinstructionordataunitanalyzekeysorboundarieswherefetchorstorememoryprotectionareused.Inaddition,confluentcomputingsystemsfrequentlyemploybufferstominimizetrafficrequirementsonthememory2.Aninstructionbuffermightcontainthenextninstructionsinthesequenceaswellasahistoryofminstructionstogetherwithseverallevelsofalternatepathsofbranch.Thepresenceofahistoricalpictureofinstructionsintheinstructionbufferallowsfortheopportunitytostoresmallloops,thusavoidingthepenaltyofreaccessinginstructions.Anoperandbufferisusedbytheexecutionunitasitsintrinsicstoragemedium.Theinstructionunitwouldrestructureinstructionsthatcalledforoperandsfrommemoryintoinstructionswhichcallforthecontentsoftheseregisters.Theinstructionunitwould,afterdecodingtheinstruction,initiatetherequestsfortheappropriateoperandsanddirecttheirplacementintotheoperandbuffer.Thustheexecutionunitwouldactasanindependentcomputer,whosestoragewouldbelimitedtothecontentsofthisbuffer,andwhoseinstructionswouldbeinashortenedformat.B.ExecutionWecenterourdiscussiononthefloatingpointinstructionssincetheyarethemostwidelyusedinlargescientificcomputing,andrequirethemostsophisticatedexecution.Inordertoachievetheappropriatebandwidthlevels,wecanrepeatthedeploymentschemethatwasusedformemory,onlyheretheindependentunitswillbededicatedtoservicingoneclassofinstructions.Inadditiontodirectlyincreasingthebandwidthbyprovidinganumberofunits,thespecializationoftheunitaidsinexecutionefficiencyinseveralotherways.Inparticular,eachunitmightactasasmallinsularunitoflogic,henceminimizingintraunitwirecommunicationproblems.Secondly,thededicatedunithasfewerlogicaldecisionstomakethanauniversalunit.Thirdly,theunitmaybeimplementedinamoreefficientfashion.Onemayalsoconsiderpipeliningtechniques6toimprovethebandwidthwithinaparticularinsularunit,inadditiontooptimizinganalgorithmtominimizelatenttime.Pipeliningisaprocesswhereinnaturalpointsinthedecisionmakinghardwarearesoughttolatchupintermediateresultsandresuscitatetheuseoftheunit.Thelatchingmayincludeextradecisionelementsforstorage,butthesemayalsobeusedinaidingthecontroloftimeskewdifferencesinthevariousparallelpaths.IFloatingAddSubtractOperationsFloatingaddsubtractoperationsconsistofthreebasicparts.First,thejustificationofthefractionalpartsofthetwooperandsbytheamountofthedifferenceoftheexponents.Second,theaddingofthefractionalpartsorappropriatecomplement.Third,thepostnormalizationofthefractioniftheresulthasaleadingzeroinasignificantpositionoranoverflowoccurs.Becauseofthedecisionmakingshiftingproblemsassociatedwiththeexponenthandling,anormallatchpointforpipeliningthetwoaddoperationsinoneunitduplexingtheunitisattheinterfacefromthepreshiftintotheadderorafterthefirstleveloftheadderstructure.Thefloatingaddclassofinstructionhasbeenreported5tooperateinthe120nanosecondrangeforaduplexedunitwith56bitfractionSystems360format.2FloatingMultiply5,lo,12,13Theessentialdecisionprocessofthemultiplicationalgorithmistheadditionofthemultiplicandtoitselfbyasmanytimesasareindicatedbythemultiplier.Iftherearenbitsinamultiplierandmultiplicand,thenthemultiplicandmustbeaddedtoitselfntimeswithaoneplaceshiftbeforeeachaddition.Standardtechniquesexistforreducingthenumberofadditionse.g.,multiplierbits1111.maybetreatedas1oooO.1.requiredbyencodingthemultiplierintoalessernumberofsignedbits,say42.Oncethisisdone,Wallace12suggeststhedirectadditionofthe42,nbit,shiftedmultiplicands.Thebasicdecisionelementinaddingofeachbitofthen/2operandsisthesocalledbinaryfulladder,whichtakesthreeinputsofequalsignificanceandproducestwooutputs,oneofthesamesignificance,thesum,andoneofhighersignificance,thecarry.Bytheassociativelaw,thecarryisinjectedintoanyavailablelowerleveladdstructureoftheappropriatesignificance.Theneteffectissometimesreferredtoasaflushadder.Thefinalresult,ofcourse,isthereductionofthen/2multiplesintotworesultsegmentsthesumsandthecarrieswhicharethenassimilatedbyaconventionalcarrypropagateadderwithananticipationmechanism.Thebasicproblemistheimplementationofthisalgorithm.Forlargen,theplanesegmentsthatpartitionthealgorithmareinconsistentwithconventionalpackagingsizesandinterconnectionlimitationsseebelow.ConsiderasimplervariationFig.4bSIhereonlyn/mmultiplesofthemultiplicandareretiredperiterationmiterationsarerequired.Then/mmultiplesaredecodedinton/2madditionsofthemultiplier.Thesemultiplesare1966FLYNNVERYHIGHSPEEDCOMPUTING1905theninsertedintothefirstleveloftheaddertreeFig.4b.Nowtheaddassimilationofbothcarriesandsumsproceedsasbeforewiththeintroductionofintermediatestagingpointswheretheresultsaretemporarilylatched.Thestoragepointsactasskewrelativetimingcontrolandimprovetheoverallexecutionbandwidthefficiencyofthealgorithm.Afterthefirstsetofmultipleshasbeeninsertedintothefirstlevelofthetreeandassimilated,itislatchedatthefirstlatchpoint.Thestoragepointservestodecouplethefirstlevelfromsubsequentlevelsofthehardwarethereforethefirstsetofmultiplesmaynowproceedtobeabsorbedinthesecondleveland,simultaneously,asecondsetofmultiplesmaybeinsertedintothefirstlevelofthetree.Theprocesscontinuesand,atthebottomofthetree,thecarriesandsumsareassimilatedinanadderwithcarrypropagation.Noticethatthelatencyinthisvariationmaybetwicethatoftheoriginalalgorithm.Alsonoticethatilii1,,/J/1/1REGULARCUTTINGPLANEREGISTERSCONTAININGMULTIPLESASSUME61TREEOFLATCHINGFULLADDERSCARRYPRWAGATEINCLUDESCARRYADDERANTICIPATIONUNITINTERCONNECTIONSARESHOWNONCUTTINGPLANE0000006ithbitMULTIPLEREGISTERSPARTIALSUMACCUMUL4TlONithREWRcumNPLANEbFig.4.Outlineofamultipliertree.LiliilithPL4NEPROFILEpotentialbandwidthinthisalgorithmhasnotsufferedsinceasecondmultiplicationmayproceedindependentlyassoonasthelastsetofmultiplesisinsertedandhaspassedthefirstlevelofthetree.Implementationsofthissecondschemehaveyieldedaperformanceof180nanosecondsforafloatingpointmultiplyincludingpostshiftingandexponentupdating56bitfraction,Systems360format.Figure4,partsaandbarethreedimensionalrepresentationsofthesecondvariationofthemultiplyalgorithm.Figure4aisanisometricprojectionandFig.4bisacrosssectionorregularcuttingplaneandprofileview.Thisrepresentationwaschosentoillustratesomeofthedifficultiesofimplementinghighspeedsystemsingeneral.Itiswellknown1thatpropagationdelayandnonuniformtransmissionlineloadingaremajorfactorsintheswitchmgspeedofalogicstage.Onewould,therefore,desireaphysicalpackageconsistentwiththenaturalcommunicationpatternofthealgorithmsothattheimplementationcouldbeoptimized.Thedifficultiesofaplanarpackageareobvious.Communicationsbetweenregularcuttingplanesprofileview,Fig.4bmustbemadeaxiallyinthephysicalimplementation.Indeed,thereisnoassurancethatthecapacityofthephysicalpackageplanewillmatchtherequirementsoftheregularcuttingplaneseveralpackageplanesmayberequired.Noticethatthefirstvariationofthealgorithmhadmuchmoreextensiverequirementsperplane,thusitsimplementationwouldverylikelybelessefficientthanthesecondvariation.3FloatingPointDivide5,IO,lIHistorically,dividehasbeenlimitedbythefactthattheiterativeprocesswasdependentonpreviouspartialresults.Recently,attentionhasbeengiventotechniqueswhichdonothavethislimitation.ThesetechniquesincludeuseofNewtonRaphsoniterationsforseriesapproximationstoquotients.AssumeaQuotientbLet11b1X1xx23...fXf...whichcanberewrittenas1b1xlx21x4...1x2mThusQuotienta.Ix.lx2.1x4...1x2.Recallthatinbinarythefactor1xisrelatedto1xbyacomplementationoperation.Thus,thedenominatorisrewrittenas1x,complementedtoform1x,theproductis1x2whichiscomplementedtoform1x2,whichcontinuesthedevelopment.1906PROCEEDINGSOFTHEIEEEDECEMBERNoticethatthespeedissubstantiallyenhancedbythepresenceofahighspeedmultiplier.Infact,asmanyastwomultiplicationsmightproceedsimultaneously.oneforthequotientdevelopmentandoneforthenextdenominatorterm,ifthehardwarepermits.Inforcingquickconvergenceoftheseries,usuallyatablelookuparrangementisusedtoprovidethefirstapproximationforthequotient.Thereafter,subsequentiterationsdevelopdoubletheprecisionofthefirstapproximation.Floatingpointdividesunder700nanosecondshavebeenreportedusingthisscheme5.4AlgorithmsforAchievingMaximumEjiciencyintheConcurrentExecutionofIndependentUnitsOneofthepurposesinhavingindependentunitsdedicatedtoindividualinstructiontypesistoimprovetheexecutionofeachoftheinstructionclasses.Theotheradvantageisthattheseindependentunitsmayoperateconcurrentlyandservetoincreasetheoverallbandwidthoftheexecutionunit.Tomasulo4describesanelegantalgorithmtoallowtheachievementofmaximumefficiencyintheconcurrentexecutionofunitsFig.5,illustrationfromAmdahlI.Theoperationofthisalgorithmtakesadvantageofthelatenttimefollowingtheissuingoftherequesttoaccesstheoperandsfromstorageintothememoryintheexecutionareaeithervirtualoraddressablebuffers.Whenaninstructionisforwardedtotheexecutionarea,awordintheexecutionunitmemoryisreservedforitsoperand,buttheoperandhasnotyetarrived.Forexample,iftheinstructiontobeperformedconsistedofamultiply,thenitcouldbeimmediatelyforwardedtothemultiplicationunit.TheLOADR,,aMULTRI,pADDR,,R2STORERLOADRI,bFig.5.EffectofTomasuloconcurrencyalgorithm.aSequencewithconventionaldependency.bSequenceasperformedaftertagwasforwarded.multiplierwouldrequireatagrepresentingtheoperand.Theexecutionareaisprovidedwithacommondatabusandthetagrepresentingtheoperandisbroadcastonthebusonecycleearly,thentheexecutionfunctionalunitsexaminetheirqueuesofrequiredoperandsandgateinimmediatelyanappropriateonewithoutwaitingforittogotothebufferstorage.Thetagforwardingfreesthebufferstorageofhavinganyresponsibilityforthisparticularoperand.Ofcourse,atthesametime,aloadcouldthenbeexecutedintothatsameword.Thus,thesecondloadinstructionandthemultiplyinstructioncouldproceedconcurrentlyinafashionimpossiblebefore.Results,astheybecomeavailable,mightalsobeforwardedtounitstheadderinFig.5requestingactionbyasimilarmechanism,ratherthanproceedingdirectlytostorage,whethervirtualoraddressed.Thisavoidsintermediatestopsandhenceimprovesoverallefficiencyofexecutionbyimprovingtheoverlapintheconcurrencysystem.C.BranchingAssumeforthemomentthatmemorybandwidthandexecutionbandwidthhavenowbeenarrangedsothattheymorethansatisfytherequirementofoneinstructionprocessedperdecodecycle.WhatthenwouldlimittheperformanceofaSISDsystemIfitisassumedthatthereareafixednumberofdatadependentbranchpointsinagivenprogram,then,byoperatinginaconfluentorotherhighspeedmode,weessentiallybringthebranchpointsclosertogether.However,theresolutiontimeofthedatadependentbranchpointisbasicallyfixedforasystemwithagivenexecutionand/oraccessinglatency.Thebasicretrogressivefactoristhepresenceofbranchdependenciesintheinstructionstream.Amongthemanytypesofbranchinstructions,wehave2,71ExecuteTheoperandwhichisfetchedistobetreatedasaninstructionoraninstructioncounter.Thispresentssomeproblemssincetheaccesslatencyisinsertedintotheinstructionstream.Itisessentiallythesameproblemasindirectaddressingoradoubleoperandfetch.Onecouldanticipatesomeofthisdifficultybykeepingaheadintheinstructionstream.However,thisparticularinstructiondoesnotposeaseriesdegradationproblembecauseitsuseisnormallyrestrictedtolinkages.2BranchUnconditionalTheeffectiveaddresssogeneratedbecomesthecontentsoftheinstructioncounter.Thissubclassofinstructionpresentsthesameproblemasexecute.3BranchonIndexThecontentsofanindexregisterisdecrementedbyoneoneachiterationuntiltheregisteriszero,uponwhichthealternatepathistaken.Onlytheaccesslatencyisafactorsincezerocanbeanticipated.4DataDependentBranchThepathsofthebrancharedeterminedbytheconditionsign,bit,status,etc.ofsomedatacell.Invariably,thisconditionisdependentontheexecutionofapreviouslyissuedinstruction.1966FLYNNVERYHIGHSPEEDCOMPUTING1907Herethedegradationisseriousandunavoidable.Firsttheexecutionoftheconditiongeneratingdatamustbecompleted.Then,thetestismadeandthepathisselected.Now,theoperandscanbefetchedandconfluencycanberestored.Itispresumedthatbothinstructionpathswerepreviouslyfetchedbutnotethatthisisdoneonlyattheexpenseofgreatermemorybandwidthrequirements.Anotherslightimprovementcanbemadeiftheoperandsare.fetchedforonealternativepathsolongastheunresolvedbranchpathisnotexecutedtoavoidseriousrecoveryandreconstructionproblemsiftheguessproveswrong.Ofthetwoloopclosingorconditionalbranchinstructions,BranchonIndexhaslessdegradationduetoresolutionoflatencythanDataDependentBranch.Whenanoptionexistsasintheperformanceofaknownnumberofiterationstheprogrammershouldselecttheformer.AssumingthattheDataDependentBranchresolvinglatencyisaconstantforanyparticularmachine,wemayshowitsrelationshipFig.6toperformancebyassumingvariouspercentagesofoccurrenceinexecutedcodeofthistypeinstruction.Thelatencyrepresentsthesumofaverageexecutiontimeplusoperandaccesstimefrommemory.Figure6assumesthattheorganizationunderconsidcrationhasenoughconfluencytoperformoneinstruction/cycleontheaveragewithoutbranchresolutioninterruptions.Toresolvethebranch,itisgenerallynecessarytofullyexecuteandtesttheresultoftheprecedinginstruction.Duringthistime,fetchingofinstructionsanddatamayproceedforonebranchpathandpossiblyafewinstructionsmaybefetchedforthealternate.FetchingofbothpathsdoublesthebandwidthrequiredandincreasesthewaitingtimeFig.3.Thusthelatencyincludesanaverageexecutiontime,testtime,and.a.percentagewrongpathguessesoftheoperandaccesstime.Noticethatwhilewehavestudieddegradation.duetonNUMBEROFCYCLESOFEXECFTIONANDACCESSLATENCYII.oWWJvuZtUVE0.5UWuz0Fig.6.1020PIIXNTASEOFOCEURREEYGEffCONDlTlOULBRANUIINSTRUCTIOISINImmnowsTRumDegradationduetodatadependentbranchinstructions.latencyfortheSJSDorganization,multiplestreamorganizationsmayexhibitbranchinduceddegradationduetoeitherthssamephenomenonorananalogspatialinefficiencyinwhichonlyonepathactivatesordeterminestheoutcomeofadependency.CLASSESOFORGANIZATIONIntlussectionweshallconsidereachoftheorganizationalclasseslistedinthefirstsectionofthepaper.Wewillremarkonorillustraterepresentativesystemsthatfallintoeachclass.Thevariousconfigurationsarebynomeansexhaustive.A.ConJluentSISDTheconfluentSISDprocessorIBMSTRETCH7,CDC6600series8,IBM360/90series2l5achievesitspowerbyoverlappingthevarioussequentialdecisionprocesseswhichmakeuptheexecutionoftheinstructionFigs.1and2.Inspiteofthevariousschemesforachievingarbitrarilyhighmemorybandwidthandexecutionbandwidth,thereremainsanessentialconstraintinthistypeoforganization.Asweimpliedbefore,thisbottleneckisthedecodingofoneinstructioninaunittime,thus,nomorethanoneinstructioncanheretiredinthesametimequantum,.ontheaverage.Ifoneweretotrytoextend.thisorganizationbytakingtwo,three,orndifferentinstructionsinthesamedecodecycle,andnolimitationswereplacedoninstructioninterdependence,thenumberofinstructiontypestobeclassifiedwouldbeincreasedbythecombinatorialamountMdifferentinstructionstakennatatimerepresentsMdifferentoutcomesandthedecodingmechanismwouldbecorrespondinglyincreasedincomplexity.Ontheotherhand,onecouldplacerestrictionsontheoccurrenceofeitherspecifiedtypesofinstructionsorinstructiondependencies.This,inturn,narrowstheclassofproblemsforwhichthemachineissuitableand/ordemandsrestrictiveprogrammingpractices.Indeed,thisisacharacteristicofmultiplestreamorganizationssincethemultiplicityorparallelismimpliesindependentsimultaneousaction.B.SIMD1418SIMDtypestructureshavebeenproposedbyUnger14,SlotnikI51SOLOMON,ILLIACIV,CraneandGithens16,and,morerecently,byHellerman17.SOLOMONistheclassicSIMD.Therearenuniversalexecutionunitseachwithitsownaccesstooperandstorage.Thesingleinstructionstreamactssimultaneouslyonthenoperandswithoutusingconfluencetechniques.Increasedperformanceisgainedstrictlybyusingmoreunits.Communicationbetweenunitsisrestrictedtoapredeterminedneighborhoodpatternandmustalsoproceedinauniversal,uniformfashionFig.7a.NoteSOLOMONhasbeensupersededbyILLIACIVasasystembeingactivelydeveloped,whichisnolongercompletelySIMD.ThedifficultieswithSIMDare1LatencyintheinstructionstreamforSIMDbranchesisnowreplacedbylatencyinthedatastreamcausedbyPROCEEDINGSOFTHEIEEEDECEMBER19082operandcommunicationforwardingproblems.Presently,thenumberofclassesofproblemswhoseoperandstreamshavetherequiredcommunicationregularityisnotwellestablished.SIMDorganizationsareinconsistentwithstandardalgorithmictechniquesincluding,andespecially,compilertechniClUeS.3TheuniversalityoftheexecutionunitsdeprivethemofImaximumefficiency.C.MISDThesestructureshavereceivedmuchlessattention18,19.AnexampleofsuchastructureisshowninFig.7b.ItbasicallyemploysthehighbandwidthdedicatedexecutionunitasdescribedintheconfluentSISDsection.Thisunitisthensharedbynyirtualmachinesoperatingonprogramsequencesindependentofoneanother.Eachvirtualmachinehasaccesstotheexecutionhardwareoncepercycle.Eachvirtualmachine,ofcourse,hasitsownEXECUTION1UNITSUINSTRUCTIONUNITLIMITEDWMMUNICATION4BUSESARETIMESHAREDMULTIPLYBUSIUNIT9INSTRUCTIONIIIIDAlAdHlSTORAGEEXECUTIONUNITEXECUTIONUNITEXECUTIONSOURCEDATASTREAYDERIVEDDATISTREAYcFig.7.aSIMD.bMISD.ConvertstoMIMDifinstructionsanddataareprivatelymaintainedtogether.cMISD.privateinstructionmemoryandinteractionbetweeninstructionstreamsoccursonlyviathecommondatamemory.Presumably,ifthereareNinstructionunitsthenthebandwidthofthecommondatastoragemustbeNtimesgreaterthantheindividualinstructionstorage.ThisrequirementcouldbesubstantiallyreducedbyuseofamodifiedversionofTomasulostagforwardingalgorithmandaseparatecommonforwardingbus.AnotherversionofthisFig.7cwouldforceforwardingofoperands.ThusthedatastreampresentedtoExecutionUnit2istheresultantofExecutionUnit1operatingitsinstructiononthesourcedatastream.Theinstructionthatanyunitperformsmaybefixedspecializedsuchthattheinterconnectionorsetupofunitsmustbeflexible,semifixedsuchthatthefunctionofanyunitisfixedforonepassofadatafileorvariablesothatthestreamofinstructionsoperatesatanypointonthesingledatastream.Undersuchanarrangement,onlythefirstexecutionunitseesthesourcedatastreamandwhileitisprocessingtheithoperandtheithexecutionunitisprocessingtheithderivationofthefirstoperandofthesourcestream.D.MIMDIfwereconstructtheorganizationofFig.7bsothatthedataandinstructionstreamsaremaintainedtogetherinprivatememories,wehaveanexampleofMIMD.Thereisnointeractionorminimuminteractionallowedbetweenthesevirtualmachines.Solongasthelatenttimeforexecutionofanyoperationislessthanthememorycycle,noproblemsariseduetobranching.Thus,suchanarrangementallowsmaximumadvantageoftheallowablebandwidthsofexecutionunitandmemoryunit.Ofcourse,suchanapproachmightwellbecriticizedonthebasisthattherequirementofindependenceoftheinstructionsequencesdoesnotaddressitselftotherequirementsoflargescientificproblems.Itwouldbebettersuitedtotheneedsofthetimesharingand/orutilityenvironment.ThisrestrictedMIMDpointsuptheshortcomingofourorganizationaldefinitions.Thespecificationsfailtoincludeaclassdicationofhowthestreamsmayinteract,thusarestrictedMIMDmaybeorganizationallymuchsimplerthanaconfluentSISD.GeneralMIMDstructureshavebeenmorewidelydescribed20E23withlargescalemultiplicitybeingenvisionedbyHolland21andmorerestrictedimplementationsbeingundertakenbyBurroughsandUnivac23.Inhisoriginalproposal,Hollandconsidersanarrayofprocessorseachwithaonewordstorage.Themodulesareindependentandarecapableofconcurrentexecution.Theprocessorshavearithmeticabilitybutcommunicationislimitedbytheirneedtobuildapathtotheappropriateoperand.Inpathbuilding,thedisplacementanddirectionoftheoperandisspecifiedandtheinterveningmoduleprocessorsformavinculum.SuchanarrangementmightsolvesomeoftheessentialblockageproblemsofSEDandSIMD,sinceindependentprogramsegmentsproceedsimultaneously.Also,memorybandwidthisalwaysadequate.1966FLYNNVERYHIGHSPEEDCOMPUTING1909ThedifficultieswithsuchanarrangementofMIMDtime,andthismaybefurtherdegradedbytheoccurrenceofgenerallyincluder201conditionalbranchinstructions.1interconnectionsbetweenunits,whichposeserious2theuniversalnatureoftheindividualmodule,which3theclassofproblemswhichcouldutilizeMIMDinterferenceproblemslimitsitsefficiencyorganization,whchispresentlysmall.SYSTEMSREQUIREMENTSTheeffectivenessoftheoverallcomputingprocessesmustbemeasuredonabasismuchlargerthanperformancenanosecondsperinstructionalone.Evenonthisprimitivebasis,itisobviousthatdifferentinstructionrepertoiremayimplysubstantiallydifferenteffectivenesswiththesameaveragenanosecondperinstructionratio.Despiteouroriginalassumptionthattheverylargeproblemdoesnotemploysignificantinput/output,itis.soonrealizedthatthis.requirementisunduly.restrictive.Presently,highspeedsystemsremainsolimited.Itmaybesometimebeforebroadereffectiveutilizationmayberealizedthroughnewdevelopmentsininput/outputequipment.Ofcourse,theoverallmeasureofefficiencyoftheprocessoristhenumberofcorrectlycompletedcomputationsandprogramsdoneoveranextendedperiodoftime.Includedinthismeasureisthereliability.andthemaintainabilityofthesystem.Complexsystemswithverylargenumbersofcomponentsarenaturallyverydifficulttomaintain,unlessfeatureswhichprovidefaultlocationareincluded.Amajorcomponentofsuchfeatureswouldbecheckingorfaultdetection.ofalloperationsanddatatransfers.Ondetectionoferror,hardwareaideddiagnosticsshouldbeprovidedsothatservicingandmaintenancemightbereadilyandeasilyaccomplished.Ifcheckingisnotincludedinthehardware,thenitis,ofcourse,incumbentontheusertoprogramathoroughcheckofhisresults.This,ofcourse,representsanoverheadwhichpenalizestheeffectiveperformanceandutilizationoftheequipment.Noticethatsomeofthesuggestedorganizationscater.torestrictiveproblemsetsparticularlytheSIMD,wheremultipleoperandsareexecutedbythesameinstructionstream.Suchorganizationsareclearlylimitedforgeneralpurposeprocessingwheretheremaybemanyinteractionsbetweenoperandelements.Similarly,withtheMISDorganizations,intheorganizationshownonFig.7b,anumberofindependentinstructionstreamsareinvolvedandtheirveryindependenceprecludesMISDuseononelargeproblemcomposedofessentialdependenciesthis,ofcourse,istypicaloflargescalescientificprogramming.CONCLUSIONSConventionalSingleInstructionStreamSingleDataStreamSISDprocessorsmaybeenhancedbyconcurrencyconfluenceofinstructionhandling,operandacquisition,andexecution.Thereis,however,alimitationoftheorderoftheexecutionofoneinstructionperinstructiondecodingBymultiplexingtheinstructionstreamorthedatastreamorboth,newclassesofprocessorswhichdonotnecessarilysharethislimitaredeveloped.Theireffectivenessdependsonthenatureoftheproblemanditisanopenquestionastowhethernewalgorithmswhichwillservetoextendtheirusefulnesscanbedeveloped.ACKNOWLEDGMENTTheauthorisindebtedtoD.JacobsohnandR.AschenbrenneroftheArgonneNationalLaboratoryforseveralvaluablediscussionsonsomeofthematerial.REFERENCESIG.M.AmdahlandM.J.Flynn,Engineeringaspectsoflargehghspeedcomputerdesign,Proc.Symp.onMicroelectronicsandLargeSystems.Washington,D.C.Spartan,1965,pp.7795.2D.W.Anderson,F.J.Sparacio,andR.M.Tomasulo,TheModel91machinephilosophyandinstructionhandling,IBMJ.Res.andDev.,November1966.3L.J.Boland,G.D.Granito,A.U.Marcotte,B.M.Messina,andJ.W.Smith,TheModel91storagesystem,ibid.4R.M.Tomasulo,Anefficientalgorithmforautomaticexploitationofmultipleexecutionunits,ibid.5S.F.Anderson,J.Earle,R.E.Goldschmidt,andD.M.Powers,TheModel91executionunit,ibid.6L.W.Cotton,Circuitimplementationofhighspeedpipelinesystems,1965Proc.AFIPSFJCC,p.489.7W.Buchholz,.Ed.,PlanningaComputerSystem.NewYorkMcCrawHill,1962.8J.E.Thomton,,ParalleloperationintheControlData6600,Proc.AFIPS1964FJCC,pt.11,pp.334.191I.Flores,Derivationofawaitingtimefactorforamultiplebankmemory,J.ACM,vol.1I,pp.265282,July1964.loM.Lehman,D.Senzig,andJ.Lee,Serialarithmetictechniques,Proc.AFIPSI965FJCC,pp.715725.llR.E.Goldschmidt,Analgorithmforhighspeeddivision,M.S.thesis,Mass.Inst.Tech.,Cambridge,June1965.12C.S.Wallace,Asuggestionfor4fastmultiplier,IEEETrans.onElectronicComputers,vol.EC13,pp.1417,February1964.13D.Jacobsohn,Asuggestionforafastmultiplier,IEEETlans.onElectronicComputersCorrespondence,vol.EC13,p.754,December1964.14S.H.Unger,Acomputerorientedtowardspatialproblems,Proc.IRE,pp.174,October1958.I51D.L.Slotnick,W.C.Borch,andR.C.McReynolds,TheSolomonComputerapreliminaryreport,Proc.I962WorkshoponComputerOrganization.Washington,D.C.Spartan,1963,pp.692.161B.A.CraneandJ.A.Githens,BulkprocessingindistributedlogicMemory,IEEETrans.onElectronicComputers,vol.EC14,pp.186196,April1965.17H.Hellerman,Parallelprocessingofalgebraicexpressions,IEEETrans.onElectronicComputers,vol.EC15,pp.8291,February1966.I81D.N.SenzigandR.V.Smith,ComputerorganizationforarrayprocessingProc.AFIPSI965FJCC,pp.117129.I91R.AschenbrennerandG.Robinson,Intrinsicmultiprocessing,ArgonneNationalLaboratory,Argonne,Ill.,ANLTech.Memo.121,June1964.20W.Comfort,Highlypamllelmachines,Proc.I962WorkshoponComputerOrganization.Washington,D.C.Spartan,1963,pp.126155.21J.H.Holland,Auniversalcomputercapableofexecutinganarbitrarynumberofsubprogramssimultaneously,I959Proc.EJCC,pp.108113.22R.Reiter,Astudyofamodelforparallelcomputation,UniversityofMichigan.AnnArbor,Tech.Rept.,ISL654,July1965.23D.R.LewisandG.E.Mellen,StretchingLARCscapabilityby1anewmultiprocessorsystem,presentedatthe1964Symp.onMicroelectronicsandLargeSystems,Washington,D.C.

注意事项

本文(43-Very High-Speed Computing Systems.pdf)为本站会员(baixue100)主动上传,人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知人人文库网([email protected]),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。

copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5