tutorialcompilation system for throughput-driven multi-core :吞吐量驱动的多核心tutorialcompilation系统_第1页
tutorialcompilation system for throughput-driven multi-core :吞吐量驱动的多核心tutorialcompilation系统_第2页
tutorialcompilation system for throughput-driven multi-core :吞吐量驱动的多核心tutorialcompilation系统_第3页
tutorialcompilation system for throughput-driven multi-core :吞吐量驱动的多核心tutorialcompilation系统_第4页
tutorialcompilation system for throughput-driven multi-core :吞吐量驱动的多核心tutorialcompilation系统_第5页
已阅读5页,还剩106页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Micro-37 TutorialCompilation System for Throughput-driven Multi-core Network Processors,Michael K. ChenErik JohnsonRoy Jumichael.k.chen, erik.j.johnson, Corporate Technology GroupIntel Corp.December 5, 2004,2,Agenda,Project OverviewDomain-specific LanguageHigh-level OptimizationsCode Generations and OptimizationsPerformance CharacterizationRuntime AdaptationSummary,Project Overview,Part of the Shangri-la Tutorial presented at MICRO-37December 5, 2004,4,Outline,Problem StatementOverview of Shangri-la SystemStatus and Teams,5,The Problem,Packet processing application,State-of-the-art: Hand-tuned code for maximal performance but often error-prone and not scalable Static resource allocation often tailored to one particular workload not flexible to varying workloads and hardware,6,Shangri-La Overview,Mission: research an industry leading programming environment for packet processing on Intel chip multiprocessors (CMP) siliconChallenges:Hide architectural details from programmersAutomate allocation of system resourcesAdapt resource allocation to match dynamic traffic conditionsAchieve performance comparable to hand-tuned systemsTechnology: Language: enable portable packet processing applications Compiler: automate code partitioning and optimizations Run-time System: adapt to dynamic workloads,7,Architectural Features of Intel IXP Processor,Heterogeneous, multi-cores: Intel Xscale Processor (control) and MicroEngines (data)Memory hierarchy Local memory (LM): distributed on MEs No HW cache Scratch, SRAM, DRAM: shared Long memory latencyMicroEngine:Single issue; deferred slotsLight-weighted HW multi-threadingEvent signals to synchronize threadsMultiple register banks and constraints as operands in instructionsLimited code store,8,Packet Processing Applications,Types of apps: IPv4 forwarding, L3-Switch, MPLS (Multi-Protocol Label Switch), NAT (Network Address Translation), Firewall, QoS (Quality of Service)Characteristics of packet processing apps:Performance metric: throughput (vs. latency) Mostly memory bound Large amount of packets without localitySmaller instruction footprintExecution paths tend to be predictable,9,Anatomy of Shangri-La,Global Optimizations,Loop/Memory Opt.,Code Generation,Language(s),Execution environment,Baker Programming Language,Baker Compiler,Front-end,Profiler,Profiling,Pi Compiler,Inter-Procedural Opt.,Aggregate Compiler,Run-time System,Dynamically adapt mapping to match traffic fluctuations,Code generation and optimization for heterogeneous cores,Compiler optimizations for pipeline construction and data structure mapping/caching,Extract run-time characteristics by executing application,Modular language (with C-like syntax) to express applications as a dataflow graph,General-purpose Compiler,10,Baker Language,Familiar to embedded systems programmersSyntactically “feels like” CSimplifies the development of packet-processing applicationsHides architectural detailsSingle-level of memoryImplicit threading modelModular programming encapsulationDomain-specificData flow modelActors and interconnects (PPFs and channels)Built-in types, e.g. packetEnables compiler to generate efficient code on target CMP hardware,11,Shangri-la Example,L3 Switch,L2 Cls,L3 Fwdr,L2 Bridge,Eth Encap,RX,TX,Profiler,Modular, simple descriptionmodule l3_switch module eth_rx, eth_tx; / Built-in ppf l2_clsfr; module eth_encap_mod, l3_fwdr, l2_bridge; wiring eth_rx.eth0 - l3_switch.l2_clsfr.input_chnl; l3_fwdr.input_chnl prot (off=72b, sz=8b),t1 = pkt-ttl (off=64b, sz=8b),t2 = b & 0xff,b = read pkt (off=64b, sz=16b)t1 = ( b 8 ) & 0xff,Packet Access Combining Example,Analysis overviewIsolate packet accessesPerform checks to guarantee packet accesses combined safelyValidate range and size of combined memory accessReplace combined accesses with accesses to / from Local Memory / transfer registers,Automatic Program Partitioning,Memory Hierarchy Optimizations,Packet Handling Optimizations (2/5),46,Static Offset and Alignment Resolution (SOAR),Generic packet accessesCan handle arbitrary layering of protocols and arbitrary field offsetsClearly simplifies programmers tasksBut dynamic offset and alignment determination add significant overheadsDynamic offsets handling adds 20+ instructions per packet accessDynamic alignment adds several instructions per packet access,offset( src_ip ) = 26B,offset( src_ip ) = ?,packet_encap,packet_decap,Automatic Program Partitioning,Memory Hierarchy Optimizations,Packet Handling Optimizations (3/5),47,l2_cls.p,Static Offset and Alignment Resolution (SOAR),Statically resolved packet field alignment eliminates a few instructionsStatically resolved packet field offset and alignment can be accessed with a few instructionsImplemented using custom dataflow analysis,bridge.p,lpm_lookup.p,options_processor.p,icmp_processor.p,encap.p,l3_cls.p,Rx,Tx,l2_bridge.m,l3_fwdr.m,eth_encap.m,l3_switch.m,arp.p,Eth IP,Copy Eth,IP Eth,New ICMP IPCopy IP ICMP IP,2/2 resolved,1/1 resolved,18/18 resolved,3/3 resolved,Eth Arp,Automatic Program Partitioning,Memory Hierarchy Optimizations,Packet Handling Optimizations (4/5),48,Eliminate Unnecessary Packet Primitives in Code,Eliminate unnecessary packet_encap and packet_decap primitivesBalanced packet_encap and packet_decap in the same aggregate can be eliminated because they have no external effectWorks in conjunction with SOAR analysis resultsConvert metadata accesses into local memory accesses when all uses are within the same aggregatePrivate uses of metadata have no external effectmetadata accesses composed of 1+ SRAM and 20+ instructionsCandidate accesses can be identified with def-use analysis,Automatic Program Partitioning,Memory Hierarchy Optimizations,Packet Handling Optimizations (5/5),49,Global Data Memory Mapping,Collect dynamic access frequencies to shared global data structuresMap data structures to appropriate memory levelsMap small, frequently accessed data structures to Scratch MemoryOtherwise, place in SRAMPointers may point to objects in different levels of memoryPerform congruence analysis to allocate such objects to a common memory level,Automatic Program Partitioning,Packet Handling Optimizations,Memory Hierarchy Optimizations (1/6),50,Delayed-Update Software-Controlled Caches,Cache unprotected global data structuresSince these structures are not protected by locks, assume that they can tolerate delayed updateDelayed update result

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论