




已阅读5页,还剩106页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Micro-37 TutorialCompilation System for Throughput-driven Multi-core Network Processors,Michael K. ChenErik JohnsonRoy Jumichael.k.chen, erik.j.johnson, Corporate Technology GroupIntel Corp.December 5, 2004,2,Agenda,Project OverviewDomain-specific LanguageHigh-level OptimizationsCode Generations and OptimizationsPerformance CharacterizationRuntime AdaptationSummary,Project Overview,Part of the Shangri-la Tutorial presented at MICRO-37December 5, 2004,4,Outline,Problem StatementOverview of Shangri-la SystemStatus and Teams,5,The Problem,Packet processing application,State-of-the-art: Hand-tuned code for maximal performance but often error-prone and not scalable Static resource allocation often tailored to one particular workload not flexible to varying workloads and hardware,6,Shangri-La Overview,Mission: research an industry leading programming environment for packet processing on Intel chip multiprocessors (CMP) siliconChallenges:Hide architectural details from programmersAutomate allocation of system resourcesAdapt resource allocation to match dynamic traffic conditionsAchieve performance comparable to hand-tuned systemsTechnology: Language: enable portable packet processing applications Compiler: automate code partitioning and optimizations Run-time System: adapt to dynamic workloads,7,Architectural Features of Intel IXP Processor,Heterogeneous, multi-cores: Intel Xscale Processor (control) and MicroEngines (data)Memory hierarchy Local memory (LM): distributed on MEs No HW cache Scratch, SRAM, DRAM: shared Long memory latencyMicroEngine:Single issue; deferred slotsLight-weighted HW multi-threadingEvent signals to synchronize threadsMultiple register banks and constraints as operands in instructionsLimited code store,8,Packet Processing Applications,Types of apps: IPv4 forwarding, L3-Switch, MPLS (Multi-Protocol Label Switch), NAT (Network Address Translation), Firewall, QoS (Quality of Service)Characteristics of packet processing apps:Performance metric: throughput (vs. latency) Mostly memory bound Large amount of packets without localitySmaller instruction footprintExecution paths tend to be predictable,9,Anatomy of Shangri-La,Global Optimizations,Loop/Memory Opt.,Code Generation,Language(s),Execution environment,Baker Programming Language,Baker Compiler,Front-end,Profiler,Profiling,Pi Compiler,Inter-Procedural Opt.,Aggregate Compiler,Run-time System,Dynamically adapt mapping to match traffic fluctuations,Code generation and optimization for heterogeneous cores,Compiler optimizations for pipeline construction and data structure mapping/caching,Extract run-time characteristics by executing application,Modular language (with C-like syntax) to express applications as a dataflow graph,General-purpose Compiler,10,Baker Language,Familiar to embedded systems programmersSyntactically “feels like” CSimplifies the development of packet-processing applicationsHides architectural detailsSingle-level of memoryImplicit threading modelModular programming encapsulationDomain-specificData flow modelActors and interconnects (PPFs and channels)Built-in types, e.g. packetEnables compiler to generate efficient code on target CMP hardware,11,Shangri-la Example,L3 Switch,L2 Cls,L3 Fwdr,L2 Bridge,Eth Encap,RX,TX,Profiler,Modular, simple descriptionmodule l3_switch module eth_rx, eth_tx; / Built-in ppf l2_clsfr; module eth_encap_mod, l3_fwdr, l2_bridge; wiring eth_rx.eth0 - l3_switch.l2_clsfr.input_chnl; l3_fwdr.input_chnl prot (off=72b, sz=8b),t1 = pkt-ttl (off=64b, sz=8b),t2 = b & 0xff,b = read pkt (off=64b, sz=16b)t1 = ( b 8 ) & 0xff,Packet Access Combining Example,Analysis overviewIsolate packet accessesPerform checks to guarantee packet accesses combined safelyValidate range and size of combined memory accessReplace combined accesses with accesses to / from Local Memory / transfer registers,Automatic Program Partitioning,Memory Hierarchy Optimizations,Packet Handling Optimizations (2/5),46,Static Offset and Alignment Resolution (SOAR),Generic packet accessesCan handle arbitrary layering of protocols and arbitrary field offsetsClearly simplifies programmers tasksBut dynamic offset and alignment determination add significant overheadsDynamic offsets handling adds 20+ instructions per packet accessDynamic alignment adds several instructions per packet access,offset( src_ip ) = 26B,offset( src_ip ) = ?,packet_encap,packet_decap,Automatic Program Partitioning,Memory Hierarchy Optimizations,Packet Handling Optimizations (3/5),47,l2_cls.p,Static Offset and Alignment Resolution (SOAR),Statically resolved packet field alignment eliminates a few instructionsStatically resolved packet field offset and alignment can be accessed with a few instructionsImplemented using custom dataflow analysis,bridge.p,lpm_lookup.p,options_processor.p,icmp_processor.p,encap.p,l3_cls.p,Rx,Tx,l2_bridge.m,l3_fwdr.m,eth_encap.m,l3_switch.m,arp.p,Eth IP,Copy Eth,IP Eth,New ICMP IPCopy IP ICMP IP,2/2 resolved,1/1 resolved,18/18 resolved,3/3 resolved,Eth Arp,Automatic Program Partitioning,Memory Hierarchy Optimizations,Packet Handling Optimizations (4/5),48,Eliminate Unnecessary Packet Primitives in Code,Eliminate unnecessary packet_encap and packet_decap primitivesBalanced packet_encap and packet_decap in the same aggregate can be eliminated because they have no external effectWorks in conjunction with SOAR analysis resultsConvert metadata accesses into local memory accesses when all uses are within the same aggregatePrivate uses of metadata have no external effectmetadata accesses composed of 1+ SRAM and 20+ instructionsCandidate accesses can be identified with def-use analysis,Automatic Program Partitioning,Memory Hierarchy Optimizations,Packet Handling Optimizations (5/5),49,Global Data Memory Mapping,Collect dynamic access frequencies to shared global data structuresMap data structures to appropriate memory levelsMap small, frequently accessed data structures to Scratch MemoryOtherwise, place in SRAMPointers may point to objects in different levels of memoryPerform congruence analysis to allocate such objects to a common memory level,Automatic Program Partitioning,Packet Handling Optimizations,Memory Hierarchy Optimizations (1/6),50,Delayed-Update Software-Controlled Caches,Cache unprotected global data structuresSince these structures are not protected by locks, assume that they can tolerate delayed updateDelayed update result
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 民生银行襄阳市老河口市2025秋招笔试英文行测高频题含答案
- 2024年事业单位工勤技能考试每日一练试卷附参考答案详解【满分必刷】
- 华夏银行武汉市蔡甸区2025秋招笔试EPI能力测试题专练及答案
- 招商银行黄冈市黄州区2025秋招笔试综合模拟题库及答案
- 平安银行杭州市拱墅区2025秋招笔试热点题型专练及答案
- 广发银行宁波市宁海县2025秋招笔试英语题专练及答案
- 民生银行重庆市巴南区2025秋招笔试英文行测高频题含答案
- 2024安全监察人员模拟试题含答案详解(基础题)
- 2025年盘锦市大洼区人民医院面向社会招聘合同制工作人员(49)考前自测高频考点模拟试题及完整答案详解
- 农发行南平市光泽县2025秋招半结构化面试题库及参考答案
- ISO 22000-2018食品质量管理体系-食品链中各类组织的要求(2023-雷泽佳译)
- 卡巴斯基应急响应指南
- 理财规划大赛优秀作品范例(一)
- 2023年四川能投筠连电力招聘笔试参考题库附带答案详解
- 护理管理组织结构与设计
- 静配中心清洁消毒考核试题
- 一级烟草专卖管理师理论考试题库(含答案)
- 小学数学《分数除法》50道应用题包含答案
- 碳捕集、利用与封存技术课件
- 化工试生产总结报告
- 复句与单句的辨析课件
评论
0/150
提交评论