计算机结构与程序优化.ppt_第1页
计算机结构与程序优化.ppt_第2页
计算机结构与程序优化.ppt_第3页
计算机结构与程序优化.ppt_第4页
计算机结构与程序优化.ppt_第5页
已阅读5页,还剩111页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、计算机结构与程序优化,Introduction to Intel 64 Architectures Optimization,Main Purpose,处理器架构简介 SIMD指令介绍 (SSE /max(A,B),cmp A, B ; Condition jbe L30 ; Conditional branch mov ebx A ; ebx holds X jmp L31 ; Unconditional branch L30: mov ebx, B L31:,xor ebx, ebx ; Clear ebx cmp A, B setle bl ; When ebx = 0 or 1 ; O

2、R the complement condition sub ebx, 1 ; ebx=11.11 or 00.00 and ebx, A ; ebx=A-B or 0 add ebx, B ; ebx=A or B,Branch Prediction,Spin-Wait and Idle Loops All branch targets should be 16-byte aligned Unroll small loops until the overhead of the branch and induction variable accounts (generally) for les

3、s than 10%.,Fetch iBUFF_SIZE;i+) sum+=buffi;,Sandy Bridge only,Traversing through pointers,L1D Cache Bank Conflict,L1D Cache Bank Conflict (continue),Minimize Register Spills,Data Layout Optimizations,Pad data structures defined in the source code so that every data element is aligned to a natural o

4、perand size address boundary,Decomposing an Array,Locality Enhancement,Optimization techniques such as blocking, loop interchange, loop skewing, and packing are best done by the compiler. Optimize data structures either to fit in one-half of the first-level cache or in the second-level cache; turn o

5、n loop optimizations in the compiler to enhance locality for nested loops,Minimizing Bus Latency,If there is a blend of reads and writes on the bus, changing the code to separate these bus transactions into read phases and write phases can help performance software should favor data access patterns

6、that result in higher concentrations of cache miss patterns,Non Temporal Store Bus-traffic,The data transfer rate for bus write transactions is higher if 64 bytes are written out to the bus at a time,Prefetching,First-Level Data Cache Prefetching Avoid Fetch Un-needed Lines Prefetching for 2-Level C

7、ache,1st-Level DCache Prefetching,Avoid Fetch Un-needed Lines,For L1 Hardware Prefetch,Method 1: Organize the data so consecutive accesses can usually be found in the same 4-KByte page. Access the data in constant strides forward or backward IP Prefetcher. Method 2: Organize the data in consecutive

8、lines. Access the data in increasing addresses, in sequential cache lines.,Prefetching for 2-Level Cache,Streamer Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data in blocks of 128 bytes, aligned on 128 bytes,Example of Latency Hiding,Memory Acc

9、ess Latency and Execution Without Prefetch,Example of Latency Hiding,Memory Access Latency and Execution With Prefetch,Spread Prefetch Instructions,Rearranging PREFETCH instructions may yield a noticeable speedup for the code which stresses the cache resource,Multi-core 2950 Tick 48 bit; max Latency

10、 15000 tick,Using bit wizardry,Matters Computational-Ideas, Algorithms, Source Code, Jorg Arndt Hackers Delight, Henry S. Warren, Jr. HAKMEM - AIM-239, MIT,QuadCore Intel Core 2 Quad Q9550, 2833 MHz Throughput 3.12 Gbit/s Break out throughput 1090 Tick 288 bit; 212 Tick 48 bit; max Latency 1200 tick

11、,Look up table,QuadCore Intel Core 2 Quad Q9550, 2833 MHz Throughput 19.1 Gbit/s Break out throughput 280 Tick 288 bit; 68 Tick 48 bit; max Latency 500 tick,A Painless Guide to CRC Error Detection Algorithms Index V3.00, Ross N. williams,Decoder,Viterbi Algorithm Original Program C Optimization SIMD Optimization,Viterbi Algorithm,Viterbi Algorithm,Original Program,QuadCore Intel Core 2 Quad Q9550,

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论