课件01IntegratedHandsonArchitectureSoftwareToolsver1.32.ppt_第1页
课件01IntegratedHandsonArchitectureSoftwareToolsver1.32.ppt_第2页
课件01IntegratedHandsonArchitectureSoftwareToolsver1.32.ppt_第3页
课件01IntegratedHandsonArchitectureSoftwareToolsver1.32.ppt_第4页
课件01IntegratedHandsonArchitectureSoftwareToolsver1.32.ppt_第5页
已阅读5页,还剩63页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Overview of Intel Core 2 Architecture and Software Development Tools,May 2008,Overview of Architecture ,Two Timings,Value of area,Thread A,Thread B,11.667,+3.765,15.432,15.432,+ 3.563,18.995,Value of area,Thread A,Thread B,11.667,+3.765,11.667,15.432,+ 3.563,15.230,Order of thread execution causes n

2、on determinant behavior in a data race,The Private Clause,Reproduces the variable for each thread Variables are un-initialized; C+ object is default constructed Any value external to the parallel region is undefined Can you spot the Race Condition? Make x int i; #pragma omp parallel for for(i=0; iN;

3、 i+) x = ai; y = bi; ci = x + y; ,private(x,y),Scheduling Clause,The schedule clause affects how loop iterations are mapped onto threads schedule(static ,chunk) Blocks of iterations of size “chunk” to threads Round robin distribution schedule(dynamic,chunk) Threads grab “chunk” iterations When done

4、with iterations, thread requests next set schedule(guided,chunk) Dynamic schedule starting with large block Size of the blocks shrink; no smaller than “chunk”,#pragma omp parallel for private (gP) schedule (static, 8) for( int i = start; i = end; i += 2 ) if ( TestForPrime(i) ) gP+; ,Lab 4 Mandelbro

5、t Scheduling,Objective: create a parallel version of mandelbrot. Analyze with VTune to look for load imbalance. Modify the code to add OpenMP clauses to diminish the load imbalance and improve performance Follow the next Mandelbrot activity called Mandelbrot Scheduling in the student lab doc,Work Qu

6、euing Intel Implementation Will be part of OpenMP 3.0 (slightly differently),Independent tasks can execute concurrently Create Queue of TasksWorks on Recursive functions Linked lists, etc.,Serial,Parallel,#pragma intel omp parallel taskq while(p != NULL) #pragma intel omp task do_work(p-data); p = p

7、-next; ,Optional Lab 5 Linked List Task Queue,while(p != NULL) do_work(p-data); p = p-next; ,Objective: Use VTune to identify where to parallelize a pointer chasing code and then modify the code to implement a task queue to parallelize the application Follow the Linked List task Queue activity calle

8、d LinkedListTaskQ in the student lab doc Note: We also have a companion lab, that uses worksharing to solve the same problem LinkedListWorkSharing We also have taskq labs on recursive functions - examples quicksort iNUM;i+) for(j=0;jNUM;j+) for(k=0;kNUM;k+) cij =cij + aik * bkj; for(i=0;iNUM;i+) for

9、(k=0;kNUM;k+) for(j=0;jNUM;j+) cij =cij + aik * bkj;,Fast Loop Index,Non unit stride skipping in memory can cause cache thrashing particularly for arrays sizes 2n,Unit Stride Memory Access (C/C+),Pan ready to fry eggs,Poor Cache Uilization - with Eggs,:,Carton represents cache line Refrigerator repr

10、esents main memory Table represents cache When table is filled up old cartons are evicted and most eggs are wasted,Request for an egg not already on table, brings a new carton of eggs from the refrigerator, but user only fries one egg from each carton. When table fills up old carton is evicted,User

11、requests one specific egg,User requests 2nd specific egg,User requests a 3rd egg Carton evicted,Previous user had usedall eggs on table,:,Good Cache Utilization - with Eggs,Carton eviction doesnt hurt us because weve already fried all the eggs in the cartons on the table just like previous user,User

12、 requests Eggs 1-8,User requests Eggs 9-16,User eventually asks for all the eggs,Request for one egg brings new carton of eggs from refrigerator User specifically requests eggs form carton already on table User fries all eggs in carton before egg from next carton is requested,Lab 7 Matrix Multiply C

13、ache Effects,Objective: Explore the impact of poor cache utilization on performance with VTune Analyzer and explore how to manipulation loops to achieve significantly better cache utilization & performance Follow the Matrix Multiply Cache Effects lab in the student lab doc. Set VTune Analyzer to col

14、lect samples on a counter called MEM_LOAD_RETIRED.L2_MISS & RESOURCE_STALLS,Optional Lab 8 False Sharing,Objective: Explore False sharing with VTune analyzer to learn what counters can be used to identify this issue. Manipulate the baseline code to remove the False Sharing issue Follow the False Sharing activity in the student lab doc. Set VTune Analyzer up to collect samples on a ratio called “Modified Data Sharing Ratio”,BACKUP,Lab 6 Essentials of Vectorization,Objective: Explore how auto vectorization can dramatica

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论