




全文预览已结束
下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
ACM Word Template for SIG Site1st Author1st authors affiliation1st line of address2nd line of addressTelephone number, incl. country code1st authors E-mail address2nd Author2nd authors affiliation1st line of address2nd line of addressTelephone number, incl. country code2nd E-mail3rd Author3rd authors affiliation1st line of address2nd line of addressTelephone number, incl. country code3rd E-mailABSTRACT As network speed continues to grow, new challenges of network processing is emerging. In this paper we first studied the progress of network processing from a hardware perspective and showed that I/O and memory systems become the main bottlenecks of performance promotion. Basing on the analysis, we get the conclusion that conventional solutions for reducing I/O and memory accessing latencies are insufficient for addressing the problems. Motivated by the studies, we proposed an improved DCA combined with INIC solution which has creations in optimized architectures, innovative I/O data transferring schemes and improved cache policies. Experimental results show that our solution reduces 52.3% and 14.3% cycles on average for receiving and transmitting respectively. Also I/O and memory traffics are significantly decreased. Moreover, an investigation to the behaviors of I/O and cache systems for network processing is performed. And some conclusions about the DCA method are also presented.KeywordsKeywords are your own designated keywords.1. INTRODUCTIONRecently, many researchers found that I/O system becomes the bottleneck of network performance promotion in modern computer systems 123. Aim to support computing intensive applications, conventional I/O system has obvious disadvantages for fast network processing in which bulk data transfer is performed. The lack of locality support and high latency are the two main problems for conventional I/O system, which have been wildly discussed before 24. To overcome the limitations, an effective solution called Direct Cache Access (DCA) is suggested by INTEL 1. It delivers network packages from Network Interface Card (NIC) into cache instead of memory, to reduce the data accessing latency. Although the solution is promising, it is proved that DCA is insufficient to reduce the accessing latency and memory traffic due to many limitations 35. Another effective solution to solve the problem is Integrated Network Interface Card (INIC), which is used in many academic and industrial processor designs 67. INIC is introduced to reduce the heavy burden for I/O registers access in Network Drivers and interruption handling. But recent report 8 shows that the benefit of INIC is insignificant for the state of the art 10GbE network system.In this paper, we focus on the high efficient I/O system design for network processing in general-purpose-processor (GPP). Basing on the analysis of existing methods, we proposed an improved DCA combined with INIC solution to reduce the I/O related data transfer latency.The key contributions of this paper are as follows: Review the network processing progress from a hardware perspective and point out that I/O and related last level memory systems have became the obstacle for performance promotion. Propose an improved DCA combined with INIC solution for I/O subsystem design to address the inefficient problem of a conventional I/O system. Give a framework of the improved I/O system architecture and evaluate the proposed solution with micro-benchmarks. Investigate I/O and Cache behaviors in the network processing progress basing on the proposed I/O system.The paper is organized as follows. In Section 2, we present the background and motivation. In Section 3, we describe the improved DCA combined INIC solution and give a framework of the proposed I/O system implementation. In Section 4, firstly we give the experiment environment and methods, and then analyze the experiment results. In Section 5, we show some related works. Finally, in Section 6, we carefully discuss our solutions with many existing technologies, and then draw some conclusions.2. Background and MotivationIn this section, firstly we revise the progress of network processing and the main network performance improvement bottlenecks nowadays. Then from the perspective of computer architecture, a deep analysis of network system is given. Also the motivation of this paper is presented.2.1 Network processing reviewFigure 1 illustrates the progress of network processing. Packages from physical line are sampled by Network Interface Card (NIC). NIC performs the address filtering and stream control operations, then send the frames to the socket buffer and notifies OS to invoke network stack processing by interruptions. When OS receives the interruptions, the network stack accesses the data in socket buffer and calculates the checksum. Protocol specific operations are performed layer by layer in stack processing. Finally, data is transferred from socket buffer to the user buffer depended on applications. Commonly this operation is done by memcpy, a system function in OS.Figure 1. Network Processing FlowThe time cost of network processing can be mainly broke down into following parts: Interruption handling, NIC driver, stack processing, kernel routine, data copy, checksum calculation and other overheads. The first 4 parts are considered as packet cost, which means the cost scales with the number of network packets. The rests are considered as bit cost (also called data touch cost), which means the cost is in proportion to the total I/O data size. The proportion of the costs highly depends on the hardware platform and the nature of applications. There are many measurements and analyses about network processing costs 910. Generally, the kernel routine cost ranges from 10% - 30% of the total cycles; the driver and interruption handling costs range from 15% - 35%; the stack processing cost ranges from 7% - 15%; and data touch cost takes up 20% - 35%. With the development of high speed network (e.g. 10/40 Gbps Ethernet), an increasing tendency for kernel routines, driver and interruption handling costs is observed 3.2.2 MotivationTo reveal the relationship among each parts of network processing, we investigate the corresponding hardware operations. From the perspective of computer hardware architecture, network system performance is determined by three domains: CPU speed, Memory speed and I/O speed. Figure 2 depicts the relationship.Figure 2. Network xxxxObviously, the network subsystem can achieve its maximal performance only when the three domains above are in balance. It means that the throughput or bandwidth of each hardware domain should be equal with others. Actually this is hard for hardware designers, because the characteristics and physical implementation technologies are different for CPU, Memory and I/O system (chipsets) fabrication. The speed gap between memory and CPU a.k.a “the memory wall” has been paid special attention for more than ten years, but still it is not well addressed. Also the disparity between the data throughput in I/O system and the computing capacity provided by CPU has been reported in recent years 12. Meanwhile, it is obvious that the major time costs of network processing mentioned above are associated with I/O and Memory speeds, e.g. driver processing, interruption handling, and memory copy costs. The most important nature of network processing is the “producer-consumer locality” between every two consecutive steps of the processing flow. That means the data produced in one hardware unit will be immediately accessed by another unit, e.g. the data in memory which transported from NIC will be accessed by CPU soon. However for conventional I/O and memory systems, the data transfer latency is high and the locality is not exploited.Basing on the analysis discussed above, we get the observation that the I/O and Memory systems are the limitations for network processing. Conventional DCA or INIC cannot successfully address this problem, because it is in-efficient in either I/O transfer latency or I/O data locality utilization (discussed in section 5). To diminish these limitations, we present a combined DCA with INIC solution. The solution not only takes the advantages of both method but also makes many improvements in memory system polices and software strategies. 3. Design MethodologiesIn this section, we describe the proposed DCA combined with INIC solution and give a framework of the implementation. Firstly, we present the improved DCA technology and discuss the key points of incorporating it into I/O and Memory systems design. Then, the important software data structures and the details of DCA scheme are given. Finally, we introduce the system interconnection architecture and the integration of NIC.3.1 Improved DCAIn the purpose of reducing data transfer latency and memory traffic in system, we present an improved Direct Cache Access solution. Different with conventional DCA scheme, our solution carefully consider the following points. The first one is cache coherence. Conventionally, data sent from device by DMA is stored in memory only. And for the same address, a different copy of data is stored in cache which usually needs additional coherent unit to perform snoop operation 11; but when DCA is used, I/O data and CPU data are both stored in cache with one copy for one memory address, shown in figure 4. So our solution modifies the cache policy, which eliminated the snooping operations. Coherent operation can be performed by software when needed. This will reduce much memory traffic for the systems with coherence hardware 12. Figure 3. xxxxThe second one is cache pollution. DCA is a mixed blessing to CPU: On one side, it accelerates the data transfer; on the other side, it harms the locality of other programs executed in CPU and causes cache pollution. Cache pollution is highly depended on the I/O data size, which is always quite large. E.g. one Ethernet package contains a maximal 1492 bytes normal payload and a maximal 65536 bytes large payload for Large Segment Offload (LSO). That means for a common network buffer (usually 50 400 packages size), a maximal size range from 400KB to 16MB data is sent to cache. Such big size of data will cause cache performance drop dramatically. In this paper, we carefully investigate the relationship between the size of I/O data sent by DCA and the size of cache system. To achieve the best cache performance, a scheme of DCA is also suggested in section 4. Scheduling of the data sent with DCA is an effective way to improve performance, but it is beyond the scope of this paper.The third one is DCA policy. DCA policy refers the determination of when and which part of the data is transferred with DCA. Obviously, the scheme is application specific and varies with different user targets. In this paper, we make a specific memory address space in system to receive the data transferred with DCA. The addresses of the data should be remapped to that area by user or compilers. 3.2 DCA Scheme and detailsTo accelerate network processing, many important software structures used in NIC driver and the stack are coupled with DCA. NIC Descriptors and the associated data buffers are paid special attention in our solution. The former is the data transfer interface between DMA and CPU, and the later contains the packages. For farther research, each package stored in buffer is divided into the header and the payload. Normally the headers are accessed by protocols frequently, but the payload is accessed only once or twice (usually performed as memcpy) in modern network stack and OS. The details of the related software data structures and the network processing progress can be found in previous works 13.The progress of transfer one package from NIC to the stack with the proposed solution is illustrated in Table 1. All the accessing latency parameters in Table 1 are based on a state of the art multi-core processor system 3. One thing should be noticed is that the cache accessing latency from I/O is nearly the same with that from CPU. But the memory accessing latency from I/O is about 2/3 of that from CPU due to the complex hardware hierarchy above the main memory.Table 1. Table captions should be placed above the tableData transfer typerequestCost evaluation (cycles)requestoroperationdata structureCache linesDCA with INICDCA withoutINICDNIC1 CPU write descriptor1154004002 NIC read descriptor11540012003 NIC write header11540012004 NIC write payload 4040010500480005 NIC write descriptor (status)11540012006 CPU read descriptor (status)1154004007 CPU read header1154004008 CPU read payload 406001600016000Summary10902890068800 *status in descriptor displays whether DMA successful done the transfer.We can see that DCA with INIC solution saves above 95% CPU cycles in theoretical and avoid all the traffic to memory controller. In this paper, we transfer the NIC Descriptors and the data buffers including the headers and payload with DCA to achieve the best performance. But when cache size is small, only transfer the Descriptors and the headers with DCA is an alternative solution.DCA performance is highly depended on system cache policy. Obviously for cache system, write-back with write-allocate policy can help DCA achieves better performance than write-through with write non-allocate policy. Basing on the analysis in section 3.1, we do not use the snooping cache technology to maintain the coherence with memory. Cache coherence for other non-DCA I/O data transfer is guaranteed by software.3.3 On-chip network and integrated NICFootnotes should be Times New Roman 9-point, and justified to the full width of the column.Use the “ACM Reference format” for references that is, a numbered list at the end of the article, ordered alphabetically and formatted accordingly. See examples of some typical reference types, in the new “ACM Reference format”, at the end of this document. Within this template, use the style named references for the text. Acceptable abbreviations, for journal names, can be found here: /reference/abbreviations/. Word may try to automatically underline hotlinks in your references, the correct style is NO underlining.The references are also in 9 pt., but that section (see Section 7) is ragged right. References should be published materials accessible to the public. Internal technical reports may be cited only if they are easily accessible (i.e. you can give the address to obtain the report within your citation) and may be obtained by any reader. Proprietary information may not be cited. Private communications should be acknowledged, not referenced (e.g., “Robertson, personal communication”).3.4 Page Numbering, Headers and FootersDo not include headers, footers or page numbers in your submission. These will be added when the publications are assembled.4. FIGURES/CAPTIONSPlace Tables/Figures/Images in text as close to the reference as possible (see Figure 1). It may extend across both columns to a maximum width of 17.78 cm (7”).Captions should be Times New Roman 9-point bold. They should be numbered (e.g., “Table 1” or “Figure 2”), please note that the word for Table and Figure are spelled out. Figures captions should be centered beneath the image or picture, and Table captions should be centered above the table body.5. SECTIONSFigure 1. Insert caption to place caption below figure.The heading of a section should be in Times New Roman 12-point bold in all-capitals flush left with an additional 6-points of white space above the section head. Sections and subsequent sub- sections should be numbered and flush left. For a section head and a subsection head together (such as Section 3 and subsection 3.1), use no additional space above the subsection head.5.1 SubsectionsThe heading of subsections should be in Times New Roman 12-point bold with only the initial letters capitalized. (Note: For subsections and subsubsections, a word like the or a is not capitalized unless it is the first word of the header.)5.1.1 SubsubsectionsThe heading for subsubsections should be in Times New Roman 11-point italic with initial letters capitalized and 6-points of white space above the subsubsection head. SubsubsectionsThe heading for subsubsections should be in Times New Roman 11-point italic with initial letters capitalized. SubsubsectionsThe heading for subsubsections should be in Times New Roman 11-point italic with initial letters capitalized.6. ACKNOWLEDGMENTSOur thanks to ACM SIGCHI for allowing us to modify templates they had developed.7. REFERENCES1 R. Huggahalli, R. Iyer, S. Tetrick, Direct Cache Access for High Bandwidth Network I/O, ISCA, 2005.2 D. Tang, Y. Bao, W. Hu et al., DMA Cache: Using On-chip Storage to Architecturally Separate I/O Data from CPU Data for Improving I/O Performance, HPCA, 2010.3 Guangdeng Liao, Xia Zhu, Laxmi Bhuyan, “A New Server I/O Architecture for High Speed Networks,” HPCA, 2011.4 E. A. Leon, K. B. Ferreira, and A. B. Maccabe. Reducing the Impact of the MemoryWall for I/O Using Cache Injection, In 15th IEEE Symposium on High-Performance Interconnects (HOTI07), Aug, 2007.5 A.Kumar, R.Huggahalli
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 职场人际交往与自我提升指南
- 医疗行业信息系统安全风险评估
- 2025年下教师资格证考试(幼儿)《教育教学知识与能力》真题及答案
- 2025年低压电工资格考试必考题库(附答案)
- 公司年度市场推广策划书范本
- 医院感染控制管理与人员培训方案
- 技术服务保障执行方案模板
- 土方开挖与运输全过程施工方案
- 高校学生文明礼仪培养方案
- 金融产品市场推广方案书
- 云南省三校生语文课件
- 园艺产品的主要贮藏方法与原理课件
- 质量改进培训-课件
- 社会及其构成要素
- 环境风险评价(共84张)课件
- 函数极限说课
- 农业经济学ppt全套教学课件
- 果蔬贮藏保鲜概论:第五章 采收与采后商品化处理(第2节 分级 Sorting)
- 弱电桥架安装及电缆敷设施工方案(PPT)
- FQFNew8.0+供应商自审表格使用手册
- 人教版部编三年级上册道德与法治一课一练(含答案)
评论
0/150
提交评论