63relocation-aware communication network for circuits on xilinx fpgas_第1页
63relocation-aware communication network for circuits on xilinx fpgas_第2页
63relocation-aware communication network for circuits on xilinx fpgas_第3页
63relocation-aware communication network for circuits on xilinx fpgas_第4页
63relocation-aware communication network for circuits on xilinx fpgas_第5页
已阅读5页,还剩2页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Relocation-Aware Communication Network for Circuits on Xilinx FPGAsAdewale Adetomi, Godwin Enemali, and Tughrul Arslan Institute for Integrated Micro and Nano Systems School of Engineering, University of EdinburghEdinburgh EH9 3JL, United Kingdoma.adetomi, g.enemali, t.arslaned.ac.ukand configuring

2、it in another, usually at runtime. This is often in response to faulty or damaged resources.Meanwhile, two major conditions have to be met for circuit relocation to be possible. First, the location to be relocated to must have similar resources (by type, number, and relative positions) as the circui

3、ts original location (see Fig. 1). This requirement is the easiest to meet as FPGA resources are generally tiled in regular repeating patterns. Second, there must be provision for communication at the desired location both for preserving existing interconnections and making new ones. This requiremen

4、t is generally more difficult to fulfil for runtime relocation because of the need to establish routes in runtime. The Xilinxs Partial Reconfiguration (PR) flow 1 does not support relocation, thereby necessitating that all routes between circuits are statically determined (at compileAbstractThe para

5、llelism of hardware and the dynamicreconfigurability of FPGAs enable multiple hardware tasks to run concurrently, and also time-share resources by being swapped in and out of the device during runtime. More than ever before, these capabilities are being employed in systems with high-reliability requ

6、irements. To improve reliability, a method often used is circuit relocation. However, the static nature of conventional FPGA communication interconnects is a bane to flexible runtime relocation. This paper employs a novel network architecture to enable dynamic communication and thus improve the flex

7、ibility of circuit relocation. By using the clock infrastructure of the FPGA as the physical network links for tasks in a 4-node star network, we have shown that dynamic communication between relocatable circuits can be achieved without incurring any overheads of time and resources, save for only 32

8、 slices used for the Network Interface.time).Forfullarbitraryrelocatability,adynamicKeywordsFPGA; bitstream relocation; CELOC; CERANoC; clock buffers; inter-task communication; network on chipcommunication infrastructure is needed.We propose a dynamic communication access mechanism that removes the

9、restriction of the static interconnect links and allows the arbitrary relocation of circuits. This mechanism relies on the replacement of the interconnect links with FPGA clock buffers. We call this Clock-Enabled Relocation-Aware Network-on-Chip (CERANoC). Since the clock buffers do not use the gene

10、ral logic routing resources, the path from a transmitting circuit to a receiving circuit is free of logic interconnections. We make the following key contributions in this paper:I.INTRODUCTIONThe reprogrammable nature of FPGAs makes them a software-like alternative to ASICs. At the same time, their

11、high performance, enabled by the capability to host multiple concurrent circuits, gives them a superior computational density compared to Processors. The introduction of Dynamic Partial Reconfiguration (DPR) in Xilinx FPGAs allows parts of a device to be reprogrammed while other parts are operating

12、1. With DPR, the flexibility and high performance of FPGAs can be combined to make the FPGA a platform for reliable and high-performance reconfigurable computing. This is all the more important as FPGAs are beginning to find more use in safety-critical applications like space research and compute-in

13、tensive applications like data centres 2. More than ever before, the FPGA is expected to deliver on power, performance, usability, development time, and cost. In order to achieve these objectives, a hardware management layer in the form of a Reconfigurable Operating System (ROS) has been proposed to

14、 abstract application developers from the low-level intricacies of the FPGAs fabric while harnessing the benefits of the FPGA. Over the years, several ROSes like the ones in 35 have been proposed. At the core of the operation of most ROSes is bitstream relocation, which involves removing a circuit f

15、rom one location on the FPGA1)The use of FPGA clock buffers to interconnect nodes in an on-chip networkA network-on-chip for dynamic communication between relocatable circuits2)The rest of this paper is organized as follows: In section II, we discuss the challenges of runtime circuit relocation. Sec

16、tion III presents existing techniques for managing runtime relocation and introduces the CERANoC solution, while section IV describes the use of clock buffers for networking. Section V discusses the implementation of CERANoC with a focus on the special considerations needed with regards to a traditi

17、onal NoC design. Section VI presents some evaluation while section VII summarizes our approach, its limitations, and our plans for future works.one moves between FPGA families. With the increasing size of newer FPGA chips, bitstream formats and interconnects have become more complex, and runtime rou

18、ting inevitably more costly and complicated.A different approach to dynamic communication is taken in DyNoC, a dynamic network-on-chip architecture 7. While several research works have been carried out on dynamic or reconfigurable NoCs, most do not actually consider the placement of a new task. Rath

19、er, they are mostly concerned with the runtime restructuring of the network topology or packet routing to meet changing communication needs as seen in ReNoC 8 and Hoplite 9 respectively. On the other hand, DyNoCs approach to dynamic communication involves placing a new circuit over existing deactiva

20、ted routers while leaving surrounding routers free for communication. With this arrangement, a new circuit can be placed anywhere on the mesh network with continued access to the network. However, we deem this approach to still have the challenges of static routes as the authors do not seem to have

21、provided details on how these are managed and their implementation diagram 7 shows routings crisscrossing the entire floorplan.Fig. 1. Circuit relocation requires that a matching location is found in terms of resource type and layout. Task 2 can be moved to LOC 2 but not LOC 3, but the interconnecti

22、on between Tasks 1 & 2 must be preservedII. THE CHALLENGE WITH RUNTIME CIRCUIT RELOCATIONAs stated earlier, for circuit relocation to be feasible, communication must be provided for the circuit being moved at the resource-matching destination. With respect to Fig. 1, the easiest way to provide this

23、communication is to ensure that a route from Task 1 to LOC 2 is established at design time. This way, during runtime, Task 2 can be moved to LOC 2 while maintaining its communication link with Task 1. This is usually accomplished by using the PR flow 1, which allows designers to partition the chip a

24、rea into fixed reconfigurable partitions (RPs) at design time (offline). Resources not meant for dynamic reconfiguration are left in the static region. Reconfigurable Modules (RMs), which must share the same fixed port interfaces are defined for these partitions. The downside of this is that in runt

25、ime, a circuit can only be placed in or relocated to a partition to which it belongs as an RM. Since the decision of which RP an RM belongs is made at design time, this limits the number of locations circuits can be relocated to in case of emergent permanent faults. This deteriorates the reliability

26、 figures of the device in operation.One other challenge with relocation is that of existing routings in the target region which belong to circuits outside that region. For instance, in Fig. 1, there is a route between Tasks 1 and 3. If a new task were to use the entire LOC 1 region, there would be a

27、 conflict between the existing and new routing interconnects. This problem exists because FPGA implementation tools like Vivado allow circuits to use routing resources external to their confined regions even if they have no logic resources there. While the Vivado constraint CONTAIN_ROUTING ensures t

28、hat a partially-reconfigurable partition does not use routing resources outside its region, it however, does not prevent routings from the static region from crossing the RPs 1.An ideal situation for dynamic communication is to have no static interconnects to deal with or need to create routings on

29、the fly. A step in that direction is taken in 10, where the authors present a communication mechanism that involves using the Internal Configuration Access Port (ICAP) of the FPGA to transfer data between arbitrarily-placed hardware tasks. This is done by connecting memory elements to the inputs and

30、 outputs of circuits and using the ICAP as a side channel to copy data from input memories to output memories thereby avoiding static interconnects. However, there is a shortcoming in this with respect to reliability. The ICAP has a maximum theoretical bandwidth of 400 MB/s 1 and Xilinx recommends t

31、hat more than 99% of this bandwidth should be dedicated to Soft Error Mitigation (SEM) 11 for the entire device. SEM is indispensable for reliable reconfigurable computing. Using at least 99% for SEM means that only a meagre 4 MB/s of the ICAPs bandwidth is available for other functions. In a system

32、 with frequent reconfigurations, this 4 MB/s is hardly sufficient for configuration-related functions let alone any other function that requires the use of the ICAP.Our solution to dynamic communication in CERANoC eliminates the static inter-circuit communication routings all together. We achieve th

33、is by removing task interconnections and replacing them with clock buffers as shown in Fig. 2. The hypothetical layout of tasks here is the same as that in Fig. 1, except that the interfaces between Tasks 1 & 2, and between Tasks 1 & 3 have been removed. To provide communication, a clock buffer is u

34、sed to transmit serial bits from Tasks 1 to 2 and 3. This signal also feeds LOC 2, so that if we now relocate task 2 to LOC 2, the communication between it and task 1 remains intact. At the same time, LOC 1 is now free of a crossing routing. Basically, the surface of the chip is generally freed of i

35、nter-circuit routings. In a more practical sense, we pre-route clock buffers between clock regions at design timeIII.RELATED WORKSTo overcome the limitations imposed by the PR flow, a technique that involves recomputing the routes by programming the Programmable Interconnect Points (PIP) during runt

36、ime can be used, but this is computationally expensive, often requiring several thousands of clock cycles per net 6. Moreover, determining the location of bits controlling the switch matrices and PIPs in the bitstream is non-trivial and there is no constancy in the bitstream format asso that during

37、runtime, regardless of the clock region a task is placed, it is able to communicate with any task in any other clock region. This technique of using clock buffers for communication was first presented in 12 and termed Clock- Enabled Low-Overhead Communication (CELOC). The main concept behind the tec

38、hnique is to allow serial data ride on a clock signal. In this paper, we investigate its use to provide communication support for relocatable circuits. The main advantage of CELOC for CERANoC is that the clock buffers use dedicated routings that are independent of the general logic interconnect. As

39、such, there is no static routing to contend with when circuits are relocated.Fig. 2. By removing the inter-circuit interfaces and replacing them with clock buffers, it is possible to achieve dynamic communicationAn advantage of using a separate clock as the communication clock is that we are not lim

40、ited to the frequency of the task clock; the communication engine can run at a much higher frequency. To avoid complications fromIV.FPGA CLOCK BUFFERS FOR COMMUNICATIONWhile the clock buffers and nets are precious and are available in the chip for functions ranging from glitchless multiplexing betwe

41、en clock sources to clock frequency division, most FPGA designs contain several unused global and horizontal clock buffers 13, CELOC repurposes these redundant resources for on-chip communication support. It involves a special adaptation of these clock buffers to serve as binary (0 or 1) signal tran

42、smitters and receivers on the FPGA. Fig. 3 shows this concept. By gating a free-running communication clock using a clock buffer, it is possible to send data from a transmitting (TX) task to a receiving task from any location on the device to another reachable by the buffer. The CELOC technique requ

43、ires a receiving (RX) task to be fed with three clocks: task_clock, com_clock, and data_clock. Task_clock is used to clock the task while com_clock is used to generate data_clock, which carries a serialized data from the source to the destination.A. Data Transfer MechanismThe parallel data from a tr

44、ansmitting (TX) task is serialized and shifted out bit-by-bit to the receiving (RX) task through the clock buffers. The Data Latch Controller latches the parallel data into registers for onward shifting to the clock enable (CE) of the buffer on the ce_cntrl line. Since the same register block is use

45、d for shifting out the serial bits, multiplexers are used to select between updating the registers with new data and shifting already latched data.The ce_cntrl signal controls the output of the buffer by toggling its CE. A 1 allows the input of the buffer to pass through to the output, while a 0 tie

46、s the output to zero. Since com_clock (which can be the same as task_clock) and task_clock are synchronous, a 1 essentially allows a full clock cycle to pass through while a 0 blocks it. As an example, Fig. 3 also shows the theoretically expected signal transitions for transmitting 10011010 (binary)

47、. The RX task can detect a rising edge on data_clock as a 1. With respect to the distance between the TX and RX tasks, the clock buffers in the Xilinx FPGAs are designed for short propagation delays and very low skew 13. This helps prevent the long propagation delays associated with shared buses. As

48、 a result, com_clock and data_clock can travel far with minimal loss of phase alignment, and thus ensure timing closure at the highest possible clock frequency.Clock Domain CrossinC), the two clocks are obtainedfrom the same source a single Phase-Locked Loop (PLL) clock generator, with the communica

49、tion clock, made as high as possible. Using this structure helps to prevent setup and hold timing violations by keeping both the transmission andthe reception synchronous and in the same clock domain no asynchronous clocks and no variable phase alignment. In our implementation, we have used the PLLE

50、2_BASE available in the Xilinx 7 series FPGA. Two global clock buffers are then used to distribute the task_clock and the com_clock throughout the chip.There are various types of clock buffers in the Xilinx FPGA and different combinations of the buffers can be used depending on the nature of the app

51、lication at hand. We have verified a number of options in 12.B. Data Recovery MechanismTo ensure that no local routing enters the RX task, the data_clock input to the RX cannot be interfaced with a non- clocking input. Therefore, to recover data from the data clock, advantage is taken of the ability

52、 of global (BUFG) and horizontal (BUFH) clock buffers to drive not only CLK inputs of logic resources, but also the Set/Reset (SR) and CE inputs of registers. This enables us to feed a task with data_clock without having a general interconnect routing cross into the task. Specifically, we use the se

53、tup in Fig. 4(a), where data_clock feeds the Preset (PRE) input of an FDPE register primitive, ensuring no local (static) routing crosses the task boundary.The FDPE is a D flip-flop with clock enable (CE) and asynchronous preset 14. By connecting CE to a 1 and D to a 0, with the clock input fed by t

54、he same clock (com_clock) used to create the data clock at the transmitter, data_clock connected to the PRE input produces on Q, signal level transitions corresponding to the rising edges of data_clock as shown in Fig. 4 for the same 8-bit data 10011010 (binary) transmitted in Fig. 3. To understand

55、how this works, we consider the truth table of the FPDE (see Fig. 4(b). We observe that by setting CE to 1 and D to 0, Q follows PRE (data_clock) instead of D at every rising edge of C.Clock Buffer BlockTX TaskData Latch ControllerData Deserializerce_cntrlSerial to Parallel ConverterRX TaskD SET QD

56、SET Q D SET QCECLR QCLR QCLR QD SET QD SET Q D SET QCounterCLR QCLR QCLR QData Serializertask_clockcom_clocktask_clockcom_clockce_cntrlAn example showing the transmission of an 8-bit binary data 10011010. The data rides on data_clockdata_clock 10011 0 1 0Fig. 3. Data can be sent from one task to ano

57、ther by using the clock enable of clock buffersutilization respectively of serial links over parallel links in NoCs.We show respectively in Fig. 5 and Fig. 6, mesh and star network implementations of CERANoC. Other topologies can be formed by manipulating the clock buffers. The diagrams only show fo

58、ur nodes, but this can be easily extended as the dotted lines depict. Fig. 7 shows the global clock generation and distribution architecture. We identify the following buffers available in the Xilinx 7 series FPGA global buffers (BUFG), horizontal buffers (BUFH), multi-region buffers (BUFMR), and regional buffers (BUFR). These are connected in-between nodes as shown in Fig. 5 and Fig. 6. A BUFG has global reach, while a BUFH can only connect two horizontal clock regions. On the other hand, a BUFMR can feed regions immediately above and below its own region. BUFGs do n

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论