High-performance architectures for high-radix switches.Research Area: Swicth Architectures
To benefit from a reduction in the latency and reduce both consumption and cost, the optimal number of ports on a switch has been increasing over time. However, traditional architectures have lagged well for poor performance or due to problems of scalability with the number of ports. In this thesis proposes a new switch architecture valid for high level switches called Partitioned Crossbar Input Queued (PCIQ). This architecture solves the problem of excessive memory requirements in the design of high level architectures. In addition PCIQ defines a new family of switch architectures. PCIQ is based on a crossbar intelligent partitioning, dividing it into sub-crossbars, requiring less memory resources than the other proposals for high-grade switches and achieves greater efficiency due in part to an increase in the efficiency of employees referees design. In this sense, PCIQ uses two rotating priority arbiters (one for each sub-crossbar) which have a cost linear and logarithmic response over time as the number of switch ports. In addition PCIQ has a cost (measured in terms of memory requirements, complexity of the crossbar and complexity in the arbitration) similar or even lower than basic as CIOQ organizations. However PCIQ can achieve maximum efficiency for uniform traffic distributions. The packet blocking at the beginning of line (HOL or English) dramatically reduces the switch performance. Traditional solutions to eliminate HOL blocking are not scalable with the number of ports or require complex architectures. In this thesis proposes a technique for congestion control that eliminates the HOL blocking call RECN-IQ. RECN-IQ is designed to switch with memories only to entry and is a highly efficient technique
Efficient mechanisms to provide fault tolerance in interconnection networks for PC ClustersResearch Area: Fault Tolerance
Currently, clusters of PCs are considered a cost-effective alternative to large parallel computers. In these systems thousands of components (processors and/or hard disks) are connected through high-performance interconnection networks. Among the high-performance network technologies currently available to build clusters, InfiniBand (IBA) has emerged as a new standard interconnection technology suitable for clusters. Indeed, has been adopted by many of the most powerful systems currently built (top500 list).
As the number of nodes increases in these systems, the interconnection network grows accordingly. Along with the increase in components the probability of faults increases dramatically, and thus, fault tolerance in the system, in general, and in the interconnection network, in particular, becomes a necessity. Unfortunately, most of the fault-tolerant routing strategies proposed for massively parallel computers cannot be applied because routing and virtual channel transitions are deterministic in IBA, which prevent packets from avoiding the faults. Therefore, a new and effective fault-tolerant strategy is needed. Thus, this thesis focuses on providing mechanisms for provinding adequate levels of fault tolerance to the routing in PC clusters, specially tailored to IBA networks. We propose and evaluate in this thesis several mechanisms suitable for interconnection networks for clusters.
The first mechanism to provide fault tolerance in IBA (referred to as Transition Fault Tolerant Routing; TFTR) consists of using several disjoint paths between every source-destination pair of nodes and selecting the appropriate path at the source end node by using the APM mechanism provided by IBA. It consists of migrating on the fly the paths affected by the failure to alternative fault-free paths. However, to this end, an efficient routing algorithm able to provide enough disjoint paths, while still guaranteeing deadlock freedom, is required. We refer to an efficient routing algorithm as the one that minimizes the required set of resources and is computed in a time-efficient manner. We address this issue/approach, in a second effort, by proposing an scalable fault-tolerance methodology (referred to as SPFTR) for tori in IBA networks.
As a second contribution of this thesis, we propose a simple and effective faulttolerant routing methodology (referred to as Reachability Based Fault Tolerant Routing; RFTR), which can be applied to any topology. RFTR builds new alternative paths by joining subpaths extracted from the set of already computed paths, thus being time-efficient.
In the last contribution, we focus on providing fault tolerance based on dynamic reconfiguration. We propose a simple and fast method for dynamic network reconfiguration, referred to as Epoch Based Reconfiguration (EBR). EBR guarantees a fast and deadlock-free reconfiguration, but instead of avoiding deadlocks our mechanism is based on regressive deadlock recoveries. EBR works in an asynchronous manner, does not require additional resources and can be applied on any topology.
Most of the proposals made in this thesis are suitable (with no hardware modification) to be implemented on commercial network technologies (mainly IBA networks) currently used in clusters of PCs, and are able to tolerate dynamically a reasonable number of faults as long as the network remains physically connected.
Efficient techniques to provide scalability for token-based cache coherence protocols
Cache coherence protocols based on tokens can provide low latency without relying on non-scalable interconnects thanks to the use of efficient requests that are unordered. However, when these unordered requests contend for the same memory block, they may cause protocols races. To resolve the races and ensure the completion of all the cache misses, token protocols use a starvation prevention mechanism that is inefficient and non-scalable in terms of required storage structures and generated traffic. Besides, token protocols use non-silent invalidations which increase the latency of write misses proportionally to the system size. All these problems make token protocols non-scalable. To overcome the main problems of token protocols and increase their scalability, we propose a new starvation prevention mechanism named Priority Requests. This mechanism resolves contention by an efficient, elegant, and flexible method based on ordered requests. Furthermore, thanks to Priority Requests, efficient techniques can be applied to limit the storage requirements of the starvation prevention mechanism, to reduce the total traffic generated for managing protocol races, and to reduce the latency of write misses. Thus, the main problems of token protocols can be solved, which, in turn, contributes to wide their efficiency and scalability.
Cost Effective Routing Implementations for On-chip NetworksResearch Area: Network-On-Chip
Current many-core architectures like Chip Multiprocessors (CMPs) and Multiprocessor System-on-Chips (MPSoCs) rely on the effectiveness of the on-chip network (NoC) for inter-core communication. An effective NoC has to be scalable while meeting tight power, area, and latency constraints. 2D mesh topologies are usually preferred for general-purpose NoC designs as they fit the chip layout. However, designers must address new emerging challenges. The increased probability of manufacturing defects, the need for an optimized use of resources to enhance application-level parallelism or the need for efficient power-aware techniques may break the regularity in those topologies. In addition, collective communication support is a desired feature to effectively address communication needs from cache coherence protocols. Under these conditions, efficient routing of messages becomes a challenge.
The objective of this dissertation is to lay the foundations of a new logic-based distributed routing architecture that is able to adapt to any irregular topology derived from a 2D mesh structure, thus providing full coverage for any topology pattern induced by any of the challenges mentioned above. And this take is done, first by starting from the grounds of a concept idea, then looking through an evolution of several mechanisms and finally, arriving to a final implementation that encompasses several modules accomplishing the objective mentioned before. In fact, this last implementation is named eLBDR (effective Logic-Based Distributed Routing), but the study will span from the first mechanism, LBDR, to the next mechanisms that have been emerged progressively, describing them in detail accompanied with evaluations and results to show a cost/applicability trade-off analysis.
Referring to the full architecture, eLBDR presents area, latency and power consumption requirements that are comparable to the most efficient solutions in routing mechanisms like Dimension-Order-Routing (DOR), reflected on real router implementations designed with first attempts of bringing ideas that are still underutilised in the NoC domain, like virtual cut-through switching. NoC scenarios modelled after link variability analysis show a 100% coverage achievement of the full mechanism in all scenarios that were configured. So, it is fair to assume that eLBDR is prepared to face the new challenges present in the NoC research field. eLBDR can be used as an effective fault-tolerant mechanism in CMP and MPSoC systems with defective components at the NoC level, aggressive power-down techniques switching off entire irregularly shaped NoC regions can be designed as the remaining network topology is still supported by eLBDR, virtualization of the chip (mapping applications to disjoint paths) can be also achieved with eLBDR by defining disjoint network regions, and finally, eLBDR allows the use of broadcast communication primitives to support effective cache coherency protocols. Broadcast is allowed inside a region in eLBDR and previous alternatives, thus implementing multicast support at the chip level.
In short, the objective of the concept idea is to offer an alternative to the use of routing tables (either at routers or at end-nodes). Although the use of routing tables at routers is extremely flexible, it does not scale in terms of latency, area, and power consumption. As described in the next chapters, all mechanisms require a small set of configuration bits, thus being more compact than large routing tables implemented in memories. More important, from the first mechanism to the most recent developed, the requirements of any of them, do not grow with system size, thus providing scalability.
Low-Memory Techniques for Routing and Fault-Tolerance on the Fat-Tree Topology
Currently, clusters of PCs are considered a cost-effective alternative to large parallel computers. In these systems, thousands of computing nodes are connected through a high-performance interconnection network. The interconnection network must be carefully designed, since it heavily impacts the performance of the whole system. Two of the main design parameters of the interconnection networks are topology and routing. Topology defines the interconnection of the elements of the network among themselves, and between them and the computing nodes. Routing defines the paths followed by the packets through the interconnection network.
Performance has traditionally been the main metric to evaluate the interconnection network. However, we have to consider two additional metrics nowadays: cost and fault-tolerance. Interconnection networks have to scale in terms of cost, in addition to scale in performance. That is, they not only need to maintain their performance as the system size is increased, but also without heavily increasing their cost. On the other hand, as the number of nodes increases in cluster-based machines, the interconnection network grows accordingly. This increase in the number of elements of the interconnection network raises the probability of faults, and thus, fault-tolerance has become mandatory for current interconnection networks.
This dissertation focuses on the fat-tree topology, which is one of the most-commonly used topologies for clusters. Our aim is to exploit its characteristics to provide fault-tolerance; and a load-balanced routing algorithm that provides a good cost/performance tradeoff.
First, we focus on the fault-tolerance of the fat-tree topology. Most of the works in the literature provide fault-tolerance at the cost of adding resources to the network, either switches, links or virtual channels. On the contrary to these works, we provide the same degree of fault-tolerance without increasing the resources of the network by taking advantage of the abundant plurality of equivalent paths in the fat-tree. In particular, we define a mechanism that avoids the use of those paths that lead to the faulty elements, in such a way that packets that would cross the faults are deviated through non-faulty paths to their destination. As all the non-faulty paths can be used without any restriction, the mechanism presents a low performance degradation due to faults while providing the maximum degree of fault-tolerance that can be achieved without adding new resources to the network.
Next, we take the challenge of designing a deterministic routing algorithm that can compete in terms of performance with the adaptive routing algorithms that are used for the fat-tree topology. Although, adaptive routing algorithms need more resources to be implemented than deterministic ones, they usually outperform the latter ones. With our work, we present a deterministic routing algorithm that obtains similar - or even better- performance than the best adaptive routing algorithm, but at a lower cost, since it does not require the use of a selection function in each switch. Furthermore, when in-order delivery of packets is required, adaptive routing algorithms require the utilization of additional mechanisms, since deterministic routing algorithms preserve the delivery order of the packets by design. The results will show that our proposed deterministic routing algorithm can clearly outperform adaptive routing when in-order delivery of packets is required.
Finally, we take advantage of the particular characteristics of the proposed deterministic routing algorithm to simplify the fat-tree topology. This topology almost halves the resources required by the fat-tree topology, achieving in many cases the same performance as the fat-tree. In general, the cost/performance ratio of the new topology is almost half of the cost/performance ratio of the fat-tree.
Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore ProcessorsResearch Area: Processor Architecture
Current superscalar processors use a reorder buffer (ROB) to track the instructions in flight. The ROB is implemented as a FIFO queue where instructions are inserted in program order after decoded, and from which they are extracted when they commit, also in program order. The use of this hardware structure provides a simple support for speculation, precise exceptions, and register reclamation. However, retiring instructions in program order may lead to a significant performance degradation if a long-latency operation blocks the ROB head. Several proposals have been published dealing with this problem. Most of them allow instructions to be retired out of order in a speculative manner, so they require checkpoints in order to roll back the processor to a precise state when speculation fails. Checkpoints management usually involves costly hardware and causes an enlargement of other major processor structures, which in turn might impact the processor cycle. This problem affects most state-of-the-art microprocessors, regardless of whether they are single- or multithreaded, or whether they implement one or multiple cores. This thesis spans the study of non-speculative out-of-order retirement of instructions in superscalar, multithreaded, and multicore processors.
First, the Superscalar Validation Buffer architecture is proposed as a processor pipeline design where instructions are retired out of program order in a non-speculative manner, hence without checkpoints. The ROB is replaced with a smaller FIFO queue, called Validation Buffer (VB), which can be left by instructions just after they are classified either as non-speculative or mispeculated, irrespective of their execution state. The management of the VB is complemented with an aggressive register reclamation technique that decouples physical register release from instructions retirement. The VB architecture largely alleviates the ROB performance bottleneck, and reduces complexity of other processor structures. For example, a ROB can be outperformed by a half as large VB, while decreasing its hardware cost.
Second, the Multithreaded Validation Buffer architecture is extended with different multithreading organizations, namely coarse-grain, fine-grain, and simultaneous multithreading. Multithreaded processors became popular as an evolution of superscalar processors to increase the issue bandwidth utilization. Likewise, out-of-order retirement of instructions contributes to reduce the issue waste by avoiding frequent pipeline stalls due to a full ROB. The evaluation of the VB architecture on multithreaded processors shows again significant performance gains and/or a reduction of complexity. For example, the number of supported hardware threads can be reduced, or the multithreading paradigm can be simplified, without affecting performance.
Finally, the Multicore Validation Buffer architecture is presented as an out-of-order retirement approach on multicore processors, which define the dominant trend in the current market. Wide instruction windows are very beneficial to multiprocessors that implement a strict memory model, especially when both loads and stores encounter long latencies due to cache misses, and whose stalls must be overlapped with instruction execution to overcome the memory gap. The extension of the VB architecture to work on a multiprocessor environment allows core pipelines to retire instructions out of program order, while still enforcing sequential consistency. This proposal provides similar performance to ROB-based multiprocessor architectures implementing a relaxed memory model, and it outperforms in-order retirement, sequentially consistent multiprocessors.