Contact

Position:
Full Professor
Address:
Valencia
Email:
This email address is being protected from spambots. You need JavaScript enabled to view it.
Phone:
+34963877007x75753

Image & Curriculum Vitae

Image & Curriculum Vitae :

Position

As of January 2010, full-time associate professor at the Universidad Politécnica de Valencia, in the Parallel Arquitectures Group in the School of Engineering in Computer Science

Research Topics

Networks on chip. Routing algorithms and their implementations to address new challenges when building the on-chip network, including fault-tolerance, power management issues, virtualization. New router architectures and topologies for on-chip networks. Interaction of cache coherency protocols and the on-chip network in CMP tile-based systems. Congestion management in on-chip networks. Router designs for efficient on-chip interconnects. On-chip networks for embedded systems (addressing heterogeneity). High performance (off-chip) interconnects. InfiniBand-like networks, addressing routing algorithms, congestion management techniques and fault-tolerant algorithms. Quality of service

Much of this research has been performed as part of national and international research projects, framed in different funded projects like NaNoC, COMCAS, Consolider-Ingenio 2010, CICYT.

The following is a list of current or past advised PhD students:

  • Teresa Nachiondo Farinós, Assistant Professor at UPV
  • José Miguel Montañana Aliaga
  • Andrés Mejía Gómez, currently at Intel Santa Clara
  • Gaspar Mora Porta, currently at Intel Santa Clara
  • Samuel Rodrigo Mocholí
  • Jesús Camacho Villanueva
  • Toni Roca
  • José Cano Reyes

Other advised students working in research projects:

  • José María Martí­nez

Publications

  1. Tomas Picornell, Carles Hernández, Jose Flich and Jose Duato. Enforcing Predictability of Many-cores with DCFNoC. IEEE Transactions on Computers, 2020. BibTeX

    @article{10.1109/TC.2020.2987797,
    	author = "Picornell, Tomas and Hern{\'a}ndez, Carles and Flich, Jose and Duato, Jose",
    	abstract = "The ever need for higher performance forces industry to include technology based on multi-processors system on chip (MPSoCs) in their safety-critical embedded systems. MPSoCs include a network-on-chip (NoC) to interconnect the cores between them and with memory and the rest of shared resources. Unfortunately, the inclusion of NoCs compromises guaranteeing time predictability as network-level conflicts may occur. To overcome this problem, in this paper we propose DCFNoC, a new time-predictable NoC design paradigm where conflicts within the network are eliminated by design. This new paradigm builds on top of the Channel Dependency Graph (CDG) in order to deterministically avoid network conflicts. The network guarantees predictability to applications and is able to naturally inject messages using a TDM period equal to the optimal theoretical bound without the need of using a computationally demanding offline process. DCFNoC is integrated in a tile-based many-core system and adapted to its memory hierarchy. Our results show that DCFNoC guarantees time predictability avoiding network interference among multiple running applications. DCFNoC always guarantees performance and also improves wormhole performance in a 4 × 4 setting by a factor of 3.7× when interference traffic is injected. For a 8 × 8 network differences are even larger. In addition, DCFNoC obtains a total area saving of 10.79% over a standard wormhole implementation.",
    	journal = "IEEE Transactions on Computers",
    	title = "{E}nforcing {P}redictability of {M}any-cores with {DCFN}o{C}",
    	year = 2020
    }
    
  2. Miguel Gorgues and Jose Flich. A Low-Latency and Flexible TDM NoC for Strong Isolation in Security-Critical Systems. 2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2019. BibTeX

    @article{10.1109/MCSoC.2019.00029,
    	author = "Gorgues, Miguel and Flich, Jose",
    	journal = "2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)",
    	title = "{A} {L}ow-{L}atency and {F}lexible {TDM} {N}o{C} for {S}trong {I}solation in {S}ecurity-{C}ritical {S}ystems",
    	year = 2019
    }
    
  3. Tomas Picornell, Carles Hernández, Jose Duato and Jose Flich. DCFNoC: A Delayed Conflict-Free Time Division Multiplexing Network on Chip. 56th Annual Design Automation Conference 2019, 2019. BibTeX

    @article{10.1145/3316781.3317794,
    	author = "Picornell, Tomas and Hern{\'a}ndez, Carles and Duato, Jose and Flich, Jose",
    	abstract = "The adoption of many-cores in safety-critical systems requires real-time capable networks on chip (NoC). In this paper we propose a new time-predictable NoC design paradigm where contention within the network is eliminated. This new paradigm builds on the Channel Dependency Graph (CDG) and guarantees by design the absence of contention. Our delayed conflict-free NoC (DCFNoC) is able to naturally inject messages using a TDM period equal to the optimal theoretical bound and without the need of using a computationally demanding offline process. Results show that DCFNoC guarantees time predictability with very low implementation cost.",
    	journal = "56th Annual Design Automation Conference 2019",
    	title = "{DCFN}o{C}: {A} {D}elayed {C}onflict-{F}ree {T}ime {D}ivision {M}ultiplexing {N}etwork on {C}hip",
    	year = 2019
    }
    
  4. Jose Flich, Rafael Tornero, Jose Maria Martínez and Carles Hernández. Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems. 2018 International Conference on Embed- ded Computer Systems: Architectures, Modeling, and Simulations (SAMOS XVIII), 2018. BibTeX

    @article{10.1145/3229631.3239368,
    	author = "Flich, Jose and Tornero, Rafael and Mart{\'i}nez, Jose Maria and Hern{\'a}ndez, Carles",
    	abstract = {The transition to Exascale computing is going to be characterised by an increased range of application classes. In addition to traditional massively parallel "number crunching" applications, new classes are emerging such as real-time HPC and data-intensive scalable computing. Furthermore, Exascale computing is characterised by a "democratisation" of HPC: to fully exploit the capabilities of Exascale-level facilities, HPC is moving towards enabling access to its resources to a wider range of new players, including SMEs, through cloud-based approaches [1]. Finally, the need for much higher energy efficiency is pushing towards deep heterogeneity, widening the range of options for acceleration, moving from the traditional CPU-only organization, to the CPU plus GPU which currently dominates the Green500¹, to more complex options including programmable accelerators and even (reconfigurable) hardware accelerators [2].},
    	journal = "2018 International Conference on Embed- ded Computer Systems: Architectures, Modeling, and Simulations (SAMOS XVIII)",
    	title = "{R}eliable power and time-constraints-aware predictive management of heterogeneous exascale systems",
    	year = 2018
    }
    
  5. Jose Flich. Exploring Manycore Architectures for Next-Generation HPC Systems through the MANGO Approac. Microprocessors and Microsystems, 2018. BibTeX

    @article{10.1016/j.micpro.2018.05.011,
    	author = "Flich, Jose",
    	abstract = "The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.",
    	journal = "Microprocessors and Microsystems",
    	title = "{E}xploring {M}anycore {A}rchitectures for {N}ext-{G}eneration {HPC} {S}ystems through the {MANGO} {A}pproac",
    	year = 2018
    }
    
  6. Jose Flich. MANGO: Exploring Manycore Architectures for Next-GeneratiOn HPC Systems. 2017 Euromicro Conference on Digital System Design (DSD, 2017. BibTeX

    @article{10.1109/DSD.2017.51 ,
    	author = "Flich, Jose",
    	abstract = "The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.",
    	journal = "2017 Euromicro Conference on Digital System Design (DSD",
    	title = "{MANGO}: {E}xploring {M}anycore {A}rchitectures for {N}ext-{G}enerati{O}n {HPC} {S}ystems",
    	year = 2017
    }
    
  7. Jose Vicente Escamilla Lopez and Jose Flich. ICARO-PAPM: Congestion Management with Selective Queue Power-Gating. 2017 International Conference on High Performance Computing & Simulation (HPCS), 2017. BibTeX

    @article{10.1109/HPCS.2017.47,
    	author = "Escamilla Lopez, Jose Vicente and Flich, Jose",
    	journal = "2017 International Conference on High Performance Computing {\&} Simulation (HPCS)",
    	title = "{ICARO}-{PAPM}: {C}ongestion {M}anagement with {S}elective {Q}ueue {P}ower-{G}ating",
    	year = 2017
    }
    
  8. Jose Vicente Escamilla Lopez and Jose Flich. Increasing the Efficiency of Latency-Driven DVFS with a Smart NoC Congestion Management Strategy. IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2016. BibTeX

    @article{10.1109/MCSoC.2016.42,
    	author = "Escamilla Lopez, Jose Vicente and Flich, Jose",
    	abstract = "Dynamic Voltage and Frequency Scaling (DVFS) can be a very effective power management strategy not only for on-chip processing elements but also for the network-on-chip (NoC). In this paper we propose a new approach to DVFS in NoC, which combines a congestion management strategy with a feedback-loop controller. The controller sets frequency and voltage to the lowest values that keep the NoC latency below a predetermined threshold. To cope with burstiness and hotspot patterns, which may lead the controller to overdrive the NoC with too high frequencies and voltages, leading to excessive power consumption, the congestion management strategy promptly identifies the flows that caused the abnormal traffic situation and eliminates them from the latency calculation, leading to a significantly higher power saving. Compared to a baseline DVFS strategy without congestion management, our results show that our proposal saves up to 53% more power when bursty or hotspot-based traffic patterns are detected. In addition, since we also apply power-gating to make an efficient use of the network buffers, we achieve an improvement of up to 38% in power savings when no bursts or hotspots are present.",
    	journal = "IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)",
    	keywords = "NoC",
    	title = "{I}ncreasing the {E}fficiency of {L}atency-{D}riven {DVFS} with a {S}mart {N}o{C} {C}ongestion {M}anagement {S}trategy",
    	year = 2016
    }
    
  9. Carles Hernández, Antoni Roca, Federico Silla, Jose Flich and Jose Duato. On the Impact of Within-Die Process Variation in GALS-Based NoC Performance. IEEE Trans. on CAD of Integrated Circuits and Systems 31(2):294-307, 2012. BibTeX

    @article{DBLP:journals/tcad/HernandezRSFD12,
    	author = "Hern{\'a}ndez, Carles and Roca, Antoni and Silla, Federico and Flich, Jose and Duato, Jose",
    	journal = "IEEE Trans. on CAD of Integrated Circuits and Systems",
    	number = 2,
    	pages = "294-307",
    	title = "{O}n the {I}mpact of {W}ithin-{D}ie {P}rocess {V}ariation in {GALS}-{B}ased {N}o{C} {P}erformance",
    	volume = 31,
    	year = 2012
    }
    
  10. Antoni Roca, Carles Hernández, Jose Flich, Federico Silla and Jose Duato. A Distributed Switch Architecture for On-Chip Networks. In Parallel Processing (ICPP), 2011 International Conference on. 2011, 21 -30. DOI BibTeX

    @conference{6047169,
    	author = "Roca, Antoni and Hern{\'a}ndez, Carles and Flich, Jose and Silla, Federico and Duato, Jose",
    	abstract = "It is well-known that current Chip Multiprocessor (CMP) and high-end MultiProcessor System-on-Chip (MPSoC) designs are growing in their number of components. Networks-on-Chip (NoC) provide the required connectivity for such CMP and MPSoC designs at reasonable costs. However, as technology advances, links become the critical component in the NoC. First, because the power consumption of the link is extremely high with respect the power consumption of the rest of components (mainly switches), becoming unacceptable for long global interconnects. Second, the delay of a link does not scale with technology, thus, degrading the performance of the network. To solve both problems, several solutions have been previously proposed. In this paper, we present a new switch architecture that reduces the negative impact of links on the NoC. We call our proposal distributed switch. The distributed switch moves the circuitry of a standard switch onto the links. Then, packets are buffered, routed, and forwarded at the same time they are crossing the link. Distributing a standard switch onto the link improves the trade off between the power consumption and the operating frequency of the entire network. In contrast, area requirements are increased. The distributed switch reduces up to 14.8 #x025; the peak power consumption while increases its area up to 22 #x025;. Furthermore, the distributed switch is able to increase the maximum achievable frequency with respect to the standard switch. In particular, the maximum operating frequency of the distributed switch can be increased up to 14.3 #x025;.",
    	booktitle = "Parallel Processing (ICPP), 2011 International Conference on",
    	doi = "10.1109/ICPP.2011.28",
    	issn = "0190-3918",
    	month = "sept.",
    	pages = "21 -30",
    	title = "{A} {D}istributed {S}witch {A}rchitecture for {O}n-{C}hip {N}etworks",
    	year = 2011
    }
    
  11. Jesus Escudero-Sahuquillo, Ernst Gunnar Gran, Pedro Javier Garcia, Jose Flich, Tor Skeie, Olav Lysne, Francisco Jose Quiles and Jose Duato. Combining Congested-Flow Isolation and Injection Throttling in HPC Interconnection Networks. In Parallel Processing (ICPP), 2011 International Conference on. 2011, 662 -672. DOI BibTeX

    @conference{6047234,
    	author = "Jesus Escudero-Sahuquillo and Ernst Gunnar Gran and Pedro Javier Garcia and Flich, Jose and Tor Skeie and Olav Lysne and Francisco Jose Quiles and Duato, Jose",
    	abstract = "Existing congestion control mechanisms in interconnects can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. These two approaches have different, but non-overlapping weaknesses. In this paper we present in detail a method that combines injection throttling and congested-flow isolation. Through simulation studies we first demonstrate the respective flaws of the injection throttling and of flow isolation. Thereafter we show that our combined method extracts the best of both approaches in the sense that it gives fast reaction to congestion, it is scalable and it has good fairness properties with respect to the congested flows.",
    	booktitle = "Parallel Processing (ICPP), 2011 International Conference on",
    	doi = "10.1109/ICPP.2011.80",
    	issn = "0190-3918",
    	month = "sept.",
    	pages = "662 -672",
    	title = "{C}ombining {C}ongested-{F}low {I}solation and {I}njection {T}hrottling in {HPC} {I}nterconnection {N}etworks",
    	year = 2011
    }
    
  12. , Jose Flich, Antoni Roca and Jose Duato. PC-Mesh: A Dynamic Parallel Concentrated Mesh. In Parallel Processing (ICPP), 2011 International Conference on. 2011, 642 -651. DOI BibTeX

    @conference{6047232,
    	author = ", and Flich, Jose and Roca, Antoni and Duato, Jose",
    	abstract = "We present a novel network on-chip topology, PC-Mesh (Parallel Concentrated Mesh), suitable for tiled CMP systems. The topology is built using four concentrated mesh (C-Mesh) networks and a new network interface able to inject packets through different networks. The goal of the new combined topology is to minimize the power consumption of the network when running applications exhibiting low traffic rates and maximize throughput when applications require high traffic rates. Thus, the topology is dynamically adjusted (switching on and off network components) with a proper injection algorithm, adapting itself to the network on-chip traffic requirements. The PC-Mesh network performs as a C-Mesh network (using one sub network) when the traffic is low obtaining large savings in power consumption. When the load network increases, new sub networks are opened and thus higher traffic rates are supported, thus providing comparable results as the mesh network. Additional benefits of the PC-Mesh network is its fault tolerance degree and the lower latency in terms of hops. An alternative PC-Mesh version is provided to optimize the fault-tolerance degree. Comparative results with detailed evaluations (in area, power, and delay) are provided both for the network interface and switches. Results demonstrate PC-Mesh is able to dynamically adapt to the current traffic situations. Experimental results with a system-level simulation platform (including the application being run and the operating system) are provided. Results show how the PC-Mesh network achieves the same results as the C-Mesh topology reducing execution time of applications by 20 #x025; as well as energy consumption by also 20 #x025;, when compared with the 2D-Mesh network topology. However, when challenged with higher traffic demands, PC-Mesh outperforms the C-Mesh network by achieving much lower execution time of applications and lower energy consumption. In some scenarios, execution time is reduced by a factor of 2 - - and power consumption by 50 #x025;.",
    	booktitle = "Parallel Processing (ICPP), 2011 International Conference on",
    	doi = "10.1109/ICPP.2011.21",
    	issn = "0190-3918",
    	month = "sept.",
    	pages = "642 -651",
    	title = "{PC}-{M}esh: {A} {D}ynamic {P}arallel {C}oncentrated {M}esh",
    	year = 2011
    }
    
  13. Samuel Rodrigo, Jose Flich, Antoni Roca, S Medardoni, D Bertozzi, , Federico Silla and Jose Duato. Cost-Efficient On-Chip Routing Implementations for CMP and MPSoC Systems. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30(4):534 -547, April 2011. URL, DOI BibTeX

    @article{5737867,
    	author = "Rodrigo, Samuel and Flich, Jose and Roca, Antoni and S. Medardoni and D. Bertozzi and , and Silla, Federico and Duato, Jose",
    	abstract = "The high-performance computing domain is enriching with the inclusion of networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area, and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism, or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge. This paper presents universal logic-based distributed routing (uLBDR), an efficient logic-based mechanism that adapts to any irregular topology derived from 2-D meshes, instead of using routing tables. uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the tradeoff between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the tradeoff between fault tolerance and performance. Power consumption, area, and delay estimates are also provided highlighting the efficiency of the mechanism. To do this, different router models (one for CMPs and one for MPSoCs) have been designed as a proof concept.",
    	doi = "10.1109/TCAD.2011.2119150",
    	issn = "0278-0070",
    	journal = "Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on",
    	keywords = "Fault-tolerance , logic design , networks-on-chip , routing",
    	month = "april",
    	number = 4,
    	pages = "534 -547",
    	title = "{C}ost-{E}fficient {O}n-{C}hip {R}outing {I}mplementations for {CMP} and {MPS}o{C} {S}ystems",
    	url = "http://dx.doi.org/10.1109/TCAD.2011.2119150",
    	volume = 30,
    	year = 2011
    }
    
  14. Carles Hernández, Antoni Roca, Jose Flich, Federico Silla and Jose Duato. Fault-Tolerant Vertical Link Design for Effective 3D Stacking. IEEE Computer Architecture Letters 99(RapidPosts), 2011. URL, DOI BibTeX

    @article{10.1109/L-CA.2011.17,
    	author = "Hern{\'a}ndez, Carles and Roca, Antoni and Flich, Jose and Silla, Federico and Duato, Jose",
    	address = "Los Alamitos, CA, USA",
    	doi = "10.1109/L-CA.2011.17",
    	issn = "1556-6056",
    	journal = "IEEE Computer Architecture Letters",
    	number = "RapidPosts",
    	publisher = "IEEE Computer Society",
    	title = "{F}ault-{T}olerant {V}ertical {L}ink {D}esign for {E}ffective 3{D} {S}tacking",
    	url = "http://doi.ieeecomputersociety.org/10.1109/L-CA.2011.17",
    	volume = 99,
    	year = 2011
    }
    
  15. Samuel Rodrigo, Jose Flich, Antoni Roca, S Medardoni, D Bertozzi, , Federico Silla and Jose Duato. Cost-efficient on-chip routing implementations for CMP and MPSoC systems. 2011, 534 - 547. URL, DOI BibTeX

    @conference{20111313880819,
    	author = "Rodrigo, Samuel and Flich, Jose and Roca, Antoni and S. Medardoni and D. Bertozzi and , and Silla, Federico and Duato, Jose",
    	abstract = "The high-performance computing domain is enriching with the inclusion of networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area, and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism, or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge. This paper presents universal logic-based distributed routing (uLBDR), an efficient logic-based mechanism that adapts to any irregular topology derived from 2-D meshes, instead of using routing tables. uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the tradeoff between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the tradeoff between fault tolerance and performance. Power consumption, area, and delay estimates are also provided highlighting the efficiency of the mechanism. To do this, different router models (one for CMPs and one for MPSoCs) have been designed as a proof concept. © 2006 IEEE.",
    	address = "445 Hoes Lane / P.O. Box 1331, Piscataway, NJ 08855-1331, United States",
    	doi = "10.1109/TCAD.2011.2119150",
    	issn = 02780070,
    	journal = "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems",
    	key = "Fault tolerance",
    	keywords = "Computer software selection and evaluation;Logic design;Microprocessor chips;Quality assurance;Telecommunication networks;Topology;",
    	note = "Cost-efficient;Distributed routing;Efficient routing;High-performance computing;Irregular topology;Key component;Latency constraints;Many-core;Networks on chips;networks-on-chip;On chips;Performance costs;Power Consumption;Power-aware;Router model;routing;Routing table;Universal logic;",
    	number = 4,
    	pages = "534 - 547",
    	title = "{C}ost-efficient on-chip routing implementations for {CMP} and {MPS}o{C} systems",
    	url = "http://dx.doi.org/10.1109/TCAD.2011.2119150",
    	volume = 30,
    	year = 2011
    }
    
  16. J Escudero-Sahuquillo, P J Garcia, F J Quiles, Jose Flich and Jose Duato. Cost-effective queue schemes for reducing head-of-line blocking in fat-trees. Concurrency Computation Practice and Experience 12(15), 2011. URL, DOI BibTeX

    @article{IP51411971,
    	author = "J. Escudero-Sahuquillo and P.J. Garcia and F.J. Quiles and Flich, Jose and Duato, Jose",
    	abstract = "The fat-tree is one of the most common topologies among the interconnection networks of the systems currently used for high-performance parallel computing. Among other advantages, fat-trees allow the use of simple but very efficient routing schemes. One of them is a deterministic routing algorithm that has been recently proposed, offering a similar (or better) performance than adaptive routing while reducing complexity and guaranteeing in-order packet delivery. However, as other deterministic routing proposals, this deterministic routing algorithm cannot react when high traffic loads or hot-spot traffic scenarios produce severe contention for the use of network resources, leading to the appearance of Head-of-Line (HoL) blocking, which spoils the network performance. In that sense, we describe in this paper two simple, cost-effective strategies for dealing with the HoL-blocking problem that may appear in fat-trees with the aforementioned deterministic routing algorithm. From the results presented in the paper, we conclude that, in the mentioned environment, these proposals considerably reduce HoL-blocking without significantly increasing switch complexity and the required silicon area. © 2011 John Wiley {\&} Sons, Ltd.",
    	doi = "10.1002/cpe.1764",
    	issn = "1532-0626",
    	journal = "Concurrency Computation Practice and Experience",
    	key = "Trees (mathematics)",
    	keywords = "Cost effectiveness;Network performance;Packet networks;Parallel architectures;Routing algorithms;",
    	note = "Adaptive routing;Deterministic routing;Deterministic routing algorithms;Efficient routing;Head of line blocking;Hot-spot traffic;In-order packet delivery;Network resource;Parallel Computing;Silicon area;Switch complexity;Traffic loads;",
    	number = 15,
    	title = "{C}ost-effective queue schemes for reducing head-of-line blocking in fat-trees",
    	url = "http://dx.doi.org/10.1002/cpe.1764",
    	volume = 12,
    	year = 2011
    }
    
  17. F Trivino, J Sanchez, F J Alfaro and Jose Flich. Virtualizing network-on-chip resources in chip-multiprocessors. Microprocessors and Microsystems 35(2):230 - 45, 2011. URL, DOI BibTeX

    @article{11839233,
    	author = "F. Trivino and J. Sanchez and F.J. Alfaro and Flich, Jose",
    	abstract = "The number of cores on a single silicon chip is rapidly growing and chips containing tens or even hundreds of identical cores are expected in the future. To take advantage of multicore chips, multiple applications will run simultaneously. As a consequence, the traffic interferences between applications increases and the performance of individual applications can be seriously affected.In this paper, we improve the individual application performance when several applications are simultaneously running. This proposal is based on the virtualization concept and allows us to reduce execution time and network latency in a significant percentage. [All rights reserved Elsevier].",
    	address = "Netherlands",
    	doi = "10.1016/j.micpro.2010.10.001",
    	issn = "0141-9331",
    	journal = "Microprocessors and Microsystems",
    	keywords = "multiprocessing systems;network-on-chip;",
    	note = "virtualizing network-on-chip resources;chip multiprocessors;single silicon chip;identical cores;multicore chips;traffic interferences;virtualization concept;",
    	number = 2,
    	pages = "230 - 45",
    	title = "{V}irtualizing network-on-chip resources in chip-multiprocessors",
    	url = "http://dx.doi.org/10.1016/j.micpro.2010.10.001",
    	volume = 35,
    	year = 2011
    }
    
  18. Carles Hernández, Antoni Roca, Jose Flich, Federico Silla and Jose Duato. Characterizing the impact of process variation on 45 nm NoC-based CMPs. Journal of Parallel and Distributed Computing 71(5):651 - 663, 2011. URL, DOI BibTeX

    @article{20111413888254,
    	author = "Hern{\'a}ndez, Carles and Roca, Antoni and Flich, Jose and Silla, Federico and Duato, Jose",
    	abstract = "Current integration scales make possible to design chip multiprocessors with a large amount of cores interconnected by a NoC. Unfortunately, they also bring process variation, posing a new burden to processor manufacturers. Regarding the NoC, variability causes that the delays of links and routers do not match those initially established at design time. In this paper we analyze how variability affects the NoC by applying a new variability model to 100 instances of an 8 × 8 mesh NoC synthesized using 45 nm technology. We also show that GALS-based NoCs present communication bottlenecks due to the slower components of the network, which cause congestion, thus reducing performance. This performance reduction finally affects the applications being executed in the CMP because they may be mapped to slower areas of the chip. In this paper we show that using a mapping algorithm that considers variability data may improve application execution time up to 50%. © 2010 Elsevier Inc. All rights reserved.",
    	address = "6277 Sea Harbor Drive, Orlando, FL 32887-4900, United States",
    	doi = "10.1016/j.jpdc.2010.09.006",
    	issn = "0743-7315",
    	journal = "Journal of Parallel and Distributed Computing",
    	key = "Routers",
    	keywords = "Conformal mapping;Design;Microprocessor chips;Multiprocessing systems;Servers;Systems analysis;VLSI circuits;",
    	note = "Chip Multiprocessor;NoC (or Network-on-Chip);Process mapping;Process variations;Router design;",
    	number = 5,
    	pages = "651 - 663",
    	title = "{C}haracterizing the impact of process variation on 45 nm {N}o{C}-based {CMP}s",
    	url = "http://dx.doi.org/10.1016/j.jpdc.2010.09.006",
    	volume = 71,
    	year = 2011
    }
    
  19. Samuel Rodrigo, Jose Flich, Antoni Roca, S Medardoni, D Bertozzi, , Federico Silla and Jose Duato. Cost-Efficient On-Chip Routing Implementations for CMP and MPSoC Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30(4):534 - 47, 2011. URL, DOI BibTeX

    @article{11874902,
    	author = "Rodrigo, Samuel and Flich, Jose and Roca, Antoni and S. Medardoni and D. Bertozzi and , and Silla, Federico and Duato, Jose",
    	abstract = "The high-performance computing domain is enriching with the inclusion of networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area, and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism, or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge. This paper presents universal logic-based distributed routing (uLBDR), an efficient logic-based mechanism that adapts to any irregular topology derived from 2-D meshes, instead of using routing tables. uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the tradeoff between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the tradeoff between fault tolerance and performance. Power consumption, area, and delay estimates are also provided highlighting the efficiency of the mechanism. To do this, different router models (one for CMPs and one for MPSoCs) have been designed as a proof concept.",
    	address = "USA",
    	doi = "10.1109/TCAD.2011.2119150",
    	issn = "0278-0070",
    	journal = "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems",
    	keywords = "microprocessor chips;network routing;network-on-chip;",
    	note = "cost-efficient on-chip routing implementations;chip multiprocessors;CMP;MPSoC Systems;many-core system-on-chip;networks-on-chip;communication scalability;latency constraints;area constraints;power constraints;application-level parallelism;power-aware techniques;topology regularity;universal logic-based distributed routing;logic-based mechanism;2D meshes;fault tolerance;fault performance;power consumption;",
    	number = 4,
    	pages = "534 - 47",
    	title = "{C}ost-{E}fficient {O}n-{C}hip {R}outing {I}mplementations for {CMP} and {MPS}o{C} {S}ystems",
    	url = "http://dx.doi.org/10.1109/TCAD.2011.2119150",
    	volume = 30,
    	year = 2011
    }
    
  20. , Jose Flich, Jose Duato, H Eberle and W Olesinski. A power-efficient network on-chip topology. In Proceedings of the Fifth International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip. 2011, 23–26. URL, DOI BibTeX

    @conference{Camacho:2011:PNO:1930037.1930044,
    	author = ", and Flich, Jose and Duato, Jose and H. Eberle and W. Olesinski",
    	abstract = "NoCs have become a critical component in many-core architectures. Usually, the preferred topology is the 2D-Mesh as it enables a tile-based layout significantly reducing the design effort. However, new emerging challenges such as power consumption need to be addressed. Looking at the NoC, routers and links not being used must be switched off, thus achieving large power savings. Topology and routing algorithm must be carefully designed as they may lack enough flexibility to switch off components for long periods of time. We present the NR-Mesh (Nearest neighboR Mesh) topology. It gives an end node the choice to inject a message through different neighboring routers, thereby reducing hop count and saving latency. At the receiver side, a message may be delivered to the end node through different routers, thus reducing hop count further and increasing flexibility. When allowing links and routers to switch off and combined with adaptive routing, the power management technique is able to achieve significant power savings (up to 36% savings in static power consumed at routers). When compared with the 2D-Mesh, NR-Mesh reduces execution time by 23% and power consumption at routers by 47%.",
    	address = "New York, NY, USA",
    	booktitle = "Proceedings of the Fifth International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip",
    	doi = "http://doi.acm.org/10.1145/1930037.1930044",
    	isbn = "978-1-4503-0272-2",
    	keywords = "Network-on-Chip; Power Efficient Chip Technology; Chip Topology; Routing Algorithms;",
    	pages = "23--26",
    	publisher = "ACM",
    	series = "INA-OCMC '11",
    	title = "{A} power-efficient network on-chip topology",
    	url = "http://doi.acm.org/10.1145/1930037.1930044",
    	year = 2011
    }
    
  21. , Jose Flich, Jose Duato, H Eberle and W Olesinski. Towards an Efficient NoC Topology through Multiple Injection Ports. In Digital System Design (DSD), 2011 14th Euromicro Conference on. 2011, 165 -172. DOI BibTeX

    @conference{6037406,
    	author = ", and Flich, Jose and Duato, Jose and H. Eberle and W. Olesinski",
    	abstract = "In this paper, we present a flexible network on-chip topology: NR-Mesh (Nearest neighbor Mesh). The topology gives an end node the choice to inject a message through different neighboring routers, thereby reducing hop count and saving latency. At the receiver side, a message may be delivered to the end node through different routers, thus reducing hop count further and increasing flexibility when routing messages. This flexibility allows for maximizing network components to be in switch off mode, thus enabling power aware routing algorithms. Additional benefits are reduced congestion/contention levels in the network, support for efficient broadcast operations, savings in power consumption, and partial fault-tolerance. Our second contribution is a power management technique for the adaptive routing. This technique turns router ports and their attached links on and off depending on traffic conditions. The power management technique is able to achieve significant power savings when there is low traffic in the network. We further compare the new topology with the 2D-Mesh, using either deterministic or adaptive routing. When compared with the 2D-Mesh using deterministic routing, executing real applications in a full system simulation platform, the NR-Mesh topology using adaptive routing is able to obtain significant savings, 7% of reduction in execution time and 75% in energy consumption at the network on average for a 16-Node CMP System. Similar numbers are achieved for a 32-Node CMP system.",
    	booktitle = "Digital System Design (DSD), 2011 14th Euromicro Conference on",
    	doi = "10.1109/DSD.2011.25",
    	keywords = "CMP system;NR-mesh topology;NoC topology;adaptive routing;broadcast operation;congestion level;contention level;deterministic routing;energy consumption;fault-tolerance;flexible network on-chip topology;hop count;injection port;nearest neighbor mesh;neigh",
    	month = "31 2011-sept. 2",
    	pages = "165 -172",
    	title = "{T}owards an {E}fficient {N}o{C} {T}opology through {M}ultiple {I}njection {P}orts",
    	year = 2011
    }
    
  22. Jesus Escudero-Sahuquillo, Pedro J Garcia, Francisco J Quiles, Jose Flich and Jose Duato. Cost-Effective Congestion Management for Interconnection Networks Using Distributed Deterministic Routing. In 16th International Conference on Parallel and Distributed Systems (ICPADS 2010). December 2010. BibTeX

    @conference{icpads2010,
    	author = "Jesus Escudero-Sahuquillo and Pedro J. Garcia and Francisco J. Quiles and Flich, Jose and Duato, Jose",
    	abstract = "The Interconnection networks are essential elements in current computing systems. For this reason, achieving the best network performance, even in congestion situations, has been a primary goal in recent years. In that sense, there exist several techniques focused on eliminating the main negative effect of congestion: the Head of Line (HOL) blocking. One of the most successful HOL blocking elimination techniques is RECN, which can be applied in source routing networks. FBICM follows the same approach as RECN, but it has been developed for distributed deterministic routing networks. Although FBICM effectively eliminates HOL blocking, it requires too much resources to be implemented. In this paper we present a new FBICM version, based on a new organization of switch memory resources, that significantly reduces the required silicon area, complexity and cost. Moreover, we present new results about FBICM, in network topologies not yet analyzed. From the experiment results we can conclude that a far less complex and feasible FBICM implementation can be achieved by using the proposed improvements, while not losing efficiency.",
    	address = "Shanghai, China",
    	booktitle = "16th International Conference on Parallel and Distributed Systems (ICPADS 2010)",
    	keywords = "Deterministic Routing; Congestion Management; Head-Of-Line Blocking;",
    	month = "December",
    	title = "{C}ost-{E}ffective {C}ongestion {M}anagement for {I}nterconnection {N}etworks {U}sing {D}istributed {D}eterministic {R}outing",
    	year = 2010
    }
    
  23. Antoni Roca, Jose Flich, Federico Silla and Jose Duato. VCTlite: Towards an Efficient Implementation of Virtual Cut-Through Switching in On-Chip Networks. In 17th Int'l Conference on High Performance Computing (HiPC) In Press. December 2010. BibTeX

    @conference{roca-hipc10,
    	author = "Roca, Antoni and Flich, Jose and Silla, Federico and Duato, Jose",
    	address = "Goa,India",
    	booktitle = "17th Int'l Conference on High Performance Computing (HiPC)",
    	keywords = "on-chip networks; switching;",
    	month = "December",
    	title = "{VCT}lite: {T}owards an {E}fficient {I}mplementation of {V}irtual {C}ut-{T}hrough {S}witching in {O}n-{C}hip {N}etworks",
    	volume = "In Press",
    	year = 2010
    }
    
  24. D Flich J.; Bertozzi (ed.). Designing Network On-Chip Architectures in the Nanoscale Era. CRC Press, December 2010. URL BibTeX

    @book{365336,
    	author = "Gilabert, Francisco and Silla, Federico and Gomez, Maria E. and Lodde, Mario and Roca, Antoni and Flich, Jose and Duato, Jose and Hern{\'a}ndez, Carles and Rodrigo, Samuel",
    	abstract = "Going beyond isolated research ideas and design experiences, Designing Network On-Chip Architectures in the Nanoscale Era covers the foundations and design methods of network on-chip (NoC) technology. The contributors draw on their own lessons learned to provide strong practical guidance on various design issues. Exploring the design process of the network, the first part of the book focuses on basic aspects of switch architecture and design, topology selection, and routing implementation. In the second part, contributors discuss their experiences in the industry, offering a roadmap to recent products. They describe Tilera’s TILE family of multicore processors, novel Intel products and research prototypes, and the TRIPS operand network (OPN). The last part reveals state-of-the-art solutions to hardware-related issues and explains how to efficiently implement the programming model at the network interface. In the appendix, the microarchitectural details of two switch architectures targeting multiprocessor system-on-chips (MPSoCs) and chip multiprocessors (CMPs) can be used as an experimental platform for running tests. A stepping stone to the evolution of future chip architectures, this volume provides a how-to guide for designers of current NoCs as well as designers involved with 2015 computing platforms. It cohesively brings together fundamental design issues, alternative design paradigms and techniques, and the main design tradeoffs—consistently focusing on topics most pertinent to real-world NoC designers.",
    	editor = "Flich, J.; Bertozzi, D.",
    	isbn = 9781439837108,
    	keywords = "Network on chip;Chip Architectures;",
    	month = "December",
    	publisher = "CRC Press",
    	title = "{D}esigning {N}etwork {O}n-{C}hip {A}rchitectures in the {N}anoscale {E}ra",
    	url = "http://www.crcpress.com/product/isbn/9781439837108",
    	year = 2010
    }
    
  25. Antoni Roca, Jose Flich, Federico Silla and Jose Duato. A Latency-Efficient Router Architecture for CMP Systems. In Digital System Design: Architectures, Methods and Tools (DSD), 2010 13th Euromicro Conference on. 2010, 165 -172. URL, DOI BibTeX

    @conference{5615623,
    	author = "Roca, Antoni and Flich, Jose and Silla, Federico and Duato, Jose",
    	abstract = "As technology advances, the number of cores in Chip Multi Processor systems (CMPs) and Multi Processor Systems-on-Chips (MPSoCs) keeps increasing. Current test chips and products reach tens of cores, and it is expected to reach hundreds of cores in the near future. Such complexity demands for an efficient network-on-chip (NoC). The common choice to build such networks is the 2D mesh topology (as it matches the regular tile-based design) and the Dimension-Order Routing (DOR) algorithm (because its simplicity). The network in such systems must provide sustained throughput and ultra low latencies. One of the key components in the network is the router, and thus, it plays a major role when designing for such performance levels. In this paper we propose a new pipelined router design focused in reducing the router latency. As a first step we identify the router components that take most of the critical path, and thus limit the router frequency. In particular, the arbiter is the one limiting the performance of the router. Based on this fact, we simplify the arbiter logic by using multiple smaller arbiters. The initial set of requests in the initial arbiter is then distributed over the smaller arbiters that operate in parallel. With this design procedure, and with a proper internal router organization, different router architectures are evolved. All of them enable the use of smaller arbiters in parallel by replicating ports and assuming the use of the DOR algorithm. The net result of such changes is a faster router. Preliminary results demonstrate a router latency reduction ranging from 10 #x025; to 21 #x025; with an increase of the router area. Network latency is reduced in a range from 11% to 15%.",
    	booktitle = "Digital System Design: Architectures, Methods and Tools (DSD), 2010 13th Euromicro Conference on",
    	doi = "10.1109/DSD.2010.42",
    	isbn = "978-1-4244-7839-2",
    	keywords = "arbiter design;low latency router;network-on-chip;router architecture;router design",
    	month = "sept.",
    	pages = "165 -172",
    	title = "{A} {L}atency-{E}fficient {R}outer {A}rchitecture for {CMP} {S}ystems",
    	url = "http://dx.doi.org/10.1109/DSD.2010.42",
    	year = 2010
    }
    
  26. Teresa Nachiondo, Jose Flich and Jose Duato. Buffer Management Strategies to Reduce HoL Blocking. Parallel and Distributed Systems, IEEE Transactions on 21(6):739 - 753, June 2010. URL, DOI BibTeX

    @article{4815231,
    	author = "Nachiondo, Teresa and Flich, Jose and Duato, Jose",
    	abstract = "Congestion management is likely to become a critical issue in interconnection networks, as increasing power consumption and cost concerns lead to improve the efficiency of network resources. In previous configurations, networks were usually overdimensioned and underutilized. In a smaller network, however, contention is more likely to happen and blocked packets introduce head-of-line (HoL) blocking to the rest of packets spreading congestion quickly. The best-known solution to HoL blocking is Virtual Output Queues (VOQs). However, the cost of implementing VOQs increases quadratically with the number of output ports in the network, thus, being unpractical. Therefore, a more scalable and cost-effective solution is required to reduce or eliminate HoL blocking. In this paper, we present methodologies, referred to as Destination-Based Buffer Management (DBBM), to reduce/eliminate the HoL blocking effect on interconnection networks. DBBM efficiently uses the resources (mainly memory queues) of the network. These methodologies are comprehensively evaluated in terms of throughput, scalability and fairness. Results show that the use of the DBBM strategy with a reduced number of queues at each switch is able to obtain roughly the same throughput as the VOQ mechanism. Moreover, all the proposed strategies are designed in such a way that can be used in any switch architecture. We compare DBBM with RECN, a sophisticated mechanism that eliminates HoL blocking in congestion situations. Our mechanism is able to achieve almost the same performance with very low logic requirements (in contrast with RECN).",
    	doi = "10.1109/TPDS.2009.63",
    	issn = "1045-9219",
    	journal = "Parallel and Distributed Systems, IEEE Transactions on",
    	keywords = "computer network management; quality of service; storage management; telecommunication congestion control",
    	month = "June",
    	number = 6,
    	pages = "739 - 753",
    	title = "{B}uffer {M}anagement {S}trategies to {R}educe {H}o{L} {B}locking",
    	url = "http://dx.doi.org/10.1109/TPDS.2009.63",
    	volume = 21,
    	year = 2010
    }
    
  27. Samuel Rodrigo, Jose Flich, Antoni Roca, S Medardoni, D Bertozzi, , Federico Silla and Jose Duato. Addressing Manufacturing Challenges with Cost-Efficient Fault Tolerant Routing. In Networks-on-Chip (NOCS), 2010 Fourth ACM/IEEE International Symposium on. May 2010, 25 -32. URL, DOI BibTeX

    @conference{5507564,
    	author = "Rodrigo, Samuel and Flich, Jose and Roca, Antoni and S. Medardoni and D. Bertozzi and , and Silla, Federico and Duato, Jose",
    	abstract = "The high-performance computing domain is enriching with the inclusion of Networks-on-chip (NoCs) as a key component of many-core (CMPs or MPSoCs) architectures. NoCs face the communication scalability challenge while meeting tight power, area and latency constraints. Designers must address new challenges that were not present before. Defective components, the enhancement of application-level parallelism or power-aware techniques may break topology regularity, thus, efficient routing becomes a challenge.In this paper, uLBDR (Universal Logic-Based Distributed Routing) is proposed as an efficient logic-based mechanism that adapts to any irregular topology derived from 2D meshes, being an alternative to the use of routing tables (either at routers or at end-nodes). uLBDR requires a small set of configuration bits, thus being more practical than large routing tables implemented in memories. Several implementations of uLBDR are presented highlighting the trade-off between routing cost and coverage. The alternatives span from the previously proposed LBDR approach (with 30% of coverage) to the uLBDR mechanism achieving full coverage. This comes with a small performance cost, thus exhibiting the trade-off between fault tolerance and performance.",
    	booktitle = "Networks-on-Chip (NOCS), 2010 Fourth ACM/IEEE International Symposium on",
    	doi = "10.1109/NOCS.2010.12",
    	keywords = "NoC;addressing manufacturing challenges;application level parallelism;cost efficient fault tolerant routing;logic based mechanism;networks-on-chip;power aware techniques;universal logic based distributed routing;network routing;network topology;network-on",
    	month = "may",
    	pages = "25 -32",
    	title = "{A}ddressing {M}anufacturing {C}hallenges with {C}ost-{E}fficient {F}ault {T}olerant {R}outing",
    	url = "http://dx.doi.org/10.1109/NOCS.2010.12",
    	year = 2010
    }
    
  28. Carles Hernández, Antoni Roca, Federico Silla, Jose Flich and Jose Duato. Improving the Performance of GALS-Based NoCs in the Presence of Process Variation. In 2010 ACM/IEEE International Symposium on Networks-on-Chip (NOCS). May 2010, 35 - 42. URL, DOI BibTeX

    @conference{11416504,
    	author = "Hern{\'a}ndez, Carles and Roca, Antoni and Silla, Federico and Flich, Jose and Duato, Jose",
    	abstract = "Current integration scales allow designing chip multiprocessors (CMP) where cores are interconnected by means of a network-on-chip (NoC). Unfortunately, the small feature size of current integration scales cause some unpredictability in manufactured devices because of process variation. In NoCs,variability may affect links and routers causing that they do not match the parameters established at design time. In this paper we first analyze the way that manufacturing deviations affect the components of a NoC by applying a comprehensive and detailed variability model to 200 instances of an 8×8 mesh NoC synthesized using 45 nm technology. A second contribution of this paper is showing that GALS-based NoCs present communication bottlenecks under process variation. To overcome this performance reduction we draft a novel approach, called performance domains, intended to reduce the negative impact of variability on application execution time. This mechanism is suitable when several applications are simultaneously running in the CMP chip.",
    	address = "Grenoble, France",
    	booktitle = "2010 ACM/IEEE International Symposium on Networks-on-Chip (NOCS)",
    	doi = "10.1109/NOCS.2010.13",
    	journal = "2010 ACM/IEEE International Symposium on Networks-on-Chip (NOCS)",
    	keywords = "integrated circuit design;large scale integration;network-on-chip;performance evaluation;",
    	month = "May",
    	note = "GALS-based NoCs;chip multiprocessors;network-on-chip;manufacturing deviations;process variation;performance domains;integration scales;",
    	pages = "35 - 42",
    	publisher = "ACM",
    	title = "{I}mproving the {P}erformance of {GALS}-{B}ased {N}o{C}s in the {P}resence of {P}rocess {V}ariation",
    	url = "http://dx.doi.org/10.1109/NOCS.2010.13",
    	year = 2010
    }
    
  29. Francisco Triviño, José L Sánchez, Francisco J Alfaro and Jose Flich. Virtualizing network-on-chip resources in chip-multiprocessors. Microprocessors and Microsystems In Press, Uncorrected Proof:-, 2010. URL, DOI BibTeX

    @article{Triviño2010,
    	author = "Francisco Trivi{\~n}o and Jos{\'e} L. S{\'a}nchez and Francisco J. Alfaro and Flich, Jose",
    	abstract = "The number of cores on a single silicon chip is rapidly growing and chips containing tens or even hundreds of identical cores are expected in the future. To take advantage of multicore chips, multiple applications will run simultaneously. As a consequence, the traffic interferences between applications increases and the performance of individual applications can be seriously affected. In this paper, we improve the individual application performance when several applications are simultaneously running. This proposal is based on the virtualization concept and allows us to reduce execution time and network latency in a significant percentage.",
    	doi = "DOI: 10.1016/j.micpro.2010.10.001",
    	issn = "0141-9331",
    	journal = "Microprocessors and Microsystems",
    	keywords = "Network-on-chip",
    	pages = "-",
    	title = "{V}irtualizing network-on-chip resources in chip-multiprocessors",
    	url = "http://www.sciencedirect.com/science/article/B6V0X-518TDT0-1/2/a0626334a6df097c5980c108d5606b62",
    	volume = "In Press, Uncorrected Proof",
    	year = 2010
    }
    
  30. Samuel Rodrigo, Carles Hernández, Jose Flich, Federico Silla, Jose Duato, S Medardoni, D Bertozzi, D Dai and . Yield-oriented evaluation methodology of network-on-chip routing implementations. In System-on-Chip, 2009. SOC 2009. International Symposium on. 2009, 100 -105. URL, DOI BibTeX

    @conference{5335667,
    	author = "Rodrigo, Samuel and Hern{\'a}ndez, Carles and Flich, Jose and Silla, Federico and Duato, Jose and S. Medardoni and D. Bertozzi and D. Dai and ,",
    	abstract = "Network-on-Chip technology is gaining wide popularity for the interconnection of an increasing number of processor cores on the same silicon die. However, growing process variations cause interconnect malfunction or prevent the network from working at the intended frequency, directly impacting yield and manufacturing cost. Topology agnostic routing algorithms have the potential to tolerate process variations without degrading performance. We propose a three step methodology for evaluating routing algorithms in their ability to deal with variability. Using yield enhancement and operation speed preservation as the criteria, we demonstrate how this methodology can be used to select the best design choice among several plausible combinations of routing algorithms and implementations. Also, we show how an efficient table-less routing implementation can be used to minimise the impact of variability on manufacturing and operating frequency.",
    	booktitle = "System-on-Chip, 2009. SOC 2009. International Symposium on",
    	doi = "10.1109/SOCC.2009.5335667",
    	keywords = "Si;interconnect malfunction;network-on-chip routing;processor core interconnection;silicon die;yield enhancement;yield operation;yield oriented evaluation;integrated circuit interconnections;integrated circuit yield;microprocessor chips;network-on-chip;si",
    	month = "oct.",
    	pages = "100 -105",
    	title = "{Y}ield-oriented evaluation methodology of network-on-chip routing implementations",
    	url = "http://dx.doi.org/10.1109/SOCC.2009.5335667",
    	year = 2009
    }
    
  31. Samuel Rodrigo, S Medardoni, Jose Flich, D Bertozzi and Jose Duato. Efficient implementation of distributed routing algorithms for NoCs. Computers Digital Techniques, IET 3(5):460 -475, September 2009. DOI BibTeX

    @article{5200571,
    	author = "Rodrigo, Samuel and S. Medardoni and Flich, Jose and D. Bertozzi and Duato, Jose",
    	abstract = "Chip multiprocessors (CMPs) are gaining momentum in the high-performance computing domain. Networks-on-chip (NoCs) are key components of CMP architectures, in that they have to deal with the communication scalability challenge while meeting tight power, area and latency constraints. 2D mesh topologies are usually preferred by designers of general purpose NoCs. However, manufacturing faults may break their regularity. Moreover, resource management frameworks may require the segmentation of the network into irregular regions. Under these conditions, efficient routing becomes a challenge. Although the use of routing tables at switches is flexible, it does not scale in terms of latency and area due to its memory requirements. Logic-based distributed routing (LBDR) is proposed as a new routing method that removes the need for routing tables at all. LBDR enables the implementation of many routing algorithms on most of the practical topologies we may find in the near future in a multi-core system. From an initial topology and routing algorithm, a set of three bits per switch/output port is computed. Evaluation results show that, by using a small logic, LBDR mimics the performance of routing algorithms when implemented with routing tables, both in regular and irregular topologies. LBDR implementation in a real NoC switch is also explored, proving its smooth integration in the architecture and its negligible hardware and performance overhead.",
    	doi = "10.1049/iet-cdt.2008.0092",
    	issn = "1751-8601",
    	journal = "Computers Digital Techniques, IET",
    	keywords = "2D mesh topologies;chip multiprocessors;communication scalability;distributed routing algorithms;high-performance computing domain;logic-based distributed routing;manufacturing faults;multicore system;network segmentation;networks-on-chip;resource managem",
    	month = "september",
    	number = 5,
    	pages = "460 -475",
    	title = "{E}fficient implementation of distributed routing algorithms for {N}o{C}s",
    	volume = 3,
    	year = 2009
    }
    
  32. Vicente Chirivella, Rosa Alcover, Jose Flich and Jose Duato. Dependability analysis of a fault-tolerant network reconfiguring strategy. In Henk Sips; Dick Epema; Hai-Xiang Lin (ed.). Euro-Par 2009 Parallel Processing 5704. August 2009, 1040 - 1051. URL, DOI BibTeX

    @conference{20094612441323,
    	author = "Chirivella, Vicente and Alcover, Rosa and Flich, Jose and Duato, Jose",
    	abstract = "Fault tolerance mechanisms become indispensable as the number of processors increases in large systems. Measuring the effectiveness of such mechanisms before its implementation becomes mandatory. Research toward understanding the effects of different network parameters on the dependability parameters, like mean time to network failure or availability, becomes necessary. In this paper we analyse in detail such effects with a methodology proposed previously by us. This methodology is based on Markov chains and Analysis of Variance techniques. As a case study we analyse the effects of network size, mean time to node failure, mean time to node repair, mean time to network repair and coverage of the failure when using a 2D mesh network with a fault-tolerant mechanism (similar to the one used in the BlueGene/L system), that is able to remove rows and/or columns in the presence of failures. © 2009 Springer.",
    	address = "Delft, Netherlands",
    	booktitle = "Euro-Par 2009 Parallel Processing",
    	doi = "10.1007/978-3-642-03869-3_96",
    	editor = "Henk Sips; Dick Epema; Hai-Xiang Lin",
    	isbn = "978-3-642-03869-3",
    	issn = "0302-9743",
    	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
    	key = "Fault tolerant computer systems",
    	keywords = "Artificial intelligence;Bioinformatics;Fault tolerance;Markov processes;Quality assurance;Regression analysis;",
    	month = "Aug",
    	note = "BlueGene/L systems;Dependability analysis;Fault tolerance mechanisms;Fault-tolerant mechanism;Fault-tolerant networks;Large system;Markov Chain;Mesh network;Network failure;Network parameters;Network size;Node failure;",
    	pages = "1040 - 1051",
    	publisher = "Springer",
    	series = "Lecture Notes in Computer Science",
    	title = "{D}ependability analysis of a fault-tolerant network reconfiguring strategy",
    	url = "http://dx.doi.org/10.1007/978-3-642-03869-3_96",
    	volume = 5704,
    	year = 2009
    }
    
  33. , M Palesi, Jose Flich, S Kumar, Pedro Lopez, R Holsmark and Jose Duato. Region-Based Routing: A Mechanism to Support Efficient Routing Algorithms in NoCs. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 17(3):356 -369, March 2009. URL, DOI BibTeX

    @article{4804124,
    	author = ", and M. Palesi and Flich, Jose and S. Kumar and Lopez, Pedro and R. Holsmark and Duato, Jose",
    	abstract = "An efficient routing algorithm is important for large on-chip networks [network-on-chip (NoC)] to provide the required communication performance to applications. Implementing NoC using table-based switches provide many advantages, including possibility of changing routing algorithms and fault tolerance, due to the option of table reconfigurations. However, table-based switches have been considered unsuitable for NoCs due to their perceived high area and power consumption. In this paper, we describe the region-based routing (RBR) mechanism which groups destinations into network regions allowing an efficient implementation with logic blocks. RBR can also be viewed as a mechanism to reduce the number of entries in routing tables. RBR is general and can be used in conjunction with any adaptive routing algorithm. In particular, we have evaluated the proposed scheme in conjunction with a general routing algorithm, namely segment-based routing (SR) and an application specific routing algorithm (APSRA) using regular and irregular mesh topologies. Our study shows that the number of entries in the table is significantly reduced, especially for large networks. Evaluation results show that RBR requires only four regions to support several routing algorithms in a 2-D mesh with no performance degradation. Considering link failures, our results indicate that RBR combined with SR is able to tolerate up to 7 link failures in an 8times8 mesh. RBR also reduces area and power dissipation of an equivalent table-based implementation by factors of 8 and 10, respectively. Moreover, the degradation in performance of the network is insignificant when using APSRA combined with RBR.",
    	doi = "10.1109/TVLSI.2008.2012010",
    	issn = "1063-8210",
    	journal = "Very Large Scale Integration (VLSI) Systems, IEEE Transactions on",
    	keywords = "adaptive routing algorithm;application specific routing algorithm;fault tolerance;large on-chip networks;network-on-chip;region-based routing mechanism;segment-based routing;table-based switches;network topology;network-on-chip;",
    	month = "march",
    	number = 3,
    	pages = "356 -369",
    	title = "{R}egion-{B}ased {R}outing: {A} {M}echanism to {S}upport {E}fficient {R}outing {A}lgorithms in {N}o{C}s",
    	url = "http://dx.doi.org/10.1109/TVLSI.2008.2012010",
    	volume = 17,
    	year = 2009
    }
    
  34. A Martinez, P J Garcia, F J Alfaro, J L Sanchez, Jose Flich, F J Quiles and Jose Duato. A Switch Architecture Guaranteeing QoS Provision and HOL Blocking Elimination. Parallel and Distributed Systems, IEEE Transactions on 20(1):13 -24, 2009. DOI BibTeX

    @article{4497190,
    	author = "A. Martinez and P.J. Garcia and F.J. Alfaro and J.L. Sanchez and Flich, Jose and F.J. Quiles and Duato, Jose",
    	abstract = "Both QoS support and congestion management techniques become essential to achieve good network performance in current high-speed interconnection networks. The most effective techniques traditionally considered for both issues, however, require too many resources for being implemented. In this paper we propose a new cost-effective switch architecture able to face the challenges of congestion management and, at the same time, to provide QoS. The efficiency of our proposal is based on using the resources (queues) used by RECN (an efficient Head-Of-Line blocking elimination technique) also for QoS support, without increasing queue requirements. Provided results show that the new switch architecture is able to guarantee QoS levels without any degradation due to congestion situations.",
    	doi = "10.1109/TPDS.2008.62",
    	issn = "1045-9219",
    	journal = "Parallel and Distributed Systems, IEEE Transactions on",
    	keywords = "HOL blocking elimination;QoS provision;congestion management;high-speed interconnection networks;network performance;switch architecture;quality of service;telecommunication congestion control;telecommunication network management;telecommunication switchi",
    	month = "jan.",
    	number = 1,
    	pages = "13 -24",
    	title = "{A} {S}witch {A}rchitecture {G}uaranteeing {Q}o{S} {P}rovision and {HOL} {B}locking {E}limination",
    	volume = 20,
    	year = 2009
    }
    
  35. T Skeie, F O Sem-Jacobsen, Samuel Rodrigo, Jose Flich, D Bertozzi and S Medardoni. Flexible DOR routing for virtualization of multicore chips. In System-on-Chip, 2009. SOC 2009. International Symposium on. 2009, 073 -076. DOI BibTeX

    @conference{5335673,
    	author = "T. Skeie and F.O. Sem-Jacobsen and Rodrigo, Samuel and Flich, Jose and D. Bertozzi and S. Medardoni",
    	abstract = "The expected increase in number of cores on a single chip leads to the necessity of high-performance on chip interconnects (NoC). Furthermore, in order to fully utilize the abundance of cores, the chip is expected to support a number of applications running on the chip simultaneously. It is therefore necessary to partition the chip to support numerous applications without any risk of interference between them. The success of this depends on the flexibility of the underlying routing algorithm. This paper presents a flexible routing algorithm based on dimension ordered routing, which supports a large variety of irregular (2-D and 3-D) mesh topologies. The algorithm provides high efficiency at very low additional complexity, as is confirmed by experimental results.",
    	booktitle = "System-on-Chip, 2009. SOC 2009. International Symposium on",
    	doi = "10.1109/SOCC.2009.5335673",
    	keywords = "dimension order routing;mesh topologies;multicore chips virtualization;on chip interconnects;routing algorithm;integrated circuit interconnections;network-on-chip;",
    	month = "5-7",
    	pages = "073 -076",
    	title = "{F}lexible {DOR} routing for virtualization of multicore chips",
    	year = 2009
    }
    
  36. , Jose Flich, Jose Duato, H Eberle, N Gura and W Olesinski. A performance evaluation of 2D-mesh, ring, and crossbar interconnects for chip multi-processors. In Network on Chip Architectures, 2009. NoCArc 2009. 2nd International Workshop on. 2009, 51 -56. BibTeX

    @conference{5375715,
    	author = ", and Flich, Jose and Duato, Jose and H. Eberle and N. Gura and W. Olesinski",
    	abstract = "As the number of processing nodes on chip multi-processors (CMPs) keeps increasing, providing efficient communication with the on-chip interconnect becomes increasingly critical. With 32-core CMP designs on the drawing table of engineers, there is a demand for accurate simulation models that capture all the complexities and interactions of the different design layers including the application, operating system, cache hierarchy, coherency protocol, and other on-chip resources. These components cannot be modeled anymore in isolation as unpredicted performance anomalies may arise once all the system variables are taken into account. In this paper, we present a simulation framework for CMP systems, focusing our attention on the on-chip network. We show preliminary results for the choice of key network parameters (topology, flit size) with respect to the behavior and performance of applications running on top of different network configurations. This paper tries to convey the need for an overall CMP system simulator as a way to accurately characterize the actual behavior of the on-chip network.",
    	booktitle = "Network on Chip Architectures, 2009. NoCArc 2009. 2nd International Workshop on",
    	keywords = "2D-mesh interconnects;32-core CMP designs;cache hierarchy;chip multi-processors;coherency protocol;crossbar interconnects;on-chip network;operating system;processing nodes;ring interconnects;integrated circuit design;integrated circuit interconnections;mi",
    	month = "12-12",
    	pages = "51 -56",
    	title = "{A} performance evaluation of 2{D}-mesh, ring, and crossbar interconnects for chip multi-processors",
    	year = 2009
    }
    
  37. O Lysne, J M Montañana, Jose Flich, Jose Duato, T M Pinkston and T Skeie. An Efficient and Deadlock-Free Network Reconfiguration Protocol. Computers, IEEE Transactions on 57(6):762 -779, June 2008. URL, DOI BibTeX

    @article{4459311,
    	author = "O. Lysne and Monta{\~n}ana, J. M. and Flich, Jose and Duato, Jose and T.M. Pinkston and T. Skeie",
    	abstract = {Component failures and planned component replacements cause changes in the topology and routing paths supplied by the interconnection network of a parallel processor system over time. Such changes may require the network to be reconfigured such that the existing routing function is replaced by one that enables packets to reach their intended destinations amid the changes. Efficient reconfiguration methods are desired which allow the network to function uninterruptedly over the course of the reconfiguration process while remaining free from deadlocking behavior. In this paper, we propose, evaluate, and prove the deadlock freedom of a new network reconfiguration protocol that overlaps various phases of "static" reconfiguration processes traditionally used in commercial and research systems to provide performance efficiency on par with that of recently proposed "dynamic" reconfiguration processes but without their complexity. Simulation results show that the proposed Overlapping Static Reconfiguration protocol can reduce reconfiguration time by up to 50 percent, reduce packet latency by several orders of magnitude, reduce packet dropping by an order of magnitude, and provide unhalted packet injection as compared to traditional static reconfiguration while allowing network throughput similar to dynamic reconfiguration.},
    	doi = "10.1109/TC.2008.31",
    	issn = "0018-9340",
    	journal = "Computers, IEEE Transactions on",
    	keywords = "deadlock freedom;dynamic reconfiguration processes;interconnection network;network reconfiguration protocol;overlapping static reconfiguration protocol;parallel processor system;reduce packet latency;static reconfiguration processes;multiprocessor interco",
    	month = "june",
    	number = 6,
    	pages = "762 -779",
    	title = "{A}n {E}fficient and {D}eadlock-{F}ree {N}etwork {R}econfiguration {P}rotocol",
    	url = "http://dx.doi.org/10.1109/TC.2008.31",
    	volume = 57,
    	year = 2008
    }
    
  38. J M Montañana, Jose Flich and Jose Duato. Epoch-based reconfiguration: Fast, simple, and effective dynamic network reconfiguration. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. April 2008, 1 -12. URL, DOI BibTeX

    @conference{4536298,
    	author = "Monta{\~n}ana, J. M. and Flich, Jose and Duato, Jose",
    	abstract = "Dynamic network reconfiguration is defined as the process of changing from one routing function to another while the network remains up and running. The main challenge is to avoid deadlocks and reduce packet dropping rate while keeping network service. Current approaches either require the existence of extra network resources like e.g. virtual channels, their complexity is so high that their practical applicability is limited, or they affect to the performance of the network during the reconfiguration process. In this paper we present EBR, a simple and fast method for dynamic network reconfiguration. EBR guarantees a fast and deadlock-free reconfiguration, but instead of avoiding deadlocks our mechanism is based on regressive deadlock recoveries. Thus, EBR allows cycles to be formed, and in the situation of a deadlock some packets may be dropped. However, as demonstrated, no packets need to be dropped in the working zone of the system. Also, the mechanism works in an asynchronous manner, does not require additional resources and works on any topology. In order to minimize the number of dropped packets, EBR uses an epoch marking system that guarantees that only packets potentially leading to a deadlock will be removed. Evaluation results show that EBR works efficiently in different topologies and with different routing algorithms. When compared with current proposals, EBR always gets the best numbers in all the analyzed parameters (dropped packets, latency, throughput, reconfiguration time and resources required), thus achieving the good properties of all mechanisms.",
    	booktitle = "Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on",
    	doi = "10.1109/IPDPS.2008.4536298",
    	isbn = "978-1-4244-1693-6",
    	issn = "1530-2075",
    	keywords = "deadlock-free reconfiguration;dynamic network reconfiguration;epoch-based reconfiguration;network resource;network service;packet dropping rate;regressive deadlock recovery;routing algorithm;routing function;topology;computer networks;telecommunication ne",
    	month = "april",
    	pages = "1 -12",
    	title = "{E}poch-based reconfiguration: {F}ast, simple, and effective dynamic network reconfiguration",
    	url = "http://dx.doi.org/10.1109/IPDPS.2008.4536298",
    	year = 2008
    }
    
  39. Jose Flich and Jose Duato. Logic-Based Distributed Routing for NoCs. Computer Architecture Letters 7(1):13 -16, 2008. DOI BibTeX

    @article{4407676,
    	author = "Flich, Jose and Duato, Jose",
    	abstract = "The design of scalable and reliable interconnection networks for multicore chips (NoCs) introduces new design constraints like power consumption, area, and ultra low latencies. Although 2D meshes are usually proposed for NoCs, heterogeneous cores, manufacturing defects, hard failures, and chip virtualization may lead to irregular topologies. In this context, efficient routing becomes a challenge. Although switches can be easily configured to support most routing algorithms and topologies by using routing tables, this solution does not scale in terms of latency and area. We propose a new circuit that removes the need for using routing tables. The new mechanism, referred to as logic-based distributed routing (LBDR), enables the implementation in NoCs of many routing algorithms for most of the practical topologies we might find in the near future in a multicore chip. From an initial topology and routing algorithm, a set of three bits per switch output port is computed. By using a small logic block, LHDR mimics (demonstrated by evaluation) the behavior of routing algorithms implemented with routing tables. This result is achieved both in regular and irregular topologies. Therefore, LBDR removes the need for using routing tables for distributed routing, thus enabling flexible, fast and power-efficient routing in NoCs.",
    	doi = "10.1109/L-CA.2007.16",
    	issn = "1556-6056",
    	journal = "Computer Architecture Letters",
    	keywords = "NoC;chip virtualization;heterogeneous cores;interconnection network reliability;logic-based distributed routing;manufacturing defects;networks for multicore chips;circuit reliability;interconnections;logic circuits;network routing;network topology;network",
    	month = "january-june",
    	number = 1,
    	pages = "13 -16",
    	title = "{L}ogic-{B}ased {D}istributed {R}outing for {N}o{C}s",
    	volume = 7,
    	year = 2008
    }
    
  40. Tor Skeie, Daniel Ortega, Jose Flich and Raimir Holanda. Topic 13: High-performance networks. In Euro-Par 2008 – Parallel Processing 5168 LNCS. 2008, 898 -. URL BibTeX

    @conference{20083911589414,
    	author = "Tor Skeie and Daniel Ortega and Flich, Jose and Raimir Holanda",
    	abstract = "The communication network is the key component of every parallel and distributed system. The trend of always aiming at bigger and more complex cores has shifted towards having many simpler cores, sharing yet another complex communication layer at the chip level. Moreover, advancements on scaling out at the cluster level have pushed communication and storage networks to new limits. All these technological opportunities bring out new and exciting research challenges. © 2008 Springer-Verlag Berlin Heidelberg.",
    	address = "Las Palmas de Gran Canaria, Spain",
    	booktitle = "Euro-Par 2008 – Parallel Processing",
    	issn = 03029743,
    	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
    	key = "Chlorine compounds",
    	keywords = "Data storage equipment;",
    	note = "Chip-level;Communication networks;Complex communication;Complex cores;Distributed systems;High-performance networks;Parallel processing;Research challenges;Storage networks;Technological opportunities;",
    	pages = "898 -",
    	title = "{T}opic 13: {H}igh-performance networks",
    	url = "http://dx.doi.org/10.1007/978-3-540-85451-7_96",
    	volume = "5168 LNCS",
    	year = 2008
    }
    
  41. Jesus Escudero-Sahuquillo, Pedro Garcia, Francisco Quiles, Jose Flich and Jose Duato. FBICM: Efficient congestion management for high-performance networks using distributed deterministic routing. In High Performance Computing - HiPC 2008 5374 LNCS. 2008, 503 - 517. URL, DOI BibTeX

    @conference{20090511881191,
    	author = "Jesus Escudero-Sahuquillo and Pedro Garcia and Francisco Quiles and Flich, Jose and Duato, Jose",
    	abstract = "As the number of components in cluster-based systems increases, cost and power consumption also increase. One way to reduce both problems is using smaller networks with adequate congestion management mechanisms. Recent successful proposals (RECN) eliminate the negative effects of congestion, the Head-of-Line (HOL) blocking, leaving congestion harmless. RECN relies on source-based networks architectures, where the entire route is placed at packet headers before injection. Unfortunately, distributed table-based routing is also common in cluster-based networks, being InfiniBand the most prominent example. We propose a novel congestion management technique for distributed table-based routing. The mechanism relies on additional congestion information located at routing tables. With this information HOL blocking is minimized by smartly using switch queues. Detailed memory organization and the way congestion information is updated/propagated is described. Preliminary results indicate that with modest resource requirements maximum network performance is kept regardless of congestion. © 2008 Springer Berlin Heidelberg.",
    	address = "Bangalore, India",
    	booktitle = "High Performance Computing - HiPC 2008",
    	doi = "10.1007/978-3-540-89894-8_44",
    	issn = 03029743,
    	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
    	key = "Network management",
    	keywords = "High performance liquid chromatography;Industrial management;Network performance;Parallel processing systems;Systems engineering;",
    	note = "Congestion control;Congestion management;Distributed routing;Head of line (HOL) blocking;High-performance interconnects;InfiniBand (CO);Memory organizations;Negative effects;Network performances;One way;Packet headers;Power consumption (CE);Resource requirements;",
    	pages = "503 - 517",
    	title = "{FBICM}: {E}fficient congestion management for high-performance networks using distributed deterministic routing",
    	url = "http://dx.doi.org/10.1007/978-3-540-89894-8_44",
    	volume = "5374 LNCS",
    	year = 2008
    }
    
  42. Samuel Rodrigo, Jose Flich, Jose Duato and M Hummel. Efficient unicast and multicast support for CMPs. In 2008 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 2008, 364 - 75. URL BibTeX

    @conference{10428961,
    	author = "Rodrigo, Samuel and Flich, Jose and Duato, Jose and M. Hummel",
    	abstract = "Beyond a certain number of cores, multi-core processing chips will require a network-on-chip (NoC) to interconnect the cores and overcome the limitations of a bus. NoCs must be carefully designed to meet constraints like power consumption, area, and ultra low latencies. Although 2D meshes with DOR (dimension-order-routing) meet these constraints, the need for partitioning (e.g. virtual machines, coherency domains) and traffic isolation may prevent the use of DOR routing. Also, core heterogeneity and manufacturing and run-time faults may lead to partially irregular topologies. Routing in these topologies is complex, and previously proposed solutions required routing tables, which drastically increase power consumption, area, and latency. The exception is LBDR (logic-based distributed routing), a flexible routing method for irregular topologies that removes the need for using routing tables (both at end-nodes and switches), thus achieving large savings in chip area and power consumption. But LBDR lacks support for multicast and broadcast, which are required to efficiently support cache coherence protocols both for single and multiple coherence domains. In this paper we propose bLBDR, an efficient multicast and broadcast mechanism built on top of LBDR. bLBDR performs multicast operations using a logic-based broadcast within a domain (a region with bounds). This allows us to isolate the traffic into different domains, thus enabling the concept of visualization at the NoC level. Also, bLBDR extends the concept of routing regions in LBDR by providing a mechanism that allows the flexible definition of multiple domains, sets of network resources. bLBDR fulfills all the practical requirements, including not only low latency and power and area efficiency, but also support for visualization, partitionability, fault-tolerance, traffic isolation and broadcast across the entire network as well as constrained to coherency domains or regions. All this is achieved by a small and power efficient routing logic (7{{\&}}times; area savings and 17{{\&}}times; power reduction when compared to a routing table in an 8 {{\&}}times; 8 mesh network).",
    	address = "Piscataway, NJ, USA",
    	booktitle = "2008 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41)",
    	journal = "2008 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41)",
    	keywords = "microprocessor chips;network topology;network-on-chip;power consumption;protocols;",
    	note = "CMP;chip multiprocessors;multicore processing chips;network-on-chip;power consumption;dimension-order-routing;logic-based distributed routing;routing tables;cache coherence protocols;",
    	pages = "364 - 75",
    	title = "{E}fficient unicast and multicast support for {CMP}s",
    	url = "http://dx.doi.org/10.1109/MICRO.2008.4771805",
    	year = 2008
    }
    
  43. Scott Pakin, Craig Stunkel, Jose Flich, Francisco Alfaro, Gheorghe Almasi, Angelos Bilas, Ron Brightwell, Darius Buntinas, Wu-Chun Feng, Mitchell Gusat, Nectarios Koziris, Pedro Lopez, Andrew Lumsdaine, Jarek Nieplocha, Greg Pfister, Jamie Riotto, Vikram Saletore, Evan Speight, Pete Wyckoff, D K Panda, Jose Duato and Mazin Yousif. Workshop 9 Introduction: The Workshop on Communication Architecture for Clusters - CAC 2008. IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, pages IEEE Computer Societ, 2008. URL BibTeX

    @article{20083711535136,
    	author = "Scott Pakin and Craig Stunkel and Flich, Jose and Francisco Alfaro and Gheorghe Almasi and Angelos Bilas and Ron Brightwell and Darius Buntinas and Wu-Chun Feng and Mitchell Gusat and Nectarios Koziris and Lopez, Pedro and Andrew Lumsdaine and Jarek Nieplocha and Greg Pfister and Jamie Riotto and Vikram Saletore and Evan Speight and Pete Wyckoff and D.K. Panda and Duato, Jose and Mazin Yousif",
    	abstract = "No abstract available",
    	address = "Miami, FL, United states",
    	journal = "IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM",
    	pages = "IEEE Computer Societ",
    	title = "{W}orkshop 9 {I}ntroduction: {T}he {W}orkshop on {C}ommunication {A}rchitecture for {C}lusters - {CAC} 2008",
    	url = "http://dx.doi.org/10.1109/IPDPS.2008.4536118",
    	year = 2008
    }
    
  44. , Jose Flich and Jose Duato. On the Potentials of Segment-Based Routing for NoCs. In Parallel Processing, 2008. ICPP '08. 37th International Conference on. 2008, 594 -603. URL, DOI BibTeX

    @conference{4625898,
    	author = ", and Flich, Jose and Duato, Jose",
    	abstract = "The topology, the routing algorithm and the way the traffic pattern is distributed over the network influence the ultimate performance of the interconnection network. Off-chip high-performance interconnects provide mechanisms to support irregular topologies, whereas in on-chip networks the topology is fixed at design time. Continuous trend on device miniaturization and high volume manufacturing increase the probability of faults in embedded systems, leading to irregular topologies. Also, partitionability and virtualization of the entire on-chip network is envisioned for future systems. These trends lead to the need of routing algorithms that adapt to the static or dynamic changes in irregular topologies.In this paper we analyze the benefits of the reconfiguration at the routing algorithm level in order to allow topology changes. That is, support topology changes that appear on the network due to different reasons including switch or link failures, energy reduction decisions or design and manufacturing issues. We perform an exhaustive analysis on the performance impact of the routing algorithm in a NoC system. Our aim is to enable the possibility of reconfiguration of the routing algorithm. We take advantage on the flexibility offered by the segment-based routing methodology that allows a fast computation of many deadlock-free routing algorithms by obtaining different segmentation processes and routing restriction policies. This study analyzes the potentials offered by SR. Results show that the election of the routing algorithm may greatly affect the final performance of the network. Additionally, we propose an organized segmentation process that achieves reliable performance with low variability for all topologies studied under uniform traffic conditions. These results encourages us to the search of a dynamic mechanism that adapts the routing algorithm to the traffic.",
    	booktitle = "Parallel Processing, 2008. ICPP '08. 37th International Conference on",
    	doi = "10.1109/ICPP.2008.56",
    	issn = "0190-3918",
    	keywords = "NoC;deadlock-free routing algorithms;embedded systems;interconnection network;off-chip high-performance interconnects;routing algorithm;segment-based routing;segment-based routing methodology;traffic pattern;uniform traffic conditions;interconnections;net",
    	month = "9-12",
    	pages = "594 -603",
    	title = "{O}n the {P}otentials of {S}egment-{B}ased {R}outing for {N}o{C}s",
    	url = "http://dx.doi.org/10.1109/ICPP.2008.56",
    	year = 2008
    }
    
  45. Jose Flich, Samuel Rodrigo, Jose Duato, T Sodring, A G Solheim, T Skeie and O Lysne. On the Potential of NoC Virtualization for Multicore Chips. In Complex, Intelligent and Software Intensive Systems, 2008. CISIS 2008. International Conference on. 2008, 801 -807. DOI BibTeX

    @conference{4606771,
    	author = "Flich, Jose and Rodrigo, Samuel and Duato, Jose and T. Sodring and A.G. Solheim and T. Skeie and O. Lysne",
    	abstract = "As the end of Moores-law is on the horizon, power becomes a limiting factor to continuous increases in performance gains for single-core processors. Processor engineers have shifted to the multicore paradigm and many-core processors are a reality. Within the context of these multi-core chips, three key metrics point themselves out as being of major importance, performance, fault-tolerance (including yield), and power consumption. A solution that optimizes all three of these metrics is challenging. As the number of cores increases the importance of the interconnection network-on-chip (NoC) grows as well, and chip designers should aim to optimize these three key metrics in the NoC context as well. In this paper we identify and discuss the main properties that a NoC must exhibit in order to enable such optimizations. In particular, we propose the use of virtualization techniques at the NoC level. AS a major finding, we identify the implementation of routing algorithms to become a key design parameter in order to achieve an effective virtualization of the chip should also supporting broadcast within the virtualized context. The intention behind this paper is for it to serve as a position paper on the topic of virtualization for NoC and the challenges that should be met at the routing layer in order to maximize performance, fault-tolerance and power consumption in multicore chips.",
    	booktitle = "Complex, Intelligent and Software Intensive Systems, 2008. CISIS 2008. International Conference on",
    	doi = "10.1109/CISIS.2008.97",
    	keywords = "Moores-law;NoC virtualization;interconnection network-on-chip;many-core processors;multicore chips;routing algorithms;single-core processors;microprocessor chips;multiprocessor interconnection networks;network-on-chip;",
    	month = "4-7",
    	pages = "801 -807",
    	title = "{O}n the {P}otential of {N}o{C} {V}irtualization for {M}ulticore {C}hips",
    	year = 2008
    }
    
  46. H Eberle, P J Garcia, Jose Flich, Jose Duato, R Drost, N Gura, D Hopkins and W Olesinski. High-radix crossbar switches enabled by Proximity Communication. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for. 2008, 1 -12. DOI BibTeX

    @conference{5219754,
    	author = "H. Eberle and P.J. Garcia and Flich, Jose and Duato, Jose and R. Drost and N. Gura and D. Hopkins and W. Olesinski",
    	abstract = "We describe a novel way to implement high-radix crossbar switches. Our work is enabled by a new chip interconnect technology called proximity communication (PxC) that offers unparalleled chip IO density. First, we show how a crossbar architecture is topologically mapped onto a PxC-enabled multi-chip module (MCM). Then, we describe a first prototype implementation of a small-scale switch based on a PxC MCM. Finally, we present a performance analysis of two large-scale switch configurations with 288 ports and 1,728 ports, respectively, contrasting a 1-stage PxC-enabled switch and a multi-stage switch using conventional technology. Our simulation results show that (a) arbitration delays in a large 1-stage switch can be considerable, (b) multi-stage switches are extremely susceptible to saturation under non-uniform traffic, a problem that becomes worse for higher radices (1-stage switches, in contrast, are not affected by this problem).",
    	booktitle = "High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for",
    	doi = "10.1109/SC.2008.5219754",
    	keywords = "PxC-enabled switch;chip interconnect technology;crossbar architecture;high-radix crossbar switches;multichip module;multistage switch;proximity communication;small-scale switch;unparalleled chip IO density;multichip modules;multiprocessor interconnection",
    	month = "15-21",
    	pages = "1 -12",
    	title = "{H}igh-radix crossbar switches enabled by {P}roximity {C}ommunication",
    	year = 2008
    }
    
  47. R Tornero, J M Ordua, , Jose Flich and Jose Duato. CART: Communication-Aware Routing Technique for Application-Specific NoCs. In Digital System Design Architectures, Methods and Tools, 2008. DSD '08. 11th EUROMICRO Conference on. 2008, 26 -31. URL, DOI BibTeX

    @conference{4669215,
    	author = "R. Tornero and J.M. Ordua and , and Flich, Jose and Duato, Jose",
    	abstract = "Networks on Chip (NoCs) have been shown as an efficient solution to the complex on-chip communication problems derived from the increasing number of processor cores. One of the key issues in the design of NoCs is the reduction of both area and power dissipation. As a result, two-dimensional meshes have become the preferred topology, since it offers low and constant link delay. Unfortunately, manufacturing defects or even real-time failures often make the resulting topology to become irregular, preventing the use of traditional routing algorithms. This scenario shows the need for topology-agnostic routing algorithms that provide a valid routing solution when applied over any topology. Moreover, in order to deal with run-time failures, the routing algorithm should be able to fit runtime constraints. This paper proposes a new communication-aware routing technique, referred to as CART, that optimizes the network performance for application-specific NoCs. CART combines a flexible, topology-agnostic routing algorithm with a communication-aware mapping technique that matches the traffic generated by the application with the available network bandwidth. Since the mapping technique can be pruned as needed in order to fit either quality function values or time constraints, CART can be adapted to fit with different computational costs. The evaluation results show that CART significatively improves network performance in terms of both latency and power consumption.",
    	booktitle = "Digital System Design Architectures, Methods and Tools, 2008. DSD '08. 11th EUROMICRO Conference on",
    	doi = "10.1109/DSD.2008.19",
    	isbn = "978-0-7695-3277-6",
    	keywords = "CART;application-specific NoC;communication-aware mapping technique;communication-aware routing technique;complex on-chip communication problems;network-on-chip;power dissipation;topology-agnostic routing algorithms;two-dimensional meshes;network routing;",
    	month = "3-5",
    	pages = "26 -31",
    	title = "{CART}: {C}ommunication-{A}ware {R}outing {T}echnique for {A}pplication-{S}pecific {N}o{C}s",
    	url = "http://dx.doi.org/10.1109/DSD.2008.19",
    	year = 2008
    }
    
  48. Jose Flich, Samuel Rodrigo and Jose Duato. An Efficient Implementation of Distributed Routing Algorithms for NoCs. In Networks-on-Chip, 2008. NoCS 2008. Second ACM/IEEE International Symposium on. 2008, 87 -96. DOI BibTeX

    @conference{4492728,
    	author = "Flich, Jose and Rodrigo, Samuel and Duato, Jose",
    	abstract = "The design of NoCs for multi-core chips introduces new design constraints like power consumption, area, and ultra low latencies. Although 2D meshes are preferred, heterogeneous blocks, fabrication faults, reliability issues, and chip virtualization may lead to the need of irregular topologies or regions. In this situation, efficient routing becomes a challenge. Although the use of routing tables at switches is flexible, it does not scale in terms of latency and area due to its memory requirements. LBDR (logic-based distributed routing) is proposed as a new routing method that removes the need of using routing tables at all. LBDR enables the implementation of many routing algorithms on most of the practical topologies we might find in the near future in a multi-core system. From an initial topology and routing algorithm, a set of three bits per switch/output port is computed. Evaluation results show that, by using a small logic, LBDR mimics the performance of routing algorithms when implemented with routing tables, both in regular and irregular topologies.",
    	booktitle = "Networks-on-Chip, 2008. NoCS 2008. Second ACM/IEEE International Symposium on",
    	doi = "10.1109/NOCS.2008.4492728",
    	keywords = "NoC;distributed routing algorithm;logic-based distributed routing;multicore chip;network-on-chip;routing tables;network routing;network-on-chip;",
    	month = "7-10",
    	pages = "87 -96",
    	title = "{A}n {E}fficient {I}mplementation of {D}istributed {R}outing {A}lgorithms for {N}o{C}s",
    	year = 2008
    }
    
  49. , Jose Flich, Jose Duato, Sven-Arne Reinemo and Tor Skeie. Boosting Ethernet Performance by Segment-Based Routing. In Parallel, Distributed and Network-Based Processing, 2007. PDP '07. 15th EUROMICRO International Conference on. 2007, 55 -62. URL, DOI BibTeX

    @conference{4135259,
    	author = ", and Flich, Jose and Duato, Jose and Sven-Arne Reinemo and Tor Skeie",
    	abstract = "Ethernet is turning out to be a cost-effective solution for building cluster networks offering compatibility, simplicity, high bandwidth, scalability and a good performance-to-cost ratio. Nevertheless, Ethernet still makes inefficient use of network resources (links) and suffers from long failure recovery time due to the lack of a suitable routing algorithm. In this paper we embed an efficient routing algorithm into 802.3 Ethernet technology, making it possible to use off-the-shelf equipment to build high-performance and cost-effective Ethernet clusters, with an efficient use of link bandwidth and with fault tolerant capabilities. The algorithm, referred to as segment-based routing (SR), is a deterministic routing algorithm that achieves high performance without the need for virtual channels (not available in Ethernet). Moreover, SR is topology agnostic, meaning it can be applied to any topology, and tolerates any combination of faults derived from the original topology when combined with static reconfiguration. Through simulations we verify an overall improvement in throughput by a factor of 1.2 to 10.0 when compared to the conventional Ethernet routing algorithm, the spanning tree protocol (STP), and other topology agnostic routing algorithms such as Up*/Down* and tree-based turn-prohibition, the last one being recently proposed for Ethernet",
    	booktitle = "Parallel, Distributed and Network-Based Processing, 2007. PDP '07. 15th EUROMICRO International Conference on",
    	doi = "10.1109/PDP.2007.28",
    	issn = "1066-6192",
    	keywords = "Ethernet technology;Ethernet clusters;cluster networks;fault tolerant capability;off-the-shelf equipment;routing algorithm;segment-based routing;spanning tree protocol;static reconfiguration;topology agnostic routing algorithms;tree-based turn-prohi",
    	month = "feb.",
    	pages = "55 -62",
    	title = "{B}oosting {E}thernet {P}erformance by {S}egment-{B}ased {R}outing",
    	url = "http://dx.doi.org/10.1109/PDP.2007.28",
    	year = 2007
    }
    
  50. A Martinez-Vicente, P J Garcia, F J Alfaro, J L Sanchez, Jose Flich, F J Quiles and Jose Duato. Integrated QoS provision and congestion management for interconnection networks. In Euro-Par 2007. Parallel Processing. Proceedings 13th International Euro-Par Conference. LNCS 4641. 2007, 837 - 47. BibTeX

    @conference{9689023,
    	author = "A. Martinez-Vicente and P.J. Garcia and F.J. Alfaro and J.L. Sanchez and Flich, Jose and F.J. Quiles and Duato, Jose",
    	abstract = "Both QoS support and congestion management techniques have become essential for achieving good performance in current highspeed interconnection networks. However, traditional techniques proposed for both issues require too many resources for being implemented. In this paper we propose a new switch architecture that efficiently uses the same resources to offer both congestion management and QoS provision. It is as effective as previous proposals, but much more cost-effective.",
    	address = "Berlin, Germany",
    	booktitle = "Euro-Par 2007. Parallel Processing. Proceedings 13th International Euro-Par Conference.",
    	journal = "Euro-Par 2007. Parallel Processing. Proceedings 13th International Euro-Par Conference. (Lecture Notes in Computer Science vol. 4641)",
    	keywords = "computer network management;multistage interconnection networks;quality of service;queueing theory;telecommunication congestion control;",
    	note = "switch architecture;interconnection network;quality of service;QoS support;congestion management technique;",
    	pages = "837 - 47",
    	title = "{I}ntegrated {Q}o{S} provision and congestion management for interconnection networks",
    	volume = "LNCS 4641",
    	year = 2007
    }
    
  51. Jose Flich, , Pedro Lopez and Jose Duato. Region-Based Routing: An Efficient Routing Mechanism to Tackle Unreliable Hardware in Network on Chips. In Networks-on-Chip, 2007. NOCS 2007. First International Symposium on. 2007, 183 -194. URL, DOI BibTeX

    @conference{4209007,
    	author = "Flich, Jose and , and Lopez, Pedro and Duato, Jose",
    	abstract = "The design of scalable and reliable interconnection networks for system on chips (SoCs) introduce new design constraints not present in current multicomputer systems. Although regular topologies are preferred for building NoCs, heterogeneous blocks, fabrication faults and reliability issues derived from the high integration scale may lead to irregular topologies. In this situation, efficient routing becomes a challenge. Although table-based routing allows the use of most routing algorithms on any topology, it does not scale in terms of latency and area. In this paper we propose the region-based routing mechanism that avoids the scalability problems of table-based solutions. From an initial topology and routing algorithm, the mechanism groups, at every switch, destinations into different regions based on the output ports. By doing this, redundant routing information typically found in routing tables is eliminated. Evaluation results show that the mechanism requires only four regions to support several routing algorithms in a 2D mesh with no performance degradation. Moreover, when dealing with link failures, our results indicate that the mechanism combined with the segment-based routing algorithm is able to pack all the routing information into eight regions providing high throughput. The paper provides also a simple and efficient hardware implementation of the mechanism requiring only 240 logic gates per switch to support eight regions in a 2D mesh topology",
    	booktitle = "Networks-on-Chip, 2007. NOCS 2007. First International Symposium on",
    	doi = "10.1109/NOCS.2007.39",
    	keywords = "2D mesh topology;interconnection networks;multicomputer systems;network on chips;region-based routing;segment-based routing algorithm;system on chips;table-based routing;integrated circuit interconnections;logic design;microprocessor chips;network routing",
    	month = "7-9",
    	pages = "183 -194",
    	title = "{R}egion-{B}ased {R}outing: {A}n {E}fficient {R}outing {M}echanism to {T}ackle {U}nreliable {H}ardware in {N}etwork on {C}hips",
    	url = "http://dx.doi.org/10.1109/NOCS.2007.39",
    	year = 2007
    }
    
  52. Gaspar Mora, P J Garcia, Jose Flich and Jose Duato. RECN-IQ: A Cost-Effective Input-Queued Switch Architecture with Congestion Management. In Parallel Processing, 2007. ICPP 2007. International Conference on. 2007, 74 -74. URL, DOI BibTeX

    @conference{4343881,
    	author = "Mora, Gaspar and P.J. Garcia and Flich, Jose and Duato, Jose",
    	abstract = "As the number of computing and storage nodes keeps increasing, the interconnection network is becoming a key element of many computing and communication systems, where the overall performance directly depends on network performance. This performance may dramatically drop during congestion situations. Although congestion may be avoided by over dimensioning the network, the current trend is to reduce overall cost and power consumption by reducing the number of network components. Thus, the network will be prone to congestion, thereby becoming mandatory the use of congestion management techniques. In that sense, the technique known as Regional Explicit Congestion Notification (RECN) completely eliminates the Head-of-Line (HOL) blocking produced by congested packets, turning congestion harmless. However, RECN has been designed for switches with queues at input and output ports (CIOQ switches), thus it can not be directly applied to other types of switches. Additionally, the method RECN uses for detecting congestion requires several detection queues that increase the memory requirements and thus switch cost. Thus, we completely redefine the RECN mechanism in order to achieve different goals. First, we adapt RECN to a switch organization with queues only at input ports (IQ switches). These switches are simpler and cheaper to produce than CIOQ ones. Second, we propose a new method for detecting congestion that does not require several detection queues, thereby reducing RECN memory requirements. These improvements lead to achieve a cost-effective switch organization that derive maximum performance even in the presence of congestion. Also, we present in detail a realistic switch architecture supporting the new mechanism. Results demonstrate that the new RECN version in an IQ switch achieves maximum network performance in all the analyzed situations. These results have been a reduction factor of data memory requirements of 5 with respect to the previous RECN mechanism in CIOQ- - switches.",
    	booktitle = "Parallel Processing, 2007. ICPP 2007. International Conference on",
    	doi = "10.1109/ICPP.2007.71",
    	issn = "0190-3918",
    	keywords = "RECN-IQ memory requirement;cost-effective input-queued switch architecture;head-of-line blocking;interconnection network;packet congestion management technique;power consumption;regional explicit congestion notification;computer architecture;multiprocesso",
    	month = "10-14",
    	pages = "74 -74",
    	title = "{RECN}-{IQ}: {A} {C}ost-{E}ffective {I}nput-{Q}ueued {S}witch {A}rchitecture with {C}ongestion {M}anagement",
    	url = "http://dx.doi.org/10.1109/ICPP.2007.71",
    	year = 2007
    }
    
  53. P J Garcia, Jose Flich, Jose Duato, I Johnson, F J Quiles and F Naven. Decongestants for clogged networks. IEEE Potentials 26(6):36 - 41, 2007. BibTeX

    @article{9732590,
    	author = "P.J. Garcia and Flich, Jose and Duato, Jose and I. Johnson and F.J. Quiles and F. Naven",
    	abstract = {Interconnection networks are a key element in a wide variety of systems: massive parallel processors, local and system area networks, clusters of PCs and workstations, and Internet Protocol routers. They are essential to high performance in the form of high-bandwidth communications, with low latency, "quality of service" (guaranteed service levels), efficient switching, and flexibility of network topology, as embodied in Myrinet, InfiniBand, Quadrics, Advanced Switching, and similar interconnects. But, despite all the advances that modem interconnects offer, congestion is a growing problem as "lossless" interconnection networks{{\&}}rdquo; those that do not allow data packets to be discarded" come to the fore.},
    	address = "USA",
    	issn = "0278-6648",
    	journal = "IEEE Potentials",
    	keywords = "multistage interconnection networks;quality of service;",
    	note = "decongestant;clogged network;interconnection network;massive parallel processor;quality of service;network topology;Internet protocol router;",
    	number = 6,
    	pages = "36 - 41",
    	title = "{D}econgestants for clogged networks",
    	volume = 26,
    	year = 2007
    }
    
  54. P J Garcia, F J Quiles, Jose Flich, Jose Duato, I Johnson and F Naven. Efficient, Scalable Congestion Management for Interconnection Networks. Micro, IEEE 26(5):52 -66, 2006. DOI BibTeX

    @article{1709823,
    	author = "P.J. Garcia and F.J. Quiles and Flich, Jose and Duato, Jose and I. Johnson and F. Naven",
    	abstract = "Compared to the overdimensioned designs of the past, current interconnection networks operate closer to the point of saturation and run a higher risk of congestion. Among proposed strategies for congestion management, only the regional explicit congestion notification (RECN) mechanism achieves both the required efficiency and the scalability that emerging systems demand",
    	doi = "10.1109/MM.2006.88",
    	issn = "0272-1732",
    	journal = "Micro, IEEE",
    	keywords = "RECN mechanism;interconnection networks;regional explicit congestion notification;scalable congestion management;multiprocessor interconnection networks;",
    	month = "sept.-oct.",
    	number = 5,
    	pages = "52 -66",
    	title = "{E}fficient, {S}calable {C}ongestion {M}anagement for {I}nterconnection {N}etworks",
    	volume = 26,
    	year = 2006
    }
    
  55. Maria E Gomez, N A Nordbotten, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, T Skeie and O Lysne. A routing methodology for achieving fault tolerance in direct networks. Computers, IEEE Transactions on 55(4):400 - 415, April 2006. URL, DOI BibTeX

    @article{1608003,
    	author = "Gomez, Maria E. and N.A. Nordbotten and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and T. Skeie and O. Lysne",
    	abstract = "Massively parallel computing systems are being built with thousands of nodes. The nterconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance.",
    	doi = "10.1109/TC.2006.46",
    	issn = "0018-9340",
    	journal = "Computers, IEEE Transactions on",
    	keywords = "adaptive routing; checkpoint-restart mechanism; direct networks; fault-tolerant routing methodology; interconnection network; parallel computing system; fault tolerant computing; multiprocessor interconnection networks; network routing; parallel processi",
    	month = "april",
    	number = 4,
    	pages = "400 - 415",
    	title = "{A} routing methodology for achieving fault tolerance in direct networks",
    	url = "http://dx.doi.org/10.1109/TC.2006.46",
    	volume = 55,
    	year = 2006
    }
    
  56. , Jose Flich, Jose Duato, S -A Reinemo and T Skeie. Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. April 2006, 10 pp.. URL, DOI BibTeX

    @conference{1639341,
    	author = ", and Flich, Jose and Duato, Jose and S.-A. Reinemo and T. Skeie",
    	abstract = "Computers get faster every year, but the demand for computing resources seems to grow at an even faster rate. Depending on the problem domain, this demand for more power can be satisfied by either, massively parallel computers, or clusters of computers. Common for both approaches is the dependence on high performance interconnect networks such as Myrinet, Infiniband, or 10 Gigabit Ethernet. While high throughput and low latency are key features of interconnection networks, the issue of fault-tolerance is now becoming increasingly important. As the number of network components grows so does the probability for failure, thus it becomes important to also consider the fault-tolerance mechanism of interconnection networks. The main challenge then lies in combining performance and fault-tolerance, while still keeping cost and complexity low. This paper proposes a new deterministic routing methodology for tori and meshes, which achieves high performance without the use of virtual channels. Furthermore, it is topology agnostic in nature, meaning it can handle any topology derived from any combination of faults when combined with static reconfiguration. The algorithm, referred to as segment-based routing (SR), works by partitioning a topology into subnets, and subnets into segments. This allows us to place bidirectional turn restrictions locally within a segment. As segments are independent, we gain the freedom to place turn restrictions within a segment independently from other segments. This results in a larger degree of freedom when placing turn restrictions compared to other routing strategies. In this paper a way to compute segment-based routing tables is presented and applied to meshes and tori. Evaluation results show that SR increases performance by a factor of 1.8 over FX and up*/down* routing",
    	booktitle = "Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International",
    	doi = "10.1109/IPDPS.2006.1639341",
    	keywords = "deterministic routing;fault-tolerant routing;interconnection networks;meshes;segment-based routing;tori;fault tolerant computing;multiprocessor interconnection networks;telecommunication network routing;telecommunication network topology;",
    	month = "april",
    	pages = "10 pp.",
    	title = "{S}egment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori",
    	url = "http://dx.doi.org/10.1109/IPDPS.2006.1639341",
    	year = 2006
    }
    
  57. Gaspar Mora, Jose Flich, Jose Duato, Pedro Lopez, Elvira Baydal and O Lysne. Towards an efficient switch architecture for high-radix switches. 2006, 11 - 20. URL, DOI BibTeX

    @conference{10091275,
    	author = "Mora, Gaspar and Flich, Jose and Duato, Jose and Lopez, Pedro and Baydal, Elvira and O. Lysne",
    	abstract = "The interconnection network plays a key role in the overall performance achieved by high performance computing systems, also contributing an increasing fraction of its cost and power consumption. Current trends in interconnection network technology suggest that high-radix switches will be preferred as networks will become smaller (in terms of switch count) with the associated savings in packet latency, cost, and power consumption. Unfortunately, current switch architectures have scalability problems that prevent them from being effective when implemented with a high number of ports. In this paper, an efficient and cost-effective architecture for high-radix switches is proposed. The architecture, referred to as partitioned crossbar input queued (PCIQ), relies on three key components: a partitioned crossbar organization that allows the use of simple arbiters and crossbars, a packet-based arbiter, and a mechanism to eliminate the switch-level HOL blocking. Under uniform traffic, maximum switch efficiency is achieved. Furthermore, switch-level HOL blocking is completely eliminated under hot-spot traffic, again delivering maximum throughput. Additionally, PCIQ inherently implements an efficient congestion management technique that eliminates all the network-wide HOL blocking. On the contrary, the previously proposed architectures either show poor performance or they require significantly higher costs than PCIQ (in both components and complexity).",
    	address = "Piscataway, NJ, USA",
    	doi = "10.1109/ANCS.2006.4579519",
    	journal = "ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS 2006)",
    	keywords = "multistage interconnection networks;",
    	note = "high-radix switch architecture;interconnection network;power consumption;partitioned crossbar input queued;switch-level head-of-line block elimination;congestion management technique;",
    	pages = "11 - 20",
    	title = "{T}owards an efficient switch architecture for high-radix switches",
    	url = "http://dx.doi.org/10.1109/ANCS.2006.4579519",
    	year = 2006
    }
    
  58. , Jose Flich, Jose Duato, S -A Reinemo and T Skeie. Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. 2006, 10 pp. -. URL, DOI BibTeX

    @conference{8969869,
    	author = ", and Flich, Jose and Duato, Jose and S.-A. Reinemo and T. Skeie",
    	abstract = "Computers get faster every year, but the demand for computing resources seems to grow at an even faster rate. Depending on the problem domain, this demand for more power can be satisfied by either, massively parallel computers, or clusters of computers. Common for both approaches is the dependence on high performance interconnect networks such as Myrinet, Infiniband, or 10 Gigabit Ethernet. While high throughput and low latency are key features of interconnection networks, the issue of fault-tolerance is now becoming increasingly important. As the number of network components grows so does the probability for failure, thus it becomes important to also consider the fault-tolerance mechanism of interconnection networks. The main challenge then lies in combining performance and fault-tolerance, while still keeping cost and complexity low. This paper proposes a new deterministic routing methodology for tori and meshes, which achieves high performance without the use of virtual channels. Furthermore, it is topology agnostic in nature, meaning it can handle any topology derived from any combination of faults when combined with static reconfiguration. The algorithm, referred to as segment-based routing (SR), works by partitioning a topology into subnets, and subnets into segments. This allows us to place bidirectional turn restrictions locally within a segment. As segments are independent, we gain the freedom to place turn restrictions within a segment independently from other segments. This results in a larger degree of freedom when placing turn restrictions compared to other routing strategies. In this paper a way to compute segment-based routing tables is presented and applied to meshes and tori. Evaluation results show that SR increases performance by a factor of 1.8 over FX and up*/down* routing",
    	address = "Piscataway, NJ, USA",
    	booktitle = "Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International",
    	doi = "10.1109/IPDPS.2006.1639341",
    	journal = "Proceedings. 20th International Parallel and Distributed Processing Symposium (IEEE Cat. No.06TH8860)",
    	keywords = "fault tolerant computing;multiprocessor interconnection networks;telecommunication network routing;telecommunication network topology;",
    	note = "segment-based routing;fault-tolerant routing;meshes;tori;interconnection networks;deterministic routing;",
    	pages = "10 pp. -",
    	title = "{S}egment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori",
    	url = "http://dx.doi.org/10.1109/IPDPS.2006.1639341",
    	year = 2006
    }
    
  59. Teresa Nachiondo, Jose Flich and Jose Duato. Destination-based HoL blocking elimination. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference onParallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on 1. 2006, 10 pp. -. URL, DOI BibTeX

    @conference{9077844,
    	author = "Nachiondo, Teresa and Flich, Jose and Duato, Jose",
    	abstract = "In future interconnection networks, congestion management is likely to become a critical issue owing to increasing power consumption and cost concerns. As congested packets introduce head-of-line (HoL) blocking to the rest of packets, congestion spreads quickly. The best-known solution to HoL blocking, virtual output queues (VOQs), is not scalable at all or too costly when implemented in large networks. In previous works, we proposed an efficient and cost-effective solution, referred to as destination-based buffer management (DBBM). DBBM groups destinations into different sets, and packets addressed to destinations in the same set are mapped to the same queue. DBBM eliminates most of the HoL blocking (among packets addressed to different sets). It achieves very good results in terms of scalability, throughput, and robustness. However, depending on the distribution of packet destinations, it may introduce an uncertain degree of unfairness among packets mapped on the same queue. In order to overcome this problem, we propose the dynamic DBBM mechanism (DDBBM). DDBBM dynamically eliminates completely the HoL blocking. Performance results show that DDBBM keeps (and in some cases improves) the good results achieved by DBBM in terms of throughput and scalability. Moreover, DDBBM solves the unfairness introduced by DBBM. As an example of applicability, in this paper we show that DDBBM can be applied to InfiniBand with no hardware modification",
    	booktitle = "Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference onParallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on",
    	doi = "10.1109/ICPADS.2006.34",
    	isbn = "0-7695-2612-8",
    	issn = "1521-9097",
    	journal = "12th International Conference on Parallel and Distributed Systems",
    	keywords = "buffer storage;computer network management;packet switching;queueing theory;telecommunication congestion control;",
    	note = "destination-based HoL blocking elimination;interconnection network;network congestion management;head-of-line blocking;virtual output queues;dynamic destination-based buffer management;packet destination distribution;InfiniBand;",
    	pages = "10 pp. -",
    	title = "{D}estination-based {H}o{L} blocking elimination",
    	url = "http://dx.doi.org/10.1109/ICPADS.2006.34",
    	volume = 1,
    	year = 2006
    }
    
  60. Maria E Gomez, N A Nordbotten, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, T Skeie and O Lysne. A routing methodology for achieving fault tolerance in direct networks. IEEE Transactions on Computers 55(4):400 - 15, 2006. URL, DOI BibTeX

    @article{8935111,
    	author = "Gomez, Maria E. and N.A. Nordbotten and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and T. Skeie and O. Lysne",
    	abstract = "Massively parallel computing systems are being built with thousands of nodes. The interconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance",
    	address = "USA",
    	doi = "10.1109/TC.2006.46",
    	issn = "0018-9340",
    	journal = "IEEE Transactions on Computers",
    	keywords = "fault tolerant computing;multiprocessor interconnection networks;network routing;parallel processing;",
    	note = "direct networks;parallel computing system;interconnection network;fault-tolerant routing methodology;adaptive routing;checkpoint-restart mechanism;",
    	number = 4,
    	pages = "400 - 15",
    	title = "{A} routing methodology for achieving fault tolerance in direct networks",
    	url = "http://dx.doi.org/10.1109/TC.2006.46",
    	volume = 55,
    	year = 2006
    }
    
  61. J M Montañana, Jose Flich, Antonio Robles and Jose Duato. Reachability-based fault-tolerant routing. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on 1. 2006, 10 pp.. URL, DOI BibTeX

    @conference{1655699,
    	author = "Monta{\~n}ana, J. M. and Flich, Jose and Robles, Antonio and Duato, Jose",
    	abstract = "Clusters of PCs are being used as cost-effective alternative to large parallel computers. In most of them it is critical to keep the system running even in the presence of faults. As the number of nodes increases in these systems, the interconnection network grows accordingly. Along with the increase in components the probability of faults increases dramatically, and thus, fault-tolerance in the system, in general, and in the interconnection network, in particular, plays a key role. An interesting approach to provide fault-tolerance consists of migrating on fly the paths affected by the failure to new fault-free paths. In this paper, we propose a simple and effective fault-tolerant routing methodology, referred to as reachability based fault tolerant routing (RFTR), that can be applied to any topology. RFTR builds new alternative paths by joining subpaths extracted from the set of already computed paths, thus being time-efficient. In order to avoid deadlocks, RFTR performs, if required, a virtual channel transition on the subpath union. As an example of applicability, in this paper we apply RFTR to InfiniBand. Evaluation results on tori show that RFTR exhibits a low computation cost and does not degrade performance significantly",
    	booktitle = "Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on",
    	doi = "10.1109/ICPADS.2006.89",
    	isbn = "0-7695-2612-8",
    	issn = "1521-9097",
    	keywords = "PC clusters;interconnection network;parallel computers;reachability-based fault-tolerant routing;virtual channel transition;fault tolerant computing;reachability analysis;telecommunication network routing;workstation clusters;",
    	month = "0-0",
    	pages = "10 pp.",
    	title = "{R}eachability-based fault-tolerant routing",
    	url = "http://dx.doi.org/10.1109/ICPADS.2006.89",
    	volume = 1,
    	year = 2006
    }
    
  62. A Martinez, P J Garcia, F J Alfaro, J L Sanchez, Jose Flich, F J Quiles and Jose Duato. Towards a cost-effective interconnection network architecture with QoS and congestion management support. 2006, 884 - 95. BibTeX

    @conference{9112994,
    	author = "A. Martinez and P.J. Garcia and F.J. Alfaro and J.L. Sanchez and Flich, Jose and F.J. Quiles and Duato, Jose",
    	abstract = "Congestion management and quality of service (QoS) provision are two important issues in current network design. The most popular techniques proposed for both issues require the existence of specific resources in the interconnection network, usually a high number of separate queues at switch ports. Therefore, the implementation of these techniques is expensive or even in feasible. However, two novel, efficient, and cost-effective techniques for provision of QoS and for congestion management have been proposed recently. In this paper, we combine those techniques to build a single interconnection network architecture, providing an excellent performance while reducing the number of required resources",
    	address = "Berlin, Germany",
    	journal = "Euro-Par 2006 Parallel Processing. 12th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol. 4128)",
    	keywords = "interconnections;quality of service;telecommunication congestion control;",
    	note = "cost-effective interconnection network;quality of service;congestion management;switch port;",
    	pages = "884 - 95",
    	title = "{T}owards a cost-effective interconnection network architecture with {Q}o{S} and congestion management support",
    	year = 2006
    }
    
  63. P J Garcia, F J Quiles, Jose Flich, Jose Duato and I Johnson. RECN-DD: A Memory-Efficient Congestion Management Technique for Advanced Switching. In Parallel Processing, 2006. ICPP 2006. International Conference on. 2006, 23 -32. DOI BibTeX

    @conference{1690602,
    	author = "P.J. Garcia and F.J. Quiles and Flich, Jose and Duato, Jose and I. Johnson",
    	abstract = "As VLSI technology advances, the interconnection network represents a larger percentage of the total system cost and power consumption. In fact, a current trend in network design is to reduce the number of components. However, this leads to systems working closer to saturation point, and therefore an efficient congestion management technique is required. In that sense, RECN has been recently proposed for advanced switching (AS). RECN detects the formation of congestion trees and dynamically allocates queues for storing congested packets, thus, eliminating the HOL blocking introduced by congestion trees. These queues are deallocated when congestion vanishes. We have identified two shortcomings that may affect RECN scalability and implementation. Firstly, although RECN allocates queues in an efficient way, resource deallocation is performed in-order, thus losing efficiency and wasting resources. This leads to an excessive requirement of memory at switch ports. Secondly, both allocation and deallocation mechanisms involve the use of specific control packets not supported by the AS standard, thus preventing RECN implementation. In this sense we provide a detailed description of the current RECN deallocation mechanism. In this paper we present an enhanced RECN version (RECN-DD) where these problems have been eliminated. Specifically, we propose a new distributed queue deallocation mechanism that reduces the number of required resources and does not require the use of control packets. Moreover, we propose a new congestion notification mechanism that does not require non-standard AS packets. Instead, flow control packets are used to notify congestion, thus simplifying the implementation of RECN-DD in AS",
    	booktitle = "Parallel Processing, 2006. ICPP 2006. International Conference on",
    	doi = "10.1109/ICPP.2006.62",
    	issn = "0190-3918",
    	keywords = "distributed queue deallocation;flow control packet;memory-efficient congestion management;regional explicit congestion notification;resource deallocation;multiprocessor interconnection networks;packet switching;queueing theory;telecommunication congestion",
    	month = "14-18",
    	pages = "23 -32",
    	title = "{RECN}-{DD}: {A} {M}emory-{E}fficient {C}ongestion {M}anagement {T}echnique for {A}dvanced {S}witching",
    	year = 2006
    }
    
  64. Teresa Nachiondo, Jose Flich, Jose Duato and M Gusat. Cost/performance trade-offs and fairness evaluation of queue mapping policies. In José Cunha; Pedro C D Medeiros (ed.). Euro-Par 2005 Parallel Processing 3648. August 2005, 1024 - 1034. URL, DOI BibTeX

    @conference{8746125,
    	author = "Nachiondo, Teresa and Flich, Jose and Duato, Jose and M. Gusat",
    	abstract = "Whereas the established interconnection networks (ICTN) achieve low latency by operating in the linear region, i.e. oversizing the fabric, the strict cost and power constraints demand more efficient utilization of future networks. Increasing the utilization of lossless ICTNs may, however, lead to saturation and performance degradation owing to HOL-blocking. The current solution to HOL-blocking consists of using virtual output queueing (VOQ), whose quadratical scalability is expensive in large networks. To improve VOQ's scalability we have proposed the destination-based buffer management (DBBM), a scheme that compares well with VOQ. Whereas previously we have analyzed DBBM's basic operation and performance, in this paper we have set two different goals. First we focus on how the different DBBM mappings can impact the cost/performance of multistage ICTNs. Next, because DBBM can introduce unfairness, this constitutes the second theme of our paper. The new results show that DBBM with modulo-4/8 mapping performs very well for only a fraction of the VOQ cost. Also in terms of fairness DBBM shows promise, because it (i) keeps the unfairness degree independent of both topology and routing, while (ii) minimizing the number of flows affected by unfairness",
    	booktitle = "Euro-Par 2005 Parallel Processing",
    	doi = "10.1007/11549468_112",
    	editor = "Jos{\'e} C. Cunha; Pedro D. Medeiros",
    	isbn = "978-3-540-28700-1",
    	journal = "Euro-Par 2005 Parallel Processing. 11th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol. 3648)",
    	keywords = "buffer storage;multistage interconnection networks;performance evaluation;queueing theory;",
    	month = "Aug",
    	note = "fairness evaluation;queue mapping policies;interconnection networks;destination-based buffer management;multistage ICTN;",
    	pages = "1024 - 1034",
    	series = "Lecture Notes in Computer Science",
    	title = "{C}ost/performance trade-offs and fairness evaluation of queue mapping policies",
    	url = "http://dx.doi.org/10.1007/11549468_112",
    	volume = 3648,
    	year = 2005
    }
    
  65. Teresa Nachiondo, Jose Flich and Jose Duato. Efficient reduction of HOL blocking in multistage networks. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. April 2005, 8 pp.. URL, DOI BibTeX

    @conference{1420115,
    	author = "Nachiondo, Teresa and Flich, Jose and Duato, Jose",
    	abstract = "Head-of-line blocking is one of the main problems arising in input-buffered switches. The best-known solution to this problem consists of using virtual output queues (VOQs). However this strategy is not scalable. Its implementation cost increases quadratically with the number of ports in the switch. Taking into account current trends, the demand for larger number of ports in high-performance switches is likely to increase very rapidly in the future. Therefore, a scalable and cost-effective solution is required. In this paper we propose an efficient and cost-effective strategy (belonging to a family of strategies previously proposed, referred to as destination-based buffer management (DBBM)), to reduce HOL blocking in single-stage and multistage networks. The proposed strategy is based on allowing certain destinations to share the same queue. Its main purpose is to maximize network throughput whereas keeping HOL blocking to negligible values. In this paper, we apply the strategy at every switch included in a bidirectional multistage network (BMIN). We have evaluated DBBM, VOQ, and alternative strategies in different BMIN sizes and with different traffic conditions (synthetic traffic, IP traces, and I/O traces). Results show that DBBM with a reduced number of queues at each switch obtains roughly the same throughput as the VOQ mechanism. Moreover, VOQ at the switch level (as many queues as output ports at every switch) has also been analyzed. Results demonstrate that it does not scale. As the number of stages in the network increases, the VOQ solution at the switch level introduces more HOL blocking that leads to a severe degradation in network throughput. With the DBBM using 16 queues, maximum network throughput is sustained for all the traffic cases analyzed. Moreover, as the network size increases (up to a 2048 times; 2048 BMIN), DBBM keeps roughly the same performance with the same number of queues.",
    	booktitle = "Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International",
    	doi = "10.1109/IPDPS.2005.193",
    	isbn = "0-7695-2312-9",
    	keywords = "bidirectional multistage network; destination-based buffer management; head-of-line blocking; high-performance switch; virtual output queue; multistage interconnection networks; queueing theory; storage management; telecommunication traffic;",
    	month = "April",
    	pages = "8 pp.",
    	title = "{E}fficient reduction of {HOL} blocking in multistage networks",
    	url = "http://dx.doi.org/10.1109/IPDPS.2005.193",
    	year = 2005
    }
    
  66. R Martinez, J L Sanchez, F J Alfaro, Vicente Chirivella and Jose Flich. Studying the effect of the design parameters on the interconnection network performance in NOWs. In Parallel, Distributed and Network-Based Processing, 2005. PDP 2005. 13th Euromicro Conference on. February 2005, 102 - 109. URL, DOI BibTeX

    @conference{1386048,
    	author = "R. Martinez and J.L. Sanchez and F.J. Alfaro and Chirivella, Vicente and Flich, Jose",
    	abstract = "With the increasing use of network of workstations (NOWs) as an alternative to huge parallel computers it has become essential to design high-performance interconnection networks for the communication between the nodes of these clusters. A large number of studies have been carried out to achieve this objective. Most of them propose a new technique that affects one of the parameters that characterize the interconnection network. These techniques are completely new or inspired in the techniques previously used in multiprocessor systems. The impact of the proposal is studied (in most cases using simulation), and an analysis is made of the effect of the new technique over the system performance versus those currently in existence. In this kind of study most of the network parameters are fixed and usually only a few parameters are varied. This paper presents a more general study of the interconnection network performance. This study consists in showing the effect of different design parameters over the network performance, and the interaction between them. This study would not be viable with the traditional techniques due to the number of simulations required. The alternative of the experimental design is used to carry out the study.",
    	booktitle = "Parallel, Distributed and Network-Based Processing, 2005. PDP 2005. 13th Euromicro Conference on",
    	doi = "10.1109/EMPDP.2005.40",
    	isbn = "0-7695-2280-7",
    	issn = "1066-6192",
    	keywords = "interconnection network performance; multiprocessor systems; network of workstations; network parameters; multiprocessor interconnection networks; performance evaluation; workstation clusters;",
    	month = "Feb",
    	organization = "IEEE",
    	pages = "102 - 109",
    	title = "{S}tudying the effect of the design parameters on the interconnection network performance in {NOW}s",
    	url = "http://dx.doi.org/10.1109/EMPDP.2005.40",
    	year = 2005
    }
    
  67. Jose Duato, I Johnson, Jose Flich, F Naven, P Garcia and Teresa Nachiondo. A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on. February 2005, 108 - 119. URL, DOI BibTeX

    @conference{1385933,
    	author = "Duato, Jose and I. Johnson and Flich, Jose and F. Naven and P. Garcia and Nachiondo, Teresa",
    	abstract = "In this paper, we propose a new congestion management strategy for lossless multistage interconnection networks that scales as network size and/or link bandwidth increase. Instead of eliminating congestion, our strategy avoids performance degradation beyond the saturation point by eliminating the HOL blocking produced by congestion trees. This is achieved in a scalable manner by using separate queues for congested flows. These are dynamically allocated only when congestion arises, and deallocated when congestion subsides. Performance evaluation results show that our strategy responds to congestion immediately and completely eliminates the performance degradation produced by HOL blocking while using only a small number of additional queues.",
    	booktitle = "High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on",
    	doi = "10.1109/HPCA.2005.1",
    	isbn = "0-7695-2275-0",
    	issn = "1530-0897",
    	keywords = "HOL blocking; congestion management; congestion trees; lossless multistage interconnection networks; network queue; computer network management; multistage interconnection networks; queueing theory; telecommunication congestion control;",
    	month = "Feb",
    	pages = "108 - 119",
    	title = "{A} new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks",
    	url = "http://dx.doi.org/10.1109/HPCA.2005.1",
    	year = 2005
    }
    
  68. Michihiro Koibuchi, Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Enforcing in-order packet delivery in system area networks with adaptive routing. Journal of Parallel and Distributed Computing 65(10):1223 - 1236, 2005. URL BibTeX

    @article{2005379355213,
    	author = "Michihiro Koibuchi and Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
    	abstract = "Adaptive routing, which dynamically selects the route of packets, has been widely studied for interconnection networks in massively parallel computers and system area networks. Although adaptive routing has the advantage of providing high bandwidth, it may deliver packets out-of-order, which some message passing libraries do not accept. In this paper, we propose two mechanisms called (1) FIFO transmission and (2) couple limitation to guarantee in-order packet delivery in adaptive routing. Both of them limit packet injection at source hosts. The FIFO transmission completely avoids packet sorting at destination hosts, while the couple limitation uses a few buffers to sort packets at destination hosts. Evaluation results show that the FIFO transmission and the couple limitation achieve a similar throughput to that of a method equipped with huge (infinite) buffers enough to store all out-of-order packets at destination hosts under both synthetic traffic and NAS Parallel Benchmarks. © 2005 Elsevier Inc. All rights reserved.",
    	issn = 07437315,
    	journal = "Journal of Parallel and Distributed Computing",
    	key = "Packet networks",
    	keywords = "Bandwidth;Benchmarking;Interconnection networks;Routers;Telecommunication traffic;",
    	note = "Adaptive routing;In-order packet delivery;PC clusters;System area networks;",
    	number = 10,
    	pages = "1223 - 1236",
    	title = "{E}nforcing in-order packet delivery in system area networks with adaptive routing",
    	url = "http://dx.doi.org/10.1016/j.jpdc.2005.04.007",
    	volume = 65,
    	year = 2005
    }
    
  69. P J Garcia, Jose Flich, Jose Duato, F J Quiles, I Johnson and F Naven. On the correct sizing on meshes through an effective congestion management strategy. 2005, 1035 - 45. BibTeX

    @conference{8746126,
    	author = "P.J. Garcia and Flich, Jose and Duato, Jose and F.J. Quiles and I. Johnson and F. Naven",
    	abstract = "Interconnection networks used in clusters of PCs are often dimensioned with certain restrictions. One restriction could be the reduction of power consumption and overall cost. In this sense, the network size must be reduced. Another restriction is to guarantee that the system offers a minimum bandwidth. In this case, the network size must be increased. In both cases, the head-of-line (HOL) blocking effect (related to network congestion) may appear, degrading network performance and thus, preventing the correct sizing of the network. Therefore, some mechanisms should be implemented for reducing or eliminating this problem, in order to dimension the network as desired while keeping network performance at maximum. In this paper we analyze the impact on network performance when using different mechanisms for handling HOL blocking when interconnection networks with mesh topology are dimensioned in several ways. We show that the previously proposed RECN congestion control mechanism is key in order to efficiently eliminate HOL blocking in meshes and, therefore, it allows the correct network sizing",
    	address = "Berlin, Germany",
    	journal = "Euro-Par 2005 Parallel Processing. 11th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol. 3648)",
    	keywords = "computer network management;multiprocessor interconnection networks;performance evaluation;telecommunication congestion control;",
    	note = "mesh network sizing;congestion management;interconnection networks;head-of-line blocking reduction;HOL blocking handling;RECN congestion control;",
    	pages = "1035 - 45",
    	title = "{O}n the correct sizing on meshes through an effective congestion management strategy",
    	year = 2005
    }
    
  70. Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez, Jose Duato and M Koibuchi. In-Order Packet Delivery in Interconnection Networks using Adaptive Routing. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. 2005, 101 - 101. DOI BibTeX

    @conference{1419928,
    	author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose and M. Koibuchi",
    	abstract = "Most commercial switch-based network technologies for PC clusters use deterministic routing. Alternatively, adaptive routing could be used to improve network performance. In this case, switches decide the path to reach the destination by using local information about the state of the possible outgoing links. However, there are two drawbacks that discourage adaptive routing from being applied to commercial interconnects. The first one concerns the possible switch complexity increase with respect to deterministic routing. The second drawback is due to the fact that adaptive routing may introduce out-of-order packet delivery, which is not acceptable for some applications. For the best of our knowledge, there are no works that analyze the degree of out-of-order packet delivery caused by different network and traffic conditions. In this paper, we take on such a challenge. We show that only for high traffic conditions (reaching saturation) out-of-order delivery is introduced. Moreover, by using small buffers and simple sorting mechanisms at destination, we show that high network throughput can be obtained at the same time packets are delivered in order. Thus, the paper demonstrates that it is possible to use adaptive routing, while still guaranteeing in-order packet delivery, without using large buffer resources nor degrading significantly its performance.",
    	booktitle = "Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International",
    	doi = "10.1109/IPDPS.2005.255",
    	keywords = "PC clusters; adaptive routing; deterministic routing; interconnection networks; out-of-order packet delivery; sorting mechanisms; switch-based network technologies; multiprocessor interconnection networks; network routing; packet switching; sorting; work",
    	month = "04-08",
    	pages = "101 - 101",
    	title = "{I}n-{O}rder {P}acket {D}elivery in {I}nterconnection {N}etworks using {A}daptive {R}outing",
    	year = 2005
    }
    
  71. P J Garcia, Jose Flich, Jose Duato, I Johnson, F J Quiles and F Naven. Dynamic evolution of congestion trees: Analysis and impact on switch architecture. 2005, 266 - 285. BibTeX

    @conference{2006229908739,
    	author = "P.J. Garcia and Flich, Jose and Duato, Jose and I. Johnson and F.J. Quiles and F. Naven",
    	abstract = "Designers of large parallel computers and clusters are becoming increasingly concerned with the cost and power consumption of the interconnection network. A simple way to reduce them consists of reducing the number of network components and increasing their utilization. However, doing so without a suitable congestion management mechanism may lead to dramatic throughput degradation when the network enters saturation. Congestion management strategies for lossy networks (computer networks) are well known, but relatively little effort has been devoted to congestion management in lossless networks (parallel computers, clusters, and on-chip networks). Additionally, congestion is much more difficult to solve in this context due to the formation of congestion trees. In this paper we study the dynamic evolution of congestion trees. We show that, contrary to the common belief, trees do not only grow from the root toward the leaves. There exist cases where trees grow from the leaves to the root, cases where several congestion trees grow independently and later merge, and even cases where some congestion trees completely overlap while being independent. This complex evolution and its implications on switch architecture are analyzed, proposing enhancements to a recently proposed congestion management mechanism and showing the impact on performance of different design decisions. {{\&}}copy; Springer-Verlag Berlin Heidelberg 2005.",
    	address = "Barcelona, Spain",
    	issn = 03029743,
    	journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
    	key = "Trees (mathematics)",
    	keywords = "Computer networks;Congestion control (communication);Interconnection networks;Network components;Switching theory;Throughput;",
    	note = "Congestion management;Congestion trees;Lossless networks;Throughput degradation;",
    	pages = "266 - 285",
    	title = "{D}ynamic evolution of congestion trees: {A}nalysis and impact on switch architecture",
    	volume = "3793 LNCS",
    	year = 2005
    }
    
  72. Maria E Gomez, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, N A Nordbotten, O Lysne and T Skeie. An effective fault-tolerant routing methodology for direct networks. In Parallel Processing, 2004. ICPP 2004. International Conference on. 2004, 222 - 231 vol.1. URL, DOI BibTeX

    @conference{1327925,
    	author = "Gomez, Maria E. and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and N.A. Nordbotten and O. Lysne and T. Skeie",
    	abstract = "Current massively parallel computing systems are being built with thousands of nodes, which significantly affect the probability of failure. M. E. Gomez proposed a methodology to design fault-tolerant routing algorithms for direct interconnection networks. The methodology uses a simple mechanism: for some source-destination pairs, packets are first forwarded to an intermediate node, and later, from this node to the destination node. Minimal adaptive routing is used along both subpaths. For those cases where the methodology cannot find a suitable intermediate node, it combines the use of intermediate nodes with two additional mechanisms: disabling adaptive routing and using misrouting on a per-packet basis. While the combination of these three mechanisms tolerates a large number of faults, each one requires adding some hardware support in the network and also introduces some overhead. In this paper, we perform an in-depth detailed analysis of the impact of these mechanisms on network behaviour. We analyze the impact of the three mechanisms separately and combined. The ultimate goal of this paper is to obtain a suitable combination of mechanisms that is able to meet the trade-off between fault-tolerance degree, routing complexity, and performance.",
    	booktitle = "Parallel Processing, 2004. ICPP 2004. International Conference on",
    	doi = "10.1109/ICPP.2004.1327925",
    	issn = "0190-3918",
    	keywords = "direct networks; fault-tolerant routing algorithm; in-depth detailed analysis; interconnection networks; minimal adaptive routing; parallel computing system; communication complexity; fault tolerant computing; multiprocessor interconnection networks; par",
    	month = "aug.",
    	pages = "222 - 231 vol.1",
    	title = "{A}n effective fault-tolerant routing methodology for direct networks",
    	url = "http://dx.doi.org/10.1109/ICPP.2004.1327925",
    	year = 2004
    }
    
  73. J M Montañana, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. A transition-based fault-tolerant routing methodology for InfiniBand networks. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. April 2004, 186. URL, DOI BibTeX

    @conference{1303198,
    	author = "Monta{\~n}ana, J. M. and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
    	abstract = "Summary form only given. Currently, clusters of PCs are considered a cost-effective alternative to large parallel computers. As the number of elements increases in these systems, the probability of faults increases dramatically. Therefore, it is critical to keep the system running even in the presence of faults. The interconnection network plays a key role in its performance. InfiniBand (IBA) is a new standard interconnect suitable for clusters. Most of the fault-tolerant routing strategies proposed for massively parallel computers cannot be applied to IBA because routing and virtual channel transitions are deterministic, which prevents packets from avoiding the faults. A possible approach to provide fault-tolerance in IBA consists of using several disjoint paths between every source-destination pair of nodes and selecting the appropriate path at the source host. However, to this end, a routing algorithm able to provide enough disjoint paths, while still guaranteeing deadlock freedom, is required. We propose a simple and effective fault-tolerant methodology for IBA networks that can be applied to any network topology and meets the trade-off between fault-tolerance degree and the number of network resources devoted to it. Preliminary results show that the proposed methodology scales well and supports up to three faults in 2D and five in 3D tori using only two virtual channels.",
    	booktitle = "Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International",
    	doi = "10.1109/IPDPS.2004.1303198",
    	isbn = "0-7695-2132-0",
    	issn = "",
    	keywords = "fault tolerant computing;multiprocessor interconnection networks;network topology;parallel machines;telecommunication network routing;workstation clusters;",
    	month = "april",
    	pages = 186,
    	title = "{A} transition-based fault-tolerant routing methodology for {I}nfini{B}and networks",
    	url = "http://dx.doi.org/10.1109/IPDPS.2004.1303198",
    	year = 2004
    }
    
  74. Jose Duato, Jose Flich and Teresa Nachiondo. A cost-effective technique to reduce HOL blocking in single-stage and multistage switch fabrics. In Parallel, Distributed and Network-Based Processing, 2004. Proceedings. 12th Euromicro Conference on. February 2004, 48 - 53. URL, DOI BibTeX

    @conference{1271426,
    	author = "Duato, Jose and Flich, Jose and Nachiondo, Teresa",
    	abstract = "Head-of-line (HOL) blocking is one of the main problems arising in input-buffered switches. The best-known solution to this problem consists of using virtual output queues (VOQs). However this strategy is not scalable at all. Its implementation cost increases quadratically with the number of ports in the switch. Taking into account current trends, the demand for larger number of ports in high-performance switches is likely to increase very rapidly in the near future. Therefore, a more scalable and cost-effective solution is required. We propose a very efficient and cost-effective technique, referred to as destination-based buffer management (DBBM), to reduce HOL blocking in single-stage and multistage switch. Results show that the use of the DBBM technique with a reduced number of queues at each IA is able to obtain roughly the same throughput as the VOQ mechanism. In particular, the number of queues can be reduced by a factor of up to 8 with the DBBM technique.",
    	booktitle = "Parallel, Distributed and Network-Based Processing, 2004. Proceedings. 12th Euromicro Conference on",
    	doi = "10.1109/EMPDP.2004.1271426",
    	isbn = "0-7695-2083-9",
    	issn = "1066-6192",
    	keywords = "cost-effective technique; destination-based buffer management; head-of-line blocking; input-buffered switches; multistage switch fabrics; single-stage switch fabrics; virtual output queues; IP networks; buffer storage; packet switching; queueing theory;",
    	month = "Feb",
    	pages = "48 - 53",
    	title = "{A} cost-effective technique to reduce {HOL} blocking in single-stage and multistage switch fabrics",
    	url = "http://dx.doi.org/10.1109/EMPDP.2004.1271426",
    	year = 2004
    }
    
  75. J M Stine, N P Carter and Jose Flich. Comparing Adaptive Routing and Dynamic Voltage Scaling for Link Power Reduction. Computer Architecture Letters 3(1):4 - 4, 2004. DOI BibTeX

    @article{1650125,
    	author = "J.M. Stine and N.P. Carter and Flich, Jose",
    	abstract = "We compare techniques that dynamically scale the voltage of individual network links to reduce power consumption with an approach in which all links in the network are set to the same voltage and adaptive routing is used to distribute load across the network. Our results show that adaptive routing with static network link voltages outperforms dimension-order routing with dynamic link voltages in all cases, because the adaptive routing scheme can respond more quickly to changes in network demand. Adaptive routing with static link voltages also outperforms adaptive routing with dynamic link voltages in many cases, although dynamic link voltage scaling gives better behavior as the demand on the network grows.",
    	doi = "10.1109/L-CA.2004.5",
    	issn = "1556-6056",
    	journal = "Computer Architecture Letters",
    	month = "january-december",
    	number = 1,
    	pages = "4 - 4",
    	title = "{C}omparing {A}daptive {R}outing and {D}ynamic {V}oltage {S}caling for {L}ink {P}ower {R}eduction",
    	volume = 3,
    	year = 2004
    }
    
  76. Maria E Gomez, Jose Duato, Jose Flich, Pedro Lopez, Antonio Robles, N A Nordbotten, O Lysne and T Skeie. An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori. Computer Architecture Letters 3(1):3 - 3, 2004. URL, DOI BibTeX

    @article{1650124,
    	author = "Gomez, Maria E. and Duato, Jose and Flich, Jose and Lopez, Pedro and Robles, Antonio and N.A. Nordbotten and O. Lysne and T. Skeie",
    	abstract = "In this paper we present a methodology to design fault-tolerant routing algorithms for regular direct interconnection networks. It supports fully adaptive routing, does not degrade performance in the absence of faults, and supports a reasonably large number of faults without significantly degrading performance. The methodology is mainly based on the selection of an intermediate node (if needed) for each source-destination pair. Packets are adaptively routed to the intermediate node and, at this node, without being ejected, they are adaptively forwarded to their destinations. In order to allow deadlock-free minimal adaptive routing, the methodology requires only one additional virtual channel (for a total of three), even for tori. Evaluation results for a 4 x 4 x 4 torus network show that the methodology is 5-fault tolerant. Indeed, for up to 14 link failures, the percentage of fault combinations supported is higher than 99.96%. Additionally, network throughput degrades by less than 10% when injecting three random link faults without disabling any node. In contrast, a mechanism similar to the one proposed in the BlueGene/L, that disables some network planes, would strongly degrade network throughput by 79%.",
    	doi = "10.1109/L-CA.2004.1",
    	issn = "1556-6056",
    	journal = "Computer Architecture Letters",
    	month = "january-december",
    	number = 1,
    	pages = "3 - 3",
    	title = "{A}n {E}fficient {F}ault-{T}olerant {R}outing {M}ethodology for {M}eshes and {T}ori",
    	url = "http://dx.doi.org/10.1109/L-CA.2004.1",
    	volume = 3,
    	year = 2004
    }
    
  77. Maria E Gomez, Jose Flich, Pedro Lopez, Antonio Robles, Jose Duato, N A Nordbotten, O Lysne and T Skeie. An effective fault-tolerant routing methodology for direct networks. 2004, 222 - 31. BibTeX

    @conference{8279975,
    	author = "Gomez, Maria E. and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose and N.A. Nordbotten and O. Lysne and T. Skeie",
    	abstract = "Current massively parallel computing systems are being built with thousands of nodes, which significantly affect the probability of failure. M. E. Gomex proposed a methodology to design fault-tolerant routing algorithms for direct interconnection networks. The methodology uses a simple mechanism: for some source-destination pairs, packets are first forwarded to an intermediate node, and later, from this node to the destination node. Minimal adaptive routing is used along both subpaths. For those cases where the methodology cannot find a suitable intermediate node, it combines the use of intermediate nodes with two additional mechanisms: disabling adaptive routing and using misrouting on a per-packet basis. While the combination of these three mechanisms tolerates a large number of faults, each one requires adding some hardware support in the network and also introduces some overhead. In this paper, we perform an in-depth detailed analysis of the impact of these mechanisms on network behaviour. We analyze the impact of the three mechanisms separately and combined. The ultimate goal of this paper is to obtain a suitable combination of mechanisms that is able to meet the trade-off between fault-tolerance degree, routing complexity, and performance",
    	address = "Los Alamitos, CA, USA",
    	journal = "2004 International Conference on Parallel Processing",
    	keywords = "communication complexity;fault tolerant computing;multiprocessor interconnection networks;parallel processing;",
    	note = "parallel computing system;fault-tolerant routing algorithm;interconnection networks;minimal adaptive routing;in-depth detailed analysis;direct networks;",
    	pages = "222 - 31",
    	title = "{A}n effective fault-tolerant routing methodology for direct networks",
    	volume = "vol.1",
    	year = 2004
    }
    
  78. Maria E Gomez, Jose Duato, Jose Flich, Pedro Lopez, Antonio Robles, N A Nordbotten, T Skeie and O Lysne. A new adaptive fault-tolerant routing methodology for direct networks. 2004, 462 - 73. BibTeX

    @conference{8426282,
    	author = "Gomez, Maria E. and Duato, Jose and Flich, Jose and Lopez, Pedro and Robles, Antonio and N.A. Nordbotten and T. Skeie and O. Lysne",
    	abstract = "Interconnection networks play a key role in the fault tolerance of massively parallel computers, since faults may isolate a large fraction of the machine containing many healthy nodes. In this paper, we present a methodology to design fully adaptive fault-tolerant routing algorithms for direct interconnection networks that can be applied to different regular topologies. The methodology is mainly based on the selection of an intermediate node (if needed) for each source-destination pair. Packets are adaptively routed to the intermediate node and, from this node, they are adaptively forwarded to their destination. This methodology requires only one additional virtual channel, even for tori. Evaluation results show that the methodology is 7-fault tolerant, and for up to 14 faults, more than 99% of the combinations are tolerated, also without significantly degrading performance in the presence of faults",
    	address = "Berlin, Germany",
    	journal = "High Performance Computing-HiPC 2004. 11th International Conference (Lecture notes in Computer Science Vol.3296)",
    	keywords = "fault tolerant computing;multiprocessor interconnection networks;parallel processing;telecommunication network routing;telecommunication network topology;",
    	note = "adaptive fault-tolerant routing;direct interconnection networks;massively parallel computers;",
    	pages = "462 - 73",
    	title = "{A} new adaptive fault-tolerant routing methodology for direct networks",
    	year = 2004
    }
    
  79. T Skeie, O Lysne, Jose Flich, Pedro Lopez, Antonio Robles and Jose Duato. LASH-TOR: a generic transition-oriented routing algorithm. In Parallel and Distributed Systems, 2004. ICPADS 2004. Proceedings. Tenth International Conference on. 2004, 595 - 604. URL, DOI BibTeX

    @conference{1316144,
    	author = "T. Skeie and O. Lysne and Flich, Jose and Lopez, Pedro and Robles, Antonio and Duato, Jose",
    	abstract = "Cluster networks are seen as the future access networks for multimedia streaming, e-commerce, network storage, etc. For these applications, performance and high availability are particularly crucial. Regular topologies are preferred when performance is the primary concern. However, due to spatial constraints or fault-related issues, the network structure may become irregular, which makes more difficult to find deadlock-free minimal paths. Over the recent years, several solutions have been proposed. One of them is the LASH routing, which enables minimal routing by assigning paths to different virtual layers. In this paper, we propose an extension of LASH in order to reduce the number of required virtual layers by allowing transitions between virtual layers. Evaluation results show that the new routing scheme (LASH-TOR) is able to obtain full minimal routing with a reduced number of virtual channels. For torus and mesh networks, with only two virtual channels, LASH throughput is increased by an average factor of improvement of 3.30 for large networks. For regular networks with some unconnected (faulty) links, equal performance improvements are achieved. Even for highly irregular networks of size up to 128 switches the new routing scheme only needs three virtual channels for guaranteeing minimal routing. Besides, LASH-TOR performs well compared to dimension order routing for mesh and torus networks.",
    	booktitle = "Parallel and Distributed Systems, 2004. ICPADS 2004. Proceedings. Tenth International Conference on",
    	doi = "10.1109/ICPADS.2004.1316144",
    	isbn = "0-7695-2152-5",
    	issn = "1521-9097",
    	keywords = "LASH routing; LASH-TOR; access networks; cluster networks; deadlock-free minimal paths; e-commerce; mesh network; multimedia streaming; network storage; network structure; spatial constraints; torus network; transition-oriented routing algorithm; virtual",
    	month = "7-9",
    	pages = "595 - 604",
    	title = "{LASH}-{TOR}: a generic transition-oriented routing algorithm",
    	url = "http://dx.doi.org/10.1109/ICPADS.2004.1316144",
    	year = 2004
    }
    
  80. N A Nordbotten, Maria E Gomez, Jose Flich, Pedro Lopez, Antonio Robles, T Skeie, O Lysne and Jose Duato. A fully adaptive fault-tolerant routing methodology based on intermediate nodes. 2004, 341 - 56. BibTeX

    @conference{8322959,
    	author = "N.A. Nordbotten and Gomez, Maria E. and Flich, Jose and Lopez, Pedro and Robles, Antonio and T. Skeie and O. Lysne and Duato, Jose",
    	abstract = "Massively parallel computing systems are being built with thousands of nodes. Because of the high number of components, it is critical to keep these systems running even in the presence of failures. Interconnection networks play a key-role in these systems, and this paper proposes a fault-tolerant routing methodology for use in such networks. The methodology supports any minimal routing function (including fully adaptive routing), does not degrade performance in the absence of faults, does not disable any healthy node, and is easy to implement both in meshes and tori. In order to avoid network failures, the methodology uses a simple mechanism: for some source-destination pairs, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network). The methodology is shown to tolerate a large number of faults (e.g., five/nine faults when using two/three intermediate nodes in a 3D torus). Furthermore, the methodology offers a gracious performance degradation: in an 8 × 8 × 8 torus network with 14 faults the throughput is only decreased by 6.49%",
    	address = "Germany, Germany",
    	journal = "Network and Parallel Computing. IFIP International Conference, NPC 2004. Proceedings (Lecture Notes in Computer Science Vol.3222)",
    	keywords = "fault tolerant computing;multiprocessor interconnection networks;packet switching;parallel processing;telecommunication network routing;",
    	note = "fully adaptive fault-tolerant routing;intermediate nodes;massively parallel computing systems;interconnection networks;minimal routing function;network failures;source-destination pairs;",
    	pages = "341 - 56",
    	title = "{A} fully adaptive fault-tolerant routing methodology based on intermediate nodes",
    	year = 2004
    }
    
  81. Jose Flich, Pedro Lopez, M P Malumbres, Jose Duato and T Rokicki. Applying in-transit buffers to boost the performance of networks with source routing. Computers, IEEE Transactions on 52(9):1134 - 1153, 2003. DOI BibTeX

    @article{1228510,
    	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose and T. Rokicki",
    	abstract = "In this paper, we analyze in depth the effect of using ITB in the network, showing that they not only serve for guaranteeing minimal routing, but also that they are a powerful mechanism able to balance network traffic and reduce network contention. To demonstrate these capabilities, we apply the ITB mechanism to improved routing schemes, such as DFS and smart-routing. These routing algorithms (without ITB) are able to improve the performance of up*/down* by 30 percent and 90 percent, respectively, for a 32-switch network. The evaluation results show that, when ITB are used together with these improved routing algorithms, network throughput achieved by DFS and smart-routing can still be improved by 56 percent and 23 percent, respectively. However, smart-routing requires a time to compute the routing tables that rapidly grows with network size, it being impossible in practice to build networks with more than 32 switches. This high computational cost is mainly motivated by the need of obtaining deadlock-free routing tables. However, when ITB are used, one can decouple the stages of computing routing tables and breaking cycles. Moreover, as stated above, ITB can be used to reduce network contention. In this way, in this paper, we also propose a completely new routing algorithm that tries to balance network traffic by using a simple and low time consuming strategy. The proposed algorithm guarantees deadlock freedom and reduces network contention with the use of ITB. The evaluation results show that our algorithm obtains unprecedented throughputs in 32-switch networks, tripling the original up*/down* and almost doubling smart-routing.",
    	doi = "10.1109/TC.2003.1228510",
    	issn = "0018-9340",
    	journal = "Computers, IEEE Transactions on",
    	keywords = "32-switch network; DFS; ITB; NOW; breaking cycles; deadlock-free routing tables; in-transit buffers; minimal routing; network contention reduction; network performance; network throughput; network traffic balancing; networks of workstations; performance;",
    	month = "sept.",
    	number = 9,
    	pages = "1134 - 1153",
    	title = "{A}pplying in-transit buffers to boost the performance of networks with source routing",
    	volume = 52,
    	year = 2003
    }
    
  82. Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting fully adaptive routing in InfiniBand networks. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. April 2003, 10 pp.. URL, DOI BibTeX

    @conference{1213130,
    	author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
    	abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed routing. However, routing in IBA is deterministic because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that support adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be enabled or disabled individually for each packet at the source node. Also, the proposed strategy enables the use in IBA of fully adaptive routing algorithms without using additional network resources to improve network performance. Evaluation results show that extending IBA switch capabilities with fully adaptive routing noticeably increases network performance. In particular, network throughput increases up to an average factor of 3.9.",
    	booktitle = "Parallel and Distributed Processing Symposium, 2003. Proceedings. International",
    	doi = "10.1109/IPDPS.2003.1213130",
    	issn = "1530-2075",
    	keywords = "InfiniBand networks; distributed routing; fully adaptive routing; interprocessor communication; network performance; network throughput; processing nodes; computer networks; multiprocessor interconnection networks; performance evaluation;",
    	month = "april",
    	pages = "10 pp.",
    	title = "{S}upporting fully adaptive routing in {I}nfini{B}and networks",
    	url = "http://dx.doi.org/10.1109/IPDPS.2003.1213130",
    	year = 2003
    }
    
  83. Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting adaptive routing in InfiniBand networks. In Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on. 2003, 165 - 172. URL, DOI BibTeX

    @conference{1183583,
    	author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
    	abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed deterministic routing because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper we propose a simple strategy to implement forwarding tables for IBA switches that supports adaptive routing while still maintaining compatibility with the IBA specifications. Adaptive routing can be individually enabled or disabled for each packet at the source node. The proposed strategy enables the use in IBA of any adaptive routing algorithm with an acyclic channel dependence graph. In this paper, we have taken advantage of the partial adaptivity provided by the well-known up*/down* routing algorithm. Evaluation results show that extending IBA switch capabilities with adaptive routing may noticeably increase network performance. In particular network throughput improvement can be, on average, as high as 46%.",
    	booktitle = "Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on",
    	doi = "10.1109/EMPDP.2003.1183583",
    	issn = "1066-6192",
    	keywords = "I-O devices; IBA switches; InfiniBand Architecture; InfiniBand networks; acyclic channel dependence graph; adaptive routing; deterministic routing; forwarding tables; interprocessor communication; network performance; network throughput; processing node",
    	month = "feb.",
    	pages = "165 - 172",
    	title = "{S}upporting adaptive routing in {I}nfini{B}and networks",
    	url = "http://dx.doi.org/10.1109/EMPDP.2003.1183583",
    	year = 2003
    }
    
  84. Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting fully adaptive routing in InfiniBand networks. 2003, 10 pp. -. URL BibTeX

    @conference{7891311,
    	author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
    	abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed routing. However, routing in IBA is deterministic because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that support adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be enabled or disabled individually for each packet at the source node. Also, the proposed strategy enables the use in IBA of fully adaptive routing algorithms without using additional network resources to improve network performance. Evaluation results show that extending IBA switch capabilities with fully adaptive routing noticeably increases network performance. In particular, network throughput increases up to an average factor of 3.9",
    	address = "Los Alamitos, CA, USA",
    	journal = "Proceedings International Parallel and Distributed Processing Symposium",
    	keywords = "computer networks;multiprocessor interconnection networks;performance evaluation;",
    	note = "fully adaptive routing;InfiniBand networks;processing nodes;interprocessor communication;distributed routing;network performance;network throughput;",
    	pages = "10 pp. -",
    	title = "{S}upporting fully adaptive routing in {I}nfini{B}and networks",
    	url = "http://dx.doi.org/10.1109/IPDPS.2003.1213130",
    	year = 2003
    }
    
  85. Maria E Gomez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. VOQSW: a methodology to reduce HOL blocking in InfiniBand networks. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. 2003, 10 pp.. DOI BibTeX

    @conference{1213134,
    	author = "Gomez, Maria E. and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
    	abstract = "InfiniBand is a new switch-based standard interconnect for communication between processor nodes and I/O devices as well as for interprocessor communication. InfiniBand architecture allows switches to support up to 15 virtual lanes per port for data traffic. To route packets through a given virtual lane (VL), packets are labeled with a certain service level (SL) at injection time, and SLtoVL mapping tables are used at each switch to determine the VL to be used. Many previous works in the literature have shown that separate virtual lanes are able to reduce the influence of the well-known head-of-line (HOL) blocking effect on network performance. However, using virtual lanes to form separate virtual networks is not enough to eliminate the HOL blocking problem. Alternative solutions such as Virtual Output Queuing (VOQ) are able to eliminate it at the expense of modifying the switch buffer organization. In this paper, we propose an effective strategy to implement the VOQ scheme in IBA switches by using virtual lanes. This strategy does not require to modify the switch architecture, simply SL to VL tables must be properly filled. Evaluation results show that our proposed VOQ scheme is able to outperform the results obtained with the virtual network approach using the same number of resources. Moreover, the methodology proposed to implement the VOQ scheme in IBA only requires a small number of resources in order to significantly improve network throughput.",
    	booktitle = "Parallel and Distributed Processing Symposium, 2003. Proceedings. International",
    	doi = "10.1109/IPDPS.2003.1213134",
    	keywords = "HOL blocking; InfiniBand networks; SL to VL mapping tables; head-of-line blocking effect; interprocessor communication; network performance; network throughput; switch buffer organization; switch-based standard interconnect; virtual lane; virtual output",
    	month = "22-26",
    	pages = "10 pp.",
    	title = "{VOQSW}: a methodology to reduce {HOL} blocking in {I}nfini{B}and networks",
    	year = 2003
    }
    
  86. JC Sancho, Antonio Robles, Pedro Lopez, Jose Flich and Jose Duato. Routing in InfiniBand (TM) torus network topologies. In P Sadayappan and CS Yang (eds.). 2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS. 2003, 509-518. BibTeX

    @conference{ISI:000186828800056,
    	author = "JC Sancho and Robles, Antonio and Lopez, Pedro and Flich, Jose and Duato, Jose",
    	abstract = "InfiniBand is an interconnect standard for communication between processing nodes and I/O devices as well as for interprocessor communication (NOWs). The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links whose topology can be established by the customer When the performance is the primary concern regular topologies are preferred. Low-dimensional tori (2D and 3D) are some of the regular topologies most widely used in commercial parallel computers. Routing in torus requires the use of virtual channels. Although InfiniBand provides support for deterministic routing and virtual channels, they are selected at each switch by service level (SL) identifiers associated to packets and do not depend on packet destination. This makes routing algorithm implementation more complex. In particular, a large number of SLs may be required, which is a scarce resource. In this paper we analyze the way several routing strategies can be applied in tori InfiniBand networks, also evaluating their resource requirements. In particular, we analyze and compare the well-known e-cube and up{*}/down{*} routing algorithms and the Flexible routing algorithm recently proposed.",
    	booktitle = "2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS",
    	editor = "Sadayappan, P and Yang, CS",
    	isbn = 0769520170,
    	note = "International Conference on Parallel Processing, KAOHSIUNG, TAIWAN, OCT 06-09, 2003",
    	pages = "509-518",
    	title = "{R}outing in {I}nfini{B}and ({TM}) torus network topologies",
    	year = 2003
    }
    
  87. Juan Carlos Martinez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Supporting adaptive routing in IBA switches. 2003, 441 - 456. URL BibTeX

    @conference{2003487758791,
    	author = "Martinez, Juan Carlos and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
    	abstract = "InfiniBand is a new standard for communication between processing nodes and I/O devices as well as for interprocessor communication. The InfiniBand Architecture (IBA) supports distributed deterministic routing because forwarding tables store a single output port per destination ID. This prevents packets from using alternative paths when the requested output port is busy. Despite the fact that alternative paths could be selected at the source node to reach the same destination node, this is not effective enough to improve network performance. However, using adaptive routing could help to circumvent the congested areas in the network, leading to an increment in performance. In this paper, we propose a simple strategy to implement forwarding tables for IBA switches that supports adaptive routing while still maintaining compatibility with the IBA specs. Adaptive routing can be individually enabled or disabled for each packet at the source node. The proposed strategy enables the use in IBA of any adaptive routing algorithm with an acyclic channel dependence graph. In this paper, we have taken advantage of the partial adaptivity provided by the well-known up*/down* routing algorithm. Evaluation results show that extending IBA switch capabilities with adaptive routing may noticeably increase network performance. In particular, network throughput improvement can be, on average, as high as 66%. © 2003 Elsevier B.V. All rights reserved.",
    	issn = 13837621,
    	journal = "Journal of Systems Architecture",
    	key = "Systems engineering",
    	keywords = "Algorithms;Communication;Information technology;Switches;Telecommunication networks;",
    	note = "Adaptive routing;",
    	number = "10-11",
    	pages = "441 - 456",
    	title = "{S}upporting adaptive routing in {IBA} switches",
    	url = "http://dx.doi.org/10.1016/S1383-7621(03)00103-6",
    	volume = 49,
    	year = 2003
    }
    
  88. J C Sancho, Juan Carlos Martinez, Antonio Robles, Pedro Lopez, Jose Flich and Jose Duato. Performance evaluation of COWS under real parallel applications. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International. 2003, 10 pp.. DOI BibTeX

    @conference{1213371,
    	author = "J.C. Sancho and Martinez, Juan Carlos and Robles, Antonio and Lopez, Pedro and Flich, Jose and Duato, Jose",
    	abstract = "Clusters of workstations (COWS) are often arranged as a switch-based network with irregular topology. Usually, the evaluation of interconnection networks for COWS has been carried out by simulation using synthetic traffic and by traces from real parallel applications. Although both types of traffics are used as a first approximation of the behavior of the system, a more accurate behavior can be obtained by using real parallel applications. In this paper, a new simulation framework has been developed in order to evaluate interconnection networks under real parallel applications by using an execution-driven simulator. Moreover, the new simulator can be used to evaluate the impact on the performance of the whole system of several design parameters in addition to the interconnection network. Evaluation results show that the execution time of real parallel applications can be reduced by using an effective routing algorithm. Moreover, in some cases, the achieved improvements are higher than the ones achieved by improving other design issues, such as the processor instruction issue rate, the cache size or the network bandwidth.",
    	booktitle = "Parallel and Distributed Processing Symposium, 2003. Proceedings. International",
    	doi = "10.1109/IPDPS.2003.1213371",
    	issn = "1530-2075",
    	keywords = "COWS; cache size; clusters of workstations; execution-driven simulator; interconnection networks; network bandwidth; performance evaluation; processor instruction issue rate; simulation framework; switch-based network; discrete event simulation; performa",
    	month = "22-26",
    	pages = "10 pp.",
    	title = "{P}erformance evaluation of {COWS} under real parallel applications",
    	year = 2003
    }
    
  89. Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Boosting the performance of Myrinet networks. Parallel and Distributed Systems, IEEE Transactions on 13(11):1166 - 1182, November 2002. URL, DOI BibTeX

    @article{1058099,
    	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
    	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because it is a well-known commercial product and its behavior can be controlled by the software running in network interfaces (Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. We propose the In-Transit Buffer (ITB) mechanism to improve network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like Myrinet, analyzing its behavior on both networks with regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by only modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. Results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network.",
    	doi = "10.1109/TPDS.2002.1058099",
    	issn = "1045-9219",
    	journal = "Parallel and Distributed Systems, IEEE Transactions on",
    	keywords = "In-Transit Buffer; Myrinet network; irregular topologies; network interfaces; network performance boosting; network traffic; parallel computers; performance evaluation; scalability; simulation; throughput; up down source routing; workstation networks; wo",
    	month = "nov",
    	number = 11,
    	pages = "1166 - 1182",
    	title = "{B}oosting the performance of {M}yrinet networks",
    	url = "http://dx.doi.org/10.1109/TPDS.2002.1058099",
    	volume = 13,
    	year = 2002
    }
    
  90. Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Boosting the performance of Myrinet networks. Parallel and Distributed Systems, IEEE Transactions on 13(7):693 -709, July 2002. URL, DOI BibTeX

    @article{1019859,
    	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
    	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because they are a well-known commercial product and their behavior can be controlled by the software running on the network interfaces (the Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. In this paper, we propose an in-transit buffer (ITB) mechanism to improve the network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like the Myrinet, analyzing its behavior on networks with both regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by simply modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. The results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network",
    	doi = "10.1109/TPDS.2002.1019859",
    	issn = "1045-9219",
    	journal = "Parallel and Distributed Systems, IEEE Transactions on",
    	keywords = "Myrinet Control Program;Myrinet network performance;in-transit buffer mechanism;incremental expansion capability;irregular topologies;minimal routing;network interfaces;network throughput;network traffic patterns;performance evaluation;regular topologies;",
    	month = "jul",
    	number = 7,
    	pages = "693 -709",
    	title = "{B}oosting the performance of {M}yrinet networks",
    	url = "http://dx.doi.org/10.1109/TPDS.2002.1019859",
    	volume = 13,
    	year = 2002
    }
    
  91. Maria E Gomez, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Evaluation of routing algorithms for InfiniBand networks. 2002, 775 - 80. BibTeX

    @conference{7568237,
    	author = "Gomez, Maria E. and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
    	abstract = "Storage area networks (SAN) provide the scalability required by the IT servers. The InfiniBand (IBA) interconnect is very likely to become the de facto standard for SAN as well as for NOW. The routing algorithm is a key design issue in irregular networks. Moreover, as several virtual lanes can be used and different network issues can be considered, the performance of the routing algorithms may be affected. In this paper we evaluate three existing routing algorithms (up*/down*, DFS, and smart-routing) suitable for being applied to IBA. Evaluation has been performed by simulation under different synthetic traffic patterns and I/O traces. Simulation results show that the smart-routing algorithm achieves the highest performance",
    	address = "Berlin, Germany",
    	journal = "Euro-Par 2002 Parallel Processing. 8th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol.2400)",
    	keywords = "parallel algorithms;performance evaluation;telecommunication network routing;telecommunication standards;telecommunication traffic;workstation clusters;",
    	note = "routing algorithms;InfiniBand networks;storage area networks;SAN;scalability;de facto standard;IBA interconnect;NOW;irregular networks;virtual lanes;performance;up*/down* routing;DFS routing;smart routing;synthetic traffic patterns;I/O traces;simulation;IT servers;",
    	pages = "775 - 80",
    	title = "{E}valuation of routing algorithms for {I}nfini{B}and networks",
    	year = 2002
    }
    
  92. Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Removing the latency overhead of the ITB mechanism in COWs with source routing. 2002, 463 - 70. URL BibTeX

    @conference{7205122,
    	author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
    	abstract = "Clusters of workstations (COWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. The in-transit buffer (ITB) mechanism can improve network performance when applied to COWs with irregular topology and source routing. This mechanism considerably improves the performance of this kind of network when compared to current source routing algorithms; however, it introduces a latency penalty. An implementation of this mechanism was performed, showing that the latency overhead of the mechanism may be noticeable, especially for short messages and at low network loads. In this paper, we analyze in detail the latency overhead of ITBs, proposing several mechanisms to reduce, hide and remove it. Firstly, we show, by simulation, the effect of an ITB implementation that is much slower than the one implemented. Then we propose three mechanisms that try to overcome the latency penalty. All the mechanisms are simple and can be easily implemented; also, they are out of the critical path of the ITB packet-processing procedure. The results show very good behaviour of the proposed mechanisms, considerably reducing or even completely removing the latency overhead",
    	address = "Los Alamitos, CA, USA",
    	journal = "Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing",
    	keywords = "buffer storage;delays;performance evaluation;telecommunication network routing;workstation clusters;",
    	note = "latency overhead removal;in-transit buffer mechanism;workstation clusters;source routing;network performance;irregular network topology;short messages;network loads;simulation;latency penalty;critical path;packet processing procedure;",
    	pages = "463 - 70",
    	title = "{R}emoving the latency overhead of the {ITB} mechanism in {COW}s with source routing",
    	url = "http://dx.doi.org/10.1109/EMPDP.2002.994334",
    	year = 2002
    }
    
  93. Jose Flich, Pedro Lopez, J C Sancho, Antonio Robles and Jose Duato. Improving InfiniBand routing through multiple virtual networks. 2002, 49 - 63. BibTeX

    @conference{7387421,
    	author = "Flich, Jose and Lopez, Pedro and J.C. Sancho and Robles, Antonio and Duato, Jose",
    	abstract = "InfiniBand is very likely to become the de facto standard for communication between nodes and I/O devices as well as for interprocessor communication. Often, the interconnection pattern is irregular. Up*/down* is the most popular routing scheme currently used in NOWs with irregular topologies. However, the main drawbacks of up*/down* routing are the unbalanced channel utilization and the difficulties to route most packets through minimal paths, which negatively affects network performance. Using additional virtual lanes can improve up*/down* routing performance by reducing the head-of-line blocking effect, but its use is not aimed to remove its main drawbacks. We propose a methodology that uses a reduced number of virtual lanes in an efficient way to achieve a better traffic balance and a higher number of minimal paths. This methodology is based on routing packets simultaneously through several properly selected up*/down* trees. To guarantee deadlock freedom, each up*/down* tree is built over a different virtual network. Simulation results, show that the proposed methodology increases throughput up to an average factor ranging from 1.18 to 2.18 for 8, 16, and 32-switch networks by using only two virtual lanes. For larger networks with an additional virtual lane, network throughput is tripled, on average",
    	address = "Berlin, Germany",
    	journal = "High Performance Computing. 4th International Symposium, ISHPC 2002. Proceedings (Lecture Notes in Computer Science Vol.2327)",
    	keywords = "multiplexing;multiprocessor interconnection networks;telecommunication network routing;workstation clusters;",
    	note = "InfiniBand routing;networks of workstations;multiple virtual networks;interprocessor communication;NOWs;switch-based network;point-to-point links;up*/down* routing;head-of-line blocking effect;deadlock freedom;",
    	pages = "49 - 63",
    	title = "{I}mproving {I}nfini{B}and routing through multiple virtual networks",
    	year = 2002
    }
    
  94. P J Garcia, M D Mora, F J Alfaro, J L Sanchez and Jose Flich. Evaluation of alternative arbitration policies fo myrinet switches. In Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM. 2002, 162 -169. BibTeX

    @conference{1016560,
    	author = "P.J. Garcia and M.D. Mora and F.J. Alfaro and J.L. Sanchez and Flich, Jose",
    	booktitle = "Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM",
    	pages = "162 -169",
    	title = "{E}valuation of alternative arbitration policies fo myrinet switches",
    	year = 2002
    }
    
  95. J C Sancho, Antonio Robles, Jose Flich, Pedro Lopez and Jose Duato. Effective methodology for deadlock-free minimal routing in InfiniBand networks. In Parallel Processing, 2002. Proceedings. International Conference on. 2002, 409 - 418. DOI BibTeX

    @conference{1040897,
    	author = "J.C. Sancho and Robles, Antonio and Flich, Jose and Lopez, Pedro and Duato, Jose",
    	abstract = "The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links whose topology is arbitrarily established by the customer. We propose a simple and effective methodology for designing deadlock-free routing strategies that are able to route packets through minimal paths in InfiniBand networks. This methodology can meet the trade-off between network performance and the number of resources dedicated to deadlock avoidance. Evaluation results show that the resulting routing strategies significantly outperform up*/down* routing. In particular, throughput improvement ranges, on average, from 1.33 for small networks to 4.05 for large networks. Also, it is shown that just two virtual lanes and three service levels are enough to achieve more than 80% of the throughput improvement achieved by the best proposed routing strategy (the one that always provides minimal paths without limiting the number of resources).",
    	booktitle = "Parallel Processing, 2002. Proceedings. International Conference on",
    	doi = "10.1109/ICPP.2002.1040897",
    	issn = "0190-3918",
    	keywords = "InfiniBand architecture; InfiniBand networks; NOWs; deadlock-free minimal routing; interconnection pattern; minimal paths; network performance; packet routing; point-to-point links; service levels; switch-based network; throughput improvement; up*/down*",
    	pages = "409 - 418",
    	title = "{E}ffective methodology for deadlock-free minimal routing in {I}nfini{B}and networks",
    	year = 2002
    }
    
  96. Jose Flich, Pedro Lopez, Perez M Malumbres and Jose Duato. Boosting the performance of Myrinet networks. IEEE Transactions on Parallel and Distributed Systems 13(7):693 - 709, 2002. URL BibTeX

    @article{2002367073594,
    	author = "Flich, Jose and Lopez, Pedro and M. Perez Malumbres and Duato, Jose",
    	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. These networks allow the customer to connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Some of these networks use source routing and wormhole switching. In particular, we are interested in Myrinet networks because it is a well-known commercial product and its behavior can be controlled by the software running in network interfaces (Myrinet Control Program, MCP). Usually, the Myrinet network uses up*/down* routing for computing the paths for every source-destination pair. In this paper, we propose the In-Transit Buffer (ITB) mechanism to improve network performance. We apply the ITB mechanism to NOWs with up*/down* source routing, like Myrinet, analyzing its behavior on both networks with regular and irregular topologies. The proposed scheme can be implemented on Myrinet networks by only modifying the MCP, without changing the network hardware. We evaluate by simulation several networks with different traffic patterns using timing parameters taken from the Myrinet network. Results show that the current routing schemes used in Myrinet networks can be strongly improved by applying the ITB mechanism. In general, our proposed scheme is able to double the network throughput on medium and large NOWs. Finally, we present a first implementation of the ITB mechanism on a Myrinet network.",
    	issn = 10459219,
    	journal = "IEEE Transactions on Parallel and Distributed Systems",
    	key = "Computer networks",
    	keywords = "Buffer storage;Computer hardware;Computer simulation;Computer workstations;Interfaces;Parallel processing systems;Program processors;Routers;Telecommunication traffic;Topology;",
    	note = "Myrinet networks;",
    	number = 7,
    	pages = "693 - 709",
    	title = "{B}oosting the performance of {M}yrinet networks",
    	url = "http://dx.doi.org/10.1109/TPDS.2002.1019859",
    	volume = 13,
    	year = 2002
    }
    
  97. J C Sancho, Jose Flich, Antonio Robles, Pedro Lopez and Jose Duato. Analyzing the influence of virtual lanes on the performance of infiniband networks. In Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM. 2002, 166 -175. BibTeX

    @conference{1016568,
    	author = "J.C. Sancho and Flich, Jose and Robles, Antonio and Lopez, Pedro and Duato, Jose",
    	booktitle = "Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM",
    	pages = "166 -175",
    	title = "{A}nalyzing the influence of virtual lanes on the performance of infiniband networks",
    	year = 2002
    }
    
  98. Salvador Coll, Jose Flich, M P Malumbres, Pedro Lopez, Jose Duato and F J Mora. A first implementation of in-transit buffers on myrinet gm software. In Parallel and Distributed Processing Symposium., Proceedings 15th International. April 2001, 1640 -1647. URL, DOI BibTeX

    @conference{925150,
    	author = "Coll, Salvador and Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose and F.J. Mora",
    	abstract = "Clusters of workstations (COWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. In these systems, the interconnection network connects hosts using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Myrinet is the most popular network used to build COWs. It uses source routing with the up*/down* routing algorithm. In previous papers we proposed the In-Transit Buffer (ITB) mechanism that improves network performance by allowing minimal routing, balancing network traffic, and reducing network contention. The mechanism is based on ejecting packets at some intermediate hosts and later re-injecting them into the network. Moreover, the ITB mechanism does not require additional hardware as it can be implemented on the software running at Myrinet network adapters. In this paper, we present a first implementation of the ITB mechanism on Myrinet GM software. We show the changes required in packet format and the modifications performed in the Myrinet Control Program (MCP). In addition, both the overhead introduced by the new code and the cost of extracting and re-injecting packets are measured. Results show that, even for this simple implementation, code overhead is only about 125 ns per packet and the message latency increase for messages that use the ITB mechanismis around 1.3 s per ITB. This is the first attempt to implement this mechanism, showing that a real implementation of ITBs is feasible on Myrinet COWs, and the associated overhead does not restrict the potential benefits of this mechanism.",
    	booktitle = "Parallel and Distributed Processing Symposium., Proceedings 15th International",
    	doi = "10.1109/IPDPS.2001.925150",
    	isbn = "0-7695-0990-8",
    	issn = "1530-2075",
    	month = "apr",
    	pages = "1640 -1647",
    	title = "{A} first implementation of in-transit buffers on myrinet gm software",
    	url = "http://dx.doi.org/10.1109/IPDPS.2001.925150",
    	year = 2001
    }
    
  99. Jose Flich, Pedro Lopez, M P Malumbres, Jose Duato and T Rokicki. Improving network performance by reducing network contention in source-based COWS with a low path-computation overhead. In Parallel and Distributed Processing Symposium., Proceedings 15th International. April 2001, 8 pp.. DOI BibTeX

    @conference{925016,
    	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose and T. Rokicki",
    	abstract = "In previous papers, we have proposed the in-transit buffer mechanism (ITB) to improve network performance in COWs with irregular topology and source routing. This mechanism allows the use of minimal paths among all hosts, breaking cyclic dependences between channels by storing and later re-injecting packets at some intermediate hosts. However it also has two additional features that can improve even more network performance. First, the ITB mechanism reduces network contention because some messages are ejected from the network freeing network links. Second the ITB mechanism allows the use of any path between each source-destination pair improving traffic balance. In this paper we present a new routing algorithm that takes advantage of ITB by exploiting both issues: traffic balance and network contention reduction. The evaluation results show that network throughput can be considerably improved. On average, network throughput increases with respect to up*/down* by factors of 2.51 and 3.77 in 32 and 64-switch networks, respectively",
    	booktitle = "Parallel and Distributed Processing Symposium., Proceedings 15th International",
    	doi = "10.1109/IPDPS.2001.925016",
    	keywords = "in-transit buffer mechanism;network contention;network performance;network throughput;source routing;source-based COWS;traffic balance;performance evaluation;workstation clusters;",
    	month = "apr",
    	pages = "8 pp.",
    	title = "{I}mproving network performance by reducing network contention in source-based {COWS} with a low path-computation overhead",
    	year = 2001
    }
    
  100. Pedro Lopez, Jose Flich and Jose Duato. Deadlock-free routing in InfiniBandTM through destination renaming. In Parallel Processing, International Conference on, 2001.. 2001, 427 - 434. DOI BibTeX

    @conference{952089,
    	author = "Lopez, Pedro and Flich, Jose and Duato, Jose",
    	abstract = "The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links that supports any topology defined by the user including irregular ones, in order to provide flexibility and incremental expansion capability. Routing in IBA is distributed, based on forwarding tables, and only considers the packet destination ID for routing within subnets in order to drastically reduce forwarding table size. Unfortunately, the forwarding tables for most of the previously proposed routing algorithms for irregular topologies consider both the destination ID and the input channel. Therefore, these popular routing algorithms for irregular topologies may not be usable in InfiniBand networks because they do nor conform to the IBA specifications. In this paper we propose an easy-to-implement strategy to adapt the forwarding tables already computed following any routing algorithm that considers the destination ID and the input channel into the required IBA forwarding table format. The resulting routing algorithm is deadlock-free on IBA. Indeed, the originally computed paths are not modified at all. Hence, the proposed strategy does not degrade performance with respect to the original routing scheme.",
    	booktitle = "Parallel Processing, International Conference on, 2001.",
    	doi = "10.1109/ICPP.2001.952089",
    	issn = "",
    	keywords = "InfiniBand Architecture; deadlock-free; destination renaming; packet destination; routing algorithms; switch-based network; multiprocessor interconnection networks; network routing;",
    	month = "3-7",
    	pages = "427 - 434",
    	title = "{D}eadlock-free routing in {I}nfini{B}and{TM} through destination renaming",
    	year = 2001
    }
    
  101. Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Improving routing performance in Myrinet networks. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International. 2000, 27 -32. URL, DOI BibTeX

    @conference{845961,
    	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
    	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. In some of these networks, packets are delivered using source routing. Due to the irregular topology, the routing scheme is often non-minimal. In this paper we analyze the routing scheme used in Myrinet networks in order to improve its performance. We propose new routing algorithms that balance the utilization of the available routes and always use minimal paths. We show through simulation that the current routing schemes used in Myrinet networks can be improved by modifying only the routing software without increasing the software overhead significantly. The overall throughput can be doubled without modifying the network hardware",
    	booktitle = "Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International",
    	doi = "10.1109/IPDPS.2000.845961",
    	keywords = "Myrinet networks;NOWs;networks of workstations;routing performance;routing scheme;network routing;workstation clusters;",
    	pages = "27 -32",
    	title = "{I}mproving routing performance in {M}yrinet networks",
    	url = "http://dx.doi.org/10.1109/IPDPS.2000.845961",
    	year = 2000
    }
    
  102. Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Improving the performance of regular networks with source routing. In Parallel Processing, 2000. Proceedings. 2000 International Conference on. 2000, 353 -361. URL, DOI BibTeX

    @conference{876151,
    	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
    	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. In these machines, the network connects processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Also, when performance is the primary concern, these network products are being used to build large commodity clusters with regular topologies. In previous papers, we have proposed the in-transit buffer mechanism to improve network performance, applying it to NOWs with irregular topology and source routing. This mechanism allows the use of minimal paths among all hosts, breaking cyclic dependencies between channels by storing and later re-injecting packers at some intermediate hosts. In this paper we apply the in-transit buffer mechanism to regular networks with source routing in order to improve their performance. Also, two path selection policies are evaluated. The first one will always choose the same minimal path from source to destination, whereas the second one will choose from different alternative minimal paths in a round-robin fashion. The evaluation results show that the overall network throughput can be doubled for large networks",
    	booktitle = "Parallel Processing, 2000. Proceedings. 2000 International Conference on",
    	doi = "10.1109/ICPP.2000.876151",
    	keywords = "NOWs;networks of workstations;parallel computers;path selection policies;regular networks;round-robin;source routing;buffer storage;network routing;performance evaluation;workstation clusters;",
    	pages = "353 -361",
    	title = "{I}mproving the performance of regular networks with source routing",
    	url = "http://dx.doi.org/10.1109/ICPP.2000.876151",
    	year = 2000
    }
    
  103. Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Performance evaluation of a new routing strategy for irregular networks with source routing. 2000, 34 - 43. URL BibTeX

    @conference{7144248,
    	author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
    	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. In some of these networks, messages are delivered using the up*/down* routing algorithm. However, the up*/down* routing scheme is often non-minimal. Also, some of these networks use source routing. With this technique, the entire path to destination is generated at the source host before the message is sent. In this paper we develop a new mechanism in order to improve the performance of irregular networks with source routing, increasing overall throughput. With this mechanism, messages always use minimal paths. To avoid possible deadlocks, when necessary, routes between a pair of hosts are divided into sub-routes, and a special kind of virtual cut-through is performed at some intermediate hosts. We evaluate the new mechanism by simulation using parameters taken from the Myrinet network. We show that the current routing schemes used in Myrinet can be improved by modifying only the routing software without increasing its overhead significantly and, most importantly, without modifying the network hardware. The benefits of using the new routing scheme are noticeable for networks with 16 or more switches, and increase with network size. For 32 and 64-switch networks, throughput is increased on average by a factor ranging from 1.3 to 3.3",
    	address = "New York, NY, USA",
    	journal = "Conference Proceedings of the 2000 International Conference on Supercomputing",
    	keywords = "multiprocessor interconnection networks;network routing;performance evaluation;",
    	note = "performance evaluation;routing strategy;irregular networks;source routing;networks of workstations;deadlocks;virtual cut-through;Myrinet network;routing software;wormhole switching;minimal routing;",
    	pages = "34 - 43",
    	title = "{P}erformance evaluation of a new routing strategy for irregular networks with source routing",
    	url = "http://dx.doi.org/10.1145/335231.335235",
    	year = 2000
    }
    
  104. Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Improving the performance of regular networks with source routing. 2000, 353 - 61. URL BibTeX

    @conference{6742420,
    	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
    	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. In these machines, the network connects processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Also, when performance is the primary concern, these network products are being used to build large commodity clusters with regular topologies. In previous papers, we have proposed the in-transit buffer mechanism to improve network performance, applying it to NOWs with irregular topology and source routing. This mechanism allows the use of minimal paths among all hosts, breaking cyclic dependencies between channels by storing and later re-injecting packers at some intermediate hosts. In this paper we apply the in-transit buffer mechanism to regular networks with source routing in order to improve their performance. Also, two path selection policies are evaluated. The first one will always choose the same minimal path from source to destination, whereas the second one will choose from different alternative minimal paths in a round-robin fashion. The evaluation results show that the overall network throughput can be doubled for large networks",
    	address = "Los Alamitos, CA, USA",
    	journal = "Proceedings 2000 International Conference on Parallel Processing",
    	keywords = "buffer storage;network routing;performance evaluation;workstation clusters;",
    	note = "regular networks;source routing;networks of workstations;NOWs;parallel computers;path selection policies;round-robin;",
    	pages = "353 - 61",
    	title = "{I}mproving the performance of regular networks with source routing",
    	url = "http://dx.doi.org/10.1109/ICPP.2000.876151",
    	year = 2000
    }
    
  105. Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Improving routing performance in Myrinet networks. 2000, 27 - 32. URL BibTeX

    @conference{6590291,
    	author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
    	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. In some of these networks, packets are delivered using source routing. Due to the irregular topology, the routing scheme is often non-minimal. In this paper we analyze the routing scheme used in Myrinet networks in order to improve its performance. We propose new routing algorithms that balance the utilization of the available routes and always use minimal paths. We show through simulation that the current routing schemes used in Myrinet networks can be improved by modifying only the routing software without increasing the software overhead significantly. The overall throughput can be doubled without modifying the network hardware",
    	address = "Los Alamitos, CA, USA",
    	journal = "Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000",
    	keywords = "network routing;workstation clusters;",
    	note = "routing performance;Myrinet networks;NOWs;networks of workstations;routing scheme;",
    	pages = "27 - 32",
    	title = "{I}mproving routing performance in {M}yrinet networks",
    	url = "http://dx.doi.org/10.1109/IPDPS.2000.845961",
    	year = 2000
    }
    
  106. Jose Flich, Pedro Lopez, M P Malumbres, Jose Duato and T Rokicki. Combining in-transit buffers with optimized routing schemes to boost the performance of networks with source routing. 2000, 300 - 9. BibTeX

    @conference{6977557,
    	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose and T. Rokicki",
    	abstract = "In previous papers we proposed the ITB mechanism to improve the performance of up*/down* routing in irregular networks with source routing. With this mechanism, both minimal routing and a better use of network links are guaranteed, resulting on an overall network performance improvement. In this paper, we show that the ITB mechanism can be used with any source routing scheme in the NOW environment. In particular, we apply ITB to DFS and Smart routing algorithms, which provide better routes than up*/down* routing. Results show that ITB strongly improves DFS (by 63%, for 64-switch networks) and Smart throughput (23%, for 32-switch networks)",
    	address = "Berlin, Germany",
    	journal = "High Performance Computing. Third International Symposium, ISHPC 2000. Proceedings (Lecture Notes in Computer Science Vol.1940)",
    	keywords = "buffer storage;network routing;performance evaluation;workstation clusters;",
    	note = "in-transit buffers;optimized routing schemes;network performance;source routing;ITB mechanism;NOW;Smart routing algorithm;DFS routing algorithm;",
    	pages = "300 - 9",
    	title = "{C}ombining in-transit buffers with optimized routing schemes to boost the performance of networks with source routing",
    	year = 2000
    }
    
  107. Jose Flich, M P Malumbres, Pedro Lopez and Jose Duato. Performance evaluation of networks of workstations with hardware shared memory model using execution-driven simulation. In Parallel Processing, 1999. Proceedings. 1999 International Conference on. 1999, 146 -153. DOI BibTeX

    @conference{797399,
    	author = "Flich, Jose and M.P. Malumbres and Lopez, Pedro and Duato, Jose",
    	abstract = "Networks of workstations (NOWs) are becoming increasingly popular as a cost-effective alternative to parallel computers. Typically, these networks connect processors using irregular topologies, providing the wiring flexibility, scalability, and incremental expansion capability required in this environment. Similar to the evolution of parallel computers, NOWs are also evolving from distributed memory to shared memory programming model. However, physical distances between processors are longer in NOWs than in tightly-coupled distributed shared-memory multiprocessors (DSMs), leading to higher message latency and lower network bandwidth. Therefore, the network may be a bottleneck when executing some parallel applications in a NOW supporting a shared-memory programming paradigm. In this paper we analyze whether the interconnection network is able to efficiently handle the traffic generated in a NOW with the shared memory model. In particular, we are interested in analyzing the influence of the routing mechanism in the performance of the system. We evaluate the behavior of a NOW with irregular topology by means of an execution-driven simulator using SPLASH-2 applications as the input load. The results show that the routing algorithm can considerably reduce the total execution time of applications. In particular routing adaptivity can reduce the total execution time by 58% in some applications. These results confirm the behavior observed in previous works using synthetic traffic loads",
    	booktitle = "Parallel Processing, 1999. Proceedings. 1999 International Conference on",
    	doi = "10.1109/ICPP.1999.797399",
    	keywords = "SPLASH-2;distributed shared-memory multiprocessors;execution-driven simulation;execution-driven simulator;hardware shared memory model;incremental expansion capability;interconnection network;irregular topologies;message latency;networks of workstations;p",
    	pages = "146 -153",
    	title = "{P}erformance evaluation of networks of workstations with hardware shared memory model using execution-driven simulation",
    	year = 1999
    }
    
  108. Jose Flich, Pedro Lopez, M P Malumbres and Jose Duato. Edinet: an execution driven interconnection network simulator for DSM systems. 1998, 336 - 9. BibTeX

    @conference{6161583,
    	author = "Flich, Jose and Lopez, Pedro and M.P. Malumbres and Duato, Jose",
    	abstract = "Evaluation studies on interconnection networks for distributed memory multiprocessors usually assume synthetic or trace-driven workloads. However, when the final design choices must be done a more precise evaluation study should be performed. In this paper, we describe a new execution-driven simulation tool to evaluate interconnection networks for distributed memory multiprocessors using real application workloads. As an example, we have developed a NCC-NUMA memory model and obtained some simulation results from the SPLASH-2 suite, using different network routing algorithms",
    	address = "Berlin, Germany",
    	journal = "Computer Performance Evaluation. Modelling Techniques and Tools. 10th International Conference, Tools'98. Proceedings",
    	keywords = "discrete event simulation;distributed shared memory systems;multiprocessor interconnection networks;performance evaluation;",
    	note = "Edinet;execution driven interconnection network simulator;distributed memory multiprocessors;trace-driven workloads;execution-driven simulation tool;NCC-NUMA memory model;simulation results;SPLASH-2 suite;network routing algorithms;",
    	pages = "336 - 9",
    	title = "{E}dinet: an execution driven interconnection network simulator for {DSM} systems",
    	year = 1998
    }
    

Thesis

Head-of-Line Blocking Reduction in Power-Efficient Networks-on-Chip. Jose Flich (Network-On-Chip)

Improving Network-on-Chip Performance in Multi-Core Systems. Jose Flich (Network-On-Chip)

Cost Effective Routing Implementations for On-chip Networks. Jose Flich (Network-On-Chip)

High Performance and Power Efficient On-Chip Network Designs through Multiple Injection Ports. Jose Flich (Network-On-Chip)

Floorplan-Aware High Performance NoC Design. Jose Flich, Federico Silla (Network-On-Chip)

Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors. Jose Flich (Network-On-Chip)

High-performance arch. for high-radix switches. Jose Flich, Jose Duato (Switch Architectures)

Design and Implementation of Efficient Topology Agnostic Routing Algorithms for Interconnection Networks. Jose Flich (Routing Algorithms)

Efficient mechanisms to provide fault tolerance in interconnection networks for PC Clusters. Jose Flich, Antonio Robles (Fault Tolerance)