J Supercomput (2015) 71:2339–2364 DOI 10.1007/s11227-014-1303-x
A HoL-blocking aware mechanism for selecting the upward path in fat-tree topologies C. Gómez · F. Gilabert · M. E. Gómez · P. López · J. Duato
Published online: 16 June 2015 © Springer Science+Business Media New York 2015
Abstract Large cluster-based machines require efficient high-performance interconnection networks. Routing is a key design issue of interconnection networks. Adaptive routing usually outperforms deterministic routing at the expense of introducing out-oforder packet delivery. Many of the commodity interconnects for clusters are based on fat-trees. The adaptive routing algorithm commonly used in fat-trees is composed of a fully adaptive upward subpath, followed by a deterministic downward subpath. As the latter is determined by the former, choosing the most adequate upward path for each packet is critical in fat-trees to achieve a good performance. In this paper, we present a mechanism for selecting the upward path in fat-trees, which enables optimum use of the available network resources to achieve a high network throughput. The proposed path selection is destination based, which allows reducing the head-of-line blocking effect. Indeed, the proposed mechanism can be used either as a selection function (the provided path is used as the preferred one), or as a deterministic routing algorithm (the path is the only possible one). The results show that the resulting selection function outperforms any other known one. Moreover, the proposed deterministic routing algorithm can achieve a similar, or even higher, level of performance than adaptive routing, while providing in-order packet delivery and a simpler switch implementation. Keywords Regular indirect topologies · Fat-trees · Adaptive routing · Deterministic routing · In-order delivery of packets
C. Gómez (B) · F. Gilabert · M. E. Gómez · P. López · J. Duato Departamento de Informática de Sistemas y Computación, Universitat Politècnica de Valencia, Valencia, Spain e-mail:
[email protected]
123
2340
C. Gómez et al.
1 Introduction Clusters of PCs have become very popular to build high-performance computers in the last few years due to their excellent cost–performance ratio. These machines use commodity PCs linked by a high-performance interconnection network. The interconnection network plays a critical role in achieving a high performance. Routing is one of the most important design issues of interconnection networks [6]. The routing strategy determines the path that each packet follows between a source–destination pair. In deterministic routing schemes, an injected packet traverses a fixed, predetermined path between its source and its destination, while with adaptive routing, the packet uses one of the different alternative paths available. An adaptive routing algorithm is composed of the routing and selection functions [6]. For each packet to be routed, the routing function supplies the set of routing options, while the selection function selects one of them, usually taking into account the status of the network. This information may include the status of links or the queue slot availability. This allows adaptive routing, in most of the cases, to better balance network traffic and to obtain a higher throughput [6]. However, with adaptive routing, in-order packet delivery cannot be ensured, which is mandatory for some applications. This is the case, for example, for certain cache coherence protocols, some communication libraries and network technologies [17]. On the other hand, deterministic routing algorithms usually do a poorer job balancing traffic among the network links, due to the lack of path diversity, but they are simpler to implement and guarantee in-order packet delivery by design. In general, adaptive routing outperforms deterministic routing [6], since it takes advantage of network status information. On the other hand, routing strategies can also be classified according to the place where the routing decisions are taken. In source routing, the source node computes the path prior to packet injection and stores it in the packet header. Source routing has been used in some networks, because routers are very simple and fast [1,26]. In distributed routing, each switch computes the next link that will be used by the packet while the packet travels across the network. The packet header only contains the destination node, which is shorter than the complete path as required in source routing. Distributed routing can be mainly implemented by a fixed hardware specific to a routing function on a given topology, or by using forwarding tables. This second option is very flexible, but suffers from a lack of scalability as the forwarding table size grows linearly with network size, and, most important, usually the time required to access the table also depends on its size, which is a limitation considering the increasing size of current machines. Concerning topology, cluster-based machines usually choose either regular direct networks (tori and meshes) or multistage indirect networks (MINs). Fat-trees have risen in popularity in the past few years, since most of the commonly used interconnect technologies (i.e., Myrinet [26], InfiniBand [17], Quadrics [28]) provide support for this topology. Moreover, some of the most powerful machines ever built use a fat-tree topology, such as the CM-5 [15], the Cray BlackWidow [29], or some machines in the topmost positions of the the Top500 list [32], such as the Tianhe-1A [31]. For this reason, we focus on routing in fat-trees. In particular, a routing mechanism that makes an efficient use of the fat-tree rich connectivity is proposed in this paper.
123
A HoL-blocking aware mechanism
2341
In fat-trees, routing paths followed by packets have two different subpaths. First, packets follow an upward fully adaptive subpath, and then a unique downward deterministic path that is determined by the path selected in the upward subpath. So, in fat-trees, unlike other topologies, the decisions made in the upward subpath can be critical, having a strong impact on network performance, since once completing the ascending path, a unique path is available in the descending path for the packet. In particular, in this paper, to improve the fat-tree routing algorithm and therefore its performance, we propose reducing the head-of-line (HoL) blocking effect as much as posible. For doing this, we propose a mechanism to select the upward paths in fat-trees in such a way that packets sent to each destination have an exclusive path in the down subpath, therefore avoiding interferences among packets destined to other nodes. So, links and queues in the downward paths are only used by a single destination. This means that packets in the down path can only be blocked by other packets with the same destination node, but not to other destinations. As a consequence, the HoL blocking effect is completely eliminated in the downward subpath. Avoiding the HoL blocking effect is important, as it may limit the throughput of the switch up to about 58 % of its peak value [20]. Basically, HoL blocking occurs when a packet at the head of a queue blocks, preventing the rest of the packets in that queue from advancing, even if they request access to available ports. As a consequence, average packet latency eventually increases and network throughput decreases. This effect is strongly related to high traffic loads and congestion situations [9], although it may happen also in other traffic scenarios. The proposed routing mechanism can be applied for both source and distributed routing, and although it could be implemented using forwarding tables, we propose a much simpler implementation. Selecting only a component of the destination identifier is enough for routing, both in the up and down subpaths. Therefore, this implementation is able to combine the advantages of both routings, as it provides small routing delay at the switches as source routing does, and a small header overhead as distributed routing. The mechanism can be used in two different ways. It can be used to implement a deterministic routing algorithm, if the output port provided by our mechanism at each switch is the unique routing option. Alternatively, it can be used with adaptive routing to implement the selection function. With adaptive routing, in the upward subpath, at a given switch, all the up output ports of the switch are considered for routing, and the output port provided by our mechanism will be considered as the preferred one. We will show that the proposed routing mechanism is very simple, but at the same time improves network throughput and reduces message latency. This is because it is able to balance network traffic, but most important, unlike previously proposed approaches, it allows to avoid the HoL blocking effect, reducing the congestion in the network as much as possible. To sum up, the proposed mechanism can provide (i) a selection function, in case of adaptive routing, which outperforms other selection functions, and (ii) a deterministic routing algorithm that outperforms previously proposed deterministic and even adaptive routing algorithms, while keeping all the advantages of deterministic routing such as in-order packet delivery and a simple implementation. The rest of the paper is organized as follows. Section 2 presents some related work. Section 3 revisits the fat-tree topology and presents the notation and assumptions used
123
2342
C. Gómez et al.
in the following sections. Section 4 describes the adaptive routing algorithm commonly used in fat-trees. Section 5 presents the proposed output port selection mechanism for fat-trees, also showing a possible implementation in a very simple and compact way. Section 6 evaluates the performance of our proposal and compares it to other routing algorithms. Finally, some conclusions are drawn.
2 Related work Several previous works deal with routing in fat-trees. The authors in [22] discuss adaptive versus oblivious routing in fat-trees and conclude that adaptive routing achieves a higher throughput. Deterministic routing is oblivious to the network state and this paper shows that using information about the network helps in improving the performance. Our paper, on the contrary, concludes that this is not always true if the routing paths are properly selected. In [12], the authors propose a randomized algorithm for routing messages in fat-trees. The algorithm repeatedly attempts to deliver a random subset of messages. As commented in the paper, it is difficult to be implemented since it uses several parameters to select the messages to be sent that must be properly chosen to get good results, which is a multidimensional search in a large space. The work in [27] has a different focus to our paper, since it analyzes the use of virtual channels with adaptive routing in a wormhole-routed fat-tree. In [18], a proposal for InfiniBand networks is made with the aim of reducing the network contention. The paper proposes to dynamically modify the InfiniBand routing tables to adjust them to particular traffic patterns. From our point of view, this can be a complex mechanism, especially in large systems, as routing tables must be updated each time the traffic pattern is changed. Moreover, none of the different mechanisms described in the paper to configure the forwarding tables leads to the routing algorithm proposed in this paper. In [10], a routing algorithm for multistage networks is proposed, where only a route is available for each source–destination pair (although the way to choose it is not described in the paper) when contention is not sensed; however, it changes the route after the number of blocked packets exceeds a specified threshold. To change the route only when the new one is known to be contention free, the algorithm probes the alternate route; this repeats until an unloaded path is finally found or until all possible routes have been probed multiple times. Again, the proposal is more complex than the one we present in this paper, since it needs the definition of a dispersion threshold and the use of probes until a contention-free route is found, which can affect network performance. In [19], the authors propose a strategy to implement adaptive and deterministic routing in an NoC fat-tree. Nevertheless, the selection policy among free channels is not clearly specified. In [34], the authors propose balancing the number of destinations in each switch port to obtain a optimized routing algorithm for a fat-tree in InfiniBand. However, balancing the number of destinations is not enough to achieve an optimal routing algorithm; this has to be done by balancing the routes that use each port, as our proposal does. Moreover, the authors do not clearly state if they provide minimal routing, as we do. This is also the basic idea for the proposals in [16]. However, the routing algorithm for balancing routes at switch ports is agnostic to the topology unlike ours, which results in an extremely high complexity.
123
A HoL-blocking aware mechanism
2343
A similar mechanism to our proposal in the context of InfiniBand can be found in [23]. It proposes using multiple different paths to the same destination, selecting them depending on the source node identifier. This results in a source-based load-balancing routing, opposite to our proposal that is destination based. As a consequence, the congestion due to a single congested physical destination is spread all over the network in [23]. On the contrary, our proposal reduces the HoL blocking effect, trying to limit links and ports used by packets destined to a single node as much as possible. In [23], multiple virtual identifiers are used for the same destination (using the LMC InfiniBand mechanism). This routing algorithm was finally implemented in [33], but as commented in [10], it has several technical problems, for example the requirement of large routing tables (a different entry for each virtual identifier is required), which renders this approach rather impractical for medium and large networks. On the contrary, our proposal does not need to be implemented using forwarding tables, as shown below. This paper extends the work by the same authors in [14] and [11], and several subsequent works proposed routing mechanisms based on those papers, for instance [3,4,7]. 3 Fat-tree topology The k-ary n-trees are a parametric family of regular multistage topologies which enables a feasible implementation of the fat-tree topology. The number of stages is n, and k is the arity or the number of links of a switch that connects to the previous or to the next stage (i.e., the switch degree is 2k). A k-ary n-tree is able to connect N = k n processing nodes using nk n−1 switches. Each processing node is represented as an n-tuple {0, 1, . . . , k − 1}n , and each switch is defined as a pair s, o, where s is the stage where the switch is located at, s ∈ {0 . . . n − 1}, and o is an (n − 1)-tuple {0, 1, . . . , k − 1}n−1 which identifies the switch inside the stage. Figure 1 shows a 2-ary 4-tree, with 16 processing nodes and 32 switches. , . . . , o1 , o0 are conIn a fat-tree, two switches s, on−2 , . . . , o1 , o0 and s , on−2 nected by a link if s = s +1 and oi = oi for all i = s. On the other hand, there is a link between the switch 0, on−2 , . . . , o1 , o0 and the processing node pn−1 , . . . , p1 , p0 if oi = pi+1 for all i ∈ {n − 2, . . . , 1, 0}. This link is labeled with p0 . In what follows, we will assume that descending links in each switch are labeled from 0 to k − 1, and ascending links from k to 2k − 1. 4 Adaptive routing in fat-trees In k-ary n-trees, minimal routing from a source ( pn−1 , . . . , p1 , p0 ) to a destination , . . . , p1 , p0 ) can be accomplished by sending packets upward to one of the ( pn−1 nearest common ancestors of the source and destination nodes and then, from there, downward to destination. When crossing stages in the upward direction, several paths are possible, thus providing adaptive routing. In fact, each switch can select any of its k up output ports. However, in the downward subpath, just a single path is available.
123
2344
C. Gómez et al.
FI..LI
0 0000 1
0 0000..0000
0,000
16
8
0010..1111
1,000
0000..0001 0010..1111
3,000
1000..1111
0100..1111
0001 0001..0001
0000..0111
2,000
0000..0011
24
1000..1111
0100..1111
1000..1111
0100..0111
0010..0011
2 0010
0100..0001
1 3 0011
0010..0010
0,001
0100..0001
9 0000..0001
1,001
17 0100..1111
2,001
0000..0011
25
1000..1111
0100..1111
1000..1111
0000..0111
0010..0011
3,001 1000..1111
0011..0011
0100..0111
4 1000..0011
0100 5
2 0100..0100
0,010
0100..0101
1,010
0110..0011
0101
18
10
0110..0011
1000..0011
26
1000..1111
2,010
1000..1111
0000..0111
0100..0111
0110..0111
0101..0101
0000..0011
3,010 1000..1111
6 1000..0011
1000..0101
0110
3
7
0110..0110
0111
0111..0111
19
11
0,011
1000..0101
0100..0101
0000..0011
2,011
1000..0011
1,011
0110..0111
27
1000..1111 1000..1111
0000..0111
3,011
0100..0111 1000..1111
8 0000..0111
1000 9
4 1000..1000
12
1010..0111
0,100
1000..1001
20
28
1100..0111
1,100
2,100
1000..1011
0000..0111
1100..0111 1001..1001
1010..1011
3,100
1000..1111
1010..0111
1001
0000..0111
1100..1111
10 0000..0111 1100..1001
1010 11 1011
5 1010..1010
13
0,101
1100..1001
1000..1001
1,101
21
29
1100..0111
2, 1 01
1000..1011
0000..0111
1100..0111
0000..0111
3,101
1000..1111
1010..1011
1100..1111
1011..1011
12 0000..1011
1100 13
6 1100..1100
0,110
1100..1101
1,110
1110..1011
1101
0000..0111
22
14
1110..1011
1000..1011 0000..1011
2,110
30 0000..0111
0000..0111
3,110
1000..1111 1100..1111
1101..1101
1110..1111
14
15 1111
0000..1011
0000..1101
1110
7 1110..1110
23
0000..1101
1,111 1110..1111
31 0000..0111
1000..1011
1100..1101
0,111
0000..0111
15
2,111
0000..1011 1100..1111
3,111
0000..0111 1000..1111
1111..1111
Fig. 1 A 2-ary 4-tree, with 16 processing nodes and 32 switches. Adaptive routing is implemented using IR
The stage up to which the packet must be forwarded is obtained by comparing the source and destination components beginning from the most significant ones ( pn−1 and pn−1 ). The first pair of components that differs indicates the last stage to forward up the packet. Once in the descending path, at each stage, the descending link to choose is indicated by the component corresponding to that stage in the destination n-tuple. In the example, from stage i, the packet must be forwarded through the pi link, from , and so on. stage i − 1 through link pi−1
123
A HoL-blocking aware mechanism
2345
4.1 Adaptive routing implementation This adaptive routing algorithm can be implemented in a very simple and compact way using interval routing (IR) [2]. In IR, each switch output port has one associated interval. Each packet is forwarded through the output port whose interval contains the destination of the packet. The interval associated to each output port could be cyclic (if its upper bound has a lower value than its lower bound) and is implemented with two registers. We will refer to these two registers as first interval (FI) and last interval (LI). Moreover, this scheme requires a simple hardware, at most a pair of comparators for each output link; therefore it is also very fast. IR is an alternative implementation for distributed routing, which is halfway between forwarding tables and using a fixed hardware. It is more memory efficient than forwarding tables, since it only requires storing 4klog(N ) bits per switch, k being the switch arity and N the number of destination nodes, and, it is more flexible than a fixed hardware. IR is especially interesting for fat-trees, since, as we will show, it allows implementing in a very simple and compact way the adaptive routing for fat-trees described above in this section. Moreover, with a small variation, as it will be shown in Sect. 5.1.1, it also allows implementing the proposal made in this paper. Figure 1 presents an example of configuration of the IR registers for a 16-node 2-ary 4-tree. As shown, some intervals must be cyclic. As an example, we describe how to route a packet from node 1 to 4. Switch 0 can use both ascending links (links 2 and 3) to route the packet, since destination 4 is included in their associated intervals. Assuming that the selection function selects link 3, switch 9 is reached. Switch 9 can route the packet through any of its ascending links. Assume link 2 is selected; then the packet reaches switch 17, where the downward subpath begins. At switch 17, only link 1 is allowed to route the packet destined to node 4, so the packet arrives to switch 11. At this switch, link 0 is the only routing option, reaching switch 2. Finally, switch 2 delivers the packet to node 4 through link 0. Next, we present an algorithm to fill the FI and LI registers. Figure 2 shows the prototyped FI and LI configuration for the links of a generic switch of a k-ary n-tree. We first identify the reachable destinations through the descending links. Those destinations not included in this set are reachable through any ascending link.
Fig. 2 Register configuration for adaptive routing
123
2346
C. Gómez et al.
Algorithm 1 IR configuration algorithm for a switch s, on−2 , . . . o1 , o0 at stage s of a k-ary n-tree {Configuring the FI and LI register of the down links} for i = 0 → k − 1 do F Ii ← on−2 , . . . , os , i, 0 . . . 0 L Ii ← on−2 , . . . , os , i, k − 1 . . . k − 1 i ←i +1 end for {Configuring the FI and LI register of the uplinks} for i = k → (2 ∗ k) − 1 do F Ii ← (L Ik−1 + 1)modk n L Ii ← (F I0 − 1)modk n i ←i +1 end for
The pn−1 , . . . , p1 , p0 nodes that are reachable by the descending switch links can be easily computed from the switch components. In particular, pi = oi−1 for i ∈ {n − 1, . . . , s + 1}. This set of destinations is split into several subsets that can be reached through each descending link depending on the ps component, with s being the switch stage. The subset of nodes whose ps = 0 are reachable through link 0; the subset of destinations whose ps = 1 are reachable through link 1, and so on. As an example, link 0 of switch s, on−2 , . . . o1 , o0 forwards packets destined to nodes on−2 , . . . , os , 0, X . . . X . That is, FIlinkdesc = on−2 , . . . , os , 0, 0 . . . 0 and LIlinkdesc = on−2 , . . . , os , 0, k − 1 . . . k − 1, link desc ∈ {0, . . . , k − 1}. As commented above, the destinations that are not reachable through the descending links are reachable through any of the k ascending links. The FI register of the ascending links must store the next destination to the largest one reachable through the descending links (i.e., the next destination to the one stored in the LI register of the k − 1 descending link). Likewise, the LI register of the ascending links must store the previous destination to the smallest one reachable through the descending links. That is, FIlinkasc = (LIlinkk−1 + 1) mod k n and LIlinkasc = (FIlink0 − 1 + k n )modk n , link asc ∈ {k, . . . , 2k}, with k n being the number of nodes in the network. We add k n to avoid setting a negative destination value. This interval is the same for all the ascending links of the switch and can result in a cyclic interval. The algorithm to configure IR intervals is shown in Algorithm 1. 5 A mechanism for selecting the output port The routing algorithm described in Sect. 4 is adaptive in the upward subpath and deterministic in the downward subpath. In the upward subpath, at each traversed switch, the k up output ports of the switch are returned by the routing function. A selection function is used to select the output port finally used. The downward path is determined by the switch reached in the upward subpath, and, in each switch, the down unique output port that has to be used is given by the component of the packet destination identifier corresponding to the stage of the current switch. In this section, we present a mechanism to select one of the k output ports available at each switch in the upward subpath with the aim of distributing packets in such a way
123
A HoL-blocking aware mechanism
2347
that the HoL blocking effect is avoided, which, as commented, is a key contributor to network throughput degradation. Reducing the HoL blocking effect will allow maximizing the use of network resources and to obtain the maximum network performance. With our proposal, all packets destined to a particular destination are contained in a subtree in the upward subpath (if they reach the last stage, they arrive at the same switch) and each destination has a unique and exclusive down path without interferences of packets destined to other nodes (see Fig. 4b). The upward phases from every source node are shown in red. As can be seen, packets destined to node 7 are confined in a network subtree instead of using all the network links as in adaptive routing. The unique downward phase is shown in green. Only a possible path is allowed, which is only used by packets destined to node 7. Thus, the HoL blocking effect is completely removed in the down subpaths and reduced as much as possible in the up subpaths. This can only be achieved if the destination identifier is used for routing in the upward subpaths. In particular, the mechanism proposed in this paper tries to classify packets in the upward subpath depending on their destinations by isolating as much as possible different destinations with the aim of reducing the network resources sharing among different destinations. We will explain how our mechanism selects the output ports in the upward subpaths of a fat-tree using Fig. 3, where a 2-ary 3-tree is considered. In the first stage, our mechanism shuffles upward consecutive destinations among the k (2 in this example) up output ports of each switch, thus reaching different switches in the next stage. To do that, the least significant component of the packet destination address (the LSB in this example) is used to select the ascending output port. That is, packets that must be forwarded upward must select the ascending output port according to the least significant component of the packet destination ( p0 ). Therefore, consecutive destinations are sent to different switches in the next stage. At the second stage, the destinations of all packets that reach a given switch have the same least significant component. Hence, the component to consider in the selection of the up output port in this stage is the next one in the destination address. For instance, switch 4 is only reached by packets destined to nodes 0, 2, 4 and 6. These nodes have the same least significant component, which is 0. Of these destinations, only packets destined to nodes 4 (100) and 6 (110) must be forwarded upward. Packets destined to node 4 will select the first uplink, and packets destined to node 6 the another one. Considering all the switches of stage 1 (second stage), packets destined to nodes 0 (000), 1 (001), 4 (100) and 5 (101) use the first up output port of the switches and those packets destined to nodes 2 (010), 3 (011), 6 (110) and 7 (111) use the second output port. That is, the second least significant component ( p1 ) of the packet destination is used to select the output port. It could seem that packets destined to adjacent nodes like 2 and 3 will be sent to the same switch in the next stage, since they use the same up output port. However, notice that adjacent nodes always have a different least significant component, so they have been already sent to different switches in this stage, and they will be sent up to different switches in the next stage. In general, the selection of the output port made by our mechanism is based on the destination and the stage at which the packet is currently located. In particular, it considers the component of the packet destination corresponding to that stage (i.e., a switch located at stage s considers the sth component of the destination
123
2348
C. Gómez et al. RRR RRR=0000 (descending links) RRR=0011 (ascending links)
FI..LI MR
3
0
stage 0
000 1 001
0,00
3,5,7
000..001 111
001..001 001
010..010
011
1,00
8
6 010..010 010
0
0
100..111 111
4
2
5
000..000 001
1,5,7
000..001 111
9
000..000 010
5
1
001..001 001
0,01
2,00
000..011 111
010..011
1,01
010..011 111
011..011 111
4
stage 2
4
111
1
3
0
000..000 010
0,4,6
010
000..011 2,01 111 100..111 111
7
010..010 010
1
3
5
0
100
101
1
4
001..001 111
2
5
2,4,6 000..000 001
0 000..000 111
2
stage 1
100..100 111 101..101 111
000..000 010
0,2,6
2
6
000..000 001
0,10
100..101 111
1,3,7
001..001 001
4
111
110..111 111
110..110 111
6
000..000 010
000..000 3 001 1,3,5 001..001 001 0,11
7
7
5 100..101 111 110..111 111
111..111 111
Number of paths:
2,10
1
0,2,4
110
000..011 111 100..111 111
6
6
7
1,10
10
2
2
010..010 010
1,11
3
010..010 010
11
000..011 111
2,11
100..111 111
7
6
3
5
4
Fig. 3 Deterministic routing in a 2-ary 3-tree with FIR
address, ps ). Therefore, at the switch s, on−2 , . . . , o1 , o0 , the selected output port for a packet with destination pn−1 , . . . , p1 , p0 will be k + ps . At the last stage (stage 2 in our example), each downward port only allows sending packets to exactly only one destination. The downward path is selected in the same way as with adaptive routing. Figure 3 shows a 2-ary 3-tree using the proposed mechanism, and each link has been labeled (in red bold-italic) with the destinations whose packets will be forwarded through it. For instance, link 2 at switch 0 will forward packets to nodes 2, 4 and 6, whereas node 0 at switch 4 will forward packets only to node 0. In addition, the FIR configuration for the routing algorithm in each port is shown, but this will be later explained in Sect. 5.1.1. As it can be noticed, the mechanism is very simple and easy to be implemented, but it has two important features. First, traffic in the network is completely balanced. The bottom of Fig. 3 shows the number of paths (source–destination pairs) that use each link at each stage. Indeed, all the ascending and descending links of a given stage are used by the same number of paths. This is also achieved by other previously proposed routing algorithms such as [23] (see related work section). But what is an important feature of our mechanism is that congestion is contained as much as possible,
123
A HoL-blocking aware mechanism
2349
0
0
000
000
1
1
001
001
2
2
010
010
3
3
011
011
4
4
100
100
5
5
101
101
6
6
110
110
7
7
111
111
(a) Source-based routing.
(b) Proposed mechanism.
Fig. 4 Routes from all the source nodes to destination node 7
in contrast to [23]. As shown, packets destined to the same node are contained in a subtree reaching the same switch at the last stage (if they have to reach the last stage) regardless of their source node. Moreover, each switch of the last stage receives packets addressed only to destinations whose identifiers differ only in the most significant component, and packets are forwarded depending on that destination component; so, packets destined to different nodes will be sent through different descending links in the last stage. Hence, each destination has a different descending path that does not share any link with packets destined to other nodes. This means that the HoL blocking effect is eliminated by our routing mechanism. As a consequence, in the downward subpath, congestion of one destination does not affect packets destined to other nodes. At the same time, the ascending paths of packets destined to different nodes are as disjoint as possible. Therefore, in both, the ascending and descending subpaths, the HoL blocking effect is strongly reduced, which results in a better utilization of the network resources. On the contrary, in [23] the output port in the upward subpath is selected depending on the packet source identifier and the stage of the switch. So packets destined to the same destination that have different source nodes use different ascending paths, and therefore descending paths. In this way, a possible congestion of a single destination node will be extended all over the network. This is shown in Fig. 4, in which all the paths to destination 7 are shown for a source-based routing as in [23] (Fig. 4a) and for our proposed mechanism (Fig. 4b). In Fig. 4a, 16 network links are used by packets destined to node 7, whereas in Fig. 4b, as upward subpaths to a particular destination are contained in a subtree, only 6 network links are used. The up output port selected at each switch by our mechanism can be viewed as a unique routing option and thus a deterministic routing algorithm is obtained, or it can be considered as the preferred routing option with adaptive routing. In that case, the routing function returns the k up output ports of the switch and what the proposed mechanism implements is the selection function, prioritizing the same output port that would have been used with deterministic routing. If this up output port is busy, then another up output port will be selected from the ones returned by the routing function. However, if the proposed mechanism is used as a selection function, it will not retain congestion as well as the deterministic routing algorithm, since the output
123
2350
C. Gómez et al.
port provided by the mechanism is considered the preferred one. However if this one cannot be used, another one is used. Nevertheless, as we will see in Sect. 6, this selection function is the one that obtains the best performance results among the considered adaptive routing algorithms. The deterministic routing algorithm resulting by applying the mechanism proposed in this paper is referred to as Deterministic dEstination and STage based ROuting (DESTRO). The selection function obtained by applying the mechanism is referred to as Stage and Destination Priority (SADP). 5.1 Mechanism implementation The mechanism proposed in this paper can be easily implemented using forwarding tables. However, we will show that both, DESTRO and SADP, can also be implemented in a very simple and efficient way. 5.1.1 DESTRO implementation DESTRO can be easily implemented by using flexible interval routing (FIR) [13]. To make this paper self-contained, we briefly summarize FIR. In FIR, as in IR, each output port has also an associated routing interval that can be cyclic, which is implemented with the FI and LI registers. But, to add flexibility, additional registers are associated with the output ports. In particular, each output port has a mask register (MR), which indicates which bits of the packet destination address will actually be compared with the output port interval. Moreover, FIR allows applying some routing restrictions (initially thought to enforce deadlock freedom in meshes and tori) by means of an additional register associated with each output port, the routing restrictions register (RRR), which defines, for each output port, which other output ports of the switch should be used prior to this one. This register has one bit per output port. For a given output port i, the j bit in the RRR indicates whether the output port j has more preference (bit set to 1) or not (0) than output port i. Thus, the final routing decision for an output port i is obtained not only by comparing the masked destination with the interval bounds, but also by checking the bits in its RRR. Now, we show how FIR registers can be configured to route packets in fat-trees following the proposed deterministic routing algorithm. Notice that the adaptive routing algorithm described in Sect. 4 can be also implemented with FIR, since IR is a subset of FIR. We start explaining how to configure the ascending links. They are configured in a very different way to the adaptive routing case, since the number of ascending paths is reduced to one for each source–destination pair. At each switch, the ascending link to be used is obtained from the packet destination component corresponding to the stage at which the switch is located. To obtain the proper component from the destination identifier corresponding to the switch stage, we use the mask register. The MR of each ascending link sets to 1 the bits corresponding to the component associated with the switch stage. Figure 3 shows the FIR register configuration for a 2-ary 3-tree. In the first stage, MR is set to 001, as only the least significant component (or bit in this case)
123
A HoL-blocking aware mechanism
2351
Algorithm 2 Register configuration in the ascending links with deterministic routing. s ← curr ent_stage r ← log2 (n) for L = k → 2 ∗ k − 1 do R R R L ← 0 k 1k {Selection of the bits of the packet destination corresponding to the current stage} M R L ← 0n∗r −(s+1)∗r 1r 0s∗r {The bits of the registers corresponding to the current stage are given by the link identifier} {They must be equal to L − k, being represented by r bits} F I L ← 0n∗r −(s+1)∗r L − k0s∗r L I L ← 0n∗r −(s+1)∗r L − k0s∗r end for
is selected to be compared with FI and LI, so in this stage packets will be forwarded through the ascending output port indicated by the least significant component of its destination. At the second stage, the next component (only one bit in this network) is considered (MR is set to 010) and so on. In k-ary n-trees with k > 2, components will have more than one bit and, thus, in the MR, more than one bit will be set to 1 to select the component given by the switch stage. Descending links have the same values stored in FI and LI as the ones with adaptive routing, since the path reduction is only done in the upward subpath. As the MR is not used in the downward subpath, it is set to all 1s to select all the bits in the destination address. Notice that the same downward paths valid in the adaptive routing case are also valid in the deterministic case, but only one is actually used. In Fig. 3, the reachable destinations through the descending links of a switch (for instance, destinations 0 and 1 at switch 0) are also included in the ascending intervals, so packets destined to these nodes could be incorrectly forwarded through those upward links. To avoid this problem, the RRR register gives preference to the descending links over the ascending ones, thus guaranteeing a minimal path. In the RRR, the half lowest significant bits correspond to the descending links. Therefore, in the ascending links (links 2 and 3), the RRR stores 0011, to give preference to links 0 and 1. In this way, when routing a packet to destination 0 at switch 0, both ports 0 and 2 may be allowed since this destination, after being masked, is included in the intervals associated with both output ports, but, as output port 2 gives preference to output port 0, output port 2 is not finally returned by the routing function. Algorithm 2 shows how to configure the FIR registers of the ascending links. The RRRs have the half least significant bits set to 1 to give preference to the descending links. Notice that with k = 2, the MRs will have as many bits set to 1 as the bits needed to represent a component in the destination address, that is log(k) bits. These 1s will be located in the position corresponding to the switch stage and will select these bits in the destination identifiers. The rest of the MR bits will be all set to 0. On the other hand, as explained above, the FI and LI registers of an ascending link will select the destinations whose sth component (s is the switch stage) is assigned to this link. Since ascending links are labeled from k to 2k − 1 and destination identifier components vary from 0 to k − 1, the value to match in the destination identifier component will be L −k, with L being the link identifier. The configuration for the descending links is not shown because it is very simple. The FI and LI registers are the same as in the adaptive
123
2352
C. Gómez et al.
routing case, and the MR is set all to 1s to select all the bits in the destination address for being compared with FI and LI. The RRR is set to all 0s, since no preference is given to the other output ports. As shown in [13], the routing delay of the FIR scheme, which affects the switch critical path, is very low. On the contrary, accessing a forwarding table may take longer, especially in large systems. Moreover, the required storage capacity is also much more smaller using FIR. 5.1.2 Selection function implementation When adaptive routing is used, the routing function supplies a set of output channels, and the selection function selects the output port that will finally be used. The adaptive routing implementation with interval routing shown in Sect. 4.1 corresponds to the routing function. In this section, we focus on the SADP selection function implementation. The selection function will return a preferred output link and will only consider using other links if that one is not available. If so, then a subselection function is used, for instance a very simple one such as a linear rotative search.1 Therefore, the selection function will be implemented in two steps. The first one will obtain the preferred link (i). If this link is not free or does not have enough free space or credits (any other criteria can be applied), it will perform a linear rotative search starting at link i + 1 until it finds a link that matches the criteria, if any. A possible implementation for this linear rotative search is using a programmable priority encoder that receives as inputs the links returned by the routing function that are free, and an extra input with the preferred link to start the search from this link. If the preferred link is free, this is the final choice. Otherwise, the following one that is free is used. Concerning the first step of the selection function, in SADP, the preferred link for a given packet in a particular switch is provided by the component of the packet destination corresponding to the stage at which the switch is located. This can be implemented by a multiplexer that selects a different component of the packet destination at each stage. The output of this multiplexer will be the preferred uplink and will be connected to the aforementioned programmable priority encoder. As can be noticed, both the routing function (based on IR) and the implementation of the selection function have a delay of a few gates, so the overall routing delay is small. 6 Performance evaluation 6.1 Network model To evaluate the routing algorithms proposed above, a detailed event-driven simulator has been implemented. The simulator models a k-ary n-tree with FIR routing and virtual cut-through switching. Each switch has a full crossbar with queues of two 1 Other options can be also considered. However, we have found out that it is not significant in the obtained
results.
123
A HoL-blocking aware mechanism
2353
packets both at the input and output ports. We assumed that it takes 20 clock cycles to apply the routing algorithm; switch and link bandwidth has been assumed to be one flit per clock cycle; and fly time has been assumed to be 8 clock cycles. These values were used to model Myrinet networks in [8]. Credits are used to implement the flow control mechanism. We have evaluated both synthetic traffic and I/O traces. For synthetic traffic, we have considered four different patterns: uniform, uniform with hot-spot, bit-reversal and complement traffic patterns. In the uniform traffic pattern, message destination is randomly chosen among all the processors. In the hot-spot traffic, a percentage of traffic is sent to a single node and the rest of the traffic is uniformly distributed. In the bit-reversal traffic, each node sends packets to the destination obtained by taking the components of its identifier in reverse order. In the complement traffic pattern, each processor sends all its messages to the opposite node. Thus, in a network with N processors, the processor i sends messages to the processor N −i −1. The complement traffic pattern has two interesting properties in fat-tree networks. The first one is that all the packets have to reach the upper stage to arrive at their destination; hence, the uplink selection mechanism must be applied the maximum possible number of times. The second one is that each processor node only sends messages to one destination (this also happens in bit-reversal traffic). All results shown in this paper have been obtained using a packet size of 8 KB. However, for other packet sizes (not shown) the same conclusions can be drawn. Each simulation point for synthetic traffic patterns represents the average result of several simulations/iterations with a 95 % confidence interval.
6.2 Evaluation results Next, we first compare the SADP selection function versus other selection functions, and then DESTRO is compared versus other deterministic and adaptive routing algorithms. With synthetic traffic, we have evaluated a wide range of k-ary n-tree topologies, with k being 2, 4, 8, 16 and 32 and n being up to 8. Due to space limitations, we show here only a subset of the most representative simulations. To analyze the performance of our proposal using a more realistic traffic pattern, we have used a set of I/O traces obtained from Hewlett-Packard Labs. They include I/O activity generated in the early 1999 at the disk interface of the cello system. The cello system is a time-sharing system with a storage subsystem of 23 disks with 103 client nodes, for a total of 126 nodes in the network. The traces provide information both for the requests generated by the clients and the answers generated by the disks. A detailed description of similar traces of 1992, collected in the same system, can be found in [30]. We will use packets with a payload equal to the size specified in the trace for the I/O accesses, but if the access is larger than 8 KB, we will split it into packets with, at most, a payload of 8 KB. To simulate the cello system, we have simulated a 2-ary 7-tree where we have attached the 23 disks to 23 randomly selected leaves of the network. The remaining leaves have been used to attach the clients.
123
2354
C. Gómez et al.
6.2.1 Selection function evaluation
2000
Average Message Latency (cycles)
Average Message Latency (cycles)
In this section, we compare SADP against the following selection functions when using adaptive routing: First Free (FF) The FF selection function selects the first physical link which has free space. It uses a linear search, starting from the first ascending physical link (the kth link according to our notation). Cyclic Priority (CP) The CP selection function uses a round robin algorithm to choose a different physical link each time a packet is forwarded. More Credits (MC) Since we use credit-based flow control, the MC selection function [21] selects the link which has the highest credit availability. MC is the most complex selection function, as it needs several comparators to select the link with more available credits. Adap-Lin This selection function selects as preferred link the one used by the routing algorithm proposed in [23]. For each packet, the preferred link is selected depending on the packet source identifier and the stage of the switch. Synthetic Traffic Figure 5a shows results for a very small network (4-ary 2-tree, 16 nodes) with uniform traffic. The behavior of the different selection functions is not very different, with the exception of FF, which always returns the same preferred
CP FF MC SADP Adap-Lin
1800 1600 1400 1200 1000 800 600 400
0.1
0.2
0.3
0.4
0.5
CP FF MC SADP 5000 Adap-Lin 6000
4000 3000 2000 1000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
0.6
Traffic (flits/cycle/node)
Traffic (flits/cycle/node)
(a)
(b)
Average Message Latency (cycles)
0
7000
12000 CP FF MC SADP 8000 Adap-Lin
10000
6000 4000 2000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Traffic (flits/cycle/node)
(c) Fig. 5 Average message latency versus accepted traffic for adaptive routing with different selection functions. Uniform traffic pattern. a 4-ary-2-tree. b 4-ary-4-tree. c 4-ary-6-tree
123
A HoL-blocking aware mechanism
2355
ascending link. Therefore, it saturates an ascending path before selecting another one. Hence, there is an unbalanced link utilization, as those links that belong to the preferred ascending paths always have a higher utilization than the others, resulting in a larger packet latency. On the other hand, as the network has only two stages, the rest of selection functions have almost the same performance, because there is a low number of different paths that can be chosen to reach any destination. Figure 5b, c shows the results for a higher number of stages (4 and 6, respectively) and k = 4. As it can be seen, the more stages in the network, the higher are the differences among the different selection functions. This is due to the fact that, with more stages, there are more different ascending paths to choose from. Therefore, there are more opportunities to balance traffic since the selection function is applied more times. MC and CP provide better performance than FF, but SADP clearly outperforms them. Adap-Lin obtains lower latencies than the previous selection functions, because it better balances link utilization, but SADP achieves the best performance because, besides balancing traffic very well in the network, assuming that the preferred ascending path for each packet is free, the links shared by different destinations are as few as possible reducing as much as possible the effects of HoL blocking. The arity (k) of the tree has also some effect on performance. As k increases, there are more routing options to choose from and selecting the best one leads to performance improvements. Figure 6 shows results for three networks with k = 8. As an example, comparing Fig. 5a with Fig. 6a and Fig. 5b versus Fig. 6c, it can be observed that a higher k tends to increase the differences in latency among selection functions. Figure 7 shows the results for complement traffic in two networks, a 4-ary 4-tree (256 nodes) and an 8-ary 3-tree (512 nodes). FF has the worst performance, while Adap-Lin and SADP have the best ones. Both selection functions have a similar behavior because each source node sends packets to only a different single destination node. Remember that Adap-Lin tries to provide disjoint paths for different source nodes and SADP tries to do the same for different destinations, which, for this traffic pattern, is the same. In Fig. 8 we can see how the different selection functions manage the congestion due to congested destinations. It shows the performance of the selection functions for a 4-ary 4-tree with hot-spot traffic. In particular, one random destination receives an extra 5 % traffic. For this traffic pattern, the worst throughput results are obtained with Adap-Lin because it spreads packets destined to the congested node all over the network. FF and MC are the following worst ones. Again, SADP obtains the best performance results, as it retains the congestion due to packets destined to the hot-spot node to the minimum number of links and ports. Results for bit-reversal and some of the selection functions will be shown in the DESTRO evaluation (see Sect. 6.2.2). Again, SADP is the best selection function. I/O Traces Figure 9 shows the cumulative average latency versus the elapsed time. As shown, the total time required to deliver all the packets included in the traces is almost the same for all the selection functions, because the network is able to deal with the injected traffic. However, the packet latency, as expected, is very different depending on the selection function. The relative differences for the packet latency of the analyzed selection functions are similar to the ones shown with synthetic traffic. FF is the worst selection function, and SADP is the best one. Adap-Lin shows an
123
C. Gómez et al. 3000 2500 2000
Average Message Latency (cycles)
Average Message Latency (cycles)
2356
CP FF MC SADP Adap-Lin
1500 1000 500
5000 4000
CP FF MC SADP Adap-Lin
3000 2000 1000
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Traffic (flits/cycle/node)
(a)
(b)
Average Message Latency (cycles)
Traffic (flits/cycle/node)
9000
CP FF MC 7000 SADP 6000 Adap-Lin
8000
5000 4000 3000 2000 1000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Traffic (flits/cycle/node)
(c)
600 580 560 540
Average Message Latency (cycles)
Average Message Latency (cycles)
Fig. 6 Average message latency versus accepted traffic for adaptive routing with different selection functions. Uniform traffic pattern. a 8-ary-2-tree. b 8-ary-3-tree. c 8-ary-4-tree CP FF MC SADP Adap-Lin
520 500 480 460 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2000 1800 1600 1400
CP FF MC SADP Adap-Lin
1200 1000 800 600 400 0.1
0.2
0.3
0.4
0.5
0.6
Traffic (flits/cycle/node)
Traffic (flits/cycle/node)
(a)
(b)
Fig. 7 Average message latency versus accepted traffic for adaptive routing with different selection functions. Complement traffic pattern. a A 4-ary 4-tree. b An 8-ary 3-tree
intermediate behavior. CP and MC are not shown for the sake of clarity, since they obtain latencies similar to SADP. However, for some intervals of the simulation, SADP average latency was still lower.
123
Fig. 8 Average message latency versus accepted traffic for adaptive routing with different selection functions. 5 % hot-spot traffic. 4-ary-4-tree
2357
Average Message Latency (cycles)
A HoL-blocking aware mechanism 2000
CP FF MC SADP Adap-Lin
1800 1600 1400 1200 1000 800 600 0.01
0.02
0.03
0.04
0.05
0.06
0.07
Fig. 9 Cumulative average message latency versus simulation time for I/O traces. Adaptive routing with different selection functions. 2-ary-7-tree
Average Message Latency (cycles)
Traffic (flits/cycle/node)
SADP FF Adap-Lin
2000 1800 1600 1400 1200 1000 800 0
1e+09
2e+09
3e+09
4e+09
Simulation Time (cycles)
6.2.2 DESTRO evaluation In this section, we evaluate the proposed mechanism applied as a deterministic routing algorithm for fat-trees. To do that, we have compared DESTRO versus the deterministic routing algorithm proposed in [23] (referred to as DET-Lin, see Sect. 2) and versus some adaptive routing algorithms. For the sake of clarity, only FF and SADP are shown, that is, the worst and the best selection functions. Since deterministic routing provides in-order packet delivery, to perform a fair comparison, we have implemented a simple approach to ensure in-order delivery with adaptive routing. It is similar to the one proposed in [25], which uses a reorder buffer at the destination node NIC to store out-of-order arrived packets. Every time a packet is injected, its sequence number is included in its header. When a packet arrives out of order at the destination, it is stored in the reorder buffer to wait for all packets with smaller sequence number. On the other hand, to prevent unnecessary packet retransmission, the source node does not inject packets if the destination buffer does not have enough free space to store all the packets that have been sent previously. As shown, the mechanism to guarantee in-order delivery complicates adaptive routing implementation, as a reorder buffer is required at the destination and an end-to-end
123
C. Gómez et al. 6000 5000 4000 3000
DESTRO FF-NO-ORDER FF-20p FF-600p SADP-NO-ORDER SADP-20p SADP-600p DET-Lin
2000 1000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Average Message Latency (cycles)
Average Message Latency (cycles)
2358 4000 3500 3000 2500
DESTRO FF-NO-ORDER FF-20p SADP-NO-ORDER SADP-20p DET-Lin
2000 1500 1000 500 0.1
0.15
0.2
0.25
0.3
0.35
0.4
Traffic (flits/cycle/node)
Traffic (flits/cycle/node)
(a)
(b)
0.45
Fig. 10 Average message latency versus accepted traffic for DESTRO and adaptive routing with different selection functions. Uniform traffic pattern. a A 4-ary 4-tree. b An 8-ary 3-tree
flow control is needed. Notice, though, that in the results shown, we assume that this reorder mechanism for adaptive routing does not have any cost (i.e., we assume that the ack control messages do not consume network bandwidth, and that a node can read any number of packets from the reorder buffer in one clock cycle), which benefits adaptive routing. Thus, the differences shown in the results are only due to the extra delay introduced by out-of-order arrived packets. Synthetic Traffic Figure 10a plots the average network packet latency versus accepted traffic for a 4-ary 4-tree with a uniform traffic pattern. For the adaptive routing algorithms, the figure shows results without guaranteeing in-order packet delivery and guaranteeing it with different reorder buffer sizes. In particular, we have tested sizes ranging from 20 to 600 packets per source at each destination NIC. Remember that with deterministic routing, in-order packet delivery is ensured by design. As it can be observed, DESTRO strongly outperforms adaptive routing with the FF selection function and obtains roughly the same throughput as the adaptive routing algorithm using the SADP selection function, which was the one that provided the best results. For the uniform traffic pattern, almost all the packets are delivered with the correct order, because traffic rate is distributed among all destinations. This explains why similar performance results are obtained for all the different reorder buffer sizes with adaptive routing. However, DESTRO obtains a slightly higher latency than the adaptive routing counterpart (SADP), because deterministic routing can only use one ascending link for a given packet, so it forces the packet to wait until that link is available; however, in this case, adaptive routing selects another one. On the other hand, near the saturation point, the network latency is smaller for DESTRO because it better classifies packet destinations, thus reducing the HoL blocking effect. Concerning the comparison against the other deterministic routing algorithm, DESTRO obtains a lower latency and a higher throughput than DET-Lin due to its ability to reduce the HoL blocking effect. Moreover, in Fig. 11 we show the distribution of the link utilization for each routing algorithm at the saturation point. As it can be seen, DESTRO and SADP have a higher and more uniform link utilization than the other routing algorithms due to their ability of balancing network traffic and reducing the HoL blocking effect, which, at the end, is the reason of their better performance over DET-Lin. Figure 10b
123
A HoL-blocking aware mechanism
50
Link utilization (%)
Fig. 11 Link utilization at saturation in a 4-ary 4-tree for DESTRO and adaptive routing. Uniform traffic
2359
40 30
DESTRO DET-Lin SADP FF
20 10 0
0
200
400
600
800
1000
1200
1400
0.6
0.7
Fig. 12 Average message latency versus accepted traffic for DESTRO and adaptive routing with different selection functions. Complement traffic. 8-ary 3-tree
Average Message Latency (cycles)
Link 800
DESTRO FF-NO-ORDER FF-20p SADP-NO-ORDER SADP-20p DET-Lin
750 700 650 600 550 500 450 400 0
0.1
0.2
0.3
0.4
0.5
Traffic (flits/cycle/node)
shows the performance evaluation results for an 8-ary 3-tree, but only for the small reorder buffer (20 packets per source node) since, as shown previously, its size is not significant. As it can be observed, the same conclusions are valid for a larger network. Figure 12 shows the performance results for complement traffic in an 8-ary 3-tree. DESTRO is able to achieve six times more throughput than FF. Moreover, it also slightly outperforms the SADP selection function. For this traffic pattern, all packets reach the last stage, that is, all packets have a long network path and it is more important to reduce the HoL blocking effect. Regarding the comparison against DET-Lin, both obtained the same performance results for this traffic as happened with adaptive routing due to the fact that each source only sends packets to a particular different destination. Figure 13 shows the performance results for a hot-spot traffic pattern in a 4-ary 4-tree. One random destination receives an extra 5 % of the traffic in the network. DESTRO strongly outperforms the Det-Lin deterministic routing algorithm, since Det-Lin spreads congestion all over the network. It also outperforms nearly all the adaptive routing algorithms, because the congestion of the hot-spot node is limited as much as possible. Only when in-order delivery is not guaranteed or a huge reorder buffer (600 packets per source node in each destination NIC) is available, the SADP adaptive routing algorithm obtains similar results to DESTRO. The benefits obtained
123
Fig. 13 Average message latency versus accepted traffic for DESTRO and adaptive routing with different selection functions. 5 % hot-spot traffic. 4-ary 4-tree
C. Gómez et al.
Average Message Latency (cycles)
2360 900 850 800 750 700 650
DESTRO FF-NO-ORDER FF-20p FF-600p SADP-NO-ORDER SADP-20p SADP-600p DET-Lin
600 550 500 450 0.01
0.02
0.03
0.04
0.05
0.06
Traffic (flits/cycle/node)
with SADP and DESTRO are due to the fact that congestion of the hot spot is not spread, since our mechanism shares as less as possible links between destinations. In this way, the congestion due to the hot-spot node is retained, affecting those packets destined to other nodes as less as possible. In the case of the other adaptive routing algorithms, the congestion of a destination impacts on the rest of destinations, since the preferred paths share more links with other destinations. Nevertheless, if using SADP with adaptive routing, when the preferred ascending link of a packet is not available, that packet is routed using another link, actually spreading the congestion. However, with DESTRO, when this ascending link has no buffer space left or is not available, that packet is stopped; so congestion is not spread. Other traffic patterns and network sizes have been also analyzed and the overall results are qualitatively similar. Additionally, we have analyzed the worst traffic pattern that we can conceive for DESTRO. In that traffic pattern, messages from all the nodes attached to the same switch of the first stage are only sent to processing nodes whose identifier has the same last component. For instance, in a 2-ary 2-tree, if all destinations have the last component equal to 0, then the active destinations will be only nodes 0 and 2 from the four destinations (nodes 0, 1, 2 and 4) of the network. For this traffic pattern, the effective bandwidth of DESTRO is half the total bandwidth (in a generic case, it would be 1/k), because only the links labeled as 2 will be used in the ascending subpath. This is the case of the bit-reversal traffic pattern. Figure 14 shows the performance of an 8-ary 3-tree with bit-reversal traffic. In an 8-ary 3-tree, the nodes attached to the switch 0, nodes 0, 0, 0, 0, 0, 1, 0, 0, 2, . . . , 0, 0, 7, send all their messages to nodes 0, 0, 0, 4, 0, 0,2, 0, 0, . . . , 7, 0, 0, respectively. As it can be observed, all the destinations share the two last components, so packets will share the same link in the first two stages of their ascending paths. As expected, the performance of DESTRO is poorer than the adaptive SADP one, since it only uses one of the k uplinks in two stages. DET-Lin provides also poor results with this traffic, because a similar effect arises, in this case in the downward subpath, but even more, Adap-Lin is not able to overcome this effect of also providing poor results, similar to the deterministic routing algorithms.
123
Fig. 14 Average message latency versus accepted traffic for DESTRO and adaptive routing with different selection functions. Bit-reversal traffic. 8-ary 3-tree
2361
Average Message Latency (cycles)
A HoL-blocking aware mechanism 9000
FF-NOORDER FF-20p SADP-NOORDER SADP-20p Adap-Lin-NOORDER Adap-Lin-20p DET-Lin DESTRO
8000 7000 6000 5000 4000 3000 2000 1000 0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Fig. 15 Cumulative average message latency versus simulation time for I/O traces. Deterministic and adaptive routing algorithms. 2-ary 7-tree
Average Message Latency (cycles)
Traffic (flits/cycle/switch)
2000
SADP FF DESTRO DET-Lin
1800 1600 1400 1200 1000 800 600 0
1e+09
2e+09
3e+09
4e+09
Simulation Time (cycles)
I/O Traces First, we compare DESTRO against DET-Lin and against the adaptive routing algorithms without guaranteeing in-order delivery (see Fig. 15). As stated before, the time required by the different routing algorithms to deliver all the packets included in the trace is almost the same for all of them because the network is able to deal with the injected traffic regardless of the used routing algorithm. Nevertheless, as it can be seen in Fig. 15, the packet latency is quite different depending on the routing algorithm. In particular, the deterministic routing algorithms outperform the adaptive ones with any selection function. Moreover, DESTRO is the one that provides the lowest latency. This is due to its ability to classify packets in the network and keep congestion bounded. When in-order delivery of packets is considered, the time required to deliver all the packets included in the trace is strongly affected by the used routing algorithm. Figure 16a shows the obtained results using a reorder buffer size of 20 packets. In this figure, only the first 10 % of the total number of packets in the trace was simulated, since simulating the in-order delivery mechanism was very time consuming. As it can be observed, the differences among the deterministic routing algorithms and the adaptive ones are considerable. The deterministic routing algorithms (both DESTRO
123
C. Gómez et al. Average Message Latency (cycles)
2362 SADP FF 1e+06 DESTRO DET-Lin 100000
10000
1000 5000 10000 15000 20000 25000 30000 35000
Injected Messages
(a)
(b)
Fig. 16 I/O traces are used in a 2-ary-7-tree. In-order delivery is guaranteed using a 20-packet buffer size. a Total trace delivery time. b Cumulative average message latency versus injected packets
and Det-Lin) reduces execution time by a factor of 4 over the adaptive one using FF as selection function due to its good packet classification. If SADP selection function is used, execution time is reduced by a factor of 3. SADP does not bound congestion as well as DESTRO and, in this case, the overhead of in-order delivery of packets does not compensate the flexibility provided by adaptive routing. Moreover, Fig 16b shows the cumulative packet latency when in-order delivery is guaranteed, but now versus the number of injected packets since the simulation times are very different for each curve. In this case, the latencies are much different than the ones obtained in Fig. 15. Notice that the average latency is represented using a logarithmic scale and only the first 10 % of the trace is simulated. As it can be seen, the latencies for the deterministic routing algorithms remain much smaller than the ones obtained for adaptive routing with any of the selections functions. Notice that the differences between the two deterministic routing algorithms can be better seen in Fig. 15 where a linear scale is used.
7 Conclusions Adaptive routing in fat-trees is commonly accomplished by an upward adaptive subpath in which any of the k up output ports of the traversed switches can be selected, followed by a downward deterministic subpath. The deterministic downward subpath to be followed only depends on the followed upward adaptive subpath. In this paper, we have proposed a mechanism to select the output port from the k available ones at each switch in the upward subpath of a fat-tree. This mechanism tries to balance traffic in the network and, what is more important, to reduce network congestion by avoiding the HoL blocking effect. The mechanism can be used with source or distributed routing, so it can be used in any of the commercial high-performance interconnection networks. Indeed, the output port provided at each switch can be used as a preferred one, therefore serving as a selection function for adaptive routing or as the unique routing option, obtaining then a deterministic routing algorithm. The evaluation results show that the proposed selection function (SADP) can remarkably improve the packet latency, both for synthetic traffic patterns and I/O traces. Moreover, the hardware implementation of SADP is not more complex than
123
A HoL-blocking aware mechanism
2363
other selection functions. Concerning deterministic routing (DESTRO), our proposal is the best of the deterministic routing algorithms evaluated, both for synthetic traffic and I/O traces. In particular, congestion due to hot destinations is contained as much as possible. Comparing it with adaptive routing, when synthetic traffic patterns are used, it is able to obtain similar performance results to the ones obtained by adaptive routing with SADP, with the exception of some particular traffic patterns where destinations share some components. When we analyze the performance of the considered routing algorithms using the traffic generated from real applications (I/O traces), we can observe that, when in-order delivery is enforced, DESTRO strongly outperforms any adaptive routing algorithm. In particular, DESTRO improves the time to deliver all the messages of the I/O traces over the SADP adaptive routing by a factor of 3. Moreover, DESTRO allows a simpler implementation, since neither selection function nor additional hardware resources to ensure in-order delivery are required. Acknowledgments This work was supported by the Spanish Ministerio de Ciencia e Innovación (MICINN) and jointly financed with Plan E funds, under Grant TIN2009-14475-C04 as well as by Consolider-Ingenio 2010 under Grant CSD2006-00046.
References 1. Abali B et al (2001) Adaptive routing on the new switch chip for IBM SP systems. J Parallel Distrib Comput 61(9):1148–1179 2. Bakker E, van Leeuwer J, Tan RB (1991) Linear interval routing. Algoritms Rev 2:45–61 3. Bogdanski B, Reinemo S-A, Sem-Jacobsen FO, Gran sFtree EG (2012) A fully connected and deadlock free switch-to-switch routing algorithm for fat-trees. ACM Trans Archit Code Optim 8(4):55-1–55-20 4. Bogdanski B, Dag B, Reinemo S-A, Flich J (2013) Making the network scalable: inter-subnet routing in InfiniBand. In: Proceedings of the Euro-Par 2013 international conference 5. Dally WJ, Towles B (2004) Principles and practices of interconnection networks. Morgan Kaufmann, Burlington 6. Duato J, Yalamanchili S, Ni L (2004) Interconnection networks: an engineering approach. Morgan Kaufmann, Burlington 7. Escudero-Sahuquillo J, Gunnar E, Garcia PJ, Flich J, Skeie T, Lysne O, Quiles FJ, Duato J (2014) Efficient and cost-effective hybrid congestion control for HPC interconnection networks. IEEE Trans Parallel Distrib Syst (to apear). doi:10.1109/TPDS.2014.2307851 8. Flich J, Malumbres MP, López P, Duato J (2000) Improving routing performance in Myrinet networks. In: Proceedings of the 14th international parallel and distributed processing symposium 9. García PJ, Flich J, Duato J, Johnson I, Quiles FJ, Naven F (2005) Dynamic evolution of congestion trees: analysis and impact on switch architecture. In: Proceedings of 1st HiPEAC conference, pp 266–285 10. Geoffray P, Hoefler T (2008) Adaptive routing strategies for modern high performance networks. In: IEEE HOTI 11. Gilabert F, Gómez ME, López P, Duato J (2006) On the influence of the selection function on the performance of fat-trees. In: European conference on parallel computing 12. Greenberg R, Leiserson C (1985) Randomized routing on fat-trees. In: Annual symposium on the foundations of computer science 13. Gómez ME, López P, Duato J (2005) A memory-effective routing strategy for regular interconnection networks. In: IEEE international parallel and distributed processing symposium 14. Gómez C, Gilabert F, Gómez ME, López P, Duato J (2007) Deterministic versus adaptive routing in fat-trees workshop on communication architecture on clusters. In: IEEE international parallel and distributed processing symposium 15. Hillis WD, Tucker L (1993) The CM-5 connection machine: a scalable supercomputer. Commun ACM 36(11):31–40
123
2364
C. Gómez et al.
16. Hoefler T, Schneider T, Lumsdaine A (2009) Optimized routing for large-scale InfiniBand networks. In: Proceedings of the 2009 17th IEEE symposium on high performance interconnects 17. Infiniband Trade Association. http://www.infinibandta.org 18. Johnson G, Kerbbyson D, Lang M (2008) Optimization of InfiniBand scientific applications. In: 22nd international parallel and distributed processing 19. Kariniemi H (2006) On-line reconfigurable extended generalized fat tree network-on-chip for multiprocessor system-on-chip circuits. PhD. thesis, Tampere University of Technology 20. Karol MJ, Hluchyj MG, Morgan SP (1987) Input versus output queueing on a space-division packet switch. IEEE Trans Commun 35:1347–1356 21. Kim J, Park D, Theocharides T, Vijaykrishnan N, Das CR (2005) A low latency router supporting adaptivity for on-chip interconnects. In: 42nd annual conference on design automation 22. Kim J, Dally WJ, Dally J, Abts D (2006) Adaptive routing in high-radix clos network. In: SC 2006 conference, proceedings of the ACM/IEEE, Tampa, FL, 7 Nov 2006. doi:10.1109/SC.2006.10 23. Lin X, Chung Y, Huang T (2004) A multiple LID routing for fat-tree-based InfiniBand networks. In: IEEE international parallel and distributed processing symposium 24. Martínez JC, Flich J, Robles A, López P, Duato J (2004) Supporting adaptive routing in IBA switches. J Syst Archit 49:441–449 25. Martínez JC, Flich J, Robles A, López P, Duato J, Koibuchi M (2005) In-order packet delivery in interconnection networks using adaptive routing. In: IEEE international parallel and distributed processing symposium 26. Myricom. http://www.myri.com 27. Petrini F, Vanneschi M (1995) k-ary n-tress: high performance networks for massively parallel architecture. In: IEEE Micro, vol 15 28. Quadrics homepage. http://www.quadrics.com 29. Scott S, Abts D, Kim J, Dally WJ (2006) The BlackWidow high-radix clos network. In: International sympium on computer architecture 30. Ruemmler C, Wilkes J (1993) Unix disk access patterns. In: Winter Usenix conference 31. Tianhe. http://www.nscc-tj.gov.cn/en/ 32. Top 500 Supercomputer site (2014). http://www.top500.org 33. Vishnu A, Koop M, Moody A, Mamidala A, Narravula S, Panda D (2007) Hot-spot avoidancce with multipathing over InfiniBand: an MPI perspective. In: International symposium on cluster computing and the grid 34. Zahavi E, Johnson G, Kerbyson DJ, Lang M (2010) Optimized InfiniBandTM fat-tree routing for shift all-to-all communication patterns. Concurr Comput Pract Experience 22:2
123