Preventing Out-of-Sequence for Multicast Input-Queued Space-Memory-Memory Clos-Network

Yu, Hao; Ruepp, Sarah Renée; Berger, Michael Stübert; Dittmann, Lars

Published in:
Proceedings of OPNETWORK 2011

Publication date:
2011

Citation (APA):

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

- Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
- You may not further distribute the material or use it for any profit-making activity or commercial gain
- You may freely distribute the URL identifying the publication in the public portal

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
Preventing Out-of-Sequence for Multicast Input-Queued Space-Memory-Memory Clos-Network

Hao Yu, Sarah Ruepp, Michael S. Berger, and Lars Dittmann
Technical University of Denmark
2800 Kgs. Lyngby, Denmark
E-mail: {haoyu, srru, msbe, ladit}@fotonik.dtu.dk

Abstract

This paper proposes an out-of-sequence (OOS) preventative cell dispatching algorithm, the multicast flow-based round robin (MFRR), for multicast input-queued space-memory-memory (IQ-SMM) Clos-network architecture. Independently treating each incoming cell, such as the desynchronized static round robin (DSRR), can lead to cells of the same packet being disordered after traversing the switch fabric. Extra reassembly buffer and delay are thus required at the outputs of the switch fabric to reorder the cells. Simulation results obtained by OPNET Modeler show that the MFRR can eliminate such OOS problem by monitoring the changes of received fan-out vectors.

1. Introduction

Multicast enables networks to handle traffic in a resource-efficient manner by introducing some complexity to the network nodes. By the capability of copying and sending the same packet to different destination ports, network nodes can reduce the traffic load. This benefit is especially significant regarding some high-bandwidth applications, such as Internet Protocol Television (IPTV), video broadcasting, and videoconferencing.

A number of publications attempt to build high-speed and large-scale switches [1]-[11]. In terms of the types of switch fabric, they are usually categorized into single-stage switches and multi-stage switches. A crossbar switch, as shown in Fig. 1, is a single-stage switch fabric that has two attractive properties: simple architecture and internally non-blocking. However, such a fabric architecture has a limited scalability because the number of cross points increases in a square proportion \(N^2\) to the switch size, i.e. the number of input/output ports.

As the port speed increases to today’s 100 Gbit/s, the scalability of switches becomes an important issue and factor to the switch architecture design. The low scalability of crossbar switch fabrics makes them unattractive to large-scale switches. On the other hand, using multi-stage switch fabrics to build high-capacity switches has become very common [12]. Clos-network, as shown in Fig. 2, is a multi-stage switch fabric that is considered more scalable and cost-efficient compared to the crossbar switch fabric. A Clos-network consists of three stages, i.e. the input stage, the central stage, and the output stage. Crossbar switching elements (X-SEs) are placed at each stage to construct the Clos-network, and are referred to as the input module (IM), the central module (CM), and the output module (OM) based on their location.

Various scheduling algorithms have been developed for different Clos-network architectures regarding buffer allocations. The architecture where memories are only allocated in the IMs/OMs is referred to as the memory-space-memory (MSM) architecture. CRRD/CMSD [7] and MWMD [8] have been proposed for such an architecture using round-robin arbitration. One disadvantage of these schemes is that the IMs and OMs use shared memories, resulting in a requirement of a speedup to the buffer, which hinders the scalability of the switch. To solve this problem, a bufferless space-space-space \((S^3)\) Clos-network architecture is proposed in [6]. Based on the static round-robin dispatching (SRRD), Distro [6] is proposed for the \(S^3\) architecture and is demonstrated to achieve 100% throughput under uniform traffic. Due to that no buffers are allocated at the CMs, MSM and \(S^3\) architectures require complex algorithms to solve the cell contention problems to avoid blocking and are thus poorly scalable. The fully buffered memory-memory-memory (MMM) architecture, despite having no cell contention problem, is not cost-effective in hardware implementation. Then in [9], a space-memory-memory (SMM) Clos-network architecture with desynchronized static round-robin (DSRR) dispatching scheme is proposed. The study demonstrates that SMM can achieve 100% throughput with DSRR under admissible traffic. However, the SMM with DSRR in [9] uses output-queued SMM (OQ-SMM) scheme for the buffered stages, where speedup is again required. In addition, since buffers are placed in the central stages, cells can experience different delays and thus resequencing is required to solve the out-of-sequence (OOS) problem.
problem at the output. When concerning multicast, DSRR, initially designed for unicast and not evaluated for multicast, can result in a serious OOS problem.

In order to reduce the requirement of the speedup of the OQ-SMM and solve the OOS problem caused by DSRR, this paper proposes an input-queued SMM (IQ-SMM) architecture and two cell dispatching algorithms: multicast flow-based DSRR (MF-DSRR) and multicast flow-based round-robin (MFRR). Input-queued crossbar switching elements (IQ-X-SEs) are leveraged as the CMs and OMIs in this architecture. The MF-DSRR or MFRR is independently run in the IMs, distributing incoming cells to the CMs and OMIs in this architecture. The MF-DSRR or MFRR are presented. In architecture of the system and some notations used throughout this paper are listed as follows:

\[ IM_i \quad \text{ith input module, where } 0 \leq i \leq r - 1 \]
\[ CM_k \quad \text{kth central module, where } 0 \leq k \leq m - 1 \]
\[ OM_j \quad \text{jth output module, where } 0 \leq j \leq r - 1 \]
\[ I_{lp} \quad \text{pth input of } IM_i, \text{ where } 0 \leq p \leq n - 1 \]
\[ O_{jd} \quad \text{qth output of } OM_j, \text{ where } 0 \leq q \leq n - 1 \]
\[ Q_{c_1,k} \quad \text{ith input queue of } CM_k \text{ connected to } IM_i \]
\[ Q_{c_2,k,y} \quad \text{kth input queue of } OM_j \text{ connected to } CM_k \]
\[ IC_{k,y} \quad \text{the interstage link connecting } IM_i \text{ and } CM_k \]
\[ IL_{k,j} \quad \text{the interstage link connecting } CM_k \text{ and } OM_j \]

It is assumed that each packet carries a fan-out vector \( b = (b_j), b_j \in \{0,1\}, 0 \leq j \leq N - 1 \), where \( b_j = 1 \) indicates the packet is destined to the \( j \)-th output port, otherwise \( b_j = 0 \). Cells generated from the same packet have the same fan-out vectors. Cells are replicated as far downstream as possible in the switch model. A bit-cluster is defined as a set of \( n \) consecutive bits in the fan-out vector, and is denoted as \( c_{d} = (b_j), nd \leq j \leq nd + n - 1 \). The intersection of different bit-clusters is empty, \( c_{d_1} \cap c_{d_2} = \emptyset, \text{ if } d_1 \neq d_2 \). Therefore the fan-out vector can be expressed by bit-clusters, \( b = (c_{d}), 0 \leq d \leq r - 1 \). We define \( |c_{d}| = \min (1, \sum b_j) \). \( CM_k \) multicasts the incoming cell by examining \( r \) bit-clusters of the fan-out vector. If \( c_d \neq 0 \), \( CM_k \) sends a copy of the cell to \( OM_d \). \( OM_d \) examines all the bits in \( c_{d-1} \), and if \( b_j = 1 \), \( OM_d \) sends a copy of the cell to \( O_{d,j-nd} \).

2. Input-Queued Space-Memory-Memory Clos-Network

We assume that a first-in-first-out (FIFO) queue is installed at each input port to temporarily store multicast packets. Packets are assumed to be segmented into fixed-size cells in the input port processors (IPPs) before entering into IMs and reassembled at the output port processors (OPPs) after traversing the switch fabric. The IQ-SMM Clos-network switch model consists of three stages and is denoted as \( C(n,m,r) \). As shown in Fig. 3, the switch has \( r \) IMs of size \( n \times m \), \( r \) OMIs of size \( m \times n \), \( m \) CMs of size \( r \times r \). Each IM/OM has \( n \) connections to IPPs/OPPs, and \( m \) interstage connections to CMs. Only one interstage connection exists between an IM/OM and a CM. The number of input/output ports of the switch is \( N = nr \).

IMs forward incoming cells to the CMs following certain cell dispatching (CD). Each CM/OM has \( r/m \) FIFO input queues, each of which is connected to one interstage link. Several notations used throughout this paper are listed as follows:

\[ IM_i \quad \text{ith input module, where } 0 \leq i \leq r - 1 \]
\[ CM_k \quad \text{kth central module, where } 0 \leq k \leq m - 1 \]
\[ OM_j \quad \text{jth output module, where } 0 \leq j \leq r - 1 \]
\[ I_{lp} \quad \text{pth input of } IM_i, \text{ where } 0 \leq p \leq n - 1 \]
\[ O_{jd} \quad \text{qth output of } OM_j, \text{ where } 0 \leq q \leq n - 1 \]
\[ Q_{c_1,k} \quad \text{ith input queue of } CM_k \text{ connected to } IM_i \]
\[ Q_{c_2,k,y} \quad \text{kth input queue of } OM_j \text{ connected to } CM_k \]
\[ IC_{k,y} \quad \text{the interstage link connecting } IM_i \text{ and } CM_k \]
\[ IL_{k,j} \quad \text{the interstage link connecting } CM_k \text{ and } OM_j \]

Figure 3: Input-queued space-memory-memory Clos-network
Upon detecting a change of fan-out vector on case will be at least $m > n$. Usually for a strictly non-blocking Clos-network, $m \geq 2n - 1$. Therefore, the length of the AvailableList in this case will be at least $n - 1$.

Upon detecting a change of fan-out vector on IM, MFRR also requires that each input monitors the change of received fan-out vectors. Even though MF-DSRR has a low connection setup complexity, it can cause cells carrying the same fan-out vector to be distributed to different CMs because when a fan-out change occurs on one input, the whole connection of the IM is changed. During a transmission of a flow of cells of the same fan-out vector, it is possible that the connection pattern is changed due to a fan-out change detected on other input. This will cause transmitting these cells to different CMs. In MFRR, each input maintains its own connection and changes to an idle interstage link when the input detects a fan-out change on it.

Since each input sets up the connection independently, the problem of which idle interstage link should be chosen to set up the new connection needs to be solved. A list called AvailableList is created for each IM to record the idle interstage links as its elements. Elements can be only popped from the top of the list and inserted to the bottom. When an element is popped, all the elements in the AvailableList are moved one position up to the top. Since there are $n$ inputs and $m$ interstage links in IM, the number of elements of the AvailableList is $(m - n)$. Therefore in order to use MFRR, the switch must have $m > n$. Usually for a strictly non-blocking Clos-network, $m \geq 2n - 1$. Therefore, the length of the AvailableList in this case will be at least $n - 1$.

Upon detecting a change of fan-out vector on IM, MFRR also requires that each input monitors the change of received fan-out vectors. Even though MF-DSRR has a low connection setup complexity, it can cause cells carrying the same fan-out vector to be distributed to different CMs because when a fan-out change occurs on one input, the whole connection of the IM is changed. During a transmission of a flow of cells of the same fan-out vector, it is possible that the connection pattern is changed due to a fan-out change detected on other input. This will cause transmitting these cells to different CMs. In MFRR, each input maintains its own connection and changes to an idle interstage link when the input detects a fan-out change on it.

Since each input sets up the connection independently, the problem of which idle interstage link should be chosen to set up the new connection needs to be solved. A list called AvailableList is created for each IM to record the idle interstage links as its elements. Elements can be only popped from the top of the list and inserted to the bottom. When an element is popped, all the elements in the AvailableList are moved one position up to the top. Since there are $n$ inputs and $m$ interstage links in IM, the number of elements of the AvailableList is $(m - n)$. Therefore in order to use MFRR, the switch must have $m > n$. Usually for a strictly non-blocking Clos-network, $m \geq 2n - 1$. Therefore, the length of the AvailableList in this case will be at least $n - 1$.

Upon detecting a change of fan-out vector on IM, MFRR also requires that each input monitors the change of received fan-out vectors. Even though MF-DSRR has a low connection setup complexity, it can cause cells carrying the same fan-out vector to be distributed to different CMs because when a fan-out change occurs on one input, the whole connection of the IM is changed. During a transmission of a flow of cells of the same fan-out vector, it is possible that the connection pattern is changed due to a fan-out change detected on other input. This will cause transmitting these cells to different CMs. In MFRR, each input maintains its own connection and changes to an idle interstage link when the input detects a fan-out change on it.
than 0, indicating that there are packets waiting to be segmented, the process executes the function PacketToSegQ. Else, it transits “Idle”.

4.2 Output Packet Processor (OPP) Process Model

The process model of the Output Packet Processor (OPP) is shown in Fig. 9. Parameters used in the process are initialized in the “Init” state. After the initialization phase, the module goes to the “Idle” state and waits for further events.

In “Idle” state, CellArrival triggers the process to execute InsertCelltoRsm function, which inserts the incoming cell to the reassembly buffer. When a packet is reassembled, event RsmReady transits the state from “Idle” to “Sending” and executes the function RsmSendPkt.

In “Sending” state, when the packet transmission timer expires, the event TXCompleted triggers the state transition to either “Idle” or “Sending” based on whether there is any packet ready in the reassembly buffer. If there is a packet ready for transmission, RsmSendPkt is called and the process stays in “Sending”. Else, the process transits from “Sending” to “Idle”. During the stay in “Sending”, CellArrival triggers the function InsertCelltoRsm and transition back to “Sending” because the process is busy with transmitting a packet and any incoming cells should be stored in the reassembly buffer.

4.3 Input Module (IM) Process Model

The process model of the Input Module (IM) is shown in Fig. 10. Parameters used in the process are initialized in the “Init” state. After the initialization phase, the module goes to the “Idle” state and waits for further events.

In “Idle”, CellArrival executes the function ForwardCell and transits back to “Idle” state. The function ForwardCell reads parameters from the node and distributes the incoming cells based on the selection of the cell dispatching schemes.

4.4 Central Module/Output Module (CM/OM) Process Model

The process model of the Central Module/Output Module (CM/OM) is shown in Fig. 11. Parameters used in the process are initialized in the “Init” state. Afterwards, the module goes to the “Idle” state and waits for further events. A periodic timer is maintained by the process to model the cell time.

In “Idle” state, CellArrival event executes the function InsertCell that inserts the incoming cells to one of the FIFO queues installed at the inputs of the CM. As the timer that emulates the cell time expires, event NextTimeSlot happens and makes a transition from “Idle” to “Request”, beginning the multicast scheduling process described in [14].

Basically, the CM and the OM process run the same scheduling algorithm. The only difference is that they observe different parts of the fan-out vector. As described previously, the CM only observes the bit-clusters and the fan-out observed will be different from the actual one. For instance, if a cell has a fan-out vector <11000001> and \(n = 2\) (two outputs on the OM), then the fan-out observed by the CM will become <1001>. The OM, on the contrary, only observes the bit-cluster that has the same index as it. For instance, \(OM_0\) only checks the <11> part because it is these two bits that matter to its outputs.

5. Simulation Results and Analysis

The simulation is carried out in OPNET Modeler [14]. DSRR is used as a reference to the MF-DSRR and MFRR.

OOS problem is categorized into two types: inter-packet OOS and in-packet OOS. Inter-packet OOS means that cells generated by different multicast packets are disordered, and in-packet OOS implies that cells generated by the same multicast packet are disordered. The reason of such a subdivision is to differentiate the characteristics of OOS and further analyze and compare different cell dispatching schemes so that appropriate reassembly methods can be applied if necessary.

Admissible traffic with average fan-out of 8 and average number of cells per packet of 13 is generated independently at each input of the simulated \(C(4,7,4)\) IQ-SMM Clos-network switch. The multicast scheduling algorithms with sync to reduce multiple transmissions of cells used in CMs and OMs are described in [13]. A multicast cell may be served several times before it is removed from the queue. Unnecessary multiple transmissions can cause increased cell delays in a multicast switch. The sync mechanism aims at reducing the number of transmissions per
multicast cells while maintaining the output port utilization. In order to reduce the HOL blocking problem, both CMs and OMs can look ahead into their FIFO queues for cells that can be served to idle outputs. Since it is impractical to search an infinite depth into the queues, a maximum look-ahead depth, \( A \), is defined. If the \( LA \) is reached and the IQ-SE still has idle outputs, it stops the searching and completes the scheduling process.

Fig. 12 compares the inter-packet OOS for different cell dispatching schemes. With the sync mechanism, the inter-packet OOS is reduced for both MFRR and MF-DSRR. Under high offered load, MFRR and MF-DSRR can result in more than 50% of the received cells being inter-packet OOS cells.

Fig. 13 compares the in-packet OOS for different cell dispatching schemes. MFRR schemes outperform the others with zero in-packet OOS cells. Both MF-DSRR schemes have less than 10% of total received cells being OOS cells under high offered load. DSRR schemes result in serious in-packet OOS problems. With the sync mechanism, DSRR causes more in-packet OOS cells because same-packet cells are treated independently and distributed to different CMs, resulting in a well distributed fan-out vectors in the input queues of CMs and OMs. Thus the same-packet cells have a higher probability to be disordered due to the round-robin working mechanism of the sync. A decrease of the in-packet OOS for DSRR and MF-DSRR schemes can be observed under the loads higher than 0.7. This is due to the fact that the CM queue length begins to increase non-linearly, resulting in more inter-packet OOS cells. Since the inter-packet and the in-packet OOS are considered separately, the percentage of in-packet OOS therefore decreases.

Fig. 14 shows the total number of OOS cells, the sum of inter-packet and in-packet OOS cells, for different schemes. MFRR with sync significantly reduces the OOS problem under high offered loads, while DSRR schemes cause nearly linear growths of the OOS cells with the increase of the offered load.

Fig. 15, Fig. 16 and Fig. 17 compares the inter-packet OOS in-packet OOS and total OOS cells, respectively, under different \( LA \) values for the sync mechanism. With larger \( LA \) values, a decrease on inter-packet OOS for each cell dispatching scheme is observed in Fig. 15. DSRR with \( LA = 2 \) outperforms the other two schemes. In Fig. 16, the look-ahead mechanism greatly reduces the in-packet OOS for the DSRR scheme but DSRR with \( LA = 2 \) still suffers from approximately 50% of all the received cells being in-packet OOS under high load. MFRR always maintains zero in-packet OOS under different \( LA \) values. In terms of the total OOS cells, MFRR with \( LA = 2 \) outperforms the others, as shown in Fig. 17. This is because with the capability of look-ahead, the IQ-X-SEs are able to send some of the delayed cells that cause the OOS problem.

**Conclusion**

This paper presented two out-of-sequence preventative cell dispatching schemes for the input-queued space-memory-memory (IQ-SMM) Clos-network architecture. MF-DSRR utilizes the connection pattern of DSRR and obtains a low implementation complexity. MF-DSRR can reduce the OOS problem but still suffers from the in-packet OOS due to its simplicity. On the other hand, MFRR is able to eliminate the in-packet OOS problem. With the use of a list to record the idle outputs, the MFRR can achieve a low complexity for connection setup.
Simulation results show that the MFRR cell dispatching outperforms DSRR and MF-DSRR in terms of reducing the OOS problem. The sync mechanism is able to improve the performance of MFRR and MF-DSRR but worsens the performance of DSRR. With the look-ahead mechanisms, the IQ-SMM architecture can further reduce the OOS problem.

Acknowledgement
This work was supported in part by the Danish National Advanced Technology Foundation in the project The Road to 100 Gigabit Ethernet.

Reference