Cluster Computing https://doi.org/10.1007/s10586-017-1683-9
Nxt-Max: for supporting VDC-based maximum redundant bandwidth allocation in cloud datacenter Wang Shuo1,2 · Li Jing1 · Wang Qiqi1 · Zhang Hongjie1 Received: 21 October 2017 / Revised: 27 December 2017 / Accepted: 29 December 2017 © Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract In cloud datacenter, work-conserving bandwidth offering benefits the network sharing of multiple virtual datacenters (VDCs). The lack of concrete network resources prevents tenants from predicting lower bounds on the performance of their applications. Prior works concentrate on efficient and scalable bandwidth allocation algorithms for VDCs. However, per-VDC redundant bandwidth is ignored, which is crucial to work-conserving bandwidth offering. In this paper, Nxt-Max, a VDC-based redundant bandwidth allocation framework is designed that ensures per-VDC maximum redundant bandwidth allocation by integrating online allocation with offline allocation. To make online bandwidth allocation efficient, a heuristic online parallel bandwidth allocation algorithm is proposed by dividing the whole allocation spaces into independent pods. Coupled with smart offline adjustment, the released network graph, and greedily migrate VDCs with the least redundant bandwidth are constructed, advancing the overall underlying network utilization. Simulations demonstrate that VDC allocation achieves high network utilization, low time complexity, and per-VDC maximum bandwidth allocation. Keywords Cloud datacenter · Cloud networks · Per-VDC redundant bandwidth · Bandwidth allocation
1 Introduction The healthy maturation of cloud computing has made the increasing number of enterprises carry their applications into cloud datacenter. These enterprises called VDCs form the multi-tenant environment. Central to cloud computing is its ability to share and multiplex resources across these VDCs. Cloud networks, however, are shared in a besteffort manner making it hard for both tenants and cloud providers to reason about how network resources are allocated. Trends in network virtualization and commodities of scale are pointing to cloud datacenter with millions of virtual end points, improper virtual machine (VM) scheduling may make diverse traffic from VDCs converge on the same physical link, affecting the application performance of VDCs with fixed bandwidth requirements, further resulting in low utilization of the entire underlying networks. Recent
B
Wang Shuo
[email protected]
1
Department of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, Anhui, China
2
Department of Computer Science and Technology, Bengbu University, Bengbu 233000, Anhui, China
cloud datacenter network designs rely on the path multiplicity to achieve horizontal scaling of hosts [1–5]. While making performance isolation among different VDCs, some prior works [6–10] use hypervisor-based congestion control mechanisms to solve scalability problem due to centralized control, which operate independently, there is no need for complex coordination between the hypervisors or with a central controller. However, these works concentrate on proposing the work-conserving bandwidth guarantee solutions, ignoring the redundant bandwidth allocation which in turn benefits the increase of the potential underlying network utilization. For example, as Fig. 1a describes, Path-1 and Path-2 are the multi-paths from source to destination, if the flows from VDC-A and VDC-B pass the same path, to make fair work-conserving bandwidth guarantee between VDC-A and VDC-B, thus eventually VDC-A and VDC-B get 10MB/s, respectively. But this is not the ideal circumstance, in Fig. 1b the flows from VDC-A and VDC-B pass different paths, then VDC-A and VDC-B can get 20MB/s in work-conserving condition, twice that in Fig. 1a, respectively. Many cloud services request for higher bandwidth due to operations such as all-to-all communications in MapReduce and Spark. Therefore, network bandwidth is often a
123
Cluster Computing
Fig. 1 a Scheduling to the same path b Scheduling to different paths. The bandwidth allocation under different scheduling circumstances
scarce resource. Maximum redundant bandwidth allocation may accelerate the execution of VDC tasks, increasing utilization of the underlying physical network resources and consequently revenues. However, to provide VDC-based maximum redundant bandwidth allocation brings some challenges, the main contributions are threefold: firstly, to be practical in cloud datacenter, a packet encapsulation method for enabling single virtual link to multipath mapping is proposed. Secondly, an online VN to PN mapping algorithm for providing maximum redundant bandwidth allocation is proposed. Thirdly, for maximizing the per-VDC redundant bandwidth allocation, an offline VN migration algorithm to re-optimize the mapping of existing VNs to PN is proposed. The rest of the paper is organized as follows. VDC network model is described in Sect. 2 and overview the method for supporting flexible splitting of virtual links over multiple physical paths in Sect. 3. the VDC allocation algorithm is presented in Sect. 4. The simulation is used to study and show experiment results VDC allocation in Sect. 5. Section 6 presents related work and discussion. Sect. 7 concludes.
2 Cost model Enterprises carry their applications from traditional running environment consisting of a fixed quantity of servers and network equipment to cloud datacenter for reducing the expenditure of hardware purchase and software upgrade, tenants request VMs with varying amounts of CPU, memory and storage resources, as well as internal network bandwidth resource. The fact that tenants do not expose their network requirements hurts both tenants and providers. Without relinquishing the performance isolation, this motivates the need to explicitly account for the network as the bandwidth guarantees for VDCs. In addition, tenants expect to be able to customize their network in VDCs. Hence, the brand new VN model for tenants is designed to describe their network requirements. the VN structure is introduced in details in Sects. 2.1 and 2.2. The VNs for tenant application are given
123
to the cloud that in turn deploy VMs onto servers and logic switches onto physical ones. This will allow tenants to provide their application with a cloud networking environment that is very similar to its enterprise environment, thereby increasing the possibility of preserving similar performance. In this paper, the process of online VN to PN mapping towards per-VDC maximum redundant bandwidth allocation based on this model is highlighted.
2.1 The VN model Generally, cloud providers desire a standardized template for tenants to describe their diverse VN requirements. Many recent efforts propose different VN models for bandwidth allocation, including traffic matrix [8], hose model with invariable bandwidth [9], hose model with time-varying bandwidth [11], tenant application graph (TAG) model [12] based on application communication structure. Rely on these models, they design the specific VN to PN mapping algorithms. In this paper, the switch centred graph (SCG) is proposed, a new model that tenants can use to describe fine-grained network requirements for VDCs, including network structure, uplink bandwidth, and downlink bandwidth. Unlike the hose abstraction, which presents models resembling typical physical networks, also unlike the TAG abstraction modeling the actual communication patterns of applications, the SCG abstraction aims to help tenants mimic prior network environments where applications run. The SCG model leverages the tenant’s knowledge of network structure and bandwidth demand to yield a concise yet flexible representation of actual networks, covering all of the common network topologies in practice. Given above, SCG model is generated, a graph where each vertex represents an entity indicating a switch or a VM, to mimic practical application network structure in production. Fig. 2 shows a simple SCG modeled VN with the oversubscribed multi-tier network architecture. In this VN, two bidirectional edges between the core layer and the aggregation layer are labeled < B1, B2 > and < B3, B4 >,
Cluster Computing
determines that the native resources of the switch in SCG model should not exceed those of the commodity switch in reality such as port number and port bandwidth.
2.3 Maximum redundant bandwidth allocation methodology For any VN, it should be first to be delivered and mapped to the underlying physical network. It is argued that the redundant link bandwidth a VDC can acquire depends on the number of tenants using the link and the total excess bandwidth. For ease of expression, eliminate the influence of the flow direction. Using that
Fig. 2 An SGG modeled VN
respectively. So are the edges between the aggregation layer and the access layer. By the control of uplink traffic and downlink traffic, tenants are able to stipulate the appropriate send rate and receive rate for each VM in VDC, reducing the cost on excessive bandwidth.
2.2 Model formulation The datacenter physical network is modeled as a graph PG = (S, M, E), where S is the set of all switches, M is the set of physical servers, and E is the set of all links in the network. A path from node u to node v is an ordered sequence e1 , e2 , . . . , el of distinct links where e1 and el have their respective end nodes at u and v, and each adjacent links {e j−1 e j } have a common distinct end node w where w = u = v. A path is denoted between the nodes u, v as pu,v .and the set of all paths between them as Pu,v Each link e (u, v) ∈ E has a capacity C (e).which represents the maximum bandwidth it can support. Now, the VN model is established, let VN = V S, V M, V E .be a set of elements consisting of logical switches, logical servers, and logical links. The VN should be mapped to PG for sharing the datacenter physical network. Thus VS, V M.and VE is mapped to S, M and E, respectively. The notations used is introduced in Table 1. Virtually, switch in SCG model should be in keeping with that in reality. Since cloud datacenters adopt commodity switches as the underlying interconnecting elements of network infrastructure, it
Table 1 Notations VN
All the VNs in cloud datacenter
PES
The physical edge switch in PN
PAS
The physical aggregation switch in PN
PCS
The physical core switch in PN
VES
The logical edge switch in VN
VAS
The logical aggregation switch in VN
VCS
The logical core switch in VN
map(ve, e) =
1,ifveismappedtoe 0,otherwise
(1)
Whether the logic link is mapped to physical link or not. Virtually, one logic link might be mapped to proper physical path which contains a few links. For ease of expression, assume one logic link can only be mapped to one physical path: map (ve) = pu,v
(2)
To get the number of tenants on the same link, it is obtained: n (e) =
VN
[map (ve, e)]≥ 1
(3)
ve∈V E
The total redundant bandwidth on the link can be allocated: R (e) = C (e) −
C (ve) [map (ve, e)]
(4)
V N ve∈V E
Thus, the per-VDC fair redundant bandwidth allocation on the link e is obtained: Faverage (e) =
R (e) n (e)
(5)
In fact, a logical link is able to be mapped to many physical paths, which would bring the irrational redundant bandwidth allocation among VDCs. Thus, the balanced factor α s introduced, the actual redundant bandwidth allocated to logical link on physical link is obtained: Fve (e) =∝ve ∗Faverage (e)
(6)
To provide fair redundant bandwidth allocation, obviously, the more physical links the logical link is split, the less redundant bandwidth on each physical link logical link can get. Let
123
Cluster Computing
k denote disjoint paths the logical link is split, α determined by all the values of k on the physical link:
∝ve =
VN
1 kve 1 ve∈V E kve
[map (ve, e)]
(7)
Let Psplit denote the multiple paths the logical link is split, redundant bandwidth of the logical link is the sum of that of each path: Fve =
Fve (e)
(8)
e∈Psplit
To a path pu,v = {e1 , e2 , · · · , el }, the most blocking link determines the bandwidth should be allocated. Hence it is obtained: F (e) = Fve (9)
F e ∈ p (10) allocated Bandwidth pu,v = min (e) u,v pu,v Finally, the cost model is devoted to maximizing the total redundant bandwidth allocated to the VN: maximize
allocated Bandwidth (map (ve)) (11)
V N ve∈V E
3 System architecture Commodities of scale are pointing to future cloud datacenter with millions of VMs. These VMs can be booted, shutdown, and migrated, arbitrarily. Due to these dynamics, lots of recent works concentrate on the design of largescale layer 2 network, in which the key challenge is how to make commodity switches stateless. In the work, after VN is mapped to PN, packet forwarding paths among VMs are fixed. The traditional way for packet forwarding mainly relies on the routing protocol used in cloud network such as OSPF (RFC 2187 - OSPF Version 2.), however, it may conflict with the mapped results. Installing static routing rules in the switches can guarantee correct forwarding paths, but common switches have limited capacity. In this paper, MultiTCP technology (Multipath TCP - Linux Kernel implementation.) is adopted to increase total underlying network utilization, which allows one flow to be split into several subflows across the mapped multiple paths. Meanwhile, to maintain intermediate switches stateless for scalability, source routing based forwarding to instruct subflows is introduced to travel the mapped way.
123
Fig. 3 System architecture
3.1 The Nxt-Max cloud system The goal of Nxt-Max is to make maximum redundant bandwidth guarantee for VDCs, mainly comprising two parts: one is to dynamically allocate network resources to each VDC online; the other is to rearrange the allocated resources among VDCs offline through migration, since the entry or departure of VN is agnostic which would turn into non-optimal situation. Combining online allocation with offline rearrangement can achieve per-VDC maximum redundant bandwidth guarantee. Therefore, as shown in Fig. 3, two modules, designed to enable the goal, are central allocation module (CA) and distributed encapsulation and forwarding module (DEAF) respectively.
3.2 Central allocation The main work of CA module is to allocate underlying network resources according to each online tenant request. Four components cooperated to maximize redundant bandwidth for each VN request, are online allocation component, network utilization component, allocation strategy component and offline adjustment component respectively. Tenant submits a VN request, it is first handled by the online allocation component to analyze the concrete demand of the request. Subsequently, the network utilization component is inquired to delivery global redundant network resources utilization. Meanwhile, according to the active strategy that the allocation strategy component employs, the online allocation component generates allocation results. However, multiple strategies can be supported and exerted in the system. In this paper, per-VDC maximum redundant bandwidth allocation strategy is employed. Besides, offline adjustment
Cluster Computing
Fig. 4 Offline adjustment
component periodically rearrange the allocated resources to maintain per-VDC maximum redundant bandwidth allocation, as dynamic entry or departure might cause inefficient potential network utilization from the global network usage view. In Fig. 4, three VNs are mapped to PN. Assume each link of PN has capacity of 1000 Mbps and each VN has link bandwidth requirement of 500 Mbps, it is can be seen that VN1 can get maximum bandwidth of 1000 Mbps while VN2 and VN3 each can get 500 Mbps due to limited PN resources. If VN1 is evacuated and no other VNs are coming, it is argued that VN2 and VN3 can get more bandwidth than their original allocations, thus the offline adjustment component detects optimal schedule plan and starts migration. Finally, VN2 and VN3 get 1000 Mbps each.
3.3 Large 2 supported encapsulation To guarantee per-VDC maximum redundant bandwidth allocation, it is inevitable to pin the routing path on the allocated way. Restricted scalability Traditionally, bandwidth reservation is used to setup the bandwidth reservation state in the switches along the routing path. However, this approach
incurs severe scalability problem in switch state maintenance. VL2 (Greenberg et al. [4]) is used as an example to illustrate the problem, which is a large-scale layer 2 network implementation. In VL2, a top-of-rack (ToR) switch connects 20 servers, and an Aggregation switch connects 72 ToR switches. Suppose each server hosts 32 VMs and each VM talks to 1000 other VMs. Then the bandwidth reservation state in an Aggregation switch will be 46 million (32 × 1000 × 20 × 72) entries. The entries in a ToR switch is 640 k (32 × 1000 × 20). However, the state-of-the-art, highend switches e.g., Cisco Nexus 7000 (Cisco. Cisco Nexus 7000 Series 32-Port 10 Gb Ethernet Module, 80 Gb Fabric.) can only have at most 128 k forwarding entries. Source routing Resorting to SDN technology [13], global network topology can be controlled by central allocator, thus the switches can be removed from making routing decisions. This leads us to use source routing for a scalable data plane. To implement source routing, we leverage the idea [14] proposes, a model for Source-Routing-enabled OpenFlow switches (OpenFlow. http://www.openflowswitch.org). is developed and a new action type is extended to encapsulate a new header onto the matched packet. The new header is called path-header, and contains information regarding the exact path to be taken obtained from the controller. A compact mechanism similar to the Port Switching suggested by SecondNet [8] is chose as opposed to an MPLS label stack (Semeria, Chuck, and Marketing Engineer, 2000). Pathheaders hold a sequence of interface numbers that the flow will be forwarded through at every intermediate switch. As Fig. 5 shows, for a flow from NodeA to NodeB along the highlighted path, a header can be embedded in the packets with the path expressed as the sequence { 2,3,2,3 }. The overhead of source routing is the routing path carried in the header of every packet. We are willing to pay the overhead for a scalable data plane, since the maximum path length of a typical data center network is small (fat-tree, typically 5 hops). MultiTCP-based encapsulation MultiTCP enables better resource utilization by harnessing the small pieces of available bandwidth, allowing the more VN requests to be
Fig. 5 OpenFlow-based source routing
123
Cluster Computing Fig. 6 The workflow of MultiTCP-based encapsulation
accepted. Since cloud network is claimed to be large 2 supported, multiTCP is combined with source routing to make the system practical. The structure of DEAF is shown in Fig. 6. The workflow is illustrated by an example: a VM with IP address 10.0.0.3 sends packets out, it is encapsulated with the IP address of physical switch by means of IP-in-IP encapsulation (IP Encapsulation within IP. https://tools.ietf. org/html/rfc2003.). Hence, the VM flow is split into three subflows. Finally, these subflows are given transmission route in the IP header by source routing. Each hypervisor stores its autonomous VN-to-PN mappings to assist in pinning the routing path.
4 VN allocation 4.1 Cloud datacenter physical network Given the observation that the baseline multi-rooted network topology is known and relatively fixed in cloud networking, Portland [15] is taken as the initiative to deliver scalable L2 routing, forwarding, and addressing for cloud networks. Based on Portland, its routing and forwarding mechanism are modified by introducing sourcing routing to provide bandwidth guarantees. Portland designs its scalable fault tolerant L2 domain over fat-tree which is simply an instance of the traditional data center multi-rooted tree topology with the features of Clos topology. Fig. 7 depicts a 16-port switch built as a multi-stage topology from constituent 4-port switches. In general, a three-stage fat-tree built from k-port switches
123
can support non-blocking communication among k 3 /4. end hosts using k 2 /4 individual k-port switches. The fat-tree is split into three layers, labeled edge, aggregation and core. The fat-tree as a whole is split into k individual pods, with each pod supporting non-blocking operation among k 2 /4 hosts.
4.2 VN constraints Virtual datacenter allocation aims to map VN to the underlying PN. Generally, most of VDC applications rely on multi-layer network architecture consisting of core, aggregation, and access layers. We stipulate that the self-defined VN topology cannot exceed three layers. In Sect. 2, the SCG model is used to describe tenant VN, the depth of which is 3. In addition, we also stipulate the number of VMs connected to one VS couldn’t be more than the number of ports that one physical switch can afford. These constraints conform to the actual layers and ports that the underlying fat-tree modeled PN can afford.
4.3 VN-to-PN mapping problem The VN-to-PN mapping can clearly be separated into two basic phases: assigning virtual nodes to physical nodes and embedding virtual links onto physical paths using multicommodity flow algorithms [16] in case of splittable flows. The VN allocation problem has two parts: if an allocation exists (decision problem) and if the allocation achieves maximum redundant bandwidth (optimization problem). The more redundant network bandwidth an allocation uses, the
Cluster Computing Fig. 7 OpenFlow-based fat-tree topology
Fig. 8 The index info of PAM
more VDCs can be accepted. Both problems are NP-hard in case that the flow is at most k splitable, which has already been proved in early works [17].
4.4 Heuristic VN-to-PN mapping method We focus on heuristic design of VN-to-PN mapping. There are two challenges. First, the algorithm has to be fast even when a VDC has thousands of VMs and the infrastructure has tens to hundreds of thousands servers and switches. Second, the algorithm should maximize per-VDC network bandwidth. Evidently, it is NP-hard. To obtain approximate optimal solution, the principal thought is to restrict the problem space to enable efficient heuristics, a self-organized hierarchical architecture is utilized and perform parallel VDC allocation to accelerate overall VN to PN mapping. A hierarchical architecture allows to efficiently make VDC allocation through task division, and is composed of two components: Pod Allocation Managers (PAMs) and a Global Leader (GL). Each PAM manages its corresponding local net-
work resources, and the GL keeps the summary information of the PAMs to do task decomposition if needed. According to tenant demands, GL generates VDC allocation strategies, which are delegated to the assigned PAMs for specialized VM scheduling with maximum redundant bandwidth allocation.
4.5 Heuristic VN-to-PN mapping method Each PAM sends local resource information to GL periodically. GL stores index information which helps delegate a VDC allocation task to one or more PAMs. A tenant may request a VDC with 30,000 VMs, a pod has 576 hosts, each host can accept 12 VMs, thus requires at least 5 pods to accommodate, and GL determines which pods should be used and can be able to complete the task. The index information includes two parts as Fig. 8 describes. One is the edge layer index, in the form of monotonically increasing bandwidth intervals [Bm,Bn], each of which consists of paired tuple < Switch, VM-Number >, representing the number of VMs that one edge switch in the interval of Bm and Bn can accom-
123
Cluster Computing
modate, further summing up to the number of VMs that a pod in the interval of Bm and Bn can accommodate. The other is the aggregation layer index, using the same form as the edge layer, the only difference is the paired tuple < Switch, Link-Number >, representing the number of links that that one aggregation switch in the interval of Bm and Bn can accommodate. A pod can accommodate the designated allocation only when the two parts of the index information are met. All pods in the interval are sorted in an ascending order by the number of VMs that can be provisioned, GL can rapidly select the pods meeting the demand in a greedy manner which can make VDC allocation among all the pods more even, and ensures at least one VN to PN mapping result exists according to the index information validation if some pods are selected, each pod does sub-VDC allocation including both VM mapping and link mapping in parallel by means of switch-centered mapping, eventually facilitating the fast VDC allocation in efficiency. In the course of sub-VDC mapping, per-VDC maximum redundant bandwidth is supposed to be obtained from all the possible mappings. Now, the procedure of VN to PN mapping (Algorithm 1) is described. GL receives the request with SCG model of VDC and determines how many levels that the VN has. The VN structure is maintained while mapping VN to PN. Meanwhile, GL computes the quantity of the required resources, the format of which resembles the index information. If VN has layers lower than two, one pod should be found to complete VN to PN mapping task. Otherwise, the number of pods should be equal to the degree of the VN core switch. Once pods are found, it is ass assume that the links between PCS and PAS in VN have already be mapped. Then, GL forwards the sub-VN description to the selected pods, we only send each pod the corresponding part. Each pod employs twostage mapping, which is served as a participant responses the result of its sub-allocation to GL. Once all participants response successes, the overall allocation is evaluated as completion. If any of participants do not send feedback in certain time, it is considered as allocation failure, then GL notifies all the other participants to rollback. During twostage mapping, VM flow might be split into k subflows and more than one PAS are fixed. Hence, it should employ multiple PCSes to join the subVN mapping results, called the third layer (L3) mapping. A bipartite graph is built with the PASes at the left side and PCSes in the pod at the right side. A dst node is added at the right side of the PCSes and add edges from the PCSes to dst. It can be deemed as multi-commodity flow problem that the subflows across the corresponding PASes representing multiple commodities target to the dst. The goal is to find a L3 mapping with maximum total redundant bandwidth, which is NP-hard. However, as the L3 links are not often be utilized compared to the L2 links, they are difficult to become bottlenecks in the path. Thus, the search
123
scope can be reduced to limited k times, and the final solution is the maximum total redundant bandwidth calculated in k times.
4.6 Online two-stage mapping In PAM, the baseline VN-to-PN mapping algorithm heuristically tries to achieve the goal in two stages: node mapping and link mapping. Node mapping A greedy node mapping algorithm is employed, whose motivation is to map the virtual nodes to the physical nodes with the maximum resources so as to minimize the use of the resources at the bottleneck nodes/links. This is beneficial to future requests which require specific physical nodes with scarce resources. Meanwhile, physical nodes with the maximum resources are likely to have more redundant bandwidth. Link mapping Once node mapping is finished, we start to map virtual links in VN. This link mapping problem can be transformed to multi-commodity flow problem with path splitting. However, routing protocol deployed in practice does not permit to spread the flow over a large number of different paths. Thus, to make the system practical, we refer to the approximation solution regards minimum cost ksplittable flow problem [18] suitable for this scenario, which is a multi-commodity flow problem in which each commodity may be shipped only on a restricted number of different paths.
Cluster Computing
5 Evaluation
4.7 Offline adjustment Since VN requests arrive and depart over time, the physical network can easily drift into an non-approximate optimal configuration, where resources are increasingly skewed. Theoretically, one could try to address these challenges with dynamic VM migration, coupled with mathematical techniques Like dynamic programming. However, the cost of massive VM migration is huge due to migration traffic and the underlying search space is too large for dynamic programming to be practical. Instead, the departed VN is recorded, of which the mapped physical resources form a released network graph. The offline adjustment component periodically finds VDCs with the least redundant bandwidth, and determines whether to migrate or not. By this way, the time complexity and performance degradation caused by massive migration can be dramatically reduced.
The simulation is used to study the effect and performance of VDC allocation algorithm. The algorithm is evaluated from five metrics: Firstly, maximum Redundant bandwidth allocation This metric indicates the goal, the average redundant bandwidth allocation under different underlying network load is calculated. By contrasting to the common algorithm, it can embody whether the algorithm is valuable or not. Secondly, high network utilization The maximum redundant bandwidth allocation algorithm is essentially a resource scheduling method, which pursues high utilization of underlying physical resources. Thirdly, proper VDC split proportion Too many VDC split will incur performance degradation of packet forwarding. However, VDC split can increase the overall network utilization. The algorithm should maintain a reasonable proportion between splittable VDCs and unsplittable VDCs. Fourthly, efficient and scalable VDC allocation To maintain the system user-friendly, it is necessary to guarantee quick allocation for online VDC requests. Meanwhile, the per-VDC allocation time should not relate to the load on the underlying network, called scalability. Fifthly, Offline adjustment effect This metric is measured by average redundant bandwidth increment after offline adjustment works. It is used to prove the effectivity of the offline adjustment algorithm. All the experiments are performed on a Dawning S3997 server with 32G memory and 8 quad-core 800MHZ AMD CPUs. A simulator that coarsely models a multi-tenant datacenter is developed. The simulator creates a three-level fattree topology built from 48-port 10 GigE commodity switches. This leads to the simulation involving a datacenter with 48 pods, each of which has 576 physical servers, summing up to 27,648 physical servers. Provided that per sever can accommodate 24 VMs, the scale of the datacenter can achieve a total of 663,552 VMs. In the past three years, a private campus cloud based on Openstack [18] have been developed and maintained, providing resizable compute capacity in manner of VM. The aim is to provide free resources for teachers and students to support their researches. Now, the cloud is totally equipped with 1000 CPU cores, 2T memory and 100 T disks. The active number of VMs is approximately 1500. The statistical analysis is exercised on all bandwidth intensive application types in the cloud, and conclude the following three typical and representative application scenarios. Further, the corresponding VN topology suitable for each application type is given empirically, served as the templates to mimic and create general VN topologies.
123
Cluster Computing
Fig. 9 a Enterprise network b Web application c MapReduce. Typical application types in cloud datacenter Table 2 Percentage of application type in cloud datacenter Application type
Core
Aggregation
Edge
VM
Percentage
Size
Bandwidth
Size
Bandwidth
Size
Bandwidth
Size
Enterprise network
1
1000
[2,6]
400
[2,6]
100
[2,6]
5
Web application
–
–
1
[1000,2000]
3
500
[2,6]
60
Map reduce
–
–
[2,5]
[500,800]
[2,5]
[500,800]
[2,5]
35
5.1 Enterprise network (Fig. 9a) Enterprises purchase cloud resources to meet ordinary office automation. Most middle enterprises consist of several departments, each department contains hundreds of employees. Thus, thousands of hosts are required. Generally, a three-layer model is taken for building enterprise network with over subscription.
5.2 Web application (Fig. 9b) Today, most web applications are built as multi-tier applications (Cisco Data Center Infrastructure 2.5 Design Guide). Typically, three tiers are used including frontend web server, middle application server, and backend database. It is desirable to have bandwidth guarantees for the frontend-to-middle and middle-to-backend communications so that such web services can serve their customers with predictable performance.
5.3 MapReduce (Fig. 9c) Many data-intensive infrastructure services [19] request for non-blocking network such as MapReduce [20]. In the Reduce operation phase, a Reduce worker needs to fetch intermediate files from many servers. The traffic generated by the Reduce workers forms an all-to-all communication pattern, thus requesting for a particular network topology such as fattree. In the experiment, the proportion that each application type accounting for is shown in Table 2. The CPU intensive applications are not considered in that it is irrelevant to net-
123
work. The synthetic VDC application types distribution is generated in proportion to the percentage every application type accounts for. Besides, the VDC size and assign weight are randomized to the link of VN topology, according to different template, different range of empirical estimate is made, respectively. The network utilization n_util is defined as the total bandwidth allocated to VDCs divided by the total link capacity. Similarly, server bandwidth utilization s_util is the total server bandwidth allocated to VDCs divided by the total server link capacity, and pod bandwidth utilization p_util is that in the scope of pod. We add a sequence of randomly generated VDCs into the networks, the algorithm does not stop until the accumulative rejected VDC allocations of certain type achieve 20 times. The reported results are mean values for 1000 measurements, eventually, the experiment executes 70,859 VDC allocations, generating 70,757 successful allocations in average. For the sake of comparison, we also employ randomized link scheduling method in the two-stage pod allocation, finally, 66,745 VDCs, generating 66,636 successful allocations is allocated. For ease of redundant bandwidth representation, we ignore the initial bandwidth demand of VN allocation. Maximum Redundant bandwidth allocation We statistically analyze redundant bandwidth allocation for each VDC in different phases, as the redundant bandwidth will vary with the increasing of the number of allocated VDCs. As shown in Fig. 10, we use red line and the black line to indicate the allocated redundant bandwidth when employing maximum redundant bandwidth allocation algorithm and randomized link scheduling in two-stage mapping, respectively. When 10% VDCs are allocated, the underlying physical network
Cluster Computing
Fig. 10 Redundant bandwidth allocation Fig. 11 Network utilization
has sufficient resources, thus each VDC can almost get upper bound of physical link bandwidth (1000 Mbps). When 20% VDCs are allocated, each link may have one VDC mapped on, it is likely that minority of VDCs are overlapped onto the same link, thus the average redundant bandwidth is nearly 910 Mbps. When 50% VDCs are allocated, each physical link has about two VDCs mapped on, thus each VDC can almost get 570 Mbps. Till the proportion of allocated VDCs approaches 100 %, the underlying physical network resources are approaching high loaded, the allocated redundant bandwidth is 30 Mbps. Contrast to randomized link scheduling, it can get much more redundant bandwidth. Especially 50% VDCs are allocated, it get the maximum margin 352.4 Mbps, nearly two fifths of actual link bandwidth. However, the more VDCs the underlying network affords, the smaller advantage the algorithm embodies. Because little spare network resources can be flexibly scheduled. High bandwidth utilization The network utilization is evaluated from s_util, p_util and n_util. From perspective of servers, the number of which is totally 27,648, the corresponding points are in Fig. 11. It can be seen that the s_util of most of these points are between 88 and 95%. From perspective of pods, the number of which is 48 indicated by red line. The p_util of all the pods are between 78 and 92%, below s_util of the servers. Therefore, the algorithm can achieve high network utilization when aiming at maximum bandwidth allocation. It can be also seen that the n_util is about 48.6% in that the link between PAS and PCS is seldom utilized. Within the input network templates, only enterprise network contains the link between PAS and PCS. However, its proportion is low, and the proportion of the number of links among three layers of the underlying physical network is 1:1:1, thus we can not get network utilization more than 66.66% in principle. Proper VDC split proportion To achieve high network utilization, it is supposed that VM flow can be split into several
Fig. 12 VDC split proportion
subflows. It is reasonable that unsplittable mapping pattern is prior to that splittable. The VDC split proportion is calculated, in which the unsplittable VDCs account for 67.8%, the 2-splittable VDCs take 21.5%, and the 3-splittable VDCs occupy 10.7%, as Fig. 12 is shown. It is assumed that this result is proper, and consistent with the empirical expectation. Efficient and scalable allocation The allocation time of all the VDCs is recorded in Fig. 13 sequentially. The intervals are split based on the percentage of number of VDC allocations in order to facilitate the analysis of VDC allocation efficiency. There are two obvious black lines, we consider that the higher one corresponds to the VDC applications with three-layer networks, and the other corresponds to those with two-layer networks. Linear fitting is processed on these points, and we can see the allocation time almost keeps sta-
123
Cluster Computing
Fig. 13 The allocation time of VDCs
ble with the increasing VNC number, which shows good scalability. We also can see this linear fitting line is close to the lower black line in that most of applications are of twolayer network types. There is a slight fall with the number of VDC allocations increased. We consider the main reason resulting in this fall is that the network resources are more and more deficient as the VDC allocations progress, because GL determines which PAM is suitable for a VDC allocation according to the global index information, the overall number of comparisons is gradually decreasing. Most of the time of VDC allocations is in the range of 100 and 500 ms. When the interval is between 80 and 100%, there are a few points approaching 1.9 s, mainly because some applications of three-layer network types, due to their own VDC magnitude, might search more index information, and even traverse all the index information to detect a suitable PAM in case of network deficiency. Through the analysis, we assume that the algorithm ensures an efficient VDC allocation. Offline adjustment effect The continuous departure of one hundred of VDCs is evaluated. The departed VDCs are selected randomly. The departure is mimic when underlying network affords different loads. According to the offline adjustment algorithm, the released network resources form graph. Meanwhile, VDCs with least redundant bandwidths start to migration, if possible. To denote the effect of offline adjustment, the number of VDCs is counted which migrate to the released network graph, and the increased rate of redundant bandwidth allocated before and after migration. More VDCs are migrated or the increased rate is evident, offline adjustment is more significant. In Fig. 14, we can see that the number of migrations is gradually increasing when the proportion of allocated VDCs is below 40%, because the underlying physical network resources are enough. When the proportion of allocated VDCs is 50%, the underlying physical network resources are undersupplied, thus there is an
123
Fig. 14 Offline adjustment effect
surge between 40 and 50%. Afterwards, the number of migrations have reached a plateau in that the number of departed VDCs is fixed. It can be also seen that the increased rate of redundant bandwidth allocated after migration is not over 50%, because 1000 Mbps is used as the baseline. Hence, the higher load the underlying network affords, the lower the increased rate of redundant bandwidth VDCs get. By migration, the lowest increased rate is about 10% (100 Mbps or more), however, the number of migration is more (80 or more).
6 Related work Some recent efforts concentrated on bandwidth allocations in cloud datacenter. According to characteristics of bandwidth allocation, the works are classified into dynamic bandwidth allocation and static bandwidth allocation: (1) Dynamic bandwidth allocation is that the bandwidth is allocated to VDCs on demand in which the forwarding paths are determined in runtime. When flows of VDC arise, it dynamically chooses idle links to afford traffic based on instant loads on underlying network topology. The path multiplicity technology such as ECMP [21] is widely utilized in the current datacenter to dynamically schedule flows, however, link congestion cannot be avoided due to improper VM flows scheduling. Therefore, [22] employs a central controller to schedule, it is difficult to scale out and enforce per-VDC bandwidth guarantee due to the uncertainty of actual bandwidth demands. (2) Static bandwidth allocation is that the bandwidth is allocated to VDCs and the forwarding pathes are determined in advance, such as SecondNet [8], Oktopus [9], Faircloud [6], Seawall [7], ElasticSwitch [23,24] and NetShare 14-. [8] and [9] mainly focus on the design of heuristic bandwidth allocation algorithms to map virtual
Cluster Computing
network to physical network. They give a small portion about the method of enforcing the bandwidth allocation. However, the above two bandwidth allocation methods mentioned in [8] and [9] are not work-conserving, the underlying network utilization is relatively low. [6] proposes a set of desirable properties and allocation policies to express the tradeoffs when sharing cloud networks, it can achieve the bandwidth guarantee and is work-conserving. [7] and [25] are devoted to enforcing work-conserving bandwidth guarantees to make full use of underlying network, according to number of VMs or flows on congested link. Different from the above, the work is devoted to perVDC bandwidth allocation. By means of special VN-to-PN mapping method, it can make per-VDC maximum redundant bandwidth allocation. The mapping method determines per-VDC redundant bandwidth size when confronted with work-conserving bandwidth guarantee enforcement. It is argued that proper VDC scheduling on underlying network can make higher network utilization.
7 Conclusion In this paper, Nxt-Max is presented, which is a novel bandwidth allocation framework for multi-tenant cloud datacenter, in order to ensure per-VDC maximum redundant bandwidth allocation. Then, such an mapping in cloud can make full use of underlying network resources to increase per-VDC redundant bandwidth by splitting flow using MultiTCP. Through large 2 based encapsulation, MultiTCP is made practical in cloud. Finally, coupled with offline adjustment, the allocated VDCs are rearranged to get more redundant bandwidth through migration if it is possible. It is the hope that through Nxt-Max, not only in cloud datacenter it makes VDC allocation more efficient, but also per-VDC can get much more bandwidth to complete their tasks in short time, thereby it could increase the revenues of cloud providers. Acknowledgements The study was supported by “The key science research project of Anhui province, the design and research of largescale data stream processing and storage system in cloud environment (Grant No.KJ2014A150)”.
References 1. Al-Fares, M., Loukissas, A., Vahdat, A.: A scalable, commodity data center network architecture. In: Proceedings of ACM SIGCOMM, vol. 3 no. 8, pp. 89–92 (2008) 2. Guo, C., et al.: BCube: a high performance, server-centric network architecture for modular data centers. ACM SIGCOMM Comput. Commun. Rev. 39(4), 63–74 (2009) 3. Guo, C., et al.: Dcell: a scalable and fault-tolerant network structure for data centers. ACM SIGCOMM Comput. Commun. Rev. 38(4), 75–86 (2008)
4. Greenberg, A., et al.: VL2: a scalable and flexible data center network. ACM SIGCOMM Comput. Commun. Rev. 39(4), 9–14 (2009) 5. Greenberg, A., et al.: Towards a next generation data center architecture: scalability and commoditization. In: Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrow. ACM vol. 1, no. 8, pp. 78–82 (2008) 6. Popa, L., et al.: FairCloud: sharing the network in cloud computing. In: Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication. vol. 4, no. 7, pp. 11–15 (2012) 7. Shieh, A., et al.: Seawall: performance isolation for cloud datacenter networks. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. USENIX Association. vol. 7, no. 4, pp. 51-55 (2010) 8. Guo, C., et al. Secondnet: a data center network virtualization architecture with bandwidth guarantees. In: Proceedings of the 6th International COnference. ACM; vol. 9, no. 4, pp. 491–496, (2010) 9. Ballani, H., et al.: Towards predictable datacenter networks. ACM SIGCOMM Comput. Commun. Rev. 41(4), 45–50 (2011) 10. Clos, C.: A study of non-blocking switching networks. Bell Syst. Tech. J. 32(2), 406–424 (1953) 11. Xie, D., et al.: The only constant is change: incorporating timevarying network reservations in data centers. ACM SIGCOMM Comput. Commun. Rev. 42(4), 199–210 (2012) 12. Lee, J., et al.: Application-driven bandwidth guarantees in datacenters. In: Proceedings of the 2014 ACM conference on SIGCOMM. ACM 3(2), 41–45 (2014) 13. McKeown, N., et al.: OpenFlow: enabling innovation in campus networks. ACM SIGCOMM Comput. Commun. Rev. 38(2), 69– 74 (2008) 14. Lam, T., et al.: NetShare: virtualizing data center networks across services. Department of Computer Science and Engineering, vol. 2, no. 8, pp. 451–455. University of California, San Diego (2010) 15. Niranjan Mysore, Radhika, et al.: Portland: a scalable fault-tolerant layer 2 data center network fabric. ACM SIGCOMM Comput. Commun. Rev. 39(4), 20–25 (2009) 16. Hu, T.C.: Multi-commodity network flows. Oper. Res. 11(3), 344– 360 (1963) 17. Baier, G., Kohler, E., Skutella, M.: The k-splittable flow problem. Algorithmica 42(4), 231–248 (2005) 18. Gamst, M., et al.: Two-and three-index formulations of the minimum cost multicommodity k-splittable flow problem. Eur. J. Oper. Res. 202(1), 82–89 (2010) 19. Shamsi, J., Khojaye, M.A., Qasmi, M.A.: Data-intensive cloud computing: requirements, expectations, challenges, and solutions. J Grid Comput. 11(2), 281–310 (2013) 20. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 21. Hopps, C.E.: Analysis of an equal-cost multi-path algorithm. ACM 56(8), 23–30 (2000) 22. Al-Fares, M., et al.: Hedera: dynamic flow scheduling for data center networks. NSDI 10(8), 89–92 (2010) 23. Soliman, M., et al.: Exploring source routed forwarding in SDNbased WANs. Communications (ICC), 2014 IEEE International Conference on. IEEE, 2014 24. Whittaker, R.H.: Evolution and measurement of species diversity. Taxon 21(8), 213–251 (1972) 25. Rodrigues, H., et al.: Gatekeeper: supporting bandwidth guarantees for multi-tenant datacenter networks. WIOV 1(3), 784–789 (2011) 26. Semeria, C., Marketing Engineer: Multiprotocol label switching. Enhanc. Routing New Public Netw. White Paper Juniper Netw. 79(3), 14–19 (2000) 27. Popa, L., et al. ElasticSwitch: practical work-conserving bandwidth guarantees for cloud computing. In: Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM. ACM, 2(8):78–82 (2013)
123
Cluster Computing Wang Shuo received the B.S. degree in electronic engineering from Electronic Engineering Institute of PLA. He got M.S. degree from University of Science and Technology of China (USTC) where he is currently working toward the doctor’s degree. Now he is a teacher of Bengbu University. His research interests include cloud computing, cloud networks.
Li Jing received his Bachelor’s degree in 1987, Master’s degree in 1990, Ph.D. in 1993 in Computer Science from University of Science and Technology of China (USTC) respectively. Now he is a Professor in the School of Computer Science and Technology at USTC. His recent research focuses on distributed systems, cloud computing and mobile computing. He is a senior member of CCF. In the last few years, he is mainly engaged in the promotion and application of software technology, large-scale network systems and distributed algorithms research and development. He published more than 80 papers in high profile domestic and international academic journals.
123
Wang Qiqi received the B.S. degree in computer science from University of Electronic Science and Technology of China (UESTC), where she is currently working toward the doctor’s degree. Her research interests include, Artificial Intelligence, data mining.
Zhang Hongjie is currently working toward the Master degree in computer science from University of Science and Technology of China (USTC). His research interests include cloud computing, cloud networks.