Data Center Challenges Building Networks for Agility - Usenix

Data Center Challenges Building Networks for Agility - Usenix

Data Center Challenges Building Networks for Agility Sreenivas Addagatla, Albert Greenberg, James Hamilton, Navendu Jain, Srikanth Kandula, Changhoon ...

3MB Sizes 0 Downloads 6 Views

Recommend Documents

ElasticTree: Saving Energy in Data Center Networks - Usenix
ElasticTree: Saving Energy in Data Center Networks. Brandon Heller⋆, Srini Seetharaman†, Priya Mahadevan⋄,. Yianni

Pelican: A Building Block for Exascale Cold Data Storage - Usenix
Oct 6, 2014 - A significant fraction of data stored in cloud storage is rarely accessed. This data is referred to as col

Detecting Spam in VoIP Networks - Usenix
SIP Call Flow with Proxy Server. Proxy Server. Register .... Domain A. Voice Spam Detector. Proxy. Proxy. Proxy. Proxy.

Mencius: Building Efficient Replicated State Machines for - USENIX
well-known protocol Paxos. Such a development can be changed or further refined to take advantage of specific network or

QuickStart for Data Center - Cisco
Bill Grenko Jr. Bill Williams. Brandon Beck. Brett Colbert. Brian Hara. Brian Hutchins-Knowles ... Mathew Matthews. Maur

File System Security: Secure Network Data Sharing for NT - Usenix
One difficulty in sharing data between UNIX and NT is that their file system security ... NT file servers use access con

Illicit Networks - Center for Complex Operations
protection rackets defending illicit activities such as (but not limited to) the production and trafficking of drugs, ab

Community Wireless Networks - Center for Neighborhood Technology
can work on a small, manageable scale and which permits local community control. In 2003, CNT launched the Wireless Comm

Gi Security Evaluation - Juniper Networks
Juniper Networks Mobile Data Center SGi/Gi Security Evaluation service is recommended for large wireless service provide

A taxonomy and survey on Green Data Center Networks - CiteSeerX
Usman Shahid Khan, Assad Abbas, Nauman Jalil, Samee U. Khan. Department of Electrical and Computer Engineering, North Da

Data Center Challenges Building Networks for Agility Sreenivas Addagatla, Albert Greenberg, James Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel Sudipta Sengupta 1

Capacity  Issues  in  Real  Data  Centers   •  Bing  has  many  applica6ons  that  turn  network  BW   into  useful  work   –  Data  mining  –more  jobs,  more  data,  more  analysis   –  Index  –  more  documents,  more  frequent  updates  

•  These  apps  can  consume  lots  of  BW   –  They  press  the  DC’s  boElenecks  to  their  breaking  point   –  Core  links  in  intra-­‐data  center  fabric  at  85%  u6liza6on  and   growing  

•  Got  to  point  that  loss  of  even  one  aggrega6on  router   would  result  in  massive  conges6on  and  incidents   •  Demand  is  always  growing  (a  good  thing…)   –  1  team  wanted  to  ramp  up  traffic  by  10Gbps  over  1  month   2

The  Capacity  Well  Runs  Dry   •  We  had  already  exhausted  all  ability  to  add  capacity   to  the  current  network  architecture  

Utilization on a Core Intra-DC Link

100%

Capacity upgrades June  25  -­‐    80G  to  120G   July  20    -­‐  120G  to  240G   July  27  -­‐  240G  to  320G  

We had to do something radically different 3

Target  Architecture  

Internet  

Simplify  mgmt:  Broad  layer  of   devices  for  resilience  &  ROC   “RAID  for  the  network”  

More  capacity:  Clos  network   mesh,  VLB  traffic  engineering   Fault  Domains  for   resilience  and   scalability:   Layer  3  rou6ng   Reduce  COGS:   commodity  devices   4

Deployment  Successful!  

Draining traffic from congested locations

5

Want to design some of the biggest data centers in the world? Want to experience what “scalable” and “reliable” really mean? Think measuring compute capacity in millions of MIPs is small potatoes? Bing’s AutoPilot team is hiring!



6

Agenda •  Brief characterization of “mega” cloud data centers –  Costs –  Pain-points with today’s network –  Traffic pattern characteristics in data centers •  VL2: a technology for building data center networks –  Provides what data center tenants & owners want  Network virtualization  Uniform high capacity and performance isolation  Low cost and high reliability with simple mgmt –  Principles and insights behind VL2 –  VL2 prototype and evaluation –  (VL2 is also known as project Monsoon) 7

What’s a Cloud Service Data Center?

Figure by Advanced Data Centers

•  Electrical power and economies of scale determine total data center size: 50,000 – 200,000 servers today •  Servers divided up among hundreds of different services •  Scale-out is paramount: some services have 10s of servers, some have 10s of 1000s 8

Data Center Costs Amortized Cost*

Component

Sub-Components

~45%

Servers

CPU, memory, disk

~25%

Power infrastructure

UPS, cooling, power distribution

~15%

Power draw

Electrical utility costs

~15%

Network

Switches, links, transit

•  Total cost varies –  Upwards of $1/4 B for mega data center –  Server costs dominate –  Network costs significant The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009. Greenberg, Hamilton, Maltz, Patel. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money 9

Data Centers are Like Factories •  Number 1 Goal: Maximize useful work per dollar spent •  Ugly secrets: –  10% to 30% CPU utilization considered “good” in DCs –  There are servers that aren’t doing anything at all •  Cause: –  Server are purchased rarely (roughly quarterly) –  Reassigning servers among tenants is hard –  Every tenant hoards servers Solution: More agility: Any server, any service

10

Improving Server ROI: Need Agility •  Turn the servers into a single large fungible pool –  Let services “breathe” : dynamically expand and contract their footprint as needed •  Requirements for implementing agility –  Means for rapidly installing a service’s code on a server   Virtual machines, disk images  –  Means for a server to access persistent data   Data too large to copy during provisioning process   Distributed filesystems (e.g., blob stores)  –  Means for communicating with other servers, regardless of where they are in the data center   Network

11

The Network of a Modern Data Center Internet

CR

Data Center Layer 3 Layer 2

LB

S

AR

AR

S

S

S

Internet

CR

AR



AR

LB

S

S



Key: •  CR = L3 Core Router •  AR = L3 Access Router •  S = L2 Switch •  LB = Load Balancer •  A = Rack of 20 servers with Top of Rack switch

~ 2,000 servers/podset A

A

…A

A

A

…A

Ref: Data Center: Load Balancing Data Center Services , Cisco 2004

•  Hierarchical network; 1+1 redundancy •  Equipment higher in the hierarchy handles more traffic, more expensive, more efforts made at availability  scale-up design •  Servers connect via 1 Gbps UTP to Top of Rack switches •  Other links are mix of 1G, 10G; fiber, copper

12

Internal Fragmentation Prevents Applications from Dynamically Growing/Shrinking Internet

CR

AR

AR

S

S

LB

S

A

A



S

S

A

A

CR

… LB

S

A



A

AR

AR

S

S

LB



S

A

A



S

S

A

A

LB

S

A



A

•  VLANs used to isolate properties from each other •  IP addresses topologically determined by ARs •  Reconfiguration of IPs and VLAN trunks painful, errorprone, slow, often manual 13

No Performance Isolation Internet

CR

AR

AR

S

S

LB

S

A

A



S

S

A

A

CR

… LB

S

A



A

Collateral damage …

AR

AR

S

S

LB

S

A

A



S

S

A

A

LB

S

A



A

•  VLANs typically provide only reachability isolation •  One service sending/recving too much traffic hurts all services sharing its subtree 14

Network has Limited Server-to-Server Capacity, and Requires Traffic Engineering to Use What It Has Internet

CR

CR

10:1 over-subscription or worse (80:1, 240:1) AR

AR

S

S

LB

S

A

A



S

S

A

A



LB

S

A



A

AR

AR

S

S

LB



S

A

A



S

S

A

A

LB

S

A



A

•  Data centers run two kinds of applications: –  Outward facing (serving web pages to users) –  Internal computation (computing search index – think HPC) 15

Network Needs Greater Bisection BW, and Requires Traffic Engineering to Use What It Has Internet

CR

CR

Dynamic of servers … AR reassignment AR and AR Map/Reduce-style computations mean LB is constantly LB changing S traffic S matrix S S

AR LB

S traffic engineering S S S S is a nightmare S … Explicit

S

A

LB

A



A

A

A



A

A

A



A

A

S

A



A

•  Data centers run two kinds of applications: –  Outward facing (serving web pages to users) –  Internal computation (computing search index – think HPC) 16

Measuring Traffic in Today’s Data Centers •  80% of the packets stay inside the data center –  Data mining, index computations, back end to front end –  Trend is towards even more internal communication •  Detailed measurement study of data mining cluster –  1,500 servers, 79 ToRs –  Logged: 5-tuple and size of all socket-level R/W ops –  Aggregated into flow and traffic matrices every 100 s  Src, Dst, Bytes of data exchange More info: DCTCP: Efficient Packet Transport for the Commoditized Data Center http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf The Nature of Datacenter Traffic: Measurements and Analysis http://research.microsoft.com/en-us/UM/people/srikanth/data/imc09_dcTraffic.pdf 17

Flow Characteristics DC traffic != Internet traffic Most of the flows: various mice Most of the bytes: within 100MB flows

Median of 10 concurrent flows per server

18

Traffic Matrix Volatility -  Collapse similar traffic matrices into “clusters” -  Need 50-60 clusters to cover a day’s traffic

-  Traffic pattern changes nearly constantly -  Run length is 100s to 80% percentile; 99th is 800s 19

Today, Computation Constrained by Network* 1Gbps

Server To

.4 Gbps

3 Mbps

20 Kbps

.2 Kbps 0

Server From

Figure: ln(Bytes/10sec) between servers in operational cluster •  Great efforts required to place communicating servers under the same ToR  Most traffic lies on the diagonal •  Stripes show there is need for inter-ToR communication *Kandula, Sengupta, Greenberg, Patel

20

What Do Data Center Faults Look Like? •  Need very high reliability near top of the tree

CR

–  Very hard to achieve   Example: failure of a temporarily unpaired core switch affected ten million users for four hours –  0.3% of failure events knocked out all members of a network redundancy group

LB

S

A

A

AR

AR

S

S

S

S

…A

A

CR



AR

AR

LB

S

A



…A

Ref: Data Center: Load Balancing Data Center Services , Cisco 2004

 Typically at lower layers in tree, but not always 21

Objectives for the Network of Single Data Center Developers want network virtualization: a mental model where all their servers, and only their servers, are plugged into an Ethernet switch •  Uniform high capacity –  Capacity between two servers limited only by their NICs –  No need to consider topology when adding servers •  Performance isolation –  Traffic of one service should be unaffected by others •  Layer-2 semantics –  Flat addressing, so any server can have any IP address –  Server configuration is the same as in a LAN –  Legacy applications depending on broadcast must work 22

VL2: Distinguishing Design Principles •  Randomizing to Cope with Volatility –  Tremendous variability in traffic matrices •  Separating Names from Locations –  Any server, any service •  Leverage Strengths of End Systems –  Programmable; big memories •  Building on Proven Networking Technology –  We can build with parts shipping today  Leverage low cost, powerful merchant silicon ASICs, though do not rely on any one vendor  Innovate in software 23

What Enables a New Solution Now? •  Programmable switches with high port density –  Fast: ASIC switches on a chip (Broadcom, Fulcrum, …) –  Cheap: Small buffers, small forwarding tables –  Flexible: Programmable control planes

•  Centralized coordination –  Scale-out data centers are not like enterprise networks –  Centralized services already control/monitor health and role of each server (Autopilot) –  Centralized directory and control plane acceptable (4D)

24 port 10GE switch. List price: $10K 24

An Example VL2 Topology: Clos Network D/2 switches

...

D ports

Intermediate node switches in VLB

Node degree (D) of available switches & # servers supported

D/2 ports

...

D/2 ports

Aggregation switches

10G

D switches Top Of Rack switch 20 ports

[D2/4] * 20 Servers

•  A scale-out design with broad layers •  Same bisection capacity at each layer  no oversubscription •  Extensive path diversity  Graceful degradation under failure •  ROC philosophy can be applied to the network switches

25

Use Randomization to Cope with Volatility D/2 switches

...

D ports

Intermediate node switches in VLB

Node degree (D) of available switches & # servers supported

D/2 ports

...

D/2 ports

Aggregation switches

10G

D switches Top Of Rack switch 20 ports

[D2/4] * 20 Servers

•  Valiant Load Balancing –  Every flow “bounced” off a random intermediate switch –  Provably hotspot free for any admissible traffic matrix –  Servers could randomize flow-lets if needed 26

Separating Names from Locations: How Smart Servers Use Dumb Switches Headers  

Dest:  N        Src:  S   Dest:  TD      Src:  S   Dest:  D          Src:  S   Payload…  

2   ToR  (TS)   Dest:  N          Src:  S   Dest:  TD      Src:  S   Dest:  D          Src:  S   Payload  

Intermediate   Node  (N)  

Dest:  TD      Src:  S   Dest:  D          Src:  S   Payload…  

3   ToR  (TD)   Dest:  D            Src:  S  

1   Source  (S)  

4  

Payload…  

Dest  (D)  

•  Encapsulation used to transfer complexity to servers –  Commodity switches have simple forwarding primitives –  Complexity moved to computing the headers •  Many types of encapsulation available –  IEEE 802.1ah defines MAC-in-MAC encapsulation; VLANs; etc. 27

Leverage Strengths of End Systems Lookup(AA)

Applica6on   User   Kernel  

VL2  Agent   EncapInfo(AA)

TCP   IP   ARP   Encapsulator  

NIC  

Resolve   remote   IP   MAC   Resolu6on   Cache  

Directory   System  

Provision(AA,…) CreateVL2VLAN(…) AddToVL2VLAN(…) …

Provisioning   System  

Server  machine   •  Data center OSes already heavily modified for VMs, storage, etc. –  A thin shim for network support is no big deal •  Applications work with Application Addresses –  AA’s are flat names; infrastructure addresses invisible to apps •  No change to applications or clients outside DC 28

Separating Network Changes from Tenant Changes How to implement VLB while avoiding need to update state on every host at every topology change?

L3 Network running OSPF

I1   IANY  

IANY   I?  

I2   IANY  

I3   IANY  

Links  used     for  up  paths   Links  used   for  down  paths  

T1   T2  anycast T3   + flow-based T4   TECMP 5   [ IP ] T6   •  Harness huge bisection bandwidth T5   z   payload 3   y •  Obviate esoteric traffic engineering or optimization •  Ensure robustness x   yto z     failures •  Work with switch mechanisms available today 29

VL2  Analysis  and  Prototyping   •  Will  it  work?    VLB  traffic  engineering  depends   on  there  being  few  long  flows  

•  Will  it  work?  Control  plane  has  to  be  stable  at   large  scale   DHCP  Discover  /s   100   200   300   400   500   1000   1500   2000   2500  

CPU  load  

DHCP  Discover   Delivered   7%   9%   10%   11%   12%   17%   22%   27%   30%  

DHCP  Offer   Delivered   100%   100%   100%   100%   100%   99.8%   99.7%   99.4%   99.4%  

100%   73.3%   50.0%   37.4%   31.2%   16.8%   12.0%   11.2%   9.0%  

Prototype  Results:  Huge  amounts   of  traffic  with  excellent  efficiency     •  154  Gbps  goodput  sustained   among  212  servers   •  10.2  TB  of  data  moved  in  530s   •  Fairness  of  0.95/1.0    great   performance  isola6on   •  91%  of  maximum  capacity   •  TCP  RTT  100-­‐300  microseconds   on  quiet  network  low  latency  

30

VL2 Prototype

•  Experiments conducted with 40, 80, 300 servers –  Results have near perfect scaling –  Gives us some confidence that design will scale-out as predicted 31

VL2 Achieves Uniform High Throughput

•  Experiment: all-to-all shuffle of 500 MB among 75 servers – 2.7 TB •  Excellent metric of overall efficiency and performance •  All2All shuffle is superset of other traffic patterns •  Results: •  Ave goodput: 58.6 Gbps; Fairness index: .995; Ave link util: 86% •  Perfect system-wide efficiency would yield aggregate goodput of 75G –  Monsoon efficiency is 78% of perfect –  10% inefficiency due to duplexing issues; 6% header overhead –  VL2 efficiency is 94% of optimal

32

VL2 Provides Performance Isolation

-  Service 1 unaffected by service 2’s activity

33

VLB vs. Adaptive vs. Best Oblivious Routing

-  VLB does as well as adaptive routing (traffic engineering using an oracle) on Data Center traffic -  Worst link is 20% busier with VLB, median is same 34

Related Work •  OpenFlow –  Shares idea of simple switches controlled by external SW –  VL2 is a philosophy for how to use the switches •  Fat-trees of commodity switches [Al-Fares, et al., SIGCOMM’08] –  Shares a preference for a Clos topology –  Monsoon provides a virtual layer 2 using different techniques: changes to servers, an existing forwarding primitive, directory service •  Dcell [Guo, et al., SIGCOMM’08] –  Uses servers themselves to forward packets •  SEATTLE [Kim, et al., SIGCOMM’08] –  Shared goal of a large L2, different approach to directory service •  Formal network theory and HPC –  Valiant Load Balancing, Clos networks •  Logically centralized routing –  4D, Tesseract, Ethane

35

Summary •  Key to economic data centers is agility –  Any server, any service –  Today, the network is the largest blocker •  The right network model to create is a virtual layer 2 per service –  Uniform High Bandwidth  –  Performance Isolation  –  Layer 2 Semantics  •  VL2 implements this model via several techniques –  Randomizing to cope with volatility (VLB) uniform BW/perf iso –  Name/location separation & end system changes  L2 semantics –  End system changes & proven technology  deployable now –  Performance is scalable VL2: Any server/any service agility via scalable virtual L2 networks that eliminate fragmentation of the server pool 36

Want to design some of the biggest data centers in the world? Want to experience what “scalable” and “reliable” really mean? Think measuring compute capacity in millions of MIPs is small potatoes? Bing’s AutoPilot team is hiring!



37

More Information •  •  •  •  •  •  •  •  • 

The Cost of a Cloud: Research Problems in Data Center Networks –  http://research.microsoft.com/~dmaltz/papers/DC-Costs-CCR-editorial.pdf VL2: A Scalable and Flexible Data Center Network –  http://research.microsoft.com/apps/pubs/default.aspx?id=80693 Towards a Next Generation Data Center Architecture: Scalability and Commoditization –  http://research.microsoft.com/~dmaltz/papers/monsoon-presto08.pdf DCTCP: Efficient Packet Transport for the Commoditized Data Center –  http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf The Nature of Datacenter Traffic: Measurements and Analysis –  http://research.microsoft.com/en-us/UM/people/srikanth/data/imc09_dcTraffic.pdf What Goes into a Data Center? –  http://research.microsoft.com/apps/pubs/default.aspx?id=81782 James Hamilton’s Perspectives Blog –  http://perspectives.mvdirona.com Designing & Deploying Internet-Scale Services –  http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf Cost of Power in Large Scale Data Centers –  http://perspectives.mvdirona.com/2008/11/28/CostOfPowerInLargeScaleDataCenters.aspx

38

BACK UP SLIDES

39

Other Issues •  •  •  •  • 

Dollar costs of a VL2 network Cabling costs and complexity Directory System performance TCP in-cast Buffer allocation policies on the switches

40

Cabling Costs and Issues ToR

Network Cage

...

•  Cabling complexity is not a big deal –  Monsoon network cabling fits nicely into conventional open floor plan data center –  Containerized designs available

Int Aggr

ToR

... ... ...

•  Cost is not a big deal –  Computation shows it as 12% of total network cost –  Estimate: SFP+ cable = $190, two 10G ports = $1K, cabling should be ~19% of switch cost 41

Directory System Performance •  Key issues: –  Lookup latency (SLA set at 10ms) –  How many servers needed to handle a DC’s lookup traffic? –  Update latency –  Convergence latency

42

Directory System RSM

RSM Servers

3. Replicate

RSM

RSM 2. Set

...

DS

DS

2. Reply

...

4. Ack (6. Disseminate)

DS

2. Reply 1. Lookup

...

Directory Servers

5. Ack 1. Update

Agent

Agent

“Lookup”

“Update” 43

Directory System Performance •  Lookup latency –  Each server assigned to the directory system can handle 17K lookups/sec with 99th percentile latency < 10ms –  Scaling is linear as expected (verified with 3,5,7 directory servers) •  Directory System sizing –  How many lookups per second?   Median node has 10 connections, 100K servers = 1M entries   Assume (worst case?) that all need to be refreshed at once –  64 servers handles the load w/i 10ms SLA –  Directory system consumes 0.06% of total servers

44

Directory System Performance

45

The Topology Isn’t the Most Important Thing •  Two-layer Clos network seems optimal for our current environment, but … •  Other topologies can be used with Monsoon –  Ring/Chord topology makes organic growth easier –  Multi-level fat tree, parallel Clos networks d1= 40 ports

i

...

n/(d1-2) positions

layer 1 or 2 links

Type (2) switches

A

layer 1 links

TOR

B

n1 = 144 switches

n/(d1-2) positions

...

144 ports

...

...

d2 = 100 ports

Type (1) switches

n2 = 72 switches

n2 = 72 switches

A

B TOR n1 = 144 switches

Number of servers = 2 x 144 x 36 x 20 = 207,360

46

VL2 is resilient to link failures

-  Performance degrades and recovers gracefully as links are failed and restored

47

Abstract (this won’t be part of the presented slide deck – I’m just keeping the information together) Here’s an abstract and slide deck for a 30 to 45 min presentation on VL2, our data center network. I can add more details on the Monsoon design or more background on the enabling HW, the traffic patterns, etc. as desired. See http://research.microsoft.com/apps/pubs/default.aspx?id=81782 for possibilities. (we could reprise the tutorial if you’d like – it ran in 3 hours originally) We can do a demo if that would be appealing (takes about 5 min) To be agile and cost effective, data centers must allow dynamic resource allocation across large server pools. Today, the highest barriers to achieving this agility are limitations imposed by the network, such as bandwidth bottlenecks, subnet layout, and VLAN restrictions. To overcome this challenge, we present VL2, a practical network architecture that scales to support huge data centers with 100,000 servers while providing uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics. VL2 uses (1) flat addressing to allow service instances to be placed anywhere in the network, (2) Valiant Load Balancing to spread traffic uniformly across network paths, and (3) end-system based address resolution to scale to large server pools, without introducing complexity to the network control plane. VL2’s design is driven by detailed measurements of traffic and fault data from a large operational cloud service provider. VL2’s implementation leverages proven network technologies, already available at low cost in high-speed hardware implementations, to build a scalable and reliable network architecture. As a result, VL2 networks can be deployed today, and we have built a working prototype with 300 servers. We evaluate the merits of the VL2 design using measurement, analysis, and experiments. Our VL2 prototype shuffles 2.7 TB of data among 75 servers in 395 seconds – sustaining a rate that is 94% of the maximum possible.

48