Communication Transport Optimized For Data Center Environment

< Back

Communication Transport Optimized For Data Center Environment

Abstract: Methods and apparatus for congestion control in computer networks achieve high burst tolerance low latency and high throughput with shallow buffered switches. A method for controlling congestion includes transmitting a set of data packets on a network connection from a first computing device to a second computing device identifying each data packet in the set of data packets that experienced congestion on the network connection sending by the second computing device to the first computing device a sequence of bits that represents the number of data packets in the set of data packets that were identified as having experienced congestion and adjusting a rate of transmitting data packets on the network connection based on the sequence of bits sent to the first computing device.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

08 August 2012

Publication Number

50/2015

Publication Type

INA

Invention Field

COMMUNICATION

Status

Email

Parent Application

Patent Number

Legal Status

Grant Date

2021-10-27

Renewal Date

Applicants

MICROSOFT CORPORATION

One Microsoft Way Redmond Washington 98052 6399

Inventors

1. ATTAR Mohammedreza Alizadeh

c/o Microsoft Corporation LCA International Patents One Microsoft Way Redmond Washington 98052 6399

2. SRIDHARAN Murari

c/o Microsoft Corporation LCA International Patents One Microsoft Way Redmond Washington 98052 6399

3. PRABHAKAR Balaji

c/o Microsoft Corporation LCA International Patents One Microsoft Way Redmond Washington 98052 6399

4. MALTZ David A.

c/o Microsoft Corporation LCA International Patents One Microsoft Way Redmond Washington 98052 6399

5. PADHYE Jitendra D.

c/o Microsoft Corporation LCA International Patents One Microsoft Way Redmond Washington 98052 6399

6. GREENBERG Albert G.

c/o Microsoft Corporation LCA International Patents One Microsoft Way Redmond Washington 98052 6399

7. PATEL Parveen K.

c/o Microsoft Corporation LCA International Patents One Microsoft Way Redmond Washington 98052 6399

Specification

COMMUNICATION TRANSPORT OPTIMIZED FOR DATA CENTER
ENVIRONMENT
Field of the Invention
This invention relates to congestion control in computer networks and, more
particularly, to methods and apparatus for controlling congestion in a data center
environment. However, the invention is not limited to use in a data center environment.
Background
Data centers may include several hundred or several thousand servers
interconnected by high-speed switches. Cloud data centers host diverse applications,
mixing in the same network many workflows that require small, predictable latency with
others requiring large, sustained throughput. In recent years, data centers have
transformed computing, with large scale consolidation of enterprise IT into data center
hubs, and with the emergence of cloud computing service providers. A consistent theme
in data center design has been to build highly available, high performance computing and
storage infrastructure using low cost, commodity components. In particular, low-cost
switches are common, providing up to 48 ports at 1 Gbps, at a price under $2,000. Several
recent research proposals envision creating economical, easy-to-manage data centers using
novel architectures built on such commodity switches.
Whether these proposals are realistic depends in large part on how well the
commodity switches handle the traffic of real data center applications. It has been
discovered that soft real-time applications, such as web search, retail, advertising, and
recommendation systems that have driven much of the data center construction, generate a
diverse mix of short flows and long flows. These applications require the following from
the data center network: low latency for short flows, high burst tolerance, and high
utilization for long flows.
The first two requirements stem from the Partition/Aggregate workflow pattern
that many of these applications use. The soft real-time deadlines for end results translate
into latency targets for the individual tasks in the workflow. These latency targets vary
from about 10 ms to about 100 ms, and tasks not completed before their deadlines are
cancelled, thereby adversely affecting the final result. Thus, application requirements for
low latency directly impact the quality of the result returned and thus revenue. Reducing
network latency allows application developers to shift more cycles to the algorithms that
improve relevance and end user experience.
The third requirement, high utilization for large flows, stems from the need to
continuously update internal data structures of these applications, as the freshness of this
data also affects the quality of results. High throughput for long flows that update data is
thus as essential as low latency and burst tolerance.
In this environment, today's state of the art TCP protocol falls short. Accordingly,
there is a need for improved methods and apparatus for efficient packet transport in
computer networks, such as data centers.
Summary
The present invention provides methods and apparatus for congestion control
which achieve high burst tolerance, low latency and high throughput with shallowbuffered
switches. To meet the requirements of a diverse mix of short flows and long
flows, switch buffers are maintained with small queue occupancies, while high throughput
is maintained for long flows. These goals are achieved primarily by reacting to congestion
based on the extent of congestion. A congestion control algorithm uses a marking scheme
at switches that sets a marking bit in transmitted data packets as soon as the buffer
occupancy exceeds a small, fixed threshold. The sender reacts by reducing the rate of
transmitting data packets by a factor that depends on the fraction of marked packets. The
larger the fraction, the larger the decrease in transmission rate. The transmission rate can
be controlled by adjusting the length of a transmission window. The sender derives multibit
feedback from the single-bit marking information in each packet of a set of transmitted
packets.
According to a first aspect of the invention, a method is provided for controlling
congestion on a network connection between a first computing device and a second
computing device. The method comprises: transmitting a set of data packets on the
network connection from the first computing device to the second computing device;
identifying each data packet in the set of data packets that experienced congestion on the
network connection; sending, by the second computing device to the first computing
device, a sequence of bits that represents the number of data packets in the set of data
packets that were identified as having experienced congestion; and adjusting a rate of
transmitting data packets on the network connection based on the sequence of bits sent to
the first computing device.
According to a second aspect of the invention, a method is provided for controlling
congestion on a network connection between a first computing device and a second
computing device. The method comprises: transmitting, by the first computing device, a
set of data packets on the network connection to the second computing device; marking
data packets in the set of transmitted data packets if a queue size in a device on the
network connection exceeds a predetermined, single value threshold K; receiving, at the
first computing device, information identifying data packets in the set of transmitted data
packets that were marked; estimating, at the first computing device, a measure of
congestion on the network connection based on the data packets in the set of data packets
that were identified as marked; and adjusting, by the first computing device, a rate of
transmitting data packets on the network connection based on the estimated measure of
congestion.
According to a third aspect of the invention, a method is provided for controlling
congestion on a network connection between a first computing device and a second
computing device. The method comprises: transmitting a set of data packets on the
network connection from the first computing device to the second computing device;
marking data packets in the set of transmitted data packets if a queue size in a device on
the network connection exceeds a predetermined, single value threshold K; sending, by the
second computing device to the first computing device, a sequence of bits that represents
the number of data packets in the set of data packets that were marked; estimating a
measure of congestion on the network connection by determining, based on the sequence
of bits, a fraction of data packets in the set of transmitted data packets that were marked;
adjusting a rate of transmitting data packets on the network connection based on the
fraction of marked data packets in the set of transmitted data packets; and updating the
estimated measure of congestion on the network connection for each set of transmitted
data packets.
Brief Description of the Drawings
In the drawings:
Fig. 1 is a schematic diagram that illustrates a Partition/Aggregate workflow
pattern;
Fig. 2 is a block diagram that illustrates incast congestion at a switch connected to
the aggregator;
Fig. 3 is a block diagram of a computer network including a sender that transmits
data packets to a receiver, in accordance with embodiments of the invention;
Fig. 4 illustrates a congestion control algorithm in accordance with embodiments
of the invention;
Fig. 5 illustrates marking of data packets by a switch in accordance with
embodiments of the invention;
Fig. 6 is a flow chart that illustrates operation of a congestion control algorithm in
accordance with embodiments of the invention;
Fig. 7 is a state diagram that controls setting of congestion bits in ACK packets in
the case of delayed acknowledgements;
Fig. 8 is a plot of instantaneous queue length at a switch as a function of time,
using a congestion control algorithm in accordance with embodiments of the invention and
using conventional TCP;
Fig. 9 is a table that illustrates examples of operation of a congestion control
algorithm in accordance with embodiments of the invention and operation of conventional
TCP; and
Fig. 10 is a block diagram generally illustrating an example of a computer system
in which the present invention may be implemented.
Detailed Description
The Partition/Aggregate workflow pattern shown in Fig. 1 is the foundation of
many large scale web applications executed in data centers. The Partition/Aggregate
workflow pattern includes a top level aggregator 100, lower level aggregators 110
connected to top level aggregator 100, and workers 120 connected to respective lower
level aggregators 110. Aggregators 100 and 110, and workers 120 may each be
implemented as a server. The workflow pattern may employ any number of levels.
A request is received by top level aggregator 100. Requests from higher layers of
the application are broken into pieces and assigned to workers in lower levels. The
responses of the workers are aggregated to produce a result. Web search, social network
content composition and advertisement selection may be based on this workflow pattern.
For interactive, soft real-time applications such as these, latency is a key metric, with total
permissible latency being determined, for example, by customer impact studies. After
subtracting typical Internet and rendering delays, the backend part of the application is
typically allocated between 230-300 ms.
Many applications have a multi-layer Partition/Aggregate workflow pattern, with
lags at one layer delaying the initiation of others. Further, responding to a request may
require iteratively invoking the workflow pattern, with an aggregator making serial
requests to the workers below to prepare a response. For example, in a web search, a query
may be sent to many aggregators and workers, each responsible for a different part of the
index. Based on the replies, an aggregator may refine the query and send the refined query
to improve the relevance of the result. Lagging instances of the Partition/Aggregate
workflow can thus add up to threaten the total latency for queries.
To prevent the total latency from being violated, worker nodes are typically
assigned tight deadlines, usually on the order of 10-100 ms. Examples of deadlines for
completing work are shown in Fig. 1. When a node misses its deadline, the computation
continues without that response, lowering the quality of the result.
The present invention is based on an understanding of performance impairments
observed in a data center. A data center may include multiple racks of servers. Each rack
may include multiple servers, for example 44 servers, connected to a switch. The switches
may be shallow-buffered, shared memory switches, each with 4 MB of buffer shared
among 48 ports operating at 1 Gbps and two ports operating at 10 Gbps. The switches are
shared-memory switches that exploit statistical multiplexing gain through the use of
logically common packet buffers available to all switch ports. Packets arriving on an
interface are stored in a high-speed, multi-ported memory shared by all the interfaces.
Memory from the shared pool is dynamically allocated to a packet by a memory
management unit (MMU). The MMU attempts to give each interface as much memory as
it needs while preventing unfairness by dynamically adjusting the maximum amount of
memory any one interface can take. If a packet must be queued for an outgoing interface,
but the interface has reached its maximum memory allocation or the shared memory pool
is depleted, then the packet is dropped. Large multi-ported memories are expensive, so
most low-cost switches are shallow buffered, with packet buffer being the scarcest
resource.
If many data flows converge on the same interface of a switch over a short period
of time, the packets may exhaust either the switch memory or the maximum permitted
buffer for that interface, resulting in packet losses for some of the flows. This can occur
even if the flows are small. A traffic pattern which results in packet losses arises naturally
from the use of the Partition/Aggregate workflow pattern, as the request for data
synchronizes the workers' responses and creates incast at the queue of the switch port
connected to the aggregator. A diagram of a network 200 shown in Fig. 2 illustrates incast
congestion. A client 210 sends a request to N servers 220 via a switch 230. The network
200 may operate on the Partition/Aggregate workflow model illustrated in Fig. 1, with
client 210 corresponding to aggregator 100 and servers 220 corresponding to lower level
aggregators 110. The servers 220 may send responses at the same time, causing incast
congestion at switch 230. Incast-like problems degrade performance and, more
importantly, user experience. A response that experiences incast congestion is likely to
miss the aggregator deadline and be left out of the final results.
Long-lived TCP flows cause the length of the bottleneck queue to grow until
packets are dropped. When long and short flows traverse the same queue, two impairments
occur. First, packet loss on the short flows can cause incast problems as described above.
Second, there is a queue buildup impairment. Even when no packets are lost, the short
flows experience increased latency, as they are in the queue behind packets of the large
flows. Every worker is handling both query traffic and background traffic, so this traffic
pattern occurs frequently. Thus, an issue is the occupancy of the queue caused by other
flows - the background traffic - with losses occurring when the long flows and short flows
coincide. Since latency is caused by queuing, a solution is to reduce the size of the queues.
Given the mix of long and short flows in data centers, it is common for short flows
on one port to be impacted by activity on any of the many other ports. Surprisingly, the
loss rate of short flows in this traffic pattern depends on the number of long flows
traversing other ports. The explanation is that the activity on the different ports is coupled
by the shared memory pool. The long TCP flows build up queues on their respective
interfaces. Since buffer space is a shared resource, the queue buildup reduces the amount
of buffer space available to absorb bursts of traffic from Partition/Aggregate traffic. This
impairment is termed "buffer pressure." The result is packet loss and timeouts, as in incast,
but without requiring synchronized flows.
A congestion control algorithm, known as the DCTCP algorithm, addresses the
performance impairments described above. A goal of the DCTCP algorithm is to achieve
high burst tolerance, low latency, and high throughput with commodity shallow-buffered
switches. To this end, the DCTCP is designed to operate with small queue occupancies
and without loss of throughput.
A simplified block diagram of network components involved in operation of the
DCTCP algorithm is shown in Fig. 3. A sender 300 transmits data packets 302 to a
receiver 310 on a network connection 312 that includes switches 320 and 322. Switch 320
includes a buffer 330, and switch 322 includes a buffer 332. Each of buffers 330 and 332
may have a queue of data packets awaiting transmission. Switches 320 and 322 may
receive additional data flows on one or more interfaces 334 and 336, respectively. A data
packet 306 may be marked by switch 322 when a queue size in buffer 332 exceeds a
threshold, as described below. An example implementation of marking a data packet 306
is to set the Explicit Congestion Notification code point CE as defined in IETF RFC 3168
"The Addition of Explicit Congestion Notification (ECN) to IP". Receiver 310 may
acknowledge data packets 302 by sending ACK packets 304 to sender 300. Each ACK
packet may have an ECN-Echo flag set to indicate that congestion was experienced by the
corresponding received packet.
As shown in Fig. 3, sender 300 may include an application 340 that wishes to
transmit data packets to receiver 310 and a transmission rate controller 342 that controls
the transmission rate of data packets as described below. To control congestion, receiver
310 includes an ECN-Echoer to control setting of ECN-Echo flags in ACK packets as
described below. In addition to or in place of ECN-Echoer 350, receiver 310 may include
a congestion inferer 352 and a congestion marker 354 as described below. In the context
of a data center, sender 300 and receiver 310 may be servers, such that sender 300 is a first
computing device and receiver 310 is a second computing device. It will be understood
that Fig. 3 illustrates only one connection of multiple connections and two servers of
multiple servers in a data center environment.
The DCTCP algorithm achieves the goals of high burst tolerance, low latency and
high throughput by reacting to congestion based on the extent of congestion. The
algorithm uses a marking scheme at switches, which sets the Congestion Experienced
(CE) codepoint of data packets as soon as the buffer occupancy exceeds a fixed small
threshold. The sender reacts by reducing the data transmission rate by a factor that
depends on the fraction of marked packets. The larger the fraction of marked packets, the
larger the decrease in transmission rate. In some embodiments, the decrease in
transmission rate may be in proportion to the fraction of marked packets.
The algorithm derives multi-bit feedback from single bits contained in the marked
or unmarked state of each data packet in a set of data packets. The set of data packets may
be the data packets transmitted during a transmission window, also known as a congestion
window, cwnd. Since the DCTCP algorithm requires the network to provide only singlebit
feedback, much of the functionality that is already available in modern TCP stacks and
switches can be utilized.
The need for reacting based on the extent of congestion is particularly acute in the
absence of large-scale statistical multiplexing. Standard TCP reduces its window size by a
factor of two when it receives an ECN notification, that is, TCP-ECN reacts to a single
marked packet per congestion window. In effect, TCP reacts to the presence of congestion,
not to its extent. Reducing the window in half causes a large mismatch between the input
rate to the link and the available capacity. In the high speed data center environment,
where only a small number of flows share the buffer, this leads to buffer underflows and
loss of throughput.
The DCTCP algorithm has three main components as summarized in Fig. 4. A first
component 400 is the marking of data packets at the switch 320, 322. The algorithm
employs an active queue management scheme, as shown in Fig. 5. A marking threshold K
is the only parameter. As indicated by marking characteristic 500, an arriving packet is
marked, for example with the CE codepoint, if the queue occupancy is greater than
threshold K upon its arrival. Otherwise, the arriving packet is not marked. The DCTCP
marking scheme is motivated by the need to minimize queue buildup. The DCTCP
algorithm aggressively marks packets when a queue overshoot is sensed, thus allowing
senders to be notified of the queue overshoot as fast as possible.
A second component 410 of the DCTCP algorithm is the ECN-Echo at the receiver
310. The DCTCP receiver differs from a conventional TCP receiver in the way
information in the CE codepoints is conveyed back to the sender. A TCP receiver sets the
ECN-Echo flag in a series of ACK packets until it receives confirmation from the sender
that the congestion notification has been received. As described in RFC 3168, an explicit
goal of the TCP receiver is to notify the TCP sender of at most one congestion signal per
round trip time (RTT). A DCTCP receiver, however, accurately conveys the exact
sequence of marked packets back to the sender. One way to achieve this is to acknowledge
every packet, setting the ECN-Echo flag if and only if the packet has a marked CE
codepoint.
However, delayed acknowledgements are important for a variety of reasons,
including reducing the load on the sender. Delayed acknowledgements use one
accumulative ACK packet for every m consecutively received packets. To use delayed
acknowledgements, the DCTCP receiver uses a two state state-machine shown in Fig. 7 to
determine whether to send an ACK packet with the appropriate ECN-Echo bit. The states
correspond to whether the last received packet was marked with the CE codepoint or not.
Thus, the conventional delayed ACK is modified by sending an ACK packet each time the
marker bit of a received packet changes state. Since the sender knows how many
transmitted packets each ACK packet covers, it can exactly reconstruct the marked packets
received by the receiver.
A third component 420 of the DCTCP algorithm is the controller at the sender 300.
The sender maintains a running estimate of the fraction of packets that are marked, called
a, which is updated once for every window of data (roughly one RTT) as follows:
a — (1 - g) x a +g x F (1)
where F is the fraction of packets that were marked in the last window of data, and g,
having a value in the range of 0-1, is the weight given to new samples with respect to the
previous estimation of a .
It may be noted that a is a real number having a value between 0 and 1. Given that
the sender receives marks for every packet when the queue length is greater than K and
does not receive any marks when the queue length is less than K, equation (1) implies that
a is an estimate of the probability that the queue is greater than K. Thus, a value of a close
to 0 indicates a low level of congestion, and a value of a close to 1 indicates a high level
of congestion.
A DCTCP sender differs from a TCP sender with respect to its reaction to
receiving an ACK packet with the ECN-Echo flag set, as described above. Other features
of TCP, such as slow start, additive increase in congestion avoidance, and recovery from
packet loss are left unchanged. While TCP cuts its window size by a factor of two in
response to a marked ACK packet, the DCTCP algorithm uses a to reduce its window,
cwnd, as follows.
cwnd — cwndx (I - a/2) (2)
Thus, when the value of a is near 0 (low congestion), the window is slightly
reduced. The DCTCP senders start reducing their window size as soon as the queue size
exceeds K. The DCTCP algorithm thus maintains low queue size, while ensuring high
throughput. When the value of a is near 1 (high congestion), the DCTCP algorithm
reduces its window by half, as in TCP.
The sender has been described as adjusting its transmission window, or congestion
window, based on the fraction of marked data packets in a set of data packets. However,
the invention is not limited in this respect, and other methods of adjusting transmission
rate may be utilized within the scope of the invention.
The DCTCP algorithm involves selection of threshold K, the queue size threshold
that triggers marking in switches, and weight g, the weight given to new samples of a with
respect to a previous estimation of a . The values of threshold K and weight g may be
chosen based on the following guidelines.
C x RTT
K > (3)
7
1.386
g < (4)
C x RTT + K )
where C is the capacity of the network connection in packets per second, RTT is round trip
time in seconds and threshold K is in packets. Allowances may be made for packet bursts
when selecting the value of threshold K. For example, while Equation (3) may suggest a
marking threshold K as low as 20 packets for 10 Gbps, a more conservative marking
threshold larger than 60 packets may be used to avoid loss of throughput. This excess is in
line with burst sizes of 30 to 40 packets observed at 10 Gbps.
Based on packet bursts observed at 1 Gbps and 10 Gbps, and the total amount of
available buffering in switches, a marking threshold K of 20 packets for 1 Gbps ports and
a marking threshold K of 65 packets for 10 Gbps ports may be utilized, and weight g may
be set to 1/16. It will be understood that these values are given by way of example only
and are not limiting as to the scope of the present invention. Similarly, it will be
understood that threshold K can be denoted in units of bytes or cells of buffer space as
well as packets.
DCTCP senders start reacting as soon as the queue length on an interface exceeds
the threshold K. This reduces queuing delays on congested switch ports, which minimizes
the impact of long flows on the completion time of small flows. Also, more buffer space is
available as headroom to absorb transient microbursts, greatly mitigating the costly packet
losses that can lead to timeouts.
The DCTCP algorithm also solves the buffer pressure problem because a
congested port's queue length does not grow exceedingly large. Therefore, in shared
memory switches a few congested ports will not exhaust the buffer resources, thereby
harming flows passing through other ports.
The incast scenario, where a large number of synchronized small flows reach the
same queue, is the most difficult to handle. If the number of small flows is so high that
even one packet from each flow is sufficient to overwhelm the buffer on a synchronized
burst, any congestion control scheme that does not attempt to schedule traffic can do little
to avoid packet loss.
However, in practice, each flow has several packets to transmit and their windows
build up over multiple RTTs. It is often bursts in subsequent RTTs that lead to packet loss.
Because the DCTCP algorithm starts marking early and aggressively based on
instantaneous queue length, DCTCP senders receive enough marks during the first one or
two RTTs to reduce the size of follow-up bursts, thereby preventing buffer overflows.
A flow chart that summarizes the operation of the DCTCP algorithm is shown in
Fig. 6. The operations of Fig. 6 are described with reference to the network diagram of
Fig. 3. In act 600, sender 300 transmits a set of packets 302 to sender 310 on connection
312. The set of packets may be the data packets transmitted during a transmission
window.
In act 602, transmitted data packets that experience congestion are marked, for
example by setting a CE codepoint. As described above, the transmitted packets are
marked if the queue size in a switch, such as switches 320 and 322, exceeds a threshold K.
Otherwise, the data packets are not marked. A marked data packet 306 is shown in Fig. 3.
In act 604, receiver 310 notifies sender 300 of each marked packet in the set of
data packets received by receiver 310. In the case where each data packet is
acknowledged individually, the receiver 310 may set an ECN-Echo bit in the ACK packets
304 to indicate that marked packets were received. Unmarked data packets are
acknowledged without setting the ECN-Echo bit. Thus, sender 300 receives ACK packets,
and the number of packets having ECN-Echo bits set is based on the extent of congestion
experienced by the set of packets.
In act 606, sender 300 estimates the fraction of marked packets in the set of
transmitted data packets. This information is derived from the ECN-Echo bit in each of the
ACK packets returned by receiver 310. Thus, sender 300 derives multi-bit information
from single-bit information (ECN-Echo bit) contained in each of the ACK packets.
In act 608, a running estimate of the fraction of marked packets is updated by
sender 300 using Equation (1) above. In act 610, the transmission rate of subsequent sets
of data packets is adjusted based on the updated running estimate determined in act 608.
As discussed above, the transmission rate may be decreased based on the fraction of
marked packets, which represents the extent of congestion on network connection 312.
In the foregoing description of the DCTCP algorithm, transmitted data packets are
marked at a switch in the network connection when the queue size at the switch exceeds a
threshold K. The set of marked and unmarked packets is used to estimate the extent of
congestion on the network connection. In other embodiments, congestion inferer 352 and
congestion marker 354 shown in Fig. 3 are used as an alternate to marking of packets at
the switch.
The congestion inferer 352 observes the packets received by receiver 310. It
estimates the utilization of the link connecting the receiver 310 to the last hop switch 322
by recording the time at which each packet or group of packets is received and the number
of received packets. The link capacity in bits per second is known, so congestion inferer
352 obtains an estimate of link utilization by dividing the bits received during a duration
of time by the capacity multiplied by the duration. Typical durations may be 10
milliseconds, 100 milliseconds, or 500 milliseconds. If the link utilization is above
threshold, typically 90 to 95 percent, then the congestion inferer 352 determines that there
must be a sufficiently long queue at the last hop switch 322 that congestion exists at the
switch. The system then proceeds as if all packets received during that duration and the
next duration were received with the CE congestion bit set. The congestion marker 354
returns ACK packets to the sender 300 with ECN-Echo bits set. When the estimated
utilization is above the threshold value, the sender 300 estimates the extent of congestion
based on the fraction of marked packets and adjusts the transmission rate as described
above.
Fig. 8 illustrates the effectiveness of the DCTCP algorithm in achieving full
throughput, while taking up a small part of the switch packet buffer, as compared to TCP.
In Fig. 8, curve 800 represents the instantaneous queue length as a function of time for the
DCTCP algorithm, and curve 810 represents the instantaneous queue length for
conventional TCP. The queue length was measured on a Broadcom Triumph switch. Two
long flows were launched from distinct 1 Gbps ports to a common 1 Gbps port. The switch
had dynamic memory management enabled, allowing flows to a common receiver to
dynamically occupy up to 700 Kb of buffer.
A comparison of the performance of the DCTCP algorithm and conventional TCP
is shown in Fig. 9. In a first example, eight packets in a set of ten packets had the ECNEcho
bit set. The DCTCP algorithm cuts the transmission window by 40 percent, whereas
TCP cuts the transmission window by 50 percent. In a second example, one packet in a set
of ten packets had the ECN-Echo bit set. The DCTCP algorithm cuts the transmission
window by five percent, whereas TCP cuts the transmission window by 50 percent. These
examples illustrate that the DCTCP algorithm adapts to the extent of congestion, as
indicated by the fraction of marked packets, whereas TCP cuts the transmission window
by 50 percent in the presence of any congestion.
In one embodiment, transmission rate controller 342, congestion inferer 352,
congestion marker 354 and ECN Echoer 350 are incorporated into the Transmission
Control Protocol and implemented in the network stack of server 300 and receiver 310.
This embodiment has an advantage that any application 340 using a TCP connection will
receive the benefits of the invention. In another embodiment, these elements are
incorporated into libraries used by the application 340. In this embodiment, the application
on the receiver 310 or the library is responsible for sending congestion information 304 to
the sender 300. The feedback can be exactly the amount of congestion experienced and
can be sent as frequently as desired - once per RTT would be optimal. This embodiment
has an advantage that an application 340 that communicates using something other than a
TCP connection, for example, a UDP stream of data, can receive the benefits of the
invention. Other embodiments are also possible, such as incorporating the transmission
rate controller 342 into the application 340 and the congestion inferer 352, congestion
marker 354 and ECN Echoer 350 into the network stack of the receiver 310.
In another embodiment, the transmission rate controller 342 is incorporated into
the network stack or TCP code of the receiver 310 and the receiver controls the rate at
which the sender 300 sends data by setting the TCP receiver advertised window in the
acknowledgement packets 304 it sends to the sender 300. The sending rate is increased by
increasing the advertised receiver window and is decreased by decreasing the advertised
receiver window.
Prior work known as ECN-hat reacts to the average number of ECN notifications
received over a period of time. In ECN-hat, the TCP sender keeps a running average of the
number of ECN notifications it has received. When the TCP sender receives the first ECN
Notification, it reduces the congestion window based on the current average. It will not
adjust the congestion window again until at least a congestion window of data has been
sent. Any additional ECN Notifications received until a congestion window of data has
been sent are accumulated into the running average of ECN Notifications, but do not cause
further adjustment to the congestion window. The invention differs from ECN-hat by
introducing a new ECN-Echoer 350, an optional congestion inferer 352 and congestion
marker 354, and using different rules in the transmission rate controller 342 as shown in
equations (1) and (2). The DCTCP algorithm also specifies that packets may be marked in
a network switch 320, 322 whenever the instantaneous queue length is greater than
threshold K and how threshold value K should be determined (for example, equations (3)
and (4)). The invention also explains how the transmission rate controller 342 can be
incorporated into the application 340 or a communication library used by the application.
Prior work known as Random Early Drop (RED) or Random Early Marking
(REM) operates by the network switches computing a smoothed estimate of the length of
the packet queue. When the smoothed queue length is greater than a value minimumthreshold
and less than a value maximum-threshold and a packet arrives at the queue, the
packet is dropped (in RED) or marked (in REM) with a probability computed as maxdrop-
rate times (maximum-threshold - current smoothed queue length - minimumthreshold)
divided by (maximum-threshold - minimum-threshold) where max-drop-rate,
minimum-threshold and maximum-threshold are parameters that must be provided to the
switch. The key difference between RED, REM, and its variants (for example, PI and
other forms of Active Queue Management) are that in those systems the essential part of
the rate controller that determines when a sender's congestion window should be cut is
located on the switch where it does not have any per-flow information and so must use
probabilistic formulas like those in this paragraph (when the sender cuts its congestion
window, it is always by a factor of 2). As a result, the controller is largely ineffective in
practice. In the invention, the transmission rate controller 342 is located on the sender 310
or on the receiver 310 where it can associate the congestion notifications with a particular
flow and track the congestion information for each flow over time.
The invention has been shown and described in connection with data center
applications. However, the invention is not limited to data center applications and may be
utilized in other computer networks, such as Wide Area Networks (WAN). Although the
sender 300 has been described as estimating and updating a measure of congestion on the
network connection, it will be understood that these operations can be performed by the
receiver 310, with the result sent to the sender 300 for adjusting the transmission rate.
With reference to Fig. 10, an exemplary system for implementing the invention
includes a computing device, such as computing device 1000. In its most basic
configuration, computing device 1000 typically includes at least one processing unit 1002
and memory 1004. Depending on the exact configuration and type of computing device,
memory 1004 may be volatile (such as RAM), non-volatile (such as ROM, flash memory,
etc.) or some combination of the two. This most basic configuration is illustrated in
Fig. 10 by dashed line 1006. Additionally, device 1000 may also have additional
features/functionality. For example, device 1000 may also include additional storage
(removable and/or non-removable) including, but not limited to, magnetic or optical disks
or tape. Such additional storage is illustrated in Fig. 10 by removable storage 1008 and
non-removable storage 1010. Computer storage media includes volatile and nonvolatile,
removable and non-removable media implemented in any method or technology for
storage of information such as computer readable instructions, data structures, program
modules or other data. Memory 1004, removable storage 1008 and non-removable storage
1010 are all examples of computer storage media. Computer storage media includes, but is
not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM,
digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic
tape, magnetic disk storage or other magnetic storage devices, or any other medium which
can be used to store the desired information and which can accessed by device 1000. Any
such computer storage media may be part of device 1000.
Device 1000 may also contain communications connection(s) 1012 that allow the
device to communicate with other devices. Device 1000 may also include input device(s)
1014 such as keyboard, mouse, pen, voice input device, touch input device, etc. Output
device(s) 1016 such as a display, speakers, printer, etc. may also be included. All these
devices are well known in the art and need not be discussed at length.
Having thus described several aspects of at least one embodiment of this invention,
it is to be appreciated various alterations, modifications, and improvements will readily
occur to those skilled in the art. Such alterations, modifications, and improvements are
intended to be part of this disclosure, and are intended to be within the spirit and scope of
the invention. Accordingly, the foregoing description and drawings are by way of
example only.
CLAIMS
1. A method for controlling congestion on a network connection between a
first computing device and a second computing device, comprising:
transmitting a set of data packets on the network connection from the first
computing device to the second computing device;
identifying each data packet in the set of data packets that experienced congestion
on the network connection;
sending, by the second computing device to the first computing device, a sequence
of bits that represents the number of data packets in the set of data packets that were
identified as having experienced congestion; and
adjusting a rate of transmitting data packets on the network connection based on
the sequence of bits sent to the first computing device.
2. A method as defined in claim 1, wherein sending includes sending a packet
whenever the second computing receives a packet identified as having experienced
congestion after receiving a packet identified as not having experienced congestion.
3. A method as defined in claim 1, wherein identifying each data packet in the
set of data packets that experienced congestion comprises the second computing device
estimating a queue length at a last hop network device before the second computing device
and identifying data packets as having experienced congestion when the queue length is
greater than a threshold value over a period of time.
4. A method as defined in claim 3, wherein estimating the queue length
comprises determining utilization of the link to the second computing device over a period
of time and identifying data packets as having experienced congestion when the
determined utilization is above a threshold for a period of time.
5. A method as defined in claim 1, wherein identifying includes marking the
transmitted data packets if a queue size in a device on the network connection exceeds a
predetermined, single value threshold K and notifying the first computing device of each
marked data packet.
6. A method as defined in claim 5, further comprising estimating a measure of
congestion on the network connection by determining, based on the sequence of bits, a
fraction of data packets in the set of transmitted data packets that were identified as having
experienced congestion and wherein adjusting includes adjusting the rate at which packets
are transmitted based on the fraction of marked data packets in the set of transmitted data
packets.
7. A method as defined in claim 6, wherein adjusting includes adjusting the
rate at which packets are transmitted proportional to the fraction of marked data packets in
the set of transmitted data packets.
8. A method as defined in claim 6, wherein estimating includes updating the
measure of congestion once per round-trip-time.
9. A method as defined in claim 6, wherein adjusting includes reducing a
length of a data packet transmission window by a factor of (l-a/2), wherein a is a
smoothed estimate of the fraction of marked data packets.
10. A method as defined in claim 6, wherein adjusting comprises the second
computing device controlling the amount of data it receives by adjusting the size of a
receiver advertised window.
11. A method as defined in claim 6, wherein estimating includes updating an
estimated fraction of marked packets in a data packet transmission window for each data
packet transmission window.
12. A method as defined in claim 6, wherein estimating includes updating an
estimated measure of congestion in accordance with:
i+ i = (l -g)Oi + gF
where a is a smoothed estimate of the fraction of marked data packets, F is the fraction of
marked data packets in the last set of data packets, and g is the weight given to new
samples with respect to past samples in estimating a .
13. A method as defined in claim 5, further comprising determining threshold
K to be an amount of buffer space required to support full utilization of the network
connection plus an amount of buffer space required to hold the largest burst size sent by
the first computing device.
14. A method as defined in claim 13, wherein the amount of buffer space
required to support full utilization of the network connection is determined as C RTT/7,
where C is the capacity of the network connection in packets per second, RTT is round trip
time in seconds and threshold K is in packets.
15. A non-transitory computer-readable storage medium coded with computerreadable
instructions that, when executed by a computing device, perform a method as
defined in claim 1.

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	6954-CHENP-2012 POWER OF ATTORNEY 08-08-2012.pdf	2012-08-08
1	6954-CHENP-2012-RELEVANT DOCUMENTS [15-09-2023(online)].pdf	2023-09-15
2	6954-CHENP-2012 FORM-5 08-08-2012.pdf	2012-08-08
2	6954-CHENP-2012-IntimationOfGrant27-10-2021.pdf	2021-10-27
3	6954-CHENP-2012-PatentCertificate27-10-2021.pdf	2021-10-27
3	6954-CHENP-2012 FORM-3 08-08-2012.pdf	2012-08-08
4	6954-CHENP-2012-US(14)-HearingNotice-(HearingDate-20-08-2021).pdf	2021-10-17
4	6954-CHENP-2012 FORM-2 FIRST PAGE 08-08-2012.pdf	2012-08-08
5	6954-CHENP-2012-Written submissions and relevant documents [02-09-2021(online)].pdf	2021-09-02
5	6954-CHENP-2012 FORM-1 08-08-2012.pdf	2012-08-08
6	6954-CHENP-2012-Correspondence to notify the Controller [14-07-2021(online)].pdf	2021-07-14
6	6954-CHENP-2012 DRAWINGS 08-08-2012.pdf	2012-08-08
7	6954-CHENP-2012-ABSTRACT [28-12-2018(online)].pdf	2018-12-28
7	6954-CHENP-2012 DESCRIPTION (COMPLETE) 08-08-2012.pdf	2012-08-08
8	6954-CHENP-2012-CLAIMS [28-12-2018(online)].pdf	2018-12-28
8	6954-CHENP-2012 CORREPONDENCE OTHERS 08-08-2012.pdf	2012-08-08
9	6954-CHENP-2012 CLAIMS SIGNATURE LAST PAGE 08-08-2012.pdf	2012-08-08
9	6954-CHENP-2012-COMPLETE SPECIFICATION [28-12-2018(online)].pdf	2018-12-28
10	6954-CHENP-2012 CLAIMS 08-08-2012.pdf	2012-08-08
10	6954-CHENP-2012-CORRESPONDENCE [28-12-2018(online)].pdf	2018-12-28
11	6954-CHENP-2012 PCT PUBLICATION 08-08-2012.pdf	2012-08-08
11	6954-CHENP-2012-FER_SER_REPLY [28-12-2018(online)].pdf	2018-12-28
12	6954-CHENP-2012-OTHERS [28-12-2018(online)].pdf	2018-12-28
12	6954-CHENP-2012.pdf	2012-08-10
13	6954-CHENP-2012 CORRESPONDENCE OTHERS 30-01-2013.pdf	2013-01-30
13	6954-CHENP-2012-FER.pdf	2018-06-28
14	6954-CHENP-2012 FORM-3 30-01-2013.pdf	2013-01-30
14	FORM-6-1701-1800(KONPAL).92.pdf	2015-03-13
15	abstract6954-CHENP-2012.jpg	2013-10-24
15	MS to MTL Assignment.pdf	2015-03-13
16	Form-18(Online).pdf	2014-02-17
16	MTL-GPOA - KONPAL.pdf	2015-03-13
17	6954-CHENP-2012 FORM-6 28-02-2015.pdf	2015-02-28
18	MTL-GPOA - KONPAL.pdf	2015-03-13
18	Form-18(Online).pdf	2014-02-17
19	abstract6954-CHENP-2012.jpg	2013-10-24
19	MS to MTL Assignment.pdf	2015-03-13
20	6954-CHENP-2012 FORM-3 30-01-2013.pdf	2013-01-30
20	FORM-6-1701-1800(KONPAL).92.pdf	2015-03-13
21	6954-CHENP-2012 CORRESPONDENCE OTHERS 30-01-2013.pdf	2013-01-30
21	6954-CHENP-2012-FER.pdf	2018-06-28
22	6954-CHENP-2012-OTHERS [28-12-2018(online)].pdf	2018-12-28
22	6954-CHENP-2012.pdf	2012-08-10
23	6954-CHENP-2012 PCT PUBLICATION 08-08-2012.pdf	2012-08-08
23	6954-CHENP-2012-FER_SER_REPLY [28-12-2018(online)].pdf	2018-12-28
24	6954-CHENP-2012-CORRESPONDENCE [28-12-2018(online)].pdf	2018-12-28
24	6954-CHENP-2012 CLAIMS 08-08-2012.pdf	2012-08-08
25	6954-CHENP-2012 CLAIMS SIGNATURE LAST PAGE 08-08-2012.pdf	2012-08-08
25	6954-CHENP-2012-COMPLETE SPECIFICATION [28-12-2018(online)].pdf	2018-12-28
26	6954-CHENP-2012 CORREPONDENCE OTHERS 08-08-2012.pdf	2012-08-08
26	6954-CHENP-2012-CLAIMS [28-12-2018(online)].pdf	2018-12-28
27	6954-CHENP-2012 DESCRIPTION (COMPLETE) 08-08-2012.pdf	2012-08-08
27	6954-CHENP-2012-ABSTRACT [28-12-2018(online)].pdf	2018-12-28
28	6954-CHENP-2012 DRAWINGS 08-08-2012.pdf	2012-08-08
28	6954-CHENP-2012-Correspondence to notify the Controller [14-07-2021(online)].pdf	2021-07-14
29	6954-CHENP-2012 FORM-1 08-08-2012.pdf	2012-08-08
29	6954-CHENP-2012-Written submissions and relevant documents [02-09-2021(online)].pdf	2021-09-02
30	6954-CHENP-2012 FORM-2 FIRST PAGE 08-08-2012.pdf	2012-08-08
30	6954-CHENP-2012-US(14)-HearingNotice-(HearingDate-20-08-2021).pdf	2021-10-17
31	6954-CHENP-2012-PatentCertificate27-10-2021.pdf	2021-10-27
31	6954-CHENP-2012 FORM-3 08-08-2012.pdf	2012-08-08
32	6954-CHENP-2012-IntimationOfGrant27-10-2021.pdf	2021-10-27
32	6954-CHENP-2012 FORM-5 08-08-2012.pdf	2012-08-08
33	6954-CHENP-2012-RELEVANT DOCUMENTS [15-09-2023(online)].pdf	2023-09-15
33	6954-CHENP-2012 POWER OF ATTORNEY 08-08-2012.pdf	2012-08-08

Search Strategy

1	searchstrategy_30-05-2018.pdf

ERegister / Renewals

3rd: 28 Dec 2021

From 21/02/2013 - To 21/02/2014

4th: 28 Dec 2021

From 21/02/2014 - To 21/02/2015

5th: 28 Dec 2021

From 21/02/2015 - To 21/02/2016

6th: 28 Dec 2021

From 21/02/2016 - To 21/02/2017

7th: 28 Dec 2021

From 21/02/2017 - To 21/02/2018

8th: 28 Dec 2021

From 21/02/2018 - To 21/02/2019

9th: 28 Dec 2021

From 21/02/2019 - To 21/02/2020

10th: 28 Dec 2021

From 21/02/2020 - To 21/02/2021

11th: 28 Dec 2021

From 21/02/2021 - To 21/02/2022

12th: 28 Dec 2021

From 21/02/2022 - To 21/02/2023

13th: 06 Jan 2023

From 21/02/2023 - To 21/02/2024

14th: 19 Feb 2024

From 21/02/2024 - To 21/02/2025

15th: 17 Feb 2025

From 21/02/2025 - To 21/02/2026