Reporting Packet Loss

Packet loss can be caused by many things, and occasional packet loss is part of how the internet works - packet loss is the way TCP (the protocol used by the majority of data transfer on the internet) knows how to reduce its sending rate.

Severe or even moderate packet loss in real-time applications such as voice is bad for quality and should be minimised. RISE has built its core network in such a way that packet loss is extremely rare. While packet loss occurring on the core network is considered a fault condition, diagnosis is not as simple as sending a "ping."

Why isn't ping a reliable method of measuring packet loss?

Firstly, a ping application sends ping packets to its destination which can indicate loss anywhere along the whole path between the source computer and the destination host - if your computer is in the Philippines and the destination host is in the USA, there are thousands of kilometres of fibre and hundreds of network devices in this path. The ping packets sent can be dropped anywhere along this path and the ping application does not indicate where the loss occurs.

The second reason why ping isn't a reliable measure of loss is that on most platforms, responding to ICMP Echo Request packets (ping) is considered a low priority task and on core routers in particular it is ratelimited by best practise. This ensures it cannot be used as a Denial of Service vector.

How should I measure packet loss?

The best and easiest way is using a tool called MTR. This is available through standard package managers for Linux and Mac OS X, or for Windows at WinMTR projects.

MTR shows loss rates to each hop along the path to the destination. It is still subject to that fact that ICMP is treated as low priority and ratelimited by core routers, however because it shows each hop along the path, the impact of this can be limited.

When you perform an MTR, send 30-100 packets and make sure you are on a wired connection. Wireless networks can be subject to many factors which generate packet loss, and as such are not suitable for measuring packet loss outside of the wireless network.

A clean MTR trace will look something like this:

HOST: rise-test                   Loss%   Snt   Last   Avg  Best  Wrst StDev

  1.|-- 10.129.2.1                 0.0%    20    0.2   2.0   0.2  35.0   7.8

  2.|-- 10.129.4.12                0.0%    20    0.3   0.3   0.3   0.4   0.0

  3.|-- 43-226-7-25.static.rise.a  0.0%    20    0.6   1.1   0.4   6.3   1.6

  4.|-- v10.2-0-0.rcbc-cor2.rise.  0.0%    20   16.8  16.8  16.7  16.9   0.0

  5.|-- unknown.telstraglobal.net  0.0%    20   19.7  17.8  17.5  19.7   0.5

  6.|-- i-0-3-0-0.phpn-core02.bi.  0.0%    20   31.9  20.4  18.3  31.9   2.7

  7.|-- i-0-0-4-0.siko-core03.bi.  0.0%    20   81.1  82.7  80.9  84.6   1.0

  8.|-- 4637.tyo.equinix.com       0.0%    20   82.2  83.4  81.6  87.8   1.2

  9.|-- 15169.tyo.equinix.com      0.0%    20   80.7  81.5  80.6  95.5   3.3

 10.|-- 216.239.54.11              0.0%    20   81.4  81.4  81.3  81.5   0.0

 11.|-- 209.85.255.33              0.0%    20   81.8  81.8  81.7  82.3   0.0

 12.|-- google-public-dns-a.googl  0.0%    20   81.1  81.2  81.0  81.5   0.0

Note that in this trace, the first two hops are internal network hops, and this is typical. If you see a few hops at the start of your MTR which have IP addresses starting with the numbers 10, 172 or 192, these network hops are internal to your organisation. If you see packet loss at these hops, please contact your organisations network administrator as this is an internal fault.

Interpreting an MTR trace

Once you have an MTR trace, you need to interpret the information. There are a couple of "gotchas" when interpreting an MTR. An example for discussion is shown below (taken from the Wikipedia article on MTR)

                             My traceroute  [v0.71]
            example.lan                           Sun Mar 25 00:07:50 2007

                                       Packets                Pings
Hostname                            %Loss  Rcv  Snt  Last Best  Avg  Worst
 1. example.lan                        0%   11   11     1    1    1      2
 2. ae-31-51.ebr1.Chicago1.Level3.n   19%    9   11     3    1    7     14
 3. ae-1.ebr2.Chicago1.Level3.net      0%   11   11     7    1    7     14
 4. ae-2.ebr2.Washington1.Level3.ne   19%    9   11    19   18   23     31
 5. ae-1.ebr1.Washington1.Level3.ne   28%    8   11    22   18   24     30
 6. ge-3-0-0-53.gar1.Washington1.Le    0%   11   11    18   18   20     36
 7. 63.210.29.230                      0%   10   10    19   19   19     19
 8. t-3-1.bas1.re2.yahoo.com           0%   10   10    19   18   32    106
 9. p25.www.re2.yahoo.com              0%   10   10    19   18   19     19

This MTR illustrates a earlier point made about ICMP being ratelimited on core routers - we see hop 2, hop 4 and hop 5 are experiencing significant loss. It is important to note that this would not impact customer traffic, just ICMP sent to the router. This is in fact considered a "clean" MTR. We determine that loss is related to an ICMP ratelimit because the loss does not then continue to the last hop. If there was actually 19% loss occurring at hop 2, we would see it at hop 3, all the way through to hop 9. But instead, the MTR shows no loss for hops 7 through 9.

To reiterate, what we are looking for is loss that starts at a specific hop, then is present at every hop thereafter.

Loss due to service congestion

The most common cause of packet loss we see is when your service is congested due to overuse. For example if you have purchased a 20Mb service but there is a demand from your office for more than 20Mb, the excess packets will be dropped.

In an MTR this will appear as loss starting at a hop with a hostname of the format 43-226-7-25.rise.as, where "43-225-7-25" is a representation of your RISE service gateway IP.

If you are seeing loss at this hop, please conduct an isolation test to rule out congestion as a cause. If you are still seeing loss at this hop while conducting the isolation test, please raise a ticket with RISE Support as you may have a line issue.

For more information on how to monitor for congestion, check the page Bandwidth Graphing. For more information on why packets are dropped during congestion, check the page How Your RISE Service is Ratelimited.

When should I raise a ticket with RISE?

If having read the above, you have ruled out loss caused by your internal network and by service congestion, and you see loss that starts at a specific hop in the MTR, then is present at every hop thereafter, please raise a ticket with RISE Support including the MTR traces that demonstrate the loss.

Note that RISE controls a relatively small part of the global Internet, and cannot control how other providers operate their networks. If loss is seen outside of the RISE network we will do our utmost to resolve the issue by engaging with the other provider, but will be subject to their processes.