Tuesday, October 5, 2021

Facebook, Instagram and Whatsapp were all down | 4th Oct 2021

Facebook, Instagram and Whatsapp would never go down, can it?



Well on 4th Oct 2021,at around 15:50 UTC Facebook and its affiliated services Whatsapp, Instagram, Messenger, Mapillary, Oculus and other sign-in services into smart TVs, thermostats, etc were unavailable for around 5+ hours. This was the worst since the 2019 incident, when a technical error affected its sites for 24 hours as the downtime hit hardest on the small businesses and creators who rely on these services for their income. Technology outages can occur and are common, but to have the world’s largest social media apps outage at the same time was highly unusual. 




What did Cloudflare notice?


An internal incident was raised entitled “Facebook DNS lookup returning SERVFAIL” as they thought there was something wrong with DNS resolver 1.1.1.1. But then they realized something more serious was ongoing. They soon realized Facebook, Whatapp. Instagram was all down. The DNS resolver was not responding to queries asking for the IP Address of facebook.com, instagram.com. Other Facebook IP Addresses remain routed but that weren’t useful since without DNS facebook and other related services were unavailable.


How’s that even possible?


There are many opinions on as the problem would be in BGP front, so here is what is BGP - 


BGP (Border Gateway Protocol) is one of the systems that the internet uses to get your traffic to where it needs to go as quickly as possible. Since there are tons of different service providers, routers and servers responsible for the data, BGP’s is responsible to show them the way and make sure the best route. BGP uses the Autonomous System Number(ASN) to uniquely identify each system. The Internet is a network of networks, and it’s bound together by BGP. BGP allows one network to advertise its presence to other networks that form an Internet.


ASN

An Autonomous System (AS) is a unique identifier that defines a group of one or more IP prefixes run by one or more network operators that maintain a single, clearly-defined routing policy. The AS known as the autonomous system number or ASN is assigned by the Internet Assigned Numbers Authority (IANA). 1 to 64511 ASNs are available by IANA for global use. The 64512 to 65535 series is reserved for private and reserved purposes. 


As cloudflare keeps track of all BGP updates, they saw a peak of routing changes around 15:40 UTC, That’s where they thought the trouble began.

BGP Update on Facebook

Later Facebook announced that the problem began with a configuration change taken place that affected the entire internal backbone.


 

Impact

More than 3.5 billion users are around the world for Facebook, Instagram, Messenger and WhatsApp who communicate with friends and family, distribute political messaging, and expand their businesses through advertising and outreach.


As a direct impact of this, DNS resolvers all over the world stopped resolving facebook, instagram domain names. When someone types  a URL in the browser, the DNS resolver, responsible for translating domain names into actual IP addresses to connect to, first checks the cache and uses it. If not, it tries to grab the answer from the domain nameservers, typically hosted by the entity that owns it.


If the name servers are unreachable or fail to respond because of some reason a SERVFAIL is returned, and the browser issues an error to the user. Consequently 1.1.1.1 8.8.8.8 and other major public DNS resolvers started issuing (and caching) SERVFAIL responses. 



That's all??


No, now when the site was inaccessible, since people don't accept no for an answer, they started reloading the pages or killing and relaunching their apps, also apps won't accept an error and start retrying aggressively. This caused a huge traffic increase(in number of requests) on 1.1.1.1.

Traffic Increase in queries for Facebook, whatsapp, messenger, instagram

Fortunately this was built to be Free, Private, Fast and scalable, the vast majority of DNS requests kept resolving in under 10ms.




Impact on other services


When Facebook becomes unreachable, people look for alternatives to discuss and want to know more about what's going on. An increased DNS queries were seen to Signal, Twitter and other messaging and social media platforms.

Increase in traffic for twitter, signal, telegram, tiktok

All over the world WARP traffic to and from Facebook’s network simply disappeared.
Worldwide Impact

At around 21:17 UTC BGP activity from Facebook again peaked showing the availability of DNS named “facebook.com”.


It was unavailable from 15:50 to 21:30 UTC. Facebook was reconnected to global Internet and DNS was working again.



 

What Facebook has to say?


“To the huge community of people and businesses around the world who depend on us: we’re sorry,” Facebook said. “We’ve been working hard to restore access to our apps and services and are happy to report they are coming back online now.”


Facebook has published a blog on 4th Oct giving some details on what happened internally as the problem actually began with a configuration change.

Update -

Facebook has published a blog on 5th Oct giving in-depth information.

Main cause for the outage

This event is a gentle reminder that the Internet is a very complex and interdependent system of millions of systems and protocols working together. 




References


https://blog.cloudflare.com/october-2021-facebook-outage/
https://engineering.fb.com/2021/10/04/networking-traffic/outage/
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

No comments:

Post a Comment