Table of Contents Link to heading
Introduction Link to heading
Info
The Optus 2023 outage was a major network failure that affected about 10 million
Optus customers on November 8, 2023, disrupting phone calls, Internet access and
emergency services.
Root Cause Link to heading
Note
According to Optus, at around 4.05am, the Optus network underwent a routine
software upgrade. This upgrade led to changes in routing information from an
international peering network. These changes propagated through multiple layers
of the Optus network and exceeded the preset safety levels on key routers.
Unable to handle the overload, these routers disconnected from the Optus IP Core
network to protect themselves.
This was seen as an incident where routing updates sent between external parties had crashed individual routers. For example, a simple typo in a “route map” when redistributed between internal networks can similarly overload routers.
Similar Incidents Link to heading
- In 2012, a routing update from a small ISP in Indonesia caused a global Internet outage for several hours. The ISP accidentally announced that it was the best path to reach millions of IP addresses, and many other ISPs accepted the update without verification. This resulted in a massive traffic overload on the ISP’s routers, which crashed and disrupted the Internet connectivity for many users around the world.
- In 2014, a routing update from a Chinese ISP caused a partial Internet outage in North America and Europe. The ISP mistakenly advertised more than 50,000 prefixes that belonged to other networks, and some of its peers propagated the update to their customers. This created a routing loop that caused congestion and packet loss on some links, affecting the performance of many websites and services.
- In 2018, a routing update from a Nigerian ISP caused a widespread Internet outage in Africa. The ISP erroneously announced that it was the origin of over 200,000 prefixes, and some of its upstream providers accepted the update and forwarded it to their peers. This resulted in a large amount of traffic being redirected to the ISP’s routers, which were unable to handle the load and crashed. The outage lasted for more than an hour and affected many African countries.
Best Practices Link to heading
Tip
- Implement proper routing policies and security mechanisms to filter and
verify the routing updates received from external parties.
- Using prefix lists, access lists, or route maps to allow or deny the routes based on their attributes, such as origin, length, or AS path.
- Using route authentication and encryption protocols, such as BGP MD5 or IPsec, to ensure the identity and integrity of the routing updates.
- Apply route dampening and penalty mechanisms to suppress unstable or flapping
routes that can cause network instability or congestion.
- Using the route dampening feature in BGP, which assigns a penalty to each route that changes state, and suppresses the route when the penalty exceeds a certain threshold. The penalty decays over time, and the route is unsuppressed when the penalty falls below another threshold.
- Monitor the network performance and stability, and detect any anomalies or
changes in the routing table or the BGP updates.
- Using network management and monitoring tools, such as SNMP, NetFlow, or Syslog, and setting up alerts or notifications for any unusual events or trends.
Other Types of Network Outages Link to heading
- Maintenance and Upgrades: Regardless of the Operating System running on the server, a certain amount of downtime for maintenance and upgrades is unavoidable so as to maintain security and provide you with the current technology.
- Server-specific problems: All servers run on hardware, and all hardware is susceptible to component failure. Server-specific problems are bound to happen occasionally.
- Resource Failures: Anything related to your network’s infrastructure can be considered a network resource. This includes hardware like routers and switches, services like DHCP and DNS servers, or anything else that keeps the data flowing on your network.
- Hardware failure: One of the network boxes died.
- Physical damage to cables: For example, a mouse chewed through a cable.
- Network congestion: Too many users doing too many things.
- Power outages: The power went out.
- Security attacks: Such as denial of service (DoS). Someone tried to hack into the network.
These incidents highlight the importance of rigorous software testing, careful configuration management, and robust error handling in network systems. They also underscore the need for effective communication with customers during outages.