Recent AWS S3 Outage Revealed Amazon's Weak Spot And The Danger Of Being The Biggest In The IndustryBy Emily Schultz
Amazon has finally released to the world the company's explanation of the cause of the big AWS S3 outage that has crippled a significant portion of the internet for several hours. The cloud computing giant made its statement this week.
Amazon Blames Human Error
According to GeekWire, which got the full details of the story, Amazon Web Services (AWS) claims human error to blame for the massive AWS S3 outage, that brought down a large portion of the internet for several hours and affected thousands of AWS customers last February 28, Tuesday. The internet giant has just released an official statement this week, which currently posted on its Amazon Web Services website, addressing the cause of the AWS outage.
AWS S3 (Simple Storage Service), which provides backend services and support for thousands of websites, web and mobile applications and other cloud services, has run into technical problems on the morning of February 28, showing some error messages to those AWS customers who use it.
As mentioned earlier by the Geekwire, the AWS S3 team has been working on an issue that was slowing down its billing system. At 9:37 a.m. Pacific Time, according to Amazon, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that are used by the S3 billing process, according to the Amazon blog post.
Unfortunately, things are just getting bad for AWS as one of the inputs to the AWS command was entered incorrectly. As a result of the human error, a large set of servers has been removed than intended, causing a widespread outage.
Those AWS servers quickly affected other S3 "subsystems of the company's US East-1 data center region in Virginia, causing a widespread outage that temporarily knock-off a large number of high-profile websites and applications, including the popular cloud collaboration service Box and the messaging platform Slack.
The outage also impacted on a number of other AWS services that rely on S3 for backend support, which include Amazon Elastic Block Store, Amazon Elastic Computer Cloud (EC2), and AWS Lambda. Additionally, the outage also caused the AWS service status page to stop working, causing problems for users who still waiting for some important details from the firm regarding the outage.
Fortunately, Amazon still has plenty of solution available on its table, including an option where it needs to restart the whole systems and complete safety checks, a process that will take several hours. While the whole system was being restarted, the AWS S3 was unable to service network requests, a total blackout for AWS customers.
AWS Takes Some Big Changes To Prevent Another Outage
As a result of the recent AWS outage, Amazon is making several changes to its AWS system. The goal is to prevent another outage in the future. Some of the big changes being made include the changes in the tool used in the debugging operation, which said to prevent an incorrect input from triggering a similar widespread outage failure in the future.
In addition to the changes in the debugging operations, Amazon also taking significant changes in improving the recovery time of key S3 subsystems. The internet giants have also made some thorough checks on its other operational tools to ensure they also have passed safety checks.
The recent AWS S3 incident just exposed to the world the vulnerabilities of the mighty cloud infrastructure and the danger of being the biggest cloud infrastructure in the market today. The new has just revealed that a simple typo error could bring down a giant like AWS for several hours, taking everything in its path. The lesson here is simple- there are no such things as a safe place and that everyone has a weak spot. And Amazon just revealed to the world its weak spot.