AWS S3 掛掉原因：程序員輸錯字母，誤刪服務(wù)器，故障4小時！

MarvinZhang 發(fā)布于2019-04-25 17:45 / 1175人閱讀

摘要：周四聲稱，輸錯命令導致了亞馬遜網(wǎng)絡(luò)服務(wù)出現(xiàn)持續(xù)數(shù)小時的故障事件。太平洋標準時上午，一名獲得授權(quán)的團隊成員使用事先編寫的，執(zhí)行一條命令，該命令旨在為計費流程使用的其中一個子系統(tǒng)刪除少量服務(wù)器。

AWS解釋了其廣大US-EAST-1地理區(qū)域的S3存儲服務(wù)是如何受到中斷的，以及它在采取什么措施防止這種情況再次發(fā)生。

AWS周四聲稱，輸錯命令導致了亞馬遜網(wǎng)絡(luò)服務(wù)（AWS）出現(xiàn)持續(xù)數(shù)小時的故障事件。這起事件導致周二知名網(wǎng)站斷網(wǎng)，并給另外幾個網(wǎng)站帶來了問題。

這家云基礎(chǔ)設(shè)施提供商給出了以下解釋：

亞馬遜簡單存儲服務(wù)（S3）團隊當時在調(diào)試一個問題，該問題導致S3計費系統(tǒng)的處理速度比預期來得慢。太平洋標準時（PST）上午9：37，一名獲得授權(quán)的S3團隊成員使用事先編寫的playbook，執(zhí)行一條命令，該命令旨在為S3計費流程使用的其中一個S3子系統(tǒng)刪除少量服務(wù)器。遺憾的是，輸入命令時輸錯了一個字母，結(jié)果刪除了一大批本不該刪除的服務(wù)器。

這個錯誤無意中刪除了對US-EAST-1區(qū)域的所有S3對象而言至關(guān)重要的兩個子系統(tǒng)――這個區(qū)域是龐大的數(shù)據(jù)中心地區(qū)，恰恰也是亞馬遜歷史最悠久的區(qū)域。兩個系統(tǒng)都需要完全重啟。亞馬遜特別指出，這個過程以及運行必要的安全檢查“所花的時間超出了預期?！?/p>

重新啟動時，S3無法處理服務(wù)請求。該區(qū)域依賴S3進行存儲的其他AWS服務(wù)也受到了影響，包括S3控制臺、亞馬遜彈性計算云（EC2）新實例的啟動、亞馬遜彈性塊存儲（EBS）卷（需要從S3快照獲取數(shù)據(jù)時）以及AWSLambda。

亞馬遜特別指出，索引子系統(tǒng)到下午1：18分已完全恢復，而布置子系統(tǒng)在下午1：54分恢復正常。到那時，S3已正常運行。

AWS特別指出，由于這起事件，自己正在“做幾方面的變化”，包括采取將來防止錯誤輸入引發(fā)此類問題的措施。

官方博客解釋：“雖然刪除容量是一個重要的操作做法，但在這種情況下，使用的那款工具允許非?？斓貏h除大量的容量。我們已修改了此工具，以便更慢地刪除容量，并增加了防范措施，防止任何子系統(tǒng)低于最少所需容量級別時被刪除容量。”

AWS已經(jīng)采取的其他值得注意的措施有：它開始致力于將索引子系統(tǒng)的部分劃分到更小的單元。該公司還改變了AWS服務(wù)運行狀況儀表板（AWSService Health Dashboard）的管理控制臺，以便儀表板可以跨多個AWS區(qū)域運行――頗具諷刺意味的是，那個拼寫錯誤在周二導致儀表板失效，于是AWS不得不依靠Twitter，向客戶通報問題的進展。

針對北弗吉尼亞（US-EAST-1）區(qū)域亞馬遜S3服務(wù)中斷的簡要說明

我們想為大家透露另外一些信息，解釋2月28日上午出現(xiàn)在北弗吉尼亞（US-EAST-1）區(qū)域的服務(wù)中斷事件。亞馬遜簡單存儲服務(wù)（S3）團隊當時在調(diào)試一個問題，該問題導致S3計費系統(tǒng)的處理速度比預期來得慢。太平洋標準時（PST）上午9：37，一名獲得授權(quán)的S3團隊成員使用事先編寫的playbook，執(zhí)行一條命令，該命令旨在為S3計費流程使用的其中一個S3子系統(tǒng)刪除少量服務(wù)器。遺憾的是，輸入命令時輸錯了一個字母，結(jié)果刪除了一大批本不該刪除的服務(wù)器。不小心刪除的服務(wù)器支持另外兩個S3子系統(tǒng)。其中一個系統(tǒng)是索引子系統(tǒng)，負責管理該區(qū)域所有S3對象的元數(shù)據(jù)和位置信息。這個子系統(tǒng)是服務(wù)所有的GET、LIST、PUT和DELETE請求所必可不少的。第二個子系統(tǒng)是布置子系統(tǒng)，負責管理新存儲的分配，它的正常運行離不開索引子系統(tǒng)的正常運行。在PUT請求為新對象分配存儲資源過程中用到布置子系統(tǒng)。刪除相當大一部分的容量導致這每個系統(tǒng)都需要完全重啟。這些子系統(tǒng)在重啟過程中，S3無法處理服務(wù)請求。S3 API處于不可用的狀態(tài)時，該區(qū)域依賴S3用于存儲的其他AWS服務(wù)也受到了影響，包括S3控制臺、亞馬遜彈性計算云（EC2）新實例的啟動、亞馬遜彈性塊存儲（EBS）卷（需要從S3快照獲取數(shù)據(jù)時）以及AWSLambda。

S3子系統(tǒng)是為支持相當大一部分容量的刪除或故障而設(shè)計的，確保對客戶基本上沒有什么影響。我們在設(shè)計系統(tǒng)時就想到了難免偶爾會出現(xiàn)故障，于是我們依賴刪除和更換容量的功能，這是我們的核心操作流程之一。雖然自推出S3以來我們就依賴這種操作來維護自己的系統(tǒng)，但是多年來，我們之前還沒有在更廣泛的區(qū)域完全重啟過索引子系統(tǒng)或布置子系統(tǒng)。過去這幾年，S3迎來了迅猛發(fā)展，重啟這些服務(wù)、運行必要的安全檢查以驗證元數(shù)據(jù)完整性的過程所花費的時間超出了預期。索引子系統(tǒng)是兩個受影響的子系統(tǒng)中需要重啟的第一個。到PST 12：26，索引子系統(tǒng)已激活了足夠的容量，開始處理S3 GET、LIST和DELETE請求。到下午1：18，索引子系統(tǒng)已完全恢復過來，GET、LIST和DELETE API已恢復正常。S3 PUT API還需要布置子系統(tǒng)。索引子系統(tǒng)正常運行后，布置子系統(tǒng)開始恢復，等到下午1：54已完成恢復。至此，S3已正常運行。受此事件影響的其他AWS服務(wù)開始恢復過來。其中一些服務(wù)在S3中斷期間積壓下了大量的工作，需要更多的時間才能完全恢復如初。

由于這次操作事件，我們在做幾方面的變化。雖然刪除容量是一個重要的操作做法，但在這種情況下，使用的那款工具允許非?？斓貏h除大量的容量。我們已修改了此工具，以便更慢地刪除容量，并增加了防范措施，防止任何子系統(tǒng)低于最少所需容量級別時被刪除容量。這將防止將來不正確的輸入引發(fā)類似事件。我們還將審查其他操作工具，確保我們有類似的安全檢查機制。我們還將做一些變化，縮短關(guān)鍵S3子系統(tǒng)的恢復時間。我們采用了多種方法，讓我們的服務(wù)在遇到任何故障后可以迅速恢復。最重要的方法之一就是將服務(wù)分成小部分，我們稱之為單元（cell）。工程團隊將服務(wù)分解成多個單元，那樣就能評估、全面地測試恢復過程，甚至是最龐大服務(wù)或子系統(tǒng)的恢復過程。隨著S3不斷擴展，團隊已做了大量的工作，將服務(wù)的各部分重新分解成更小的單元，減小破壞影響、改善恢復機制。在這次事件過程中，索引子系統(tǒng)的恢復時間仍超過了我們的預期。S3團隊原計劃今年晚些時候?qū)λ饕酉到y(tǒng)進一步分區(qū)。我們在重新調(diào)整這項工作的優(yōu)先級，立即開始著手。

從這起事件開始一直到上午11：37，我們無法在AWS服務(wù)運行狀況儀表板（SHD）上更新各項服務(wù)的狀態(tài)，那是由于SHD管理控制器依賴亞馬遜S3。相反，我們使用AWS Twitter帳戶（@AWSCloud）和SHD橫幅文本向大家告知狀態(tài)，直到我們能夠在SHD上更新各項服務(wù)的狀態(tài)。我們明白，SHD為我們的客戶在操作事件過程中提供了重要的可見性，我們已更改了SHD管理控制臺，以便跨多個AWS區(qū)域運行。

最后，我們?yōu)檫@次事件給廣大客戶帶來的影響深表歉意。雖然我們?yōu)閬嗰R遜S3長期以來在可用性方面的卓越表現(xiàn)備感自豪，但我們知道這項服務(wù)對客戶、它們的應(yīng)用程序及最終用戶以及公司業(yè)務(wù)來說有多重要。我們會竭力從這起事件中汲取教訓，以便進一步提高我們的可用性。

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. ?One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable. ?

S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. ?The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.

We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.

From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. ?We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.

Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

歡迎加入本站公開興趣群

軟件開發(fā)技術(shù)群

興趣范圍包括：Java，C/C++，Python，PHP，Ruby，shell等各種語言開發(fā)經(jīng)驗交流，各種框架使用，外包項目機會，學習、培訓、跳槽等交流

QQ群：26931708

Hadoop源代碼研究群

興趣范圍包括：Hadoop源代碼解讀，改進，優(yōu)化，分布式系統(tǒng)場景定制，與Hadoop有關(guān)的各種開源項目，總之就是玩轉(zhuǎn)Hadoop

QQ群：288410967?

GPU云服務(wù)器云服務(wù)器服務(wù)器系統(tǒng)掛掉原因 aws s3 cdn 什么原因會導致云計算故障程序員誤刪數(shù)據(jù)庫

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://systransis.cn/yun/4198.html

發(fā)表評論

登陸后可評論

0條評論

MarvinZhang

男|高級講師

我要關(guān)注我要私信

TA的文章

【小白】線性表的順序存儲結(jié)構(gòu)的實現(xiàn)（C語言）

閱讀 817·2021-09-26 09:55
廣州主機號碼什么開頭-主機號是什么？

閱讀 2096·2021-09-22 15:44
fetch 如何請求數(shù)據(jù)

閱讀 1501·2019-08-30 15:54
前端每日實戰(zhàn)：72# 視頻演示如何用純 CSS 創(chuàng)作氣泡填色的按鈕特效

閱讀 1358·2019-08-30 15:54
用小程序做一個類似于蘋果AssistiveTouch功能

閱讀 2707·2019-08-29 16:57
移動端的vw px rem之間換算

閱讀 540·2019-08-29 16:26
前端每日實戰(zhàn)：45# 視頻演示如何用純 CSS 創(chuàng)作一個菱形 loader 動畫

閱讀 2518·2019-08-29 15:38
ES6新特性總結(jié) 一

閱讀 2170·2019-08-26 11:48

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

AWS S3 掛掉原因：程序員輸錯字母，誤刪服務(wù)器，故障4小時！

相關(guān)文章

"打錯一個字母，癱瘓半個互聯(lián)網(wǎng)" 是怎樣的感受？

北美互聯(lián)網(wǎng)哀鴻遍野 - 號稱99.9%可用性的S3掛了

阿里云故障「驚魂」1小時：難道我們是那0.1%？

再見，Python！你好，Go語言

發(fā)表評論

0條評論

MarvinZhang

男|高級講師

TA的文章

【小白】線性表的順序存儲結(jié)構(gòu)的實現(xiàn)（C語言）

廣州主機號碼什么開頭-主機號是什么？

fetch 如何請求數(shù)據(jù)

前端每日實戰(zhàn)：72# 視頻演示如何用純 CSS 創(chuàng)作氣泡填色的按鈕特效

用小程序做一個類似于蘋果AssistiveTouch功能

移動端的vw px rem之間換算

前端每日實戰(zhàn)：45# 視頻演示如何用純 CSS 創(chuàng)作一個菱形 loader 動畫

ES6新特性總結(jié) 一

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

AWS S3 掛掉原因：程序員輸錯字母，誤刪服務(wù)器，故障4小時！

相關(guān)文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

AWS S3 掛掉原因：程序員輸錯字母，誤刪服務(wù)器，故障4小時！