摘要:周四聲稱,輸錯命令導致了亞馬遜網(wǎng)絡(luò)服務(wù)出現(xiàn)持續(xù)數(shù)小時的故障事件。太平洋標準時上午,一名獲得授權(quán)的團隊成員使用事先編寫的,執(zhí)行一條命令,該命令旨在為計費流程使用的其中一個子系統(tǒng)刪除少量服務(wù)器。
AWS解釋了其廣大US-EAST-1地理區(qū)域的S3存儲服務(wù)是如何受到中斷的,以及它在采取什么措施防止這種情況再次發(fā)生。
?
AWS周四聲稱,輸錯命令導致了亞馬遜網(wǎng)絡(luò)服務(wù)(AWS)出現(xiàn)持續(xù)數(shù)小時的故障事件。這起事件導致周二知名網(wǎng)站斷網(wǎng),并給另外幾個網(wǎng)站帶來了問題。
這家云基礎(chǔ)設(shè)施提供商給出了以下解釋:
亞馬遜簡單存儲服務(wù)(S3)團隊當時在調(diào)試一個問題,該問題導致S3計費系統(tǒng)的處理速度比預期來得慢。太平洋標準時(PST)上午9:37,一名獲得授權(quán)的S3團隊成員使用事先編寫的playbook,執(zhí)行一條命令,該命令旨在為S3計費流程使用的其中一個S3子系統(tǒng)刪除少量服務(wù)器。遺憾的是,輸入命令時輸錯了一個字母,結(jié)果刪除了一大批本不該刪除的服務(wù)器。
這個錯誤無意中刪除了對US-EAST-1區(qū)域的所有S3對象而言至關(guān)重要的兩個子系統(tǒng)――這個區(qū)域是龐大的數(shù)據(jù)中心地區(qū),恰恰也是亞馬遜歷史最悠久的區(qū)域。兩個系統(tǒng)都需要完全重啟。亞馬遜特別指出,這個過程以及運行必要的安全檢查“所花的時間超出了預期?!?/p>
重新啟動時,S3無法處理服務(wù)請求。該區(qū)域依賴S3進行存儲的其他AWS服務(wù)也受到了影響,包括S3控制臺、亞馬遜彈性計算云(EC2)新實例的啟動、亞馬遜彈性塊存儲(EBS)卷(需要從S3快照獲取數(shù)據(jù)時)以及AWSLambda。
亞馬遜特別指出,索引子系統(tǒng)到下午1:18分已完全恢復,而布置子系統(tǒng)在下午1:54分恢復正常。到那時,S3已正常運行。
AWS特別指出,由于這起事件,自己正在“做幾方面的變化”,包括采取將來防止錯誤輸入引發(fā)此類問題的措施。
官方博客解釋:“雖然刪除容量是一個重要的操作做法,但在這種情況下,使用的那款工具允許非??斓貏h除大量的容量。我們已修改了此工具,以便更慢地刪除容量,并增加了防范措施,防止任何子系統(tǒng)低于最少所需容量級別時被刪除容量。”
AWS已經(jīng)采取的其他值得注意的措施有:它開始致力于將索引子系統(tǒng)的部分劃分到更小的單元。該公司還改變了AWS服務(wù)運行狀況儀表板(AWSService Health Dashboard)的管理控制臺,以便儀表板可以跨多個AWS區(qū)域運行――頗具諷刺意味的是,那個拼寫錯誤在周二導致儀表板失效,于是AWS不得不依靠Twitter,向客戶通報問題的進展。
針對北弗吉尼亞(US-EAST-1)區(qū)域亞馬遜S3服務(wù)中斷的簡要說明
?
我們想為大家透露另外一些信息,解釋2月28日上午出現(xiàn)在北弗吉尼亞(US-EAST-1)區(qū)域的服務(wù)中斷事件。亞馬遜簡單存儲服務(wù)(S3)團隊當時在調(diào)試一個問題,該問題導致S3計費系統(tǒng)的處理速度比預期來得慢。太平洋標準時(PST)上午9:37,一名獲得授權(quán)的S3團隊成員使用事先編寫的playbook,執(zhí)行一條命令,該命令旨在為S3計費流程使用的其中一個S3子系統(tǒng)刪除少量服務(wù)器。遺憾的是,輸入命令時輸錯了一個字母,結(jié)果刪除了一大批本不該刪除的服務(wù)器。不小心刪除的服務(wù)器支持另外兩個S3子系統(tǒng)。其中一個系統(tǒng)是索引子系統(tǒng),負責管理該區(qū)域所有S3對象的元數(shù)據(jù)和位置信息。這個子系統(tǒng)是服務(wù)所有的GET、LIST、PUT和DELETE請求所必可不少的。第二個子系統(tǒng)是布置子系統(tǒng),負責管理新存儲的分配,它的正常運行離不開索引子系統(tǒng)的正常運行。在PUT請求為新對象分配存儲資源過程中用到布置子系統(tǒng)。刪除相當大一部分的容量導致這每個系統(tǒng)都需要完全重啟。這些子系統(tǒng)在重啟過程中,S3無法處理服務(wù)請求。S3 API處于不可用的狀態(tài)時,該區(qū)域依賴S3用于存儲的其他AWS服務(wù)也受到了影響,包括S3控制臺、亞馬遜彈性計算云(EC2)新實例的啟動、亞馬遜彈性塊存儲(EBS)卷(需要從S3快照獲取數(shù)據(jù)時)以及AWSLambda。
S3子系統(tǒng)是為支持相當大一部分容量的刪除或故障而設(shè)計的,確保對客戶基本上沒有什么影響。我們在設(shè)計系統(tǒng)時就想到了難免偶爾會出現(xiàn)故障,于是我們依賴刪除和更換容量的功能,這是我們的核心操作流程之一。雖然自推出S3以來我們就依賴這種操作來維護自己的系統(tǒng),但是多年來,我們之前還沒有在更廣泛的區(qū)域完全重啟過索引子系統(tǒng)或布置子系統(tǒng)。過去這幾年,S3迎來了迅猛發(fā)展,重啟這些服務(wù)、運行必要的安全檢查以驗證元數(shù)據(jù)完整性的過程所花費的時間超出了預期。索引子系統(tǒng)是兩個受影響的子系統(tǒng)中需要重啟的第一個。到PST 12:26,索引子系統(tǒng)已激活了足夠的容量,開始處理S3 GET、LIST和DELETE請求。到下午1:18,索引子系統(tǒng)已完全恢復過來,GET、LIST和DELETE API已恢復正常。S3 PUT API還需要布置子系統(tǒng)。索引子系統(tǒng)正常運行后,布置子系統(tǒng)開始恢復,等到下午1:54已完成恢復。至此,S3已正常運行。受此事件影響的其他AWS服務(wù)開始恢復過來。其中一些服務(wù)在S3中斷期間積壓下了大量的工作,需要更多的時間才能完全恢復如初。
由于這次操作事件,我們在做幾方面的變化。雖然刪除容量是一個重要的操作做法,但在這種情況下,使用的那款工具允許非??斓貏h除大量的容量。我們已修改了此工具,以便更慢地刪除容量,并增加了防范措施,防止任何子系統(tǒng)低于最少所需容量級別時被刪除容量。這將防止將來不正確的輸入引發(fā)類似事件。我們還將審查其他操作工具,確保我們有類似的安全檢查機制。我們還將做一些變化,縮短關(guān)鍵S3子系統(tǒng)的恢復時間。我們采用了多種方法,讓我們的服務(wù)在遇到任何故障后可以迅速恢復。最重要的方法之一就是將服務(wù)分成小部分,我們稱之為單元(cell)。工程團隊將服務(wù)分解成多個單元,那樣就能評估、全面地測試恢復過程,甚至是最龐大服務(wù)或子系統(tǒng)的恢復過程。隨著S3不斷擴展,團隊已做了大量的工作,將服務(wù)的各部分重新分解成更小的單元,減小破壞影響、改善恢復機制。在這次事件過程中,索引子系統(tǒng)的恢復時間仍超過了我們的預期。S3團隊原計劃今年晚些時候?qū)λ饕酉到y(tǒng)進一步分區(qū)。我們在重新調(diào)整這項工作的優(yōu)先級,立即開始著手。
從這起事件開始一直到上午11:37,我們無法在AWS服務(wù)運行狀況儀表板(SHD)上更新各項服務(wù)的狀態(tài),那是由于SHD管理控制器依賴亞馬遜S3。相反,我們使用AWS Twitter帳戶(@AWSCloud)和SHD橫幅文本向大家告知狀態(tài),直到我們能夠在SHD上更新各項服務(wù)的狀態(tài)。我們明白,SHD為我們的客戶在操作事件過程中提供了重要的可見性,我們已更改了SHD管理控制臺,以便跨多個AWS區(qū)域運行。
最后,我們?yōu)檫@次事件給廣大客戶帶來的影響深表歉意。雖然我們?yōu)閬嗰R遜S3長期以來在可用性方面的卓越表現(xiàn)備感自豪,但我們知道這項服務(wù)對客戶、它們的應(yīng)用程序及最終用戶以及公司業(yè)務(wù)來說有多重要。我們會竭力從這起事件中汲取教訓,以便進一步提高我們的可用性。
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. ?One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable. ?
S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. ?The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.
We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.
From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. ?We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.
Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
歡迎加入本站公開興趣群軟件開發(fā)技術(shù)群
興趣范圍包括:Java,C/C++,Python,PHP,Ruby,shell等各種語言開發(fā)經(jīng)驗交流,各種框架使用,外包項目機會,學習、培訓、跳槽等交流
QQ群:26931708
Hadoop源代碼研究群
興趣范圍包括:Hadoop源代碼解讀,改進,優(yōu)化,分布式系統(tǒng)場景定制,與Hadoop有關(guān)的各種開源項目,總之就是玩轉(zhuǎn)Hadoop
QQ群:288410967?
文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請注明本文地址:http://systransis.cn/yun/4198.html
摘要:打錯一個字母癱瘓半個互聯(lián)網(wǎng)是怎樣的感受在今天亞馬遜披露了這起事故背后的原因后,很多人心里都會有一個疑問這個倒霉的程序員會被開除嗎關(guān)于這一點,雖然主頁君肯定沒法做出準確的判斷,但還是愿意給出我們的猜測不會。 2月28號,號稱「亞馬遜AWS最穩(wěn)定」的云存儲服務(wù)S3出現(xiàn)超高錯誤率的宕機事件。接著,半個互聯(lián)網(wǎng)都跟著癱瘓了。一個字母造成的血案AWS 最近給出了確切的解釋:一名程序員在調(diào)試系統(tǒng)的時候,運...
摘要:當和類似的服務(wù)誕生后,對于很多初創(chuàng)的互聯(lián)網(wǎng)公司,簡直是久旱逢甘霖,的持久性,和的可用性爽的不能再爽,于是紛紛把自個的存儲架構(gòu)布在了上。所以,當今早主要是宕機時,整個北美的互聯(lián)網(wǎng)呈現(xiàn)一片哀魂遍野的景象。 事件回顧美西太平洋時間早上 10 點(北京時間凌晨 2 點),AWS S3 開始出現(xiàn)異常。很多創(chuàng)業(yè)公司的技術(shù)人員發(fā)現(xiàn)他們的服務(wù)無法正常上傳或者下載文件。有人在 hacker news 上問:I...
摘要:一場因阿里云故障引發(fā)的突發(fā)事件,導致他所在的互聯(lián)網(wǎng)金融公司幾近癱瘓。此次事故從點分至點分,時長約一小時。對此,阿里云方面不予置評。但阿里云相關(guān)負責人向新浪科技表示,賠償問題將按照相關(guān)服務(wù)保障條款進行處理。 6月27日晚,北京國貿(mào)寫字樓2座燈火通明。林曉宇疾步往返于運維部與研發(fā)部的走廊上,表情有些凝重?! ∫粓鲆虬⒗镌乒收弦l(fā)的突發(fā)事件,導致他所在的互聯(lián)網(wǎng)金融公司幾近癱瘓。在運維部工作近一年,...
摘要:語言誕生于谷歌,由計算機領(lǐng)域的三位宗師級大牛和寫成。作者華為云技術(shù)宅基地鏈接谷歌前員工認為,比起大家熟悉的,語言其實有很多優(yōu)良特性,很多時候都可以代替,他已經(jīng)在很多任務(wù)中使用語言替代了。 Go 語言誕生于谷歌,由計算機領(lǐng)域的三位宗師級大牛 Rob Pike、Ken Thompson 和 Robert Griesemer 寫成。由于出身名門,Go 在誕生之初就吸引了大批開發(fā)者的關(guān)注。誕生...
閱讀 817·2021-09-26 09:55
閱讀 2096·2021-09-22 15:44
閱讀 1501·2019-08-30 15:54
閱讀 1358·2019-08-30 15:54
閱讀 2707·2019-08-29 16:57
閱讀 540·2019-08-29 16:26
閱讀 2518·2019-08-29 15:38
閱讀 2170·2019-08-26 11:48