隨著國(guó)產(chǎn)數(shù)據(jù)庫(kù)Polardb的推廣及應(yīng)用,數(shù)據(jù)庫(kù)故障開始增多,今天給大家分享一篇Polardb故障應(yīng)急處理思路。10月下旬,應(yīng)用反饋某模塊登陸異常,查看數(shù)據(jù)庫(kù)有大量進(jìn)程active狀態(tài),持續(xù)時(shí)間超過(guò)30分鐘,通過(guò)kill進(jìn)程并重啟應(yīng)用無(wú)法恢復(fù),通過(guò)與業(yè)務(wù)協(xié)商、嘗試通過(guò)重啟數(shù)據(jù)庫(kù)實(shí)例恢復(fù)。
查看日志文件postgresql-10-26_024511.log,日志中存在以下日志:
ERROR: dsa_area could not attach to segment
涉及進(jìn)程pid= 51789 53315 53316,其中53316和53315進(jìn)程都退出了,日志如下:
53315 2020-10-26 03:07:11 UTC XX000 ERROR: dsa_area could not attach to segment
53316 2020-10-26 03:07:11 UTC XX000 ERROR: dsa_area could not attach to segment
......
53314 2020-10-26 03:07:11 UTC XX000 FATAL: cannot unpin a segment that is not pinned
54 2020-10-26 03:07:11 UTC 00000 LOG: background worker "parallel worker" (PID53316) exited withexit code 1
54 2020-10-26 03:07:11 UTC 00000 LOG: background worker "parallel worker" (PID53315) exitedwith exit code 1
而pid=51789的進(jìn)程沒(méi)有正常退出,很有可能是hang,并且影響到其他的進(jìn)程。
……
51789 192.168.7.36(41423) ywxt ywxt 2020-10-26 03:07:07 UTC 00000 13 0 0 0 LOG: execute S_1: COMMIT
51789 192.168.7.36(41423) ywxt ywxt 2020-10-26 03:07:09 UTC 00000 43 0 0 0 LOG: statement: BEGIN
51789 192.168.7.36(41423) ywxt ywxt 2020-10-26 03:07:09 UTC 00000 59 1 0 40 LOG: execute
51789 192.168.7.36(41423) ywxt ywxt 2020-10-26 03:07:09 UTC 00000 22 0 0 0 LOG: execute S_1: COMMIT
51789 192.168.7.36(41423) ywxt ywxt 2020-10-26 03:07:11 UTC 00000 35 0 0 0 LOG: statement: BEGIN
51789 192.168.7.36(41423) ywxt ywxt 2020-10-26 03:07:11 UTC 00000 69701 1 0 170881 LOG: execute
51789 192.168.7.36(41423) ywxt ywxt 2020-10-26 03:07:11 UTC XX000 ERROR: dsa_area could not attach to segment
51789 192.168.7.36(41423) ywxt ywxt 2020-10-26 03:07:11 UTC XX000 STATEMENT: SELECT * FROM (SELECT RET.*, ROWNUM AS FSDPRN FROM(………………)WHERE FSDPRN BETWEEN 1 AND 10
51789 192.168.7.36(41423) ywxt ywxt 2020-10-26 03:07:11 UTC XX000 LOG: statement: SELECT * FROM (SELECT RET.*, ROWNUM AS FSDPRNFROM (………………) WHERE FSDPRN BETWEEN 1 AND 10
params:$1 = d6d9c702-8aa9-4e9b-8cc2-4d756b1adaa2, $2 = 202009, $3 =HLW, $4 = 15255864202, $5 =1a251789-9cc9-4e77-84cd-6b651da3be5a, $6 = 2020-09-14 11:06:18
params:$1 = 40f12349-97cc-4945-94b1-e7fda75cc984, $2 = 202009, $3 =HLW, $4 = 18326858272, $5 =eaceb00f-9f78-4c6f-8cee-517898f0b200, $6 = 2020-09-15 15:16:40
查過(guò)文檔發(fā)現(xiàn)是屬于11.3之前已知的問(wèn)題:
Fix race conditions in management of dynamic shared memory (Thomas Munro)
These could lead to “dsa_area couldnot attach to segment” or “cannot unpin a segment thatis not pinned” errors.
if you just happen to hit PostgreSQL bugsuch as this one that was fixed in v11.3:
Once this problem occurs, your databasewill appear to be running, and can possibly be used to service somequeries... but most queries will hang, and eventually yourapplication servers will timeout causing application downtime.
通過(guò)重啟實(shí)例恢復(fù),由于重啟過(guò)程中,實(shí)例無(wú)法正常停止,導(dǎo)致管控重啟流程超時(shí)失敗,通過(guò)手工修復(fù)拉起實(shí)例。手動(dòng)流程如下:
將長(zhǎng)時(shí)間運(yùn)行的流程直接設(shè)置為中斷
cat/root/.kube/config |grep client-certificate-data|awk {print$2}|base64 -d >/root/.kube/admin.crt
cat/root/.kube/config |grep client-key-data|awk {print $2}|base64 -d>/root/.kube/admin.key
KUBEAPISERVER=127.0.0.1
PPASCLUSTERNAME=polar-xxxxx
curl-k --cert /root/.kube/admin.crt --key /root/.kube/admin.keyhttps://$KUBEAPISERVER:6443/apis/ppas.polardb.aliyun.com/v1beta1/namespaces/default/ppasclusters/$PPASCLUSTERNAME| python -m json.tool > polar-single.yaml
vimpolar-single.yaml
找到clusterStatus:xxxxxx,改為clusterStatus:Interrupt,保存
curl-X PUT -H "Content-Type: application/json" -k --cert/root/.kube/admin.crt --key /root/.kube/admin.keyhttps://$KUBEAPISERVER:6443/apis/ppas.polardb.aliyun.com/v1beta1/namespaces/default/ppasclusters/$PPASCLUSTERNAME/status--data @polar-single.yaml
控制臺(tái)查看集群列表,看是否進(jìn)入到中斷狀態(tài)。
清理ins_lock文件
kubectlget ppascluster polar-xxxx -o yaml
找到status- masterCluster - rwHostIns - insId,這個(gè)值是實(shí)例ID。
在三個(gè)計(jì)算節(jié)點(diǎn)找以下路徑:/data/polardb_ppas/{insId}/data,和直接進(jìn)入容器查找文件效果一樣。
找到ins_lock文件,刪除。
將中斷的集群進(jìn)行重建
kubectledit ppascluster polar-xxxx
在annotations下面增加下面兩項(xiàng):
polarbox.interrupt.recover:"T"
polarbox.interrupt.previous.status:"RebuildingRW"
刷新控制臺(tái),查看集群狀態(tài)進(jìn)入重建中,等待重建完成
1、此問(wèn)題是并行查詢過(guò)程中觸發(fā),短期內(nèi)通過(guò)關(guān)閉并行查詢來(lái)規(guī)避,目前生產(chǎn)已經(jīng)關(guān)閉并行。
2、Polardb應(yīng)急手工拉起流程的熟悉。
3、相關(guān)工具配套的場(chǎng)景沉淀。
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/130089.html
摘要:演講嘉賓簡(jiǎn)介蔡松露子嘉,阿里云云數(shù)據(jù)庫(kù)總架構(gòu)師,主要負(fù)責(zé)阿里云技術(shù)以及阿里云數(shù)據(jù)庫(kù)整體架構(gòu)等工作。也就是說(shuō)以上的四點(diǎn)只是達(dá)到了云原生數(shù)據(jù)庫(kù)的門檻值,還并不代表是這一個(gè)云原生的數(shù)據(jù)庫(kù)。 摘要:POLARDB是阿里云ApsaraDB數(shù)據(jù)庫(kù)團(tuán)隊(duì)研發(fā)的基于云計(jì)算架構(gòu)的下一代關(guān)系型數(shù)據(jù)庫(kù),其最大的特色是計(jì)算節(jié)點(diǎn)與存儲(chǔ)節(jié)點(diǎn)分離,借助優(yōu)秀的RDMA網(wǎng)絡(luò)以及最新的塊存儲(chǔ)技術(shù)。POLARDB不但滿足了公...
摘要:演講嘉賓簡(jiǎn)介蔡松露子嘉,阿里云云數(shù)據(jù)庫(kù)總架構(gòu)師,主要負(fù)責(zé)阿里云技術(shù)以及阿里云數(shù)據(jù)庫(kù)整體架構(gòu)等工作。也就是說(shuō)以上的四點(diǎn)只是達(dá)到了云原生數(shù)據(jù)庫(kù)的門檻值,還并不代表是這一個(gè)云原生的數(shù)據(jù)庫(kù)。 摘要:POLARDB是阿里云ApsaraDB數(shù)據(jù)庫(kù)團(tuán)隊(duì)研發(fā)的基于云計(jì)算架構(gòu)的下一代關(guān)系型數(shù)據(jù)庫(kù),其最大的特色是計(jì)算節(jié)點(diǎn)與存儲(chǔ)節(jié)點(diǎn)分離,借助優(yōu)秀的RDMA網(wǎng)絡(luò)以及最新的塊存儲(chǔ)技術(shù)。POLARDB不但滿足了公...
摘要:近日,阿里云正式對(duì)外發(fā)布了全新一代自研關(guān)系型數(shù)據(jù)庫(kù)。而他認(rèn)為,在未來(lái)年內(nèi),沒(méi)有自研數(shù)據(jù)庫(kù)的云計(jì)算廠商將會(huì)被逐漸淘汰出局。 近日,阿里云正式對(duì)外發(fā)布了全新一代自研關(guān)系型數(shù)據(jù)庫(kù)POLARDB。值得注意的是,POLARDB并不是基于開源數(shù)據(jù)庫(kù)MySQL之上研發(fā)的分支,而且基于第三代分布式共享存儲(chǔ)架構(gòu),創(chuàng)新實(shí)現(xiàn)企業(yè)級(jí)...
閱讀 1356·2023-01-11 13:20
閱讀 1707·2023-01-11 13:20
閱讀 1215·2023-01-11 13:20
閱讀 1906·2023-01-11 13:20
閱讀 4165·2023-01-11 13:20
閱讀 2757·2023-01-11 13:20
閱讀 1402·2023-01-11 13:20
閱讀 3671·2023-01-11 13:20