Abnormal Console Access Reported in Certain Regions of Alibaba Cloud’s Database; Company Takes Action to Repair and Apologizes

Thanks to Gamingdeputy netizens South China Wu Yanzu Lead delivery!

Gamingdeputy reported on November 28 that at 9:16 yesterday (November 27), access to the cloud database console of Alibaba Cloud was abnormal in some regions. Beijing, Shanghai, Hangzhou, Shenzhen, Qingdao, Hong Kong, as well as the East and West US regions were affected. Influence.

Yesterday evening, Alibaba Cloud issued an apology statement for the abnormal access to the cloud database console, saying that after emergency handling by engineers, the abnormal access problem was restored at 10:58 that day.

Advertisement

Hello! Starting from 09:16 on November 27, 2023, Beijing time, Alibaba Cloud monitoring found that the consoles and consoles of database products (RDS, PolarDB, Redis, etc.) in Beijing, Shanghai, Hangzhou, Shenzhen, Qingdao, Hong Kong, and the East and West US regions An exception occurs in OpenAPI access, but the instance operation is not affected. After emergency treatment by engineers, the abnormal access problem was restored at 10:58 that day. We are very sorry for the inconvenience caused to you. If you have any questions, please feel free to contact us.

Gamingdeputy noticed that this is the second console service abnormality of Alibaba Cloud this month. The first abnormality occurred one day after Double 11 (November 12), involving Alibaba Cloud Disk, Taobao, Xianyu, and Ding. Nail, Yuque and other products,Lasts approximately 3.5 hours.

error report:

Problem scope

  • Some services of products such as OSS, OTS, SLS, and MNS are affected, but the operation of most products such as ECS, RDS, and networks is not affected.

  • Cloud product console, management API and other functions are affected

Problem impact time

Beijing time November 12, 2023 17:39-19:20

Problem overview

Starting from 17:39 on November 12, 2023, Beijing time, Alibaba Cloud product console access and management API calls were abnormal, and access to some cloud product services was abnormal. Engineers found that the cause of the failure was related to the Access Key Service (AK) exception. After the engineers revised the whitelist version, they took measures to restart the AK service in batches. The restoration began at 18:35, and most of the Region product consoles and management APIs were restored at 19:20.

Processing

November 12, 2023

  • 17:39 An exception occurred in the Alibaba Cloud product console access and management API call.

  • 17:50 Engineers confirmed that the fault was caused by an abnormality in the AK service, which affected the abnormal operation of the cloud product console and management API calls, as well as the abnormal operation of cloud product services that rely on the AK service.

  • 18:01 Engineers locate the root cause.

  • 18:07 Start implementing recovery measures, including revising the whitelist version and restarting the AK service.

  • 18:35 Hangzhou is waiting for the Region to return to normal.

  • 19:20 Most Region cloud product consoles and management API calls have returned to normal.

problem causes

Access Key Service (AK) encountered a read exception when reading whitelist data. Due to a logical flaw in the code that handles the read exception, an incomplete whitelist was generated, causing valid requests that were not in this whitelist to fail. The cloud product console and management API services are affected by exceptions. At the same time, some products that rely on AK services experience abnormal operation of some services due to incomplete whitelisting.

improvement measures

  • Added verification and alarm interception capabilities for AK service whitelist generation results.

  • Add grayscale verification logic for AK service whitelist update to detect abnormalities in advance.

  • Increase the quick recovery capability of AK service whitelist.

  • Strengthen the linkage recovery capabilities on the cloud product side.

Advertisement