The first symptom was that the "dashboarding and alerting service became unavailable." This itself was a problem for troubleshooting since it has "dashboards with their pre-built queries. Finally, the incident commander will assign someone to write the postmortem report. Empathize With Customers By Focusing on the Impact. Here’s an example of how Slack approached this issue during their most recent outage. Outage: Customers are experiencing degraded funcionality. Have a blameless, feasible, post-mortem culture. The problem with this approach is that your customers want to know what’s happening, even if that update amounts to a We’re still working on it. Gmail Suffers Two-Hour Global Outage: Reports. PST, a small percentage of customers experienced trouble sending and editing messages, loading threads, uploading files, and more. Users re-opening Slack for the first time in a while had out-of-date caches and therefore requested more data than usual. This, said Slack in its detailed report, was a large part of the problem. RCA - Network Connectivity - DNS Resolution (Scroll down to 5/2) Gmail. Trouble with connectivity and other services. The comms platforms outage came at a bad time, just when people were getting back to work after the holiday period. The root cause analysis is an important part of our post-mortem step. Any Slack entries that are moved from the Available Data listing into the postmortems Timeline will be visible to anyone looking at the postmortem. As so often happens, a large part of the challenge was understanding what the core problem was. Technical Details on the Recent Firefox Add-on Outage Add-Ons Outage Post-Mortem Result Bug 1549886 Azure, Microsoft 365. When every second of downtime comes with a cost, a good incident response process. The problem with the TGW scaling was not immediately apparent. "We go from our quietest time of the whole year to one of our biggest days quite literally overnight," said the Slack team. Post-mortem meetings: Capture key learnings. The name may sound morbid, but the overall aims are very positive. The Ripple Effect Of Outages And Downtime Cannot Be Underestimated. It involves project staff, stakeholders and clients. On-Call Post-Mortem Capacity Planning Service Level Agreement Performance. The comms platform's outage came at a bad time, just when people were getting back to work after the holiday period. A post-mortem meeting is a forum to reflect on the successes and failures of a project after its completion. As a result, Amazon's cloud computing arm said it is "reviewing the TGW scaling algorithms". Slack says it has identified a scaling failure in its AWS Transit Gateways (TGWs) as the reason for the chat service's monumental outage on 4 January.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |