Issue impacting US clusters

Incident Report for Braze, Inc.

Postmortem

On Monday, April 29, 2024, the Braze platform's US clusters experienced an outage that persisted in whole or in part for several hours, impacting customer access to our dashboard, as well as data processing and message sends. In the 13-year history of Braze, this is the first and only incident of this magnitude that we've ever had. Today our CTO and Co-founder, Jon Hyman, issued a blog post detailing the cause of this incident and the steps we have since taken to prevent a recurrence. Please read Jon’s blog post for further details.

Posted May 03, 2024 - 13:40 EDT

Resolved

The overwhelming majority of customers across US 01 and US 03 have had their backlogs processed and are back to real-time data processing & message sending. All services are functioning as expected. We are considering this incident resolved.

We apologize for this incident and will provide a detailed Root Cause Analysis (RCA) report soon.

Posted Apr 29, 2024 - 23:56 EDT

Update

US01 Data Processing, Outbound Messages, and SDK Data Collection are fully operational.

US03 Data Processing and SDK Data Collection is fully operational.
We are still actively processing a backlog of Outbound Messages for a small subset of customers in US03.

Posted Apr 29, 2024 - 23:29 EDT

Update

US01 Data Processing and SDK Data Collection are fully operational.
We are still actively processing a backlog of Outbound Messages for a small subset of customers in US01.

US03 SDK Data Collection is fully operational.
We are still actively processing a backlog of Outbound Messages for a small subset of customers in US03.
We are still actively processing a backlog of Data Processing jobs in US03.

Posted Apr 29, 2024 - 21:50 EDT

Update

US08 has been marked as operational. The messaging and data processing backlogs on that cluster have been fully processed, and all other services are operational. We can consider that cluster in a "monitoring" status.

Posted Apr 29, 2024 - 19:57 EDT

Update

Providing a number of meaningful updates to US01, and US03:

Dashboards and REST API processing are fully operational in both US01 and US03.
SDK Data collection is fully operational in 03, and we are scaling up in US01.

Data Processing and Message Sending are still experiencing sporadic latency as we work through the backlogs, but all health measures are improving rapidly.

Posted Apr 29, 2024 - 19:25 EDT

Update

US06 has been marked as operational. The messaging and data processing backlogs on that cluster have been fully processed, and all other services are operational. We can consider that cluster in a "monitoring" status.

Posted Apr 29, 2024 - 18:32 EDT

Update

We are continuing to work on a fix for this issue.

Posted Apr 29, 2024 - 18:29 EDT

Update

US04 and US05 have been marked as operational. The messaging and data processing backlogs on those clusters have been fully processed, and all other services are operational. We can consider those clusters in a "monitoring" status.

Posted Apr 29, 2024 - 18:13 EDT

Update

We are actively processing backlogs of both messaging and data across all clusters. Our Database, SRE, and Networking teams are continuing to increase overall throughput as the recovery continues and individual clusters catch back up to real-time.

Currents is operational across all clusters, and has been processing all events as they are cleared from the backlogs.

At this point we have completed both backlogs in US02 and US07. We have also completed the full message sending backlog in US04, and are more than 75% through backlogs in US05 and US06. US01 and US03 are continuing to ramp their pace of recovery. The next update will provide continued status updates on backlog processing and recovery.

Posted Apr 29, 2024 - 17:16 EDT

Update

At this point, Dashboard access is available for all clusters.

We are processing through the backlog of messages to send and data to process across all clusters.

We'll continue to provide hourly updates.

Posted Apr 29, 2024 - 16:00 EDT

Update

US02 and US07 have been marked as operational. The messaging and data processing backlogs on those clusters have been fully processed.

On our larger clusters, this will take longer, and we don't yet have a cluster-by-cluster ETA, but we are tracking toward resolution.

Posted Apr 29, 2024 - 14:20 EDT

Update

We continue to see service restoration across several clusters:

Data Processing and Messaging have resumed in US05, and US07.

Posted Apr 29, 2024 - 13:59 EDT

Update

We continue to see service restoration across several clusters:

Dashboard services are resumed on US04, US05, US06, US07.
Data Processing and Messaging have resumed in US04.

Posted Apr 29, 2024 - 13:37 EDT

Update

We are seeing Dashboard access, Data Processing, and Messaging resuming in US02. There is a backlog of work to process, and once it is fully caught up, we will update the status to operational.

We are working through the rest of the US clusters and will provide updates in real-time as we have them.

Posted Apr 29, 2024 - 13:16 EDT

Update

We continue working to resolve a network issue in our US data centers.

We continue to work through checkout, and our remediation steps are showing success across various services.

Our next update will be in 30 minutes or once we have more detailed information about the resolution.

Posted Apr 29, 2024 - 13:01 EDT

Update

We continue working to resolve a network issue in our US data centers.

Senior leaders in our Engineering organization have implemented code designed to ensure that Quiet Hours are respected where required, to the extent this feature was properly configured by customers in Campaigns and Canvases, before this incident.

We have completed the restoration of services to a pilot customer successfully, and are now working through restoration across all US Clusters.

Our next update will be in 30 minutes or less.

Posted Apr 29, 2024 - 12:28 EDT

Update

We continue working to resolve a network issue in our US data centers.

We have no material update since our last post. We continue to work through restoring connectivity to those databases.

Our next update will be in 30 minutes or less.

Posted Apr 29, 2024 - 11:55 EDT

Update

We are continuing to work to resolve a network issue in our US data centers. As mentioned, the rolling restart of our database containers with Rackspace, our database hosting provider, was completed. We are now working through restoring connectivity to those databases. Senior leaders in our engineering organization are working to ensure that Quiet Hours will be respected in the countries where they are required and as configured in campaigns.

We will provide a full RCA and postmortem once this is resolved.

Our next update will be in 30 minutes or less.

Posted Apr 29, 2024 - 11:27 EDT

Update

We are continuing to work to resolve a network issue in our US data centers. The rolling restart of our database containers with Rackspace, our database hosting provider, is complete. Services are gradually returning online, and we are currently processing the backlog of data and messages accumulated during the incident.

We will provide a full RCA and postmortem once this is resolved.

Our next update will be in 30 minutes or less.

Posted Apr 29, 2024 - 10:55 EDT

Update

We are continuing to resolve a network issue in our US data centers. The rolling restart of database containers with Rackspace, our database hosting provider, is progressing and we are approximately 75% complete. Once these restarts are complete, we will begin returning services and processing data and messaging backlogs. Our next update will be in 30 minutes or less.

Posted Apr 29, 2024 - 10:25 EDT

Update

We have identified the root cause and are working to resolve a network issue in our US data centers. We are actively performing a rolling restart of database containers with Rackspace, our database hosting provider. We do not expect data loss, and further expect that all messages will be sent once the services are up and running. Our next update will be in 30 minutes or less.

Posted Apr 29, 2024 - 09:53 EDT

Update

We are continuing to work on a fix for this issue.

Posted Apr 29, 2024 - 08:59 EDT

Update

Work is ongoing by Engineers and our database provider to restore service.

Posted Apr 29, 2024 - 08:05 EDT

Update

Engineers are continuing to work alongside our Database provider to restore service.

Posted Apr 29, 2024 - 06:53 EDT

Update

Engineers are actively working with our Database provider to restore service.

Posted Apr 29, 2024 - 06:18 EDT

Identified

We have identified a third-party networking issue.

Posted Apr 29, 2024 - 05:48 EDT

Investigating

Engineers are investigating an issue impacting multiple services on all US clusters.

Posted Apr 29, 2024 - 05:41 EDT

This incident affected: US 01 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging, Currents), US 02 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging, Currents), US 03 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging, Currents), US 04 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging, Currents), US 06 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging, Currents), US 08 Cluster (Dashboard, SDK Data Collection, Data Processing, Rest APIs, Outbound Messaging, Currents), US 05 Cluster (Dashboard, SDK Data Collection, Data Processing, Rest APIs, Outbound Messaging, Currents), and US 07 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging, Currents).