Elevated error rates and reduced capacity in Timeseries and sequences API in EUR-W3

Resolved

Lasted for 4d

This incident is resolved and the time series and sequences services are now operating at normal performance levels and with the resiliency that the services were designed for.

Mon, Mar 20, 2023, 08:30 AM

•

1 year ago

•

Affected components

No components marked as affected

Updates

Resolved

This incident is resolved and the time series and sequences services are now operating at normal performance levels and with the resiliency that the services were designed for.

Mon, Mar 20, 2023, 08:30 AM

2d earlier...

Identified

The engineering team is still not able to lift the rate limits that has been configured to protect the backend storage during the recovery and optimization work that is ongoing. Users will see 429 response codes from the API when rate limits kick in. Users will see higher rates of 429s between 4.00 am and 10:00 am UTC due to work ongoing to improve redundancy.

Sat, Mar 18, 2023, 07:37 AM

18h earlier...

Identified

The engineering team is still working on stabilizing the performance of the timeseries and sequences service. There has been adjustments to the rate limiting in place to reduce the load on the system. End users will get a 429 response code to their requests if their request rate exceeds the rate limits. We are considering further relaxing rate limits, and a new update will be made here if and when this happens.

Fri, Mar 17, 2023, 12:51 PM

4h earlier...

Identified

The engineering team is still working on stabilizing the performance of the timeseries and sequences service. There is rate limiting in place to reduce the load on the system. End users will get a 429 response code to their requests if their request rate exceeds the rate limits. We are adding more resources to the backend systems, but we are not able to lift the rate limits before the processing of the backlog is complete. It will still be a few hours.

Fri, Mar 17, 2023, 08:14 AM

23h earlier...

Identified

The storage backend has now recovered completely and is running with the desired number of replicas. Risk for dataloss is no longer a concern in this incident. There is a processing backlog that now needs to be addressed. Cognite engineering is working on an ussue with query performance degradation related to high load.

Thu, Mar 16, 2023, 08:51 AM

2h earlier...

Identified

The engineering team has fixed the problems related to the replication in the backend database. We are currently running with a normal level of resiliency. But we have still not lifted the rate limiting as we want to observe the system for a while longer before opening up for full load on the system.

Thu, Mar 16, 2023, 06:03 AM

12h earlier...

Identified

The engineering team is still working on resolving this incident. We have had two low-level storage failures in the storage backend. There is redundancy in the system, but not all replicas are fully operational Cognite is now bringing up a restore cluster to mitigate the chances of data loss. Incoming traffic is still being rate limited to protect the service and the storage backend from too high a load during the work on containing and eradicating the incident.

Wed, Mar 15, 2023, 05:16 PM

3h earlier...

Identified

The engineering team is continuing to investigate how to improve the performance of the backend for time series and sequences. To prevent data loss, the team has configured rate limiting for the services. Users will see 429 https responses if these new rate limits are exceeded.

Wed, Mar 15, 2023, 02:10 PM

1h earlier...

Identified

Cognite Engineering is working on an incident where the backend datastores for timeseries and sequences have performance problems that results in a need to throttle incoming load and from time to time return 5xx responses due to system overload. The engineering team is working on improving the storage system's performance. A new update will be posted when end users experience is believed to change.

Wed, Mar 15, 2023, 12:56 PM