Resolved
Cognite Engineering has closed this incident after monitoring the improvements over the last 30 hours. The data modelling service is healthy and there should be no impact on transformations.
Monitoring
Cognite Engineering has concluded that this incident can be set to resolved. The code changes deployed yesterday has proven to improve both latency and error rate significantly. The service is now operating well within the boundaries set by the objectives that we have defined. We will monitor the service a few more hours before we close the incident.
Identified
The fixes and optimizations deployed yesterday have proven to improve the performance and availability of the service significantly. The duration of the error spikes is now less than 10 minutes, and we have not seen error rates above the SLA threshold of 0.5% since the fixes were deployed.
Cognite Engineering continues to work on this to improve its robustness and reliability. We will be monitoring the service performance closely and keep this incident open.
Identified
Cognite Engineering has made progress in understanding what causes the error spikes that have affected the data modeling service during this incident. Code changes have been deployed, and we see that we have reduced the latency of the service. We will continue working on the analysis to ensure that the capacity and the error rate of the service improves. There are plans for further improvements, and in the mean time, the service is monitored closely to manually do tuning if needed.
Investigating
The steps taken before the weekend led to a reduced scope and impact of the incident. The team is investigating the root cause.
Customers need to be advised that transformations may fail and that data modeling may be inaccessible for shorter periods. Please, submit a ticket to Cognite Support in case you have questions.
Investigating
Cognite Engineering are investigatng the root cause for an incident where we see spikes of unavailability for the data modeling service. The incident causes transformations to fail, and the data modeling service to be unavailable. End users will get 5xx errors, and the issue continues until we take manual steps to resolve the issues.
The team has taken steps to reduce the duration of the periods with elevated error rates. The strategy is to keep the service at best possible quality through the weekend and use the oncall resources to address issues that these steps are unable to handle.
Customers need to be advised that transformations may fail and that data modeling may be inaccessible for shorter periods. Please, submit a ticket to Cognite Support in case you have questions.