Resolved -
The engineering team has continued their monitoring during the day, and have concluded that the situation has stabilized and that the service performance and quality is at the desired level. Cognite has closed this incident.
Jun 7, 13:32 CEST
Update -
Cognite Engineering has been monitoring the services since yesterday. The latency and error rates seem to have been stable since yesterday UTC 17:30. The load on the system is at a level where we think there is little risk for yesterday's problems to repeat. We are continuing to monitor and analyze the root cause.
Jun 7, 09:18 CEST
Monitoring -
The engineering team has spent additional hours working on a resolution for the incident. This incident is likely due to a change in traffic or traffic pattern. Some of the traffic has been turned off now, and the service looks healthier. We will continue to monitor, but no action will be taken until tomorrow if this remains stable. The incident is not considered resolved, but with the current traffic pattern and load, we believe that the service quality and performance is at an acceptable level.
Jun 6, 21:35 CEST
Identified -
Cognite Engineering has deployed configuration changes that seem to reduce the problem significantly. The engineering team is still working on additional improvements. End users should see reduced error rates and fewer requests with slow response times.
Jun 6, 14:35 CEST
Update -
The end-user experience is both elevated response times, and increased error rates. The incident is re-classified as "partial outage". Cognite is still working on the investigation.
Jun 6, 13:50 CEST
Investigating -
Cognite Engineering is investigating an issue where the performance of sequences and time series is reduced. We see P90 latency of up to 1 minute for parts of the services.
Jun 6, 13:25 CEST
Resolved -
The incident was fully resolved at 11:20 UTC. The total downtime was 1 hour 35 minutes.
Jun 2, 15:56 CEST
Update -
We are continuing the monitoring of the cluster that was impacted by this incident. Our systems indicate that all services now are working, and that the cluster is fully operational. We will continue monitoring for a few more hours.
Jun 2, 13:37 CEST
Monitoring -
Cognite Engineering has resolved the problem causing a major outage in ASIA-J1 cluster. End users will now see that login and APIs will start working. We are monitoring the situation closely for a few hours before deciding if the incident can be closed.
Jun 2, 13:27 CEST
Update -
The engineering team is still working on getting services back online in our ASIA-J1 cluster. The impact on end users is that API calls will fail, and login to apps also will time out or fail.
Jun 2, 13:17 CEST
Identified -
Cognite engineering is currently working on a resolution for an issue causing downtime for several services in our ASIA-J1 cluster. The incident also impacts our monitoring systems, and the impact of the incident is therefore not completely understood. Several APIs are down - including timeseries and sequences.
Jun 2, 12:47 CEST
Resolved -
Fusion access management is now working normally.
Jun 2, 14:33 CEST
Monitoring -
We believe we have fixed the issue after updating a problematic component. Please contact support if you are still affected.
Jun 2, 11:50 CEST
Update -
While we continue to investigate the root cause, we are testing a workaround for fusion that avoids the issue, and will provide an update when we are able to release it.
Jun 1, 22:27 CEST
Investigating -
Visiting the access management page in Fusion can cause the website to crash for some users.
This is known to affect users in multiple contexts (different clusters and projects)- investigation is going.
Workarounds include alternative clients for managing access to CDF, including Postman/API clients such as curl, or the python SDK.