It started in the morning with customer reports of timeouts and general slowness.
As a managed services provider leveraging Linode for hosting our critical applications, we recently encountered an enlightening challenge that tested our resilience and problem-solving skills. Our tale echoes the depth of analysis seen in post-mortems like the famous Cloudbleed incident, but from the unique perspective of an MSP navigating third-party infrastructure. This journey began with a perplexing series of failed Kubernetes leader elections within our deployments in LKE Canada, leading us to uncover and address a significant issue in the Linode infrastructure rollout related to NodeBalancers.
The Initial Discovery
While our monitoring did not show systems being down, it became clear that certain systems were not behaving correctly. In particular, components running in the server that actually use Kubernetes themselves (e.g. cert-manager) was restarting and logging timeouts or failed leader elections. These elections are the heartbeat of Kubernetes' high-availability feature, ensuring seamless management of workloads. Observing such failures was both unusual and concerning, signaling potential disruptions to our services.
Initially, it seemed we were grappling with a Kubernetes-specific hiccup—perhaps a bug or misconfiguration. Yet, the problem's persistence and regional specificity hinted at a deeper, underlying issue.
The Investigation Unravels
Assembling a task force of network engineers, systems administrators, and Kubernetes aficionados (just kidding, it was just me), we embarked on a meticulous investigation. Our approach was observing logs and timing restarts to provide as much information as possible to Linode support engineers.
In particular, this corresponded with planned Linode improvements in a couple of regions, notably Canada around NodeBalancers.
Given the critical role of NodeBalancers in managing incoming traffic, any disruption to their operation could lead to significant service degradation. This change was seemingly the only reason something that was working so well just a day before was now behaving like a partial and ongoing outage.
Identifying the Root Cause
Ultimately, Linode diagnosed and eventually reverted the changes they made on March the 12th. So much so it no longer shows up as a completed maintenance on status.linode.com
Linode has not shared any specifics, but I think it can be concluded that every LKE control-plane instance is behind a NodeBalancer, and thus changes to the NodeBalancer infrastructure had un-intended side effects in the Kubernetes infrastructure/implementation.
Extracted Lessons
Here are Hibou, we will be duplicating critical infrastructure in the regions to lessen impacts that specific region changes or disruptions can cause on our overall services and uptime.
I cannot understate the importance of a cohesive model of interactions in your infrastructure. Knowing an upcoming change may impact your site is always better than having a problem and needing to look for potential causes.
As always, providing clear data to support staff is helpful in tickets being taken seriously and not dismissed because things appear to be working at the moment.