The awfulness of AWS EKS
A little bit of a background. I’ve been a Kubernetes user since early 2015 and have been using AWS since 2013 (on and off, with stints across all major cloud players like GCP, Azure, DO, AliBaba — no Oracle yet though). I’ve done “Kubernetes the hard way” and loved it.
The first Kubernetes-as-a-service platform I’ve ever used was GKE though and I think that spoiled me through and through, setting up high expectations for everyone else.
GKE is a Kubernetes first service, where everything that makes up GCP services comes in second. You don’t need to consider the minutiae of instance groups, how autoscaling happens behind the scenes, rights assignments, networking etc. Everything’s setup with sane defaults that allows the cluster to work as intended (whether private or public, with or without cluster autoscaling, etc). All settings are one click away — and everything you can do imperatively you can also do declaratively. You need autoscaler? It deploys everything that needs to allow the cluster to autoscale, from the cluster autoscaler itself to all the rights setup that it needs to work. You need service mesh? It’s one click away.
Azure AKS isn’t too far removed. While obviously behind GCP (a Kubernetes supporter), it has come a long way to provide almost everything GKE can. In many ways I like it better, it’s a sort of GKE without the clutter. I mean, I can do without Istio option. Honestly I prefer a good documentation on how I can set it up, rather than an automagic option.
And where was AWS all this time? Honestly, I have no idea. While I did my first project in GKE at the end of 2016, Azure was planning the preview of AKS and AWS was nowhere to be heard of. Only in late 2017 did AWS show a preview of EKS, which was launched in 2018 …. and boy was it an awful experience.
You see, where Azure and Google built the Kubernetes-as-a-service experience around Kubernetes itself, offering the full setup of a cluster a few clicks away, AWS did it the other way around.
EKS was (and in many ways still is) a patchwork around AWS concepts. What do I mean by that?
- what EKS calls setting up a cluster merely sets up the control plane / master. Only then are you directed to separately setup node pools and the like.
- Of course, EKS doesn’t create all the networking for you … oh no. Where the other providers allow you to setup the structure if you don’t have it, AWS doesn’t. You’d better setup everything right or stuff just won’t work and you have no idea why
- At launch, EKS was clearly the unwanted stepchild: it wasn’t offered a first class citizen place in aws cli tools (in the way GKE did). Took almost 4 years to be able to do some stuff in aws cli tools, until AWS bought eksctl … which is still worlds apart from being the declarative tool they’ve been advertising.
- when you try to discuss EKS with AWS support, they don’t really focus on the Kubernetes platform, but rather on the rest of AWS concepts and EKS platform seems foreign to them (“why did you setup the load balancer target group like that, you should have only 2 instances there” — “I didn’t set it up, that’s how the ingress interacts with the platform, it registers all instances regardless whether ingress pods run there …. “why is the autoscaling group changing by itself” — “because it’s managed by the autoscaler”)
- you need to make sure all proper policies are in places and assigned (including, for example CNI policies though weirdly AWS only provides an aws-managed policy for ipv4 whereas for ipv6 cluster you need to create your own)
- cluster autoscaler isn’t even a plugin (though weirdly things like ebs csi controller and vpc cni are considered plugins, though they are enabled by default)
- you get instructions on how to remove pod count limit, by editing launch templates instead of providing launch templates with the limit removed, even in cases where the limit shouldn’t exist (like for ipv6 clusters)
- the whole ridiculous pod count per node limit only exists due to the insistence of tying the internal cluster networking to the VPC. Sure, it’s great if you plan to route traffic directly to services/pods and it’s sure good to have the option, but it shouldn’t be the default — it should be an option for which the users know the limitation of and should choose whether to use it. In other words, cni overlay should be the default and vpc cni should be an option.
I rest my case.