The ups and downs of cert-manager

When I visit a website and see that isn’t served over https I always get a little smug.  After all, my website is served over https and it’s all thanks to cert-manager.  Cert-manager is a great tool if you are cheap.  For now, I think you can safely ignore cert-manager if money is now object.  I say “for now” because I predict cert-manager will become the de facto standard for managing certificates when and if Jetstack releases a v1.

The Ups

I’m glass half full kind of person so let’s talk about the positive aspects of cert-manager first.  First and foremost, cert-manager automates provisioning ACME certificates from providers like letsencrypt.  Free certificates automatically.  Better yet, cert-manager integrates seamlessly with Kubernetes ingress resources.  To get a certificate all you need to do is add a little snippet like this to your ingress resource:

spec:
  tls:
  - hosts:
    - joecreager.com
    secretName: joecreager-com-tls

Easy, right?  Yes it is.  And it is so easy that it lulls you into believing it will always be this easy.  Until…

The Downs

In October 2019 I received a friendly email from letsencrypt stating the version of cert-manager I was using was spamming their API and they were going to start blocking requests from older versions.  Dang.  That weekend I tackled the upgrade.  It was stressful and not terribly straightforward.  Oftentimes cert-manager upgrades require taking additional steps to complete the upgrade.  If the steps are overlooked or done incorrectly you are better off starting from scratch.  That’s exactly what I did.

I knew this wasn’t the last hiccup in my future.  I was right.  A few days ago I got another email from letsencrypt.  This time it said the certificate for one of my domains was going to expire in 20 days.  I looked at the logs from cert-manager and discovered my certificate wasn’t failing to validate.  Basically it was trying to make a request to an endpoint under that domain to verify the certificate and the connection was timing out.  I copied the URL myself and hit it with curl and it worked fine.  Odd.

Life happened and I forgot about it until I got yet another email this weekend stating my certificate was 10 days from expiring.  Well now I had to do something about it.  So I copied the errors from the logs and plugged them into Google.  It was a fruitless endeavor.  No one appeared to be having this exact issue.  In the issue that was closest to the one I was having a maintainer chimed in and suggested upgrading to the latest version.  Then closed the issue.  I checked my version and found that I was a couple of releases behind.

Before taking the maintainer’s advice I decide to grasp at straws.  I don’t always grasp at straws but when I do I make things worse.  I decided to try deleting the certificate secret.  I didn’t work.  When cert-manager created a new certificate it couldn’t verify that one either.  Instead of a certificate that was going to expire in 10 days I had an invalid certificate instead.  Doh!

With increasing blood pressure I l reviewed the many manual actions required to upgrade from the version I was currently using (v0.11.0) to the latest (v0.15.1).  For example:

(No, really, you MUST read this before you upgrade)
Update Deployment selector to follow Helm chart best practices. This will require deleting the three cert-manager Deployment resources before upgrading.

And decided it will probably be easier to go with the nuclear option and start over.  The good news about going this route is that your certificates still stay in place.  As long as your certificates aren’t expiring soon you have plenty of time to figure things out.  I had already borked things up so I was suppressing the urge close my laptop and weep at the dining room table.

I installed cert-manager with helm so I used helm to delete the current deployment.  This appeared to work until I tried to use helm to install the latest version.  This choked because there were still custom resource definitions created by cert-manager that were lingering about.  I hunted them down and deleted them too.  Some stubbornly refused to delete.  After furiously googling for an answer I found an github issue with a solution that allowed me to delete the stuck CRD.

kubectl patch crd challenges.acme.cert-manager.io -p '{"metadata":{"finalizers": []}}' --type=merge

After issuing this command I was able to install the new version of cert-manager.  Things were looking up.  Then I noticed that the cert-manager webhook pod was taking an abnormally long time to start.  Upon further investigation it could not mount a secret as a volume.  The reasons for this escaped me at the time.  So I tore it down and tried older versions.  None of those worked either due to similar volume mounting issues.  By now my blood pressure was through the roof and I was getting that tight chested feeling you get when you are pretty much at your breaking point.

I decided to give the latest version another shot.  Lucky for me I came across a handy troubleshooting guide from big blue.  It suggested adding an extra label to the cert-manager namespace.

kubectl label namespace cert-manager certmanager.k8s.io/disable-validation=true

Bam!  Problem solved.  The secret was able to mount as a volume.  My certificate was renewed.  I let out a sigh, stood up from my computer, and fixed a martini.  It’s a drink and a snack!

Thoughts

While I find cert-manager very useful for my personal stuff, it’s really not ready for prime time and I would not choose it for production if you have more than two dimes to rub together.  If you are using AWS, you can provision a LoadBalancer service with a certificate for your https traffic with a simple annotation in the service.  AWS certificates can also be automatically renewed as well.

For some reason, cert-manager is very complicated.  Far more complicated than the architecture diagram in the readme would have you believe.  There are three pods and multiple custom resource definitions involved.  Additionally, a special ACME solver pod is spun up for verification of a certificate.  Lots of moving parts.  I’m sure there are good reasons for the complexity.  However, I can’t help but feel there is a place for a much more simplistic (and reliable) version for people who just want a letsencrypt certificate for a domain.  Something that is easy to troubleshoot and easy to tear down and redeploy.

And that’s the story of how blind faith in cert-manager and panicked decision making stole my Saturday and my sanity.  Here’s to hoping for the best until the next fire needs extinguishing.