Kubernetesの高可用性構築に挑む：フェイルオーバー、MetalLBの活用

株式会社LOWWS CTOのスラヴィ・パンタレーブ(Slavi Pantaleev)と、株式会社LOWWS プログラマーのルボミル・ポポヴ(Lyubomir Popov)にインタビューし、現在携わっているプロジェクトの中で興味深い技術や最新の知見について聞いていくコーナーです。

今回はルボに同席してもらい、スラヴィにKubernetesクラスターの可用性向上と運用課題について聞きました。

日本語の記事に続いて英語の記事があります。

This is an interview with Slavi Pantaleev, CTO of LOWWS Inc., and Lyubomir Popov, programmer at LOWWS Inc., where we explore interesting technologies and the latest insights from their current projects.

This time, Lyubo joined us for a discussion with Slavi about the enhancement of Kubernetes cluster availability and related operational issues.

The English article follows the Japanese article.

フェイルオーバー、MetalLBによる高可用性Kubernetesクラスター構築

スラヴィ：最近はKubernetes関連の作業を中心にやっています。ある会社のために取り組んでいて、顧客向けにまた新しいKubernetesクラスターを構築しているところです。今回は可用性の高い構成にしたいという要望なので、ノードの数を増やす必要がありますし、ロードバランシングや、ノードがダウンしたときにどうやってダウンタイムを回避するかも考えないといけません。以前にもいくつかクラスターを構築してきましたが、そのほとんどはコントロールプレーンが1つしかありませんでした。そのため、それが落ちると新しいコンテナのスケジューリングができなくなったり、クラスターとの通信ができなくなったりします。なので、今回は複数のコントロールプレーンを用意する必要があります。負荷分散まではしなくてもよいですが、フェイルオーバーは必要です。1つが落ちても、別のコントロールプレーンにAPIリクエストを送り続けられるようにしたいのです。

もう1つの単一障害点がIngressです。Kubernetesでは、さまざまなコンテナが別々のノード上で動作しています。通常、Ingress controllerというコンポーネントを使って、全てのトラフィックをそこに通す形になります。以前は、このIngress controllerを1つのノードにしか置いておらず、そのノードだけがパブリックIPを持っていました。なので、そのノードが落ちると、クラスターへのすべての外部通信が遮断されてしまっていました。この問題を解決するために、ロードバランサーのソリューションを調べています。有名なものの1つにMetalLBがあり、これは自前のベアメタルKubernetesクラスター向けのものです。クラウド上でKubernetesを使っていれば、プロバイダーから冗長化されたコントロールプレーンやロードバランサーが提供されますが、自分たちでベアメタル環境を運用しているので、それを自力で解決しなければなりません。

目標はダウンタイムを最小限に抑えることです。ノードがダウンしたときには、できれば1分以内に別のノードに切り替わるようにしたいと思っています。どのソリューションを選ぶかによってトレードオフもあります。MetalLBを使って試したり、Cloudflare DNSを使ってヘルスチェックやトラフィックの振り分けを行う案も検討しています。もっと詳しく話すこともできますが、これがだいたいの概要です。

ナオト：ルボはKubernetesを自分のプロジェクトで使ったことはありますか？

ルボ：いえ、使ったことはありません。フェイルオーバーの構成が必要なとき、例えば複数のサーバーやデータセンターがあるような場合には便利だと聞いています。ただ、自分でKubernetesクラスターを構成したことはないですね。

スラヴィ：その場合はKubernetesと格闘する手間が省けて良いですね。Kubernetesには良い面もありますが、難しさもあります。アップグレードのたびに、別の部分で不具合が出たりします。最近、新しいKubernetesのバージョンにアップグレードしましたが、KubesprayというAnsible Playbookを使いました。通常はスムーズに行くのですが、今回はリグレッションが多くて、数日かかってしまいました。このKubesprayプロジェクトは、一般的にはかなり人気のあるものらしいです。CNCF（Cloud Native Computing Foundation）のプロジェクトで、プルリクエストに対していろいろな承認があって、すべてが正しく動作するかチェックする自動化もたくさんありますし、複数のチームメンバーによる承認もあります。それでもやっぱり、品質は正直あまり高くないと感じますし、リグレッションがすごく多いです。なので、個人的にはあまり優れたAnsible Playbookだとは思っていません。

他の人たちとも話してみたんですが、Kubernetesをセルフホストしている場合、正直あまり良い選択肢は多くありません。なので、このKubesprayが最善の手段かもしれませんが、決して素晴らしいとは言えないですね。実際、Kubesprayでは多くのリグレッションを追いかけたり、バグを修正したり、パッチを適用したりといった面倒ごとがたくさんありました。リリースは2週間前だったのに、いまだにバグが修正されていなかったり、修正用のマイナーリリースも出ていなかったりします。これが本当に困りものなんです。待っていれば改善されるかというと、結局壊れたままのプロジェクトが提供されて、スムーズにクラスターをアップグレードすることができません。でも最終的には、手作業と試行錯誤を重ねて、なんとか動くようにはしました。

要するに、これを扱うのはほぼフルタイムの仕事みたいなものです。何かをアップグレードすると、クラスター内の他の部分の更新が必要になってきます。たとえば、Ingress Controller をアップグレードすると、別のバグに当たってしまって、今度は Argo CD をアップグレードしなければいけなくなったりします。というのも、新しい Ingress エンジンの Helm チャートが古い Argo CD に対応していないからです。それでまたアップグレードして…と、最初にやろうとしていたことが何だったのか、もうわからなくなってしまいます。数日間アップグレード作業にかかりきりになるんです。とはいえ、うまくクラスターが稼働するようになれば、たくさんの開発者がそのクラスターを使えるようになりますし、個別に手間をかける必要もなくなります。でもそのためには、誰かがこのクラスターを維持して、最新の状態に保つ役割を果たさないといけません。

ルボ：Cloudflare DNSをKubernetesでどう使ってるんですか？Cluster DNSとか、Cloudflare DNSの使い方について教えてください。

スラヴィ：アイデアとしては、複数のノードをIngressのエントリーポイントとして指定して、それぞれにパブリックIPを持たせます。そして、フェイルオーバーやロードバランシングの手段として、ドメインに対して複数のAレコードを設定して、それら3〜4つのIPに対応させるというやり方です。ただ、問題なのは、複数のAレコードを見たときの挙動が、OSやクライアントによって違うってことなんです。特にWindowsでは、最初の1つしか使わないことが多くて、実際にはロードバランシングがうまく働いていません。それに、どのOSでもフェイルオーバーの反応がすごく遅いです。たとえば、最初のIPアドレスに通信しようと20秒くらい粘って、それでもだめなら次に切り替えるという感じです。私が見た限りでは、フェイルオーバーに20秒くらいかかっています。なので、複数のAレコードを使ってロードバランシングするのは、正直あまりいい方法とは言えないですね。遅すぎます。

これはOSだけじゃなくて、プロトコルにも左右されます。たとえば、HTTPをTCPで使っている場合は、フェイルオーバーにやっぱり20秒くらいかかります。でも、HTTP/3 (QUIC)であれば、1秒もかからずに済みます。そっちの方がずっといいですね。ただ、QUICはまだあまり普及していないし、対応していないコンポーネントもあると思います。

ということで、今のところ私たちはCloudflareのSpectrumっていうサービスを使おうと考えています。これはTCPベースのロードバランシングをしてくれるもので、内部的にヘルスチェックも行って、正しいIPアドレスにトラフィックを送ってくれます。ただ、まだそこまで設定は進んでいなくて、今も作業中です。もし似たようなことをやった経験があれば、どうやったか教えてもらえるとありがたいです。

ルボ：まだやったことはないですが、面白そうですね。

スラヴィ：ええ、そうなんです。通常は、誰かがホスティングしてくれているKubernetesサービスを使っていれば、こういう問題はあまり気にしなくていいんですよね。私たちの場合は、ベアメタルじゃないとはいえ、クラウド上の仮想マシンを使っていて、それもあまり理想的とは言えません。ただ、これはお客様の要望なんですよ。今はこのプロセスを通じて学びながら、できればクラウドに依存しない形を目指して進めています。

Setting up a High-Availability Kubernetes Cluster with Failover and MetalLB

Slavi: Lately, I’ve been working on Kubernetes stuff, mostly for a company. We’re building yet another Kubernetes cluster for a customer. This time, they want it to be highly available, so we need more nodes in the cluster. We also need to figure out how to do load balancing and avoid downtime if a node fails. Previously, we had a bunch of clusters, but most had a single control plane. If it dies, new containers can’t be scheduled, and you can’t communicate with the cluster. One thing we need is multiple control planes. We don’t necessarily need to load balance between them, but we need failover. If one dies, we can continue sending API requests to another.

Another single point of failure is Ingress. In Kubernetes, various containers run on different nodes. Usually, you have an Ingress controller component that all traffic flows through. Previously, we had it on a single node with a public IP. If that node failed, we lost all incoming communication to the cluster. To solve this, I’m looking into load balancer solutions. One popular one is MetalLB, which is for self-hosted bare metal Kubernetes clusters. Normally, in the cloud, you get redundant control planes and load balancer components from the provider. But since we’re self-hosting on bare metal, we need to solve it ourselves.

The goal is to minimize downtime. If a node fails, we want a quick switch over to another—ideally in less than a minute. There are tradeoffs depending on the solution we choose. I’ve been experimenting with MetalLB and considering using Cloudflare DNS for health checks and distributing traffic between different public IPs. We can go into more detail if you’d like, but that’s the general idea.

Naoto: Lyubo, have you ever used Kubernetes in your own project?

Lyubo: No, I haven’t. I know it’s useful for failover setups, like when you have multiple servers or data centers. But I haven’t configured a Kubernetes cluster myself.

Slavi: That saves you from a lot of effort fighting with Kubernetes. It has its nice points but also difficulties. Every upgrade seems to break something else. I recently upgraded to a new Kubernetes version using the Kubespray Ansible Playbook. Normally upgrades go smoothly, but this time it took multiple days because of many regressions. This Kubespray project is supposedly a very popular one. It’s a CNCF (Cloud Native Computing Foundation) project. They have various approvals for pull requests, lots of automation for checking that everything is all right, and approval by multiple team members. But still, it’s kind of very low quality, with so many regressions, and it’s not a great Ansible Playbook, I think.

But based on my discussion with other people who are also self-hosting Kubernetes, you don’t have a lot of good options for doing it. So maybe this Playbook is your best bet, but it’s not a great one. I had lots of trouble chasing regressions, fixing bugs in the Playbook, and applying patches. Even though the release was two weeks ago, they still haven’t fixed these bugs or created a minor release to fix them. This is quite annoying. Even if you wait, you get a broken project and you can’t really upgrade your cluster smoothly. But I did it eventually after lots of manual work and digging around.

Basically, it’s a full-time job dealing with this. You upgrade that, and then you see there are other updates to the cluster, like a new Ingress controller upgrade. You apply that, and then you hit another bug. Then you see you need to upgrade Argo CD, for example, because the new Ingress engine Helm chart is not compatible with the older Argo CD. So you need to upgrade that too. You start with one thing, and then you end up upgrading things for multiple days and can’t go back to what you initially wanted to do. Lots of work on that. Hopefully, you have a large cluster, and once you get it working, many developers can use that cluster without having to spend time on it. But someone needs to spend the time to keep it all running and updated.

Lyubo: How do you use Cloudflare DNS with Kubernetes? How do you use cluster DNS or Cloudflare DNS?

Slavi: The idea is, we have multiple nodes which we designate as Ingress entry points. These have public IPs. One solution for failover and load balancing is to have multiple A records for the domain pointing to these three or four IPs. But the problem is, different operating systems and clients behave differently when they see multiple A records. Especially Windows—it only uses the first one. So it’s not really doing any load balancing. Also, all of them have very slow failover. Clients usually take about 20 seconds to try communicating with the first IP address. If that’s not successful, they fall back to the second or third one, but it takes about 20 seconds to do this failover.

So it’s not a great idea to use multiple A records for load balancing because it’s very slow. This also depends on the protocol, not just the operating system. If you’re doing HTTP over TCP, it takes about 20 seconds. If you’re doing HTTP/3 (QUIC), I think it’s less than a second. So it’s much better, but QUIC is still not widely used and may not be supported by some components.

So this is a slow failover. We’re thinking of using Cloudflare Spectrum, which does TCP-based load balancing for us. They would do health checks internally and send traffic to the correct IP address. But I still haven’t gotten to this point. It’s a work in progress. If you have any ideas on how to do this, it would be helpful if you have experience with something similar.

Lyubo: No, not yet, but it would be interesting to do.

Slavi: Yeah, it is. Usually, you don’t need to solve this if you’re using a hosted Kubernetes service. In our case, it’s not really bare metal because we’re using virtual machines in the cloud. I think it’s not ideal, but maybe in the future, it will be something else. We’re trying to learn as we go and trying to stay independent of cloud providers if possible. That’s what the customer wants.

この記事はインタビューをもとにAIを使用して作成されています。 This article was created using AI based on interviews.