Kubernetes 自动化诊断工具：K8sgpt-Operator

时间：2023-05-04 14:07:29 来源：云原生指北作者：

背景

在 Kube.NETes 上，从部署 Deployment 到正常提供服务，整个流程可能会出现各种各样问题，有兴趣的可以浏览 Kubernetes Deployment 的故障排查可视化指南（2021 中文版）[1]。从可视化指南也可能看出这些问题实际上都是有迹可循，根据错误信息基本很容易找到解决方法。随着 ChatGPT 的流行，基于 LLM 的文本生成项目不断涌现，k8sgpt[2] 便是其中之一。

k8sgpt 是一个扫描 Kubernetes 集群、诊断和分类问题的工具。它将 SRE 经验编入其分析器，并通过 AI 帮助提取并丰富相关的信息。

其内置了大量的分析器：

podAnalyzer
pvcAnalyzer
rsAnalyzer
serviceAnalyzer
eventAnalyzer
ingressAnalyzer
statefulSetAnalyzer
deploymentAnalyzer
cronJobAnalyzer
nodeAnalyzer
hpaAnalyzer（可选）
pdbAnalyzer（可选）
networkPolicyAnalyzer（可选）

k8sgpt 的能力是通过 CLI 来提供的，通过 CLI 可以对集群中的错误进行快速的诊断。

k8sgpt analyze --explain --filter=Pod --namespace=default --output=json
{
  "status": "ProblemDetected",
  "problems": 1,
  "results": [
    {
      "kind": "Pod",
      "name": "default/test",
      "error": [
        {
          "Text": "Back-off pulling image "flomesh/pipy2"",
          "Sensitive": []
        }
      ],
      "details": "The Kubernetes system is experiencing difficulty pulling the requested image named "flomesh/pipy2". nnThe solution may be to check that the image is correctly spelled or to verify that it exists in the specified container registry. Additionally, ensure that the networking infrastructure that connects the container registry and Kubernetes system is working properly. Finally, check if there are any access restrictions or credentials required to pull the image and ensure they are provided correctly.",
      "parentObject": "test"
    }
  ]
}

但是，每次进行诊断都要执行命令，有点繁琐且限制较多。我想大家想要的肯定是能够监控到问题并自动诊断。这就有了今天要介绍的 k8sgpt-operator[3]

介绍

简单来说 k8sgpt-operator 可以在集群中开启自动化的 k8sgpt。它提供了两个 CRD： K8sGPT 和 Result。前者可以用来设置 k8sgpt 及其行为；而后者则是用来展示问题资源的诊断结果。

apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-sample
  namespace: kube-system
spec:
  model: gpt-3.5-turbo
  backend: OpenAI
  noCache: false
  version: v0.2.7
  enableAI: true
  secret:
    name: k8sgpt-sample-secret
    key: openai-api-key

演示

实验环境使用 k3s 集群。

export INSTALL_K3S_VERSION=v1.23.8+k3s2
curl -sfL https://get.k3s.io | sh -s - --disable traefik --disable local-storage --disable servicelb --write-kubeconfig-mode 644 --write-kubeconfig ~/.kube/config

安装 k8sgpt-operator

helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm install release k8sgpt/k8sgpt-operator -n openai --create-namespace

安装完成后，可以看到随 operator 安装的两个 CRD：k8sgpts 和 results。

kubectl api-resources | grep -i gpt
k8sgpts                                        core.k8sgpt.ai/v1alpha1                true         K8sGPT
results                                        core.k8sgpt.ai/v1alpha1                true         Result

在开始之前，需要先生成一个 OpenAI 的 key[4]，并保存到 secret 中。

OPENAI_TOKEN=xxxx
kubectl create secret generic k8sgpt-sample-secret --from-literal=openai-api-key=$OPENAI_TOKEN -n openai

接下来创建 K8sGPT 资源。

kubectl Apply -n openai -f - << EOF
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-sample
spec:
  model: gpt-3.5-turbo
  backend: openai
  noCache: false
  version: v0.2.7
  enableAI: true
  secret:
    name: k8sgpt-sample-secret
    key: openai-api-key
EOF

执行完上面的命令后在 openai 命名空间下会自动创建 Deployment k8sgpt-deployment 。

测试

使用一个不存在的镜像创建 pod。

kubectl run test --image flomesh/pipy2 -n default

然后在 openai 命名空间下会看到一个名为 defaulttest 的资源。

kubectl get result -n openai
NAME          AGE
defaulttest   5m7s

详细信息中可以看到诊断内容以及出现问题的资源。

kubectl get result -n openai defaulttest -o yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
  creationTimestamp: "2023-05-02T09:00:32Z"
  generation: 1
  name: defaulttest
  namespace: openai
  resourceVersion: "1466"
  uid: 2ee27c26-61c1-4ef5-ae27-e1301a40cd56
spec:
  details: "The error message is indicating that Kubernetes is having trouble pulling
    the image "flomesh/pipy2" and is therefore backing off from trying to do so.
    nnThe solution to this issue would be to check that the image exists and that
    the spelling and syntax of the image name is correct. Additionally, check that
    the image is accessible from the Kubernetes cluster and that any required authentication
    or authorization is in place. If the issue persists, it may be necessary to troubleshoot
    the network connectivity between the Kubernetes cluster and the image repository."
  error:
  - text: Back-off pulling image "flomesh/pipy2"
  kind: Pod
  name: default/test
  parentObject: test

参考资料

[1] Kubernetes Deployment 的故障排查可视化指南（2021 中文版）: https://atbug.com/troubleshooting-kubernetes-deployment-zh-v2/

[2] k8sgpt: https://Github.com/k8sgpt-ai/k8sgpt

[3] k8sgpt-operator: https://github.com/k8sgpt-ai/k8sgpt-operator

[4] OpenAI 的 key: https://platform.openai.com/account/api-keys

Tags：Kubernetes 点击:() 评论:()

声明：本站部分内容及图片来自互联网,转载是出于传递更多信息之目的,内容观点仅代表作者本人,不构成投资建议。投资者据此操作，风险自担。如有任何标注错误或版权侵犯请与我们联系，我们将及时更正、删除。

▌相关推荐

Kubernetes 究竟有没有 LTS？

从一个有趣的问题引出很多人都在关注的 Kubernetes LTS 的问题。有趣的问题2019 年，一个名为 apiserver LoopbackClient Server cert expired after 1 year[1] 的 issue 中提...【详细内容】

2024-03-15　　Search: Kubernetes 点击:(6)　　评论:(0)　　加入收藏

Kubernetes 集群 CPU 使用率只有 13% ：这下大家该知道如何省钱了

作者 | THE STACK译者 | 刘雅梦策划 | Tina根据 CAST AI 对 4000 个 Kubernetes 集群的分析，Kubernetes 集群通常只使用 13% 的 CPU 和平均 20% 的内存，这表明存在严重的过度...【详细内容】

2024-03-08　　Search: Kubernetes 点击:(12)　　评论:(0)　　加入收藏

聊聊 Kubernetes 网络模型综合指南

这篇详细的博文探讨了 Kubernetes 网络的复杂性，提供了关于如何在容器化环境中确保高效和安全通信的见解。译自Navigating the Network: A Comprehensive Guide to Kubernete...【详细内容】

2024-02-19　　Search: Kubernetes 点击:(38)　　评论:(0)　　加入收藏

Kubernetes Informer基本原理，你明白了吗？

本文分析 k8s controller 中 informer 启动的基本流程不论是 k8s 自身组件，还是自己编写 controller，都需要通过 apiserver 监听 etcd 事件来完成自己的控制循环逻辑。如何高...【详细内容】

2024-01-30　　Search: Kubernetes 点击:(38)　　评论:(0)　　加入收藏

Kubernetes 100个常用命令！

这篇文章是关于使用 Kubectl 进行 Kubernetes 诊断的指南。列出了 100 个 Kubectl 命令，这些命令对于诊断 Kubernetes 集群中的问题非常有用。这些问题包括但不限于：• 集...【详细内容】

2024-01-03　　Search: Kubernetes 点击:(77)　　评论:(0)　　加入收藏

管理 Kubernetes 集群这3年，我踩过的十个坑

作者 | Herve Khg编译 | 如烟出品 | 51CTO技术栈（微信号：blog51cto） Kubernetes 作为云计算领域的绝对主角，当仁不让地坐上了容器技术领域的“头把交椅”。它的精髓在于，你只要在...【详细内容】

2023-12-15　　Search: Kubernetes 点击:(168)　　评论:(0)　　加入收藏

在 Kubernetes 中无侵入安装 OpenTelemetry 探针，你学会了吗？

OpenTelemetry 探针OpenTelemetry（简称 Otel，最新的版本是 1.27）是一个用于观察性的开源项目，提供了一套工具、APIs 和 SDKs，用于收集、处理和导出遥测数据（如指标、日志和追踪信...【详细内容】

2023-12-07　　Search: Kubernetes 点击:(174)　　评论:(0)　　加入收藏

Kubernetes 的调试功能，别慌：debug 不行，还有superdebug

这篇内容主要探讨了 Kubernetes 的调试功能，介绍了 kubectl debug 和 kubectl superdebug。它们支持容器挂载并且能够调试一些需要排查问题的 Pod。文章指出了在 Kubernetes...【详细内容】

2023-12-06　　Search: Kubernetes 点击:(214)　　评论:(0)　　加入收藏

Kubernetes 中的服务注册与发现原理分析

对k8s有点了解技术人员，应该都只知道k8s是有服务注册发现的，今天就分析下这个原理，看看怎么实现的。什么是服务注册与发现服务注册与发现是一种机制，用于在集群中动态地发现和连...【详细内容】

2023-11-30　　Search: Kubernetes 点击:(164)　　评论:(0)　　加入收藏

普通Kubernetes Secret足矣

众所周知，Kubernetes secret 只是以 base64 编码的字符串，存储在集群的其余状态旁边的 etcd 中。自 2015 年引入 secret 以来，安全专家就一直在嘲笑这一决定，并寻求其他替代方案...【详细内容】

2023-11-30　　Search: Kubernetes 点击:(177)　　评论:(0)　　加入收藏

▌简易百科推荐

Web Components实践：如何搭建一个框架无关的AI组件库

一、让人又爱又恨的Web ComponentsWeb Components是一种用于构建可重用的Web元素的技术。它允许开发者创建自定义的HTML元素，这些元素可以在不同的Web应用程序中重复使用，并且...【详细内容】

2024-04-03　　京东云开发者　　　　Tags:Web Components 　点击:(8)　　评论:(0)　　加入收藏

Kubernetes 集群 CPU 使用率只有 13% ：这下大家该知道如何省钱了

2024-03-08　　InfoQ　　　　Tags:Kubernetes 　点击:(12)　　评论:(0)　　加入收藏

Spring Security：保障应用安全的利器

SpringSecurity作为一个功能强大的安全框架，为Java应用程序提供了全面的安全保障，包括认证、授权、防护和集成等方面。本文将介绍SpringSecurity在这些方面的特性和优势，以及它...【详细内容】

2024-02-27　　风舞凋零叶　　　　Tags:Spring Security 　点击:(54)　　评论:(0)　　加入收藏

五大跨平台桌面应用开发框架：Electron、Tauri、Flutter等

一、什么是跨平台桌面应用开发框架跨平台桌面应用开发框架是一种工具或框架，它允许开发者使用一种统一的代码库或语言来创建能够在多个操作系统上运行的桌面应用程序。传统上...【详细内容】

2024-02-26　　贝格前端工场　　　　Tags:框架　点击:(47)　　评论:(0)　　加入收藏

Spring Security权限控制框架使用指南

在常用的后台管理系统中，通常都会有访问权限控制的需求，用于限制不同人员对于接口的访问能力，如果用户不具备指定的权限,则不能访问某些接口。本文将用 waynboot-mall 项目举例...【详细内容】

2024-02-19　　程序员wayn　　微信公众号　　Tags:Spring 　点击:(39)　　评论:(0)　　加入收藏

开发者的Kubernetes懒人指南

你可以将本文作为开发者快速了解 Kubernetes 的指南。从基础知识到更高级的主题，如 Helm Chart，以及所有这些如何影响你作为开发者。译自Kubernetes for Lazy Developers。作...【详细内容】

2024-02-01　　云云众生s　　微信公众号　　Tags:Kubernetes 　点击:(50)　　评论:(0)　　加入收藏

链世界：一种简单而有效的人类行为Agent模型强化学习框架

强化学习是一种机器学习的方法，它通过让智能体（Agent）与环境交互，从而学习如何选择最优的行动来最大化累积的奖励。强化学习在许多领域都有广泛的应用，例如游戏、机器人、自动驾...【详细内容】

2024-01-30　　大噬元兽　　微信公众号　　Tags:框架　点击:(68)　　评论:(0)　　加入收藏

Spring实现Kafka重试Topic，真的太香了

概述Kafka的强大功能之一是每个分区都有一个Consumer的偏移值。该偏移值是消费者将读取的下一条消息的值。可以自动或手动增加该值。如果我们由于错误而无法处理消息并想重...【详细内容】

2024-01-26　　HELLO程序员　　微信公众号　　Tags:Spring 　点击:(86)　　评论:(0)　　加入收藏

SpringBoot如何实现缓存预热？

缓存预热是指在 Spring Boot 项目启动时，预先将数据加载到缓存系统（如 Redis）中的一种机制。那么问题来了，在 Spring Boot 项目启动之后，在什么时候？在哪里可以将数据加载到缓存系...【详细内容】

2024-01-19　　 Java中文社群　　微信公众号　　Tags:SpringBoot 　点击:(86)　　评论:(0)　　加入收藏

花 15 分钟把 Express.js 搞明白，全栈没有那么难

Express 是老牌的 Node.js 框架，以简单和轻量著称，几行代码就可以启动一个 HTTP 服务器。市面上主流的 Node.js 框架，如 Egg.js、Nest.js 等都与 Express 息息相关。Express 框...【详细内容】

2024-01-16　　程序员成功　　微信公众号　　Tags:Express.js 　点击:(88)　　评论:(0)　　加入收藏

推荐资讯

Meta推出新版自研AI芯	探访北京二手房市场：房
金价迭创新高的真正推	TikTok入驻条件
通胀风暴席卷华尔街：黄	整治“暗箱操作” 义
网易再牵暴雪的手，实际	注意！密码、验证码都没