هوش داده های تولیدی

قابلیت مشاهده منبع باز برای گره های AWS Inferentia در خوشه های آمازون EKS | خدمات وب آمازون

تاریخ:

Recent developments in machine learning (ML) have led to increasingly large models, some of which require hundreds of billions of parameters. Although they are more powerful, training and inference on those models require significant computational resources. Despite the availability of advanced distributed training libraries, it’s common for training and inference jobs to need hundreds of accelerators (GPUs or purpose-built ML chips such as AWS Trainium و استنتاج AWS), and therefore tens or hundreds of instances.

In such distributed environments, observability of both instances and ML chips becomes key to model performance fine-tuning and cost optimization. Metrics allow teams to understand workload behavior and optimize resource allocation and utilization, diagnose anomalies, and increase overall infrastructure efficiency. For data scientists, ML chips utilization and saturation are also relevant for capacity planning.

This post walks you through the Open Source Observability pattern for AWS Inferentia, which shows you how to monitor the performance of ML chips, used in an سرویس الاستیک کوبرنتز آمازون (Amazon EKS) cluster, with data plane nodes based on ابر محاسبه الاستیک آمازون (Amazon EC2) instances of type Inf1 و Inf2.

The pattern is part of the AWS CDK Observability Accelerator, a set of opinionated modules to help you set observability for Amazon EKS clusters. The AWS CDK Observability Accelerator is organized around patterns, which are reusable units for deploying multiple resources. The open source observability set of patterns instruments observability with آمازون گرافانا را مدیریت کرد dashboards, an AWS Distro برای OpenTelemetry collector to collect metrics, and سرویس مدیریت آمازون برای Prometheus برای ذخیره آنها

بررسی اجمالی راه حل

نمودار زیر معماری راه حل را نشان می دهد.

This solution deploys an Amazon EKS cluster with a node group that includes Inf1 instances.

The AMI type of the node group is AL2_x86_64_GPU، که با استفاده از Amazon EKS optimized accelerated Amazon Linux AMI. In addition to the standard Amazon EKS-optimized AMI configuration, the accelerated AMI includes the NeuronX runtime.

To access the ML chips from Kubernetes, the pattern deploys the نورون AWS device plugin.

Metrics are exposed to Amazon Managed Service for Prometheus by the neuron-monitor DaemonSet, which deploys a minimal container, with the Neuron tools installed. Specifically, the neuron-monitor DaemonSet runs the neuron-monitor command piped into the neuron-monitor-prometheus.py companion script (both commands are part of the container):

neuron-monitor | neuron-monitor-prometheus.py --port <port>

The command uses the following components:

  • neuron-monitor collects metrics and stats from the Neuron applications running on the system and streams the collected data to stdout in قالب JSON
  • neuron-monitor-prometheus.py maps and exposes the telemetry data from JSON format into Prometheus-compatible format

Data is visualized in Amazon Managed Grafana by the corresponding dashboard.

The rest of the setup to collect and visualize metrics with Amazon Managed Service for Prometheus and Amazon Managed Grafana is similar to that used in other open source based patterns, which are included in the AWS Observability Accelerator for CDK مخزن GitHub.

پیش نیازها

You need the following to complete the steps in this post:

محیط را تنظیم کنید

مراحل زیر را برای تنظیم محیط خود انجام دهید:

  1. Open a terminal window and run the following commands:
export AWS_REGION=<YOUR AWS REGION>
export ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)

  1. Retrieve the workspace IDs of any existing Amazon Managed Grafana workspace:
aws grafana list-workspaces

The following is our sample output:

{
  "workspaces": [
    {
      "authentication": {
        "providers": [
          "AWS_SSO"
        ]
      },
      "created": "2023-06-07T12:23:56.625000-04:00",
      "description": "accelerator-workspace",
      "endpoint": "g-XYZ.grafana-workspace.us-east-2.amazonaws.com",
      "grafanaVersion": "9.4",
      "id": "g-XYZ",
      "modified": "2023-06-07T12:30:09.892000-04:00",
      "name": "accelerator-workspace",
      "notificationDestinations": [
        "SNS"
      ],
      "status": "ACTIVE",
      "tags": {}
    }
  ]
}

  1. Assign the values of id و endpoint to the following environment variables:
export COA_AMG_WORKSPACE_ID="<<YOUR-WORKSPACE-ID, similar to the above g-XYZ, without quotation marks>>"
export COA_AMG_ENDPOINT_URL="<<https://YOUR-WORKSPACE-URL, including protocol (i.e. https://), without quotation marks, similar to the above https://g-XYZ.grafana-workspace.us-east-2.amazonaws.com>>"

COA_AMG_ENDPOINT_URL needs to include https://.

  1. Create a Grafana API key from the Amazon Managed Grafana workspace:
export AMG_API_KEY=$(aws grafana create-workspace-api-key 
--key-name "grafana-operator-key" 
--key-role "ADMIN" 
--seconds-to-live 432000 
--workspace-id $COA_AMG_WORKSPACE_ID 
--query key 
--output text)

  1. Set up a secret in مدیر سیستم های AWS:
aws ssm put-parameter --name "/cdk-accelerator/grafana-api-key" 
--type "SecureString" 
--value $AMG_API_KEY 
--region $AWS_REGION

The secret will be accessed by the External Secrets add-on and made available as a native Kubernetes secret in the EKS cluster.

Bootstrap the AWS CDK environment

The first step to any AWS CDK deployment is bootstrapping the environment. You use the cdk bootstrap command in the AWS CDK CLI to prepare the environment (a combination of AWS account and AWS Region) with resources required by AWS CDK to perform deployments into that environment. AWS CDK bootstrapping is needed for each account and Region combination, so if you already bootstrapped AWS CDK in a Region, you don’t need to repeat the bootstrapping process.

cdk bootstrap aws://$ACCOUNT_ID/$AWS_REGION

راه حل را مستقر کنید

مراحل زیر را برای استقرار راه حل کامل کنید:

  1. کلون کنید cdk-aws-observability-accelerator repository and install the dependency packages. This repository contains AWS CDK v2 code written in TypeScript.
git clone https://github.com/aws-observability/cdk-aws-observability-accelerator.git
cd cdk-aws-observability-accelerator

The actual settings for Grafana dashboard JSON files are expected to be specified in the AWS CDK context. You need to update context در cdk.json file, located in the current directory. The location of the dashboard is specified by the fluxRepository.values.GRAFANA_NEURON_DASH_URL parameter, and neuronNodeGroup is used to set the instance type, number, and فروشگاه بلوک الاستیک آمازون (Amazon EBS) size used for the nodes.

  1. Enter the following snippet into cdk.json، جایگزین context:
"context": {
    "fluxRepository": {
      "name": "grafana-dashboards",
      "namespace": "grafana-operator",
      "repository": {
        "repoUrl": "https://github.com/aws-observability/aws-observability-accelerator",
        "name": "grafana-dashboards",
        "targetRevision": "main",
        "path": "./artifacts/grafana-operator-manifests/eks/infrastructure"
      },
      "values": {
        "GRAFANA_CLUSTER_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/cluster.json",
        "GRAFANA_KUBELET_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/kubelet.json",
        "GRAFANA_NSWRKLDS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/namespace-workloads.json",
        "GRAFANA_NODEEXP_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodeexporter-nodes.json",
        "GRAFANA_NODES_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/nodes.json",
        "GRAFANA_WORKLOADS_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/infrastructure/workloads.json",
        "GRAFANA_NEURON_DASH_URL" : "https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/grafana-dashboards/eks/neuron/neuron-monitor.json"
      },
      "kustomizations": [
        {
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/infrastructure"
        },
        {
          "kustomizationPath": "./artifacts/grafana-operator-manifests/eks/neuron"
        }
      ]
    },
     "neuronNodeGroup": {
      "instanceClass": "inf1",
      "instanceSize": "2xlarge",
      "desiredSize": 1, 
      "minSize": 1, 
      "maxSize": 3,
      "ebsSize": 512
    }
  }

You can replace the Inf1 instance type with Inf2 and change the size as needed. To check availability in your selected Region, run the following command (amend Values as you see fit):

aws ec2 describe-instance-type-offerings 
--filters Name=instance-type,Values="inf1*" 
--query "InstanceTypeOfferings[].InstanceType" 
--region $AWS_REGION

  1. Install the project dependencies:
npm install

  1. Run the following commands to deploy the open source observability pattern:
make build
make pattern single-new-eks-inferentia-opensource-observability deploy

راه حل را تأیید کنید

Complete the following steps to validate the solution:

  1. اجرا کن update-kubeconfig command. You should be able to get the command from the output message of the previous command:
aws eks update-kubeconfig --name single-new-eks-inferentia-opensource... --region <your region> --role-arn arn:aws:iam::xxxxxxxxx:role/single-new-eks-....

  1. Verify the resources you created:
kubectl get pods -A

The following screenshot shows our sample output.

  1. اطمینان حاصل کنید که neuron-device-plugin-daemonset DaemonSet is running:
kubectl get ds neuron-device-plugin-daemonset --namespace kube-system

The following is our expected output:

NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-device-plugin-daemonset   1         1         1       1            1           <none>          2h

  1. تأیید کنید که neuron-monitor DaemonSet is running:
kubectl get ds neuron-monitor --namespace kube-system

The following is our expected output:

NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-monitor   1         1         1       1            1           <none>          2h

  1. To verify that the Neuron devices and cores are visible, run the neuron-ls و neuron-top commands from, for example, your neuron-monitor pod (you can get the pod’s name from the output of kubectl get pods -A):
kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-ls"

The following screenshot shows our expected output.

kubectl exec -it {your neuron-monitor pod} -n kube-system -- /bin/bash -c "neuron-top"

The following screenshot shows our expected output.

Visualize data using the Grafana Neuron dashboard

Log in to your Amazon Managed Grafana workspace and navigate to the داشبورد panel. You should see a dashboard named Neuron / Monitor.

To see some interesting metrics on the Grafana dashboard, we apply the following manifest:

curl https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/k8s-deployment-manifest-templates/neuron/pytorch-inference-resnet50.yml | kubectl apply -f -

This is a sample workload that compiles the torchvision ResNet50 model and runs repetitive inference in a loop to generate telemetry data.

To verify the pod was successfully deployed, run the following code:

kubectl get pods

You should see a pod named pytorch-inference-resnet50.

After a few minutes, looking into the Neuron / Monitor dashboard, you should see the gathered metrics similar to the following screenshots.

Grafana Operator and Flux always work together to synchronize your dashboards with Git. If you delete your dashboards by accident, they will be re-provisioned automatically.

پاک کردن

You can delete the whole AWS CDK stack with the following command:

make pattern single-new-eks-inferentia-opensource-observability destroy

نتیجه

In this post, we showed you how to introduce observability, with open source tooling, into an EKS cluster featuring a data plane running EC2 Inf1 instances. We started by selecting the Amazon EKS-optimized accelerated AMI for the data plane nodes, which includes the Neuron container runtime, providing access to AWS Inferentia and Trainium Neuron devices. Then, to expose the Neuron cores and devices to Kubernetes, we deployed the Neuron device plugin. The actual collection and mapping of telemetry data into Prometheus-compatible format was achieved via neuron-monitor و neuron-monitor-prometheus.py. Metrics were sourced from Amazon Managed Service for Prometheus and displayed on the Neuron dashboard of Amazon Managed Grafana.

We recommend that you explore additional observability patterns in the AWS Observability Accelerator for CDK GitHub repo. To learn more about Neuron, refer to the مستندات نورون AWS.


درباره نویسنده

Riccardo Freschi is a Sr. Solutions Architect at AWS, focusing on application modernization. He works closely with partners and customers to help them transform their IT landscapes in their journey to the AWS Cloud by refactoring existing applications and building new ones.

نقطه_img

جدیدترین اطلاعات

نقطه_img