Use Kubernetes Operators For New Inference Capabilities In Amazon SageMaker That Reduce LLM Deployment Costs By 50% On Average | Amazon Web Services

We are excited to announce a new version of the ਕੁਬਰਨੇਟਸ ਲਈ ਐਮਾਜ਼ਾਨ ਸੇਜਮੇਕਰ ਓਪਰੇਟਰ ਵਰਤ ਕੇ AWS Controllers for Kubernetes (ACK). ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources like buckets, databases, or message queues simply by using the Kubernetes API.

ਰੀਲਿਜ਼ v1.2.9 of the SageMaker ACK Operators adds support for inference components, which until now were only available through the SageMaker API and the AWS Software Development Kits (SDKs). Inference components can help you optimize deployment costs and reduce latency. With the new inference component capabilities, you can deploy one or more foundation models (FMs) on the same ਐਮਾਜ਼ਾਨ ਸੇਜਮੇਕਰ endpoint and control how many accelerators and how much memory is reserved for each FM. This helps improve resource utilization, reduces model deployment costs on average by 50%, and lets you scale endpoints together with your use cases. For more details, see Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency.

The availability of inference components through the SageMaker controller enables customers who use Kubernetes as their control plane to take advantage of inference components while deploying their models on SageMaker.

In this post, we show how to use SageMaker ACK Operators to deploy SageMaker inference components.

How ACK works

ਪ੍ਰਦਰਸ਼ਨ ਕਰਨ ਲਈ how ACK works, let’s look at an example using ਐਮਾਜ਼ਾਨ ਸਧਾਰਨ ਸਟੋਰੇਜ ਸੇਵਾ (Amazon S3). In the following diagram, Alice is our Kubernetes user. Her application depends on the existence of an S3 bucket named my-bucket.

ਵਰਕਫਲੋ ਵਿੱਚ ਹੇਠ ਲਿਖੇ ਕਦਮ ਹੁੰਦੇ ਹਨ:

Alice issues a call to kubectl apply, passing in a file that describes a Kubernetes ਕਸਟਮ ਸਰੋਤ describing her S3 bucket. kubectl apply passes this file, called a ਮੈਨੀਫੈਸਟ, to the Kubernetes API server running in the Kubernetes controller node.
The Kubernetes API server receives the manifest describing the S3 bucket and determines if Alice has ਅਧਿਕਾਰ to create a custom resource of ਕਿਸਮ s3.services.k8s.aws/Bucket, and that the custom resource is properly formatted.
If Alice is authorized and the custom resource is valid, the Kubernetes API server writes the custom resource to its etcd data store.
It then responds to Alice that the custom resource has been created.
At this point, the ACK service ਕੰਟਰੋਲਰ for Amazon S3, which is running on a Kubernetes worker node within the context of a normal Kubernetes ਪੋਡ, is notified that a new custom resource of kind s3.services.k8s.aws/Bucket ਬਣਾਇਆ ਗਿਆ ਹੈ.
The ACK service controller for Amazon S3 then communicates with the Amazon S3 API, calling the S3 CreateBucket API to create the bucket in AWS.
After communicating with the Amazon S3 API, the ACK service controller calls the Kubernetes API server to update the custom resource’s ਸਥਿਤੀ ਨੂੰ with information it received from Amazon S3.

ਕੁੰਜੀ ਦੇ ਹਿੱਸੇ

The new inference capabilities build upon SageMaker’s real-time inference endpoints. As before, you create the SageMaker endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. The model is configured in a new construct, an inference component. Here, you specify the number of accelerators and amount of memory you want to allocate to each copy of a model, together with the model artifacts, container image, and number of model copies to deploy.

You can use the new inference capabilities from ਐਮਾਜ਼ਾਨ ਸੇਜਮੇਕਰ ਸਟੂਡੀਓ, ਸੇਜਮੇਕਰ ਪਾਈਥਨ SDK, AWS SDKsਹੈ, ਅਤੇ AWS ਕਮਾਂਡ ਲਾਈਨ ਇੰਟਰਫੇਸ (AWS CLI). They are also supported by AWS ਕਲਾਉਡ ਫਾਰਮੇਸ਼ਨ. Now you also can use them with ਕੁਬਰਨੇਟਸ ਲਈ ਸੇਜਮੇਕਰ ਓਪਰੇਟਰ.

ਹੱਲ ਸੰਖੇਪ ਜਾਣਕਾਰੀ

For this demo, we use the SageMaker controller to deploy a copy of the Dolly v2 7B model and a copy of the FLAN-T5 XXL model ਤੱਕ ਹੱਗਿੰਗ ਫੇਸ ਮਾਡਲ ਹੱਬ on a SageMaker real-time endpoint using the new inference capabilities.

ਪੂਰਿ-ਲੋੜਾਂ

To follow along, you should have a Kubernetes cluster with the SageMaker ACK controller v1.2.9 or above installed. For instructions on how to provision an ਐਮਾਜ਼ਾਨ ਇਲਾਸਟਿਕ ਕੁਬਰਨੇਟਸ ਸੇਵਾ (Amazon EKS) cluster with ਐਮਾਜ਼ਾਨ ਲਚਕੀਲੇ ਕੰਪਿuteਟ ਕਲਾਉਡ (Amazon EC2) Linux managed nodes using eksctl, see Getting started with Amazon EKS – eksctl. For instructions on installing the SageMaker controller, refer to Machine Learning with the ACK SageMaker Controller.

You need access to accelerated instances (GPUs) for hosting the LLMs. This solution uses one instance of ml.g5.12xlarge; you can check the availability of these instances in your AWS account and request these instances as needed via a Service Quotas increase request, as shown in the following screenshot.

Create an inference component

To create your inference component, define the EndpointConfig, Endpoint, Modelਹੈ, ਅਤੇ InferenceComponent YAML files, similar to the ones shown in this section. Use kubectl apply -f <yaml file> to create the Kubernetes resources.

You can list the status of the resource via kubectl describe <resource-type>; ਉਦਾਹਰਣ ਲਈ, kubectl describe inferencecomponent.

You can also create the inference component without a model resource. Refer to the guidance provided in the API ਦਸਤਾਵੇਜ਼ ਵਧੇਰੇ ਜਾਣਕਾਰੀ ਲਈ.

EndpointConfig YAML

The following is the code for the EndpointConfig file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: EndpointConfig
metadata:
  name: inference-component-endpoint-config
spec:
  endpointConfigName: inference-component-endpoint-config
  executionRoleARN: <EXECUTION_ROLE_ARN>
  productionVariants:
  - variantName: AllTraffic
    instanceType: ml.g5.12xlarge
    initialInstanceCount: 1
    routingConfig:
      routingStrategy: LEAST_OUTSTANDING_REQUESTS

Endpoint YAML

The following is the code for the Endpoint file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Endpoint
metadata:
  name: inference-component-endpoint
spec:
  endpointName: inference-component-endpoint
  endpointConfigName: inference-component-endpoint-config

Model YAML

The following is the code for the Model file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
  name: dolly-v2-7b
spec:
  modelName: dolly-v2-7b
  executionRoleARN: <EXECUTION_ROLE_ARN>
  containers:
  - image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04
    environment:
      HF_MODEL_ID: databricks/dolly-v2-7b
      HF_TASK: text-generation
---
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
  name: flan-t5-xxl
spec:
  modelName: flan-t5-xxl
  executionRoleARN: <EXECUTION_ROLE_ARN>
  containers:
  - image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi0.9.3-gpu-py39-cu118-ubuntu20.04
    environment:
      HF_MODEL_ID: google/flan-t5-xxl
      HF_TASK: text-generation

InferenceComponent YAMLs

In the following YAML files, given that the ml.g5.12xlarge instance comes with 4 GPUs, we are allocating 2 GPUs, 2 CPUs and 1,024 MB of memory to each model:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-dolly
spec:
  inferenceComponentName: inference-component-dolly
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: dolly-v2-7b
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 2
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-flan
spec:
  inferenceComponentName: inference-component-flan
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: flan-t5-xxl
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 2
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

Invoke models

You can now invoke the models using the following code:

import boto3
import json

sm_runtime_client = boto3.client(service_name="sagemaker-runtime")
payload = {"inputs": "Why is California a great place to live?"}

response_dolly = sm_runtime_client.invoke_endpoint(
    EndpointName="inference-component-endpoint",
    InferenceComponentName="inference-component-dolly",
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)
result_dolly = json.loads(response_dolly['Body'].read().decode())
print(result_dolly)

response_flan = sm_runtime_client.invoke_endpoint(
    EndpointName="inference-component-endpoint",
    InferenceComponentName="inference-component-flan",
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)
result_flan = json.loads(response_flan['Body'].read().decode())
print(result_flan)

Update an inference component

To update an existing inference component, you can update the YAML files and then use kubectl apply -f <yaml file>. The following is an example of an updated file:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: InferenceComponent
metadata:
  name: inference-component-dolly
spec:
  inferenceComponentName: inference-component-dolly
  endpointName: inference-component-endpoint
  variantName: AllTraffic
  specification:
    modelName: dolly-v2-7b
    computeResourceRequirements:
      numberOfAcceleratorDevicesRequired: 2
      numberOfCPUCoresRequired: 4 # Update the numberOfCPUCoresRequired.
      minMemoryRequiredInMb: 1024
  runtimeConfig:
    copyCount: 1

Delete an inference component

To delete an existing inference component, use the command kubectl delete -f <yaml file>.

ਉਪਲਬਧਤਾ ਅਤੇ ਕੀਮਤ

The new SageMaker inference capabilities are available today in AWS Regions US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Jakarta, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Stockholm), Middle East (UAE), and South America (São Paulo). For pricing details, visit ਐਮਾਜ਼ਾਨ ਸੇਜਮੇਕਰ ਕੀਮਤ.

ਸਿੱਟਾ

In this post, we showed how to use SageMaker ACK Operators to deploy SageMaker inference components. Fire up your Kubernetes cluster and deploy your FMs using the new SageMaker inference capabilities today!

ਲੇਖਕਾਂ ਬਾਰੇ

ਰਾਜੇਸ਼ ਰਾਮਚੰਦਰ is a Principal ML Engineer in Professional Services at AWS. He helps customers at various stages in their AI/ML and GenAI journey, from those that are just getting started all the way to those that are leading their business with an AI-first strategy.

ਅਮਿਤ ਅਰੋੜਾ ਐਮਾਜ਼ਾਨ ਵੈੱਬ ਸੇਵਾਵਾਂ ਵਿੱਚ ਇੱਕ AI ਅਤੇ ML ਸਪੈਸ਼ਲਿਸਟ ਆਰਕੀਟੈਕਟ ਹੈ, ਜੋ ਕਿ ਐਂਟਰਪ੍ਰਾਈਜ਼ ਗਾਹਕਾਂ ਨੂੰ ਕਲਾਉਡ-ਅਧਾਰਿਤ ਮਸ਼ੀਨ ਸਿਖਲਾਈ ਸੇਵਾਵਾਂ ਦੀ ਵਰਤੋਂ ਕਰਨ ਵਿੱਚ ਉਹਨਾਂ ਦੀ ਨਵੀਨਤਾਵਾਂ ਨੂੰ ਤੇਜ਼ੀ ਨਾਲ ਸਕੇਲ ਕਰਨ ਵਿੱਚ ਮਦਦ ਕਰਦਾ ਹੈ। ਉਹ ਵਾਸ਼ਿੰਗਟਨ ਡੀਸੀ ਵਿੱਚ ਜਾਰਜਟਾਊਨ ਯੂਨੀਵਰਸਿਟੀ ਵਿੱਚ ਐਮਐਸ ਡੇਟਾ ਸਾਇੰਸ ਅਤੇ ਵਿਸ਼ਲੇਸ਼ਣ ਪ੍ਰੋਗਰਾਮ ਵਿੱਚ ਸਹਾਇਕ ਲੈਕਚਰਾਰ ਵੀ ਹੈ।

Suryansh Singh is a Software Development Engineer at AWS SageMaker and works on developing ML-distributed infrastructure solutions for AWS customers at scale.

ਸੌਰਭ ਤ੍ਰਿਕੰਡੇ ਐਮਾਜ਼ਾਨ ਸੇਜਮੇਕਰ ਇਨਫਰੈਂਸ ਲਈ ਇੱਕ ਸੀਨੀਅਰ ਉਤਪਾਦ ਪ੍ਰਬੰਧਕ ਹੈ। ਉਹ ਗਾਹਕਾਂ ਨਾਲ ਕੰਮ ਕਰਨ ਲਈ ਭਾਵੁਕ ਹੈ ਅਤੇ ਮਸ਼ੀਨ ਸਿਖਲਾਈ ਨੂੰ ਲੋਕਤੰਤਰੀਕਰਨ ਕਰਨ ਦੇ ਟੀਚੇ ਤੋਂ ਪ੍ਰੇਰਿਤ ਹੈ। ਉਹ ਗੁੰਝਲਦਾਰ ML ਐਪਲੀਕੇਸ਼ਨਾਂ, ਬਹੁ-ਕਿਰਾਏਦਾਰ ML ਮਾਡਲਾਂ, ਲਾਗਤ ਅਨੁਕੂਲਤਾਵਾਂ, ਅਤੇ ਡੂੰਘੇ ਸਿਖਲਾਈ ਮਾਡਲਾਂ ਦੀ ਤੈਨਾਤੀ ਨੂੰ ਵਧੇਰੇ ਪਹੁੰਚਯੋਗ ਬਣਾਉਣ ਨਾਲ ਸਬੰਧਤ ਮੁੱਖ ਚੁਣੌਤੀਆਂ 'ਤੇ ਧਿਆਨ ਕੇਂਦ੍ਰਤ ਕਰਦਾ ਹੈ। ਆਪਣੇ ਖਾਲੀ ਸਮੇਂ ਵਿੱਚ, ਸੌਰਭ ਹਾਈਕਿੰਗ, ਨਵੀਨਤਾਕਾਰੀ ਤਕਨਾਲੋਜੀਆਂ ਬਾਰੇ ਸਿੱਖਣ, TechCrunch ਦੀ ਪਾਲਣਾ ਕਰਨ, ਅਤੇ ਆਪਣੇ ਪਰਿਵਾਰ ਨਾਲ ਸਮਾਂ ਬਿਤਾਉਣ ਦਾ ਆਨੰਦ ਲੈਂਦਾ ਹੈ।

ਜੋਨਾ ਲਿਊ ਐਮਾਜ਼ਾਨ ਸੇਜਮੇਕਰ ਟੀਮ ਵਿੱਚ ਇੱਕ ਸਾਫਟਵੇਅਰ ਡਿਵੈਲਪਮੈਂਟ ਇੰਜੀਨੀਅਰ ਹੈ। ਉਸਦਾ ਮੌਜੂਦਾ ਕੰਮ ਡਿਵੈਲਪਰਾਂ ਦੀ ਕੁਸ਼ਲਤਾ ਨਾਲ ਮਸ਼ੀਨ ਸਿਖਲਾਈ ਮਾਡਲਾਂ ਦੀ ਮੇਜ਼ਬਾਨੀ ਕਰਨ ਅਤੇ ਅਨੁਮਾਨ ਪ੍ਰਦਰਸ਼ਨ ਨੂੰ ਬਿਹਤਰ ਬਣਾਉਣ 'ਤੇ ਕੇਂਦ੍ਰਤ ਕਰਦਾ ਹੈ। ਉਹ ਸਥਾਨਿਕ ਡੇਟਾ ਵਿਸ਼ਲੇਸ਼ਣ ਅਤੇ ਸਮਾਜਿਕ ਸਮੱਸਿਆਵਾਂ ਨੂੰ ਹੱਲ ਕਰਨ ਲਈ AI ਦੀ ਵਰਤੋਂ ਕਰਨ ਬਾਰੇ ਭਾਵੁਕ ਹੈ।

ਐਸਈਓ ਦੁਆਰਾ ਸੰਚਾਲਿਤ ਸਮੱਗਰੀ ਅਤੇ PR ਵੰਡ. ਅੱਜ ਹੀ ਵਧਾਓ।
PlatoData.Network ਵਰਟੀਕਲ ਜਨਰੇਟਿਵ ਏ.ਆਈ. ਆਪਣੇ ਆਪ ਨੂੰ ਸਮਰੱਥ ਬਣਾਓ। ਇੱਥੇ ਪਹੁੰਚ ਕਰੋ।
ਪਲੈਟੋਏਆਈਸਟ੍ਰੀਮ। Web3 ਇੰਟੈਲੀਜੈਂਸ। ਗਿਆਨ ਵਧਾਇਆ। ਇੱਥੇ ਪਹੁੰਚ ਕਰੋ।
ਪਲੇਟੋਈਐਸਜੀ. ਕਾਰਬਨ, ਕਲੀਨਟੈਕ, ਊਰਜਾ, ਵਾਤਾਵਰਨ, ਸੂਰਜੀ, ਕੂੜਾ ਕਰਕਟ ਪ੍ਰਬੰਧਨ. ਇੱਥੇ ਪਹੁੰਚ ਕਰੋ।
ਪਲੈਟੋ ਹੈਲਥ। ਬਾਇਓਟੈਕ ਅਤੇ ਕਲੀਨਿਕਲ ਟਰਾਇਲ ਇੰਟੈਲੀਜੈਂਸ। ਇੱਥੇ ਪਹੁੰਚ ਕਰੋ।
ਸਰੋਤ: https://aws.amazon.com/blogs/machine-learning/use-kubernetes-operators-for-new-inference-capabilities-in-amazon-sagemaker-that-reduce-llm-deployment-costs-by-50-on-average/

ਜਨਰੇਟਿਵ ਡਾਟਾ ਇੰਟੈਲੀਜੈਂਸ

How ACK works

ਕੁੰਜੀ ਦੇ ਹਿੱਸੇ

ਹੱਲ ਸੰਖੇਪ ਜਾਣਕਾਰੀ

ਪੂਰਿ-ਲੋੜਾਂ

Create an inference component

EndpointConfig YAML

Endpoint YAML

Model YAML

InferenceComponent YAMLs

Invoke models

Update an inference component

Delete an inference component

ਉਪਲਬਧਤਾ ਅਤੇ ਕੀਮਤ

ਸਿੱਟਾ

ਲੇਖਕਾਂ ਬਾਰੇ

ਪਿਲਿੰਗ ਬੈਕ ਲੇਅਰਜ਼: ਗੁਆਟੇਮਾਲਾ ਦੀ ਆਰਥਿਕਤਾ ਵਿੱਚ ਕੇਲੇ ਦਾ ਗੁੰਝਲਦਾਰ ਵਪਾਰ

ਵਫ਼ਾਦਾਰੀ ਨੇ ਖੁਲਾਸਾ ਕੀਤਾ ਕਿ ਪੈਨਸ਼ਨ ਫੰਡ ਕ੍ਰਿਪਟੋ ਦੀ ਪੜਚੋਲ ਕਰ ਰਹੇ ਹਨ - ਡਿਫੈਂਟ

ਨਵੀਨਤਮ ਖੁਫੀਆ ਜਾਣਕਾਰੀ

ਯੂਐਸ ਨੌਕਰੀਆਂ ਦੀ ਰਿਪੋਰਟ 'ਤੇ ਕ੍ਰਿਪਟੋ ਬਾਜ਼ਾਰਾਂ ਦਾ ਵਾਧਾ - ਦ ਡਿਫੈਂਟ

The Dollar Won, but Might the U.S. Lose Control of the Dollar?

Nigeria Set To Prohibit Peer-to-Peer Cryptocurrency Transactions – CryptoInfoNet

ਚੇਨਸਵੈਪ ਇੰਟਰਾ-ਚੇਨ ਸਵੈਪ ਦੇ ਨਾਲ ਮਲਟੀ-ਚੇਨ ਡੀਫਾਈ ਨੂੰ ਕ੍ਰਾਂਤੀ ਪ੍ਰਦਾਨ ਕਰਦਾ ਹੈ

Friend.tech ਟੋਕਨ ਲਾਂਚ ਇੱਕ ਡਰਾਉਣੇ ਸੁਪਨੇ ਵਿੱਚ ਬਦਲ ਗਿਆ ਕਿਉਂਕਿ ਕੀਮਤ 98% ਘਟਦੀ ਹੈ

US DOJ ਨੇ $43,000,000 ਫਰਜ਼ੀ ਕ੍ਰਿਪਟੋ ਪੋਂਜ਼ੀ ਸਕੀਮ ਨਾਲ ਸਬੰਧਤ ਧੋਖਾਧੜੀ ਦੇ ਦੋਸ਼ਾਂ 'ਤੇ ਮੈਨਹਟਨ ਮੈਨ ਨੂੰ ਦੋਸ਼ੀ ਠਹਿਰਾਇਆ - ਦ ਡੇਲੀ ਹੋਡਲ

ਸਾਡੇ ਨਾਲ ਚੈਟ ਕਰੋ