Skip to content

Commit fe6555d

Browse files
Merge pull request #42 from anyscale/re_invent-2025
Re invent 2025
2 parents a5afdb8 + 523d83a commit fe6555d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+2403
-0
lines changed
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
## Solution overview
2+
The following architecture diagram illustrates SageMaker HyperPod with Amazon EKS orchestration and Anyscale.
3+
4+
<img src="assets/anyscale-aws-hyperpod-arch-diagram.png" width="1024" height="570">
5+
6+
7+
See [here](details.md) for more details on this architecture.
8+
9+
## Getting Started
10+
### Prerequisites
11+
1. **AWS Account Setup**
12+
1. An **AWS account** with billing enabled
13+
1. [**AWS Identity and Access Management**](https://aws.amazon.com/iam/)(IAM) role permissions for SageMaker HyperPod
14+
1. [**AWS Credentials**](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) set up in your local environment, either as environment variables or through credentials and profile files.
15+
1. [**AWS CLI**](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) installed on your local laptop.
16+
1. **Tools** installed on your local laptop:
17+
1. **Git CLI** on a mac with `brew` via `brew install git`. Other [install options](https://git-scm.com/install/mac) are available.
18+
1. **Terraform** (version 1.0.0 or later) on a mac with `brew` via `brew tap hashicorp/tap` then `brew install terraform`. Other [install options](https://developer.hashicorp.com/terraform/tutorials/gcp-get-started/install-cli) are available.
19+
1. Basic understanding of Terraform and Infrastructure as Code
20+
1. **helm CLI** on a mac with `brew` via `brew install helm`. Other [install options](https://helm.sh/docs/intro/install/) are available.
21+
1. **kubectl CLI** on a mac with `brew` via `brew install kubectl`. Other [install options](https://kubernetes.io/docs/tasks/tools/) are available.
22+
1. **Anyscale Account Setup**
23+
1. **Anyscale CLI** installed on your local laptop via `pip install anyscale --upgrade`.
24+
1. An **Anyscale organization** (account).
25+
1. Authenticate local environment with Anyscale. Run `anyscale login`, open the link which is output in your browser, and click approve.
26+
27+
## Set up SageMaker HyperPod
28+
### Customize HyperPod Deployment Configuration
29+
30+
Review the default configurations in the existing `terraform.tfvars` file and make modifications to customize your deployment as needed.
31+
32+
* Variables you will likely want to update
33+
34+
```tf
35+
anyscale_new_cloud_name = "my-new-cloud-name"
36+
kubernetes_version = "1.31"
37+
eks_cluster_name = "my-eks-cluster"
38+
hyperpod_cluster_name = "my-hyperpod-cluster"
39+
resource_name_prefix = "hyperpod-prefix-name"
40+
aws_region = "us-west-2"
41+
availability_zone_id = "usw2-az2"
42+
```
43+
44+
### Deployment
45+
46+
> Note: You may need to increase some quotas e.g., the defaults create 2 NAT Gateways which require Elastic IP Addresses.
47+
48+
First, clone the HyperPod Helm charts GitHub repository to locally stage the dependencies Helm chart:
49+
50+
```shell
51+
git clone https://github.com/aws/sagemaker-hyperpod-cli.git /tmp/helm-repo
52+
```
53+
54+
Apply the terraform
55+
56+
```shell
57+
terraform init
58+
terraform plan
59+
terraform apply
60+
```
61+
### Verify your connection to the HyperPod cluster
62+
63+
Using the output from the Terraform modules, verify a connection to the HyperPod cluster. It should look sonething:
64+
65+
```shell
66+
aws eks update-kubeconfig --region <region> --name <my-eks-cluster>
67+
kubectl get nodes -L node.kubernetes.io/instance-type -L sagemaker.amazonaws.com/node-health-status -L sagemaker.amazonaws.com/deep-health-check-status $@
68+
```
69+
70+
### Installing K8s Components
71+
72+
#### Install the Nginx ingress controller
73+
74+
A sample file, `sample-values_nginx.yaml` has been provided in this repo. Please review for your requirements before using.
75+
76+
Run:
77+
78+
```shell
79+
helm repo add nginx https://kubernetes.github.io/ingress-nginx
80+
helm upgrade ingress-nginx nginx/ingress-nginx \
81+
--version 4.12.1 \
82+
--namespace ingress-nginx \
83+
--values sample-values_nginx.yaml \
84+
--create-namespace \
85+
--install
86+
```
87+
88+
### Register the Anyscale Cloud
89+
90+
Ensure that you are logged into Anyscale with valid CLI credentials. (`anyscale login`)
91+
92+
1. Using the output from the Terraform modules, register the Anyscale Cloud. It should look sonething like:
93+
94+
```shell
95+
anyscale cloud register --provider aws \
96+
--name <anyscale-cloud-name> \
97+
--compute-stack k8s \
98+
--region us-west-2 \
99+
--s3-bucket-id <anyscale_example_bucket> \
100+
--kubernetes-zones us-west-2a,us-west-2b,us-west-2c \
101+
--anyscale-operator-iam-identity arn:aws:iam::123456789012:role/my-kubernetes-cloud-node-group-role
102+
```
103+
104+
2. Note the Cloud Deployment ID which will be used in the next step. The Anyscale CLI will return it as one of the outputs. Example:
105+
```shell
106+
Output
107+
(anyscale +22.5s) For registering this cloud's Kubernetes Manager, use cloud deployment ID 'cldrsrc_12345abcdefgh67890ijklmnop'.
108+
```
109+
110+
### Install the Anyscale Operator
111+
112+
1. Using the below example, replace `<aws_region>` with the AWS region where EKS is running, and replace `<cloud-deployment-id>` with the appropriate value from the `anyscale cloud register` output. Please note that you can also change the namespace to one that you wish to associate with Anyscale pods.
113+
2. Using your updated helm upgrade command, install the Anyscale Operator.
114+
115+
```shell
116+
helm repo add anyscale https://anyscale.github.io/helm-charts
117+
helm upgrade anyscale-operator anyscale/anyscale-operator \
118+
--set-string global.cloudDeploymentId=<cloud-deployment-id> \
119+
--set-string global.cloudProvider=aws \
120+
--set-string global.aws.region=<aws_region> \
121+
--set-string workloads.serviceAccount.name=anyscale-operator \
122+
--namespace anyscale-operator \
123+
--create-namespace \
124+
--install
125+
```
126+
3. Verify operator is installed:
127+
```shell
128+
helm list -n anyscale-operator
129+
```
130+
### Add label to HyperPod node group(s)
131+
```shell
132+
kubectl label nodes --all eks.amazonaws.com/capacityType=ON_DEMAND
133+
```
134+
You need to wait until the HyperPod node group is available in your EKS cluster. And re-run this if you add new instance groups in the HyperPod cluster. You can check if the HyperPod node group is available by re-running this command:
135+
```shell
136+
kubectl get nodes -L node.kubernetes.io/instance-type -L sagemaker.amazonaws.com/node-health-status -L sagemaker.amazonaws.com/deep-health-check-status $@
137+
```
138+
139+
### Verify your Anyscale Cloud
140+
```shell
141+
anyscale job submit --cloud <anyscale-cloud-name> --working-dir https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip -- python hello_world.py
142+
```
143+
144+
### Clean up
145+
146+
```shell
147+
kubectl delete deployment anyscale-operator -n anyscale
148+
kubectl delete deployment ingress-nginx-controller -n ingress-nginx
149+
terraform -destroy
150+
```
158 KB
Loading
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
[![Build Status][badge-build]][build-status]
2+
[![Terraform Version][badge-terraform]](https://github.com/hashicorp/terraform/releases)
3+
[![AWS Provider Version][badge-tf-aws]](https://github.com/terraform-providers/terraform-provider-aws/releases)
4+
[![Anyscale CLI][badge-anyscale-cli]](https://docs.anyscale.com/reference/quickstart-cli)
5+
6+
# Anyscale AWS SageMaker HyperPod EKS Example - New Cluster
7+
This example creates the resources to setup a new SageMaker HyperPod cluster orchestrated on AWS EKS and creates the resources to run Anyscale on AWS EKS.
8+
9+
The content of this module should be used as a starting point and modified to your own security and infrastructure
10+
requirements.
11+
12+
## Use Amazon SageMaker HyperPod and Anyscale for next-generation distributed computing
13+
_by Sindhura Palakodety, Anoop Saha, Dominic Catalano, Florian Gauter, Alex Iankoulski, and Mark Vinciguerra on 09 OCT 2025 in [Advanced (300)](https://aws.amazon.com/blogs/machine-learning/category/learning-levels/advanced-300/), [Amazon Machine Learning](https://aws.amazon.com/blogs/machine-learning/category/artificial-intelligence/amazon-machine-learning/), [Amazon SageMaker Autopilot](https://aws.amazon.com/blogs/machine-learning/category/artificial-intelligence/sagemaker/amazon-sagemaker-autopilot/), [Amazon SageMaker HyperPod](https://aws.amazon.com/blogs/machine-learning/category/artificial-intelligence/sagemaker/amazon-sagemaker-hyperpod/), [Artificial Intelligence](https://aws.amazon.com/blogs/machine-learning/category/artificial-intelligence/), [Expert (400)](), [Generative AI](https://aws.amazon.com/blogs/machine-learning/category/artificial-intelligence/generative-ai/), [PyTorch on AWS](https://aws.amazon.com/blogs/machine-learning/category/artificial-intelligence/pytorch-on-aws/), [Technical How-to Permalink Comments Share](https://aws.amazon.com/blogs/machine-learning/category/post-types/technical-how-to/)_
14+
15+
_This post was written with Dominic Catalano from Anyscale._
16+
17+
Organizations building and deploying large-scale AI models often face critical infrastructure challenges that can directly impact their bottom line: unstable training clusters that fail mid-job, inefficient resource utilization driving up costs, and complex distributed computing frameworks requiring specialized expertise. These factors can lead to unused GPU hours, delayed projects, and frustrated data science teams. This post demonstrates how you can address these challenges by providing a resilient, efficient infrastructure for distributed AI workloads.
18+
19+
[Amazon SageMaker HyperPod](https://aws.amazon.com/sagemaker/ai/hyperpod/) is a purpose-built persistent generative AI infrastructure optimized for machine learning (ML) workloads. It provides robust infrastructure for large-scale ML workloads with high-performance hardware, so organizations can build heterogeneous clusters using tens to thousands of GPU accelerators. With nodes optimally co-located on a single spine, SageMaker HyperPod reduces networking overhead for distributed training. It maintains operational stability through continuous monitoring of node health, automatically swapping faulty nodes with healthy ones and resuming training from the most recently saved checkpoint, all of which can help save up to 40% of training time. For advanced ML users, SageMaker HyperPod allows SSH access to the nodes in the cluster, enabling deep infrastructure control, and allows access to SageMaker tooling, including Amazon SageMaker Studio, MLflow, and SageMaker distributed training libraries, along with support for various open-source training libraries and frameworks. SageMaker Flexible Training Plans complement this by enabling GPU capacity reservation up to 8 weeks in advance for durations up to 6 months.
20+
21+
The [Anyscale platform](https://www.anyscale.com/product/platform) integrates seamlessly with SageMaker HyperPod when using [Amazon Elastic Kubernetes Service](https://aws.amazon.com/eks/) (Amazon EKS) as the cluster orchestrator. [Ray](https://www.ray.io/) is the leading AI compute engine, offering Python-based distributed computing capabilities to address AI workloads ranging from multimodal AI, data processing, model training, and model serving. Anyscale unlocks the power of Ray with comprehensive tooling for developer agility, critical fault tolerance, and an optimized version called [Anyscale Runtime](https://www.anyscale.com/blog/announcing-anyscale-runtime-powered-by-ray), designed to deliver leading cost-efficiency. Through a unified control plane, organizations benefit from simplified management of complex distributed AI use cases with fine-grained control across hardware.
22+
23+
The combined solution provides extensive monitoring through [SageMaker HyperPod real-time dashboards](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-cluster-observability.html) tracking node health, GPU utilization, and network traffic. Integration with Amazon CloudWatch Container Insights, [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/), and [Amazon Managed Grafana](https://aws.amazon.com/prometheus/) delivers deep visibility into cluster performance, complemented by [Anyscale’s monitoring framework](https://docs.anyscale.com/monitoring/metrics), which provides built-in metrics for monitoring Ray clusters and the workloads that run on them.
24+
25+
This post demonstrates how to integrate the Anyscale platform with SageMaker HyperPod. This combination can deliver tangible business outcomes: reduced time-to-market for AI initiatives, lower total cost of ownership through optimized resource utilization, and increased data science productivity by minimizing infrastructure management overhead. It is ideal for Amazon EKS and Kubernetes-focused organizations, teams with large-scale distributed training needs, and those invested in the [Ray ecosystem](https://www.anyscale.com/blog/understanding-the-ray-ecosystem-and-community) or SageMaker.
26+
27+
28+
## Solution overview
29+
The following architecture diagram illustrates SageMaker HyperPod with Amazon EKS orchestration and Anyscale.
30+
31+
<img src="assets/anyscale-aws-hyperpod-arch-diagram.png" width="1024" height="570">
32+
33+
The sequence of events in this architecture is as follows:
34+
35+
1. A user submits a job to the Anyscale Control Plane, which is the main user-facing endpoint.
36+
2. The Anyscale Control Plane communicates this job to the Anyscale Operator within the SageMaker HyperPod cluster in the SageMaker HyperPod virtual private cloud (VPC).
37+
3. The Anyscale Operator, upon receiving the job, initiates the process of creating the necessary pods by reaching out to the EKS control plane.
38+
4. The EKS control plane orchestrates creation of a Ray head pod and worker pods. These pods represent a Ray cluster, running on SageMaker HyperPod with Amazon EKS.
39+
5. The Anyscale Operator submits the job through the head pod, which serves as the primary coordinator for the distributed workload.
40+
6. The head pod distributes the workload across multiple worker pods, as shown in the hierarchical structure in the SageMaker HyperPod EKS cluster.
41+
7. Worker pods execute their assigned tasks, potentially accessing required data from the storage services – such as [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3), [Amazon Elastic File System](https://aws.amazon.com/efs/) (Amazon EFS), or [Amazon FSx for Lustre](https://aws.amazon.com/fsx/lustre/) – in the user VPC.
42+
8. Throughout the job execution, metrics and logs are published to [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) and Amazon Managed Service for Prometheus or Amazon Managed Grafana for observability.
43+
9. When the Ray job is complete, the job artifacts (final model weights, inference results, and so on) are saved to the designated storage service.
44+
10. Job results (status, metrics, logs) are sent through the Anyscale Operator back to the Anyscale Control Plane.
45+
46+
This flow shows distribution and execution of user-submitted jobs across the available computing resources, while maintaining monitoring and data accessibility throughout the process.
47+
48+
See [here](README.md) for instructions on setting this up.

0 commit comments

Comments
 (0)