|
| 1 | +## Solution overview |
| 2 | +The following architecture diagram illustrates SageMaker HyperPod with Amazon EKS orchestration and Anyscale. |
| 3 | + |
| 4 | +<img src="assets/anyscale-aws-hyperpod-arch-diagram.png" width="1024" height="570"> |
| 5 | + |
| 6 | + |
| 7 | +See [here](details.md) for more details on this architecture. |
| 8 | + |
| 9 | +## Getting Started |
| 10 | +### Prerequisites |
| 11 | +1. **AWS Account Setup** |
| 12 | + 1. An **AWS account** with billing enabled |
| 13 | + 1. [**AWS Identity and Access Management**](https://aws.amazon.com/iam/)(IAM) role permissions for SageMaker HyperPod |
| 14 | + 1. [**AWS Credentials**](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) set up in your local environment, either as environment variables or through credentials and profile files. |
| 15 | + 1. [**AWS CLI**](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) installed on your local laptop. |
| 16 | +1. **Tools** installed on your local laptop: |
| 17 | + 1. **Git CLI** on a mac with `brew` via `brew install git`. Other [install options](https://git-scm.com/install/mac) are available. |
| 18 | + 1. **Terraform** (version 1.0.0 or later) on a mac with `brew` via `brew tap hashicorp/tap` then `brew install terraform`. Other [install options](https://developer.hashicorp.com/terraform/tutorials/gcp-get-started/install-cli) are available. |
| 19 | + 1. Basic understanding of Terraform and Infrastructure as Code |
| 20 | + 1. **helm CLI** on a mac with `brew` via `brew install helm`. Other [install options](https://helm.sh/docs/intro/install/) are available. |
| 21 | + 1. **kubectl CLI** on a mac with `brew` via `brew install kubectl`. Other [install options](https://kubernetes.io/docs/tasks/tools/) are available. |
| 22 | +1. **Anyscale Account Setup** |
| 23 | + 1. **Anyscale CLI** installed on your local laptop via `pip install anyscale --upgrade`. |
| 24 | + 1. An **Anyscale organization** (account). |
| 25 | + 1. Authenticate local environment with Anyscale. Run `anyscale login`, open the link which is output in your browser, and click approve. |
| 26 | + |
| 27 | +## Set up SageMaker HyperPod |
| 28 | +### Customize HyperPod Deployment Configuration |
| 29 | + |
| 30 | +Review the default configurations in the existing `terraform.tfvars` file and make modifications to customize your deployment as needed. |
| 31 | + |
| 32 | +* Variables you will likely want to update |
| 33 | + |
| 34 | + ```tf |
| 35 | + anyscale_new_cloud_name = "my-new-cloud-name" |
| 36 | + kubernetes_version = "1.31" |
| 37 | + eks_cluster_name = "my-eks-cluster" |
| 38 | + hyperpod_cluster_name = "my-hyperpod-cluster" |
| 39 | + resource_name_prefix = "hyperpod-prefix-name" |
| 40 | + aws_region = "us-west-2" |
| 41 | + availability_zone_id = "usw2-az2" |
| 42 | + ``` |
| 43 | +
|
| 44 | +### Deployment |
| 45 | +
|
| 46 | +> Note: You may need to increase some quotas e.g., the defaults create 2 NAT Gateways which require Elastic IP Addresses. |
| 47 | +
|
| 48 | +First, clone the HyperPod Helm charts GitHub repository to locally stage the dependencies Helm chart: |
| 49 | +
|
| 50 | +```shell |
| 51 | +git clone https://github.com/aws/sagemaker-hyperpod-cli.git /tmp/helm-repo |
| 52 | +``` |
| 53 | + |
| 54 | +Apply the terraform |
| 55 | + |
| 56 | +```shell |
| 57 | +terraform init |
| 58 | +terraform plan |
| 59 | +terraform apply |
| 60 | +``` |
| 61 | +### Verify your connection to the HyperPod cluster |
| 62 | + |
| 63 | +Using the output from the Terraform modules, verify a connection to the HyperPod cluster. It should look sonething: |
| 64 | + |
| 65 | +```shell |
| 66 | +aws eks update-kubeconfig --region <region> --name <my-eks-cluster> |
| 67 | +kubectl get nodes -L node.kubernetes.io/instance-type -L sagemaker.amazonaws.com/node-health-status -L sagemaker.amazonaws.com/deep-health-check-status $@ |
| 68 | +``` |
| 69 | + |
| 70 | +### Installing K8s Components |
| 71 | + |
| 72 | +#### Install the Nginx ingress controller |
| 73 | + |
| 74 | +A sample file, `sample-values_nginx.yaml` has been provided in this repo. Please review for your requirements before using. |
| 75 | + |
| 76 | +Run: |
| 77 | + |
| 78 | +```shell |
| 79 | +helm repo add nginx https://kubernetes.github.io/ingress-nginx |
| 80 | +helm upgrade ingress-nginx nginx/ingress-nginx \ |
| 81 | + --version 4.12.1 \ |
| 82 | + --namespace ingress-nginx \ |
| 83 | + --values sample-values_nginx.yaml \ |
| 84 | + --create-namespace \ |
| 85 | + --install |
| 86 | +``` |
| 87 | + |
| 88 | +### Register the Anyscale Cloud |
| 89 | + |
| 90 | +Ensure that you are logged into Anyscale with valid CLI credentials. (`anyscale login`) |
| 91 | + |
| 92 | +1. Using the output from the Terraform modules, register the Anyscale Cloud. It should look sonething like: |
| 93 | + |
| 94 | +```shell |
| 95 | +anyscale cloud register --provider aws \ |
| 96 | + --name <anyscale-cloud-name> \ |
| 97 | + --compute-stack k8s \ |
| 98 | + --region us-west-2 \ |
| 99 | + --s3-bucket-id <anyscale_example_bucket> \ |
| 100 | + --kubernetes-zones us-west-2a,us-west-2b,us-west-2c \ |
| 101 | + --anyscale-operator-iam-identity arn:aws:iam::123456789012:role/my-kubernetes-cloud-node-group-role |
| 102 | +``` |
| 103 | + |
| 104 | +2. Note the Cloud Deployment ID which will be used in the next step. The Anyscale CLI will return it as one of the outputs. Example: |
| 105 | +```shell |
| 106 | +Output |
| 107 | +(anyscale +22.5s) For registering this cloud's Kubernetes Manager, use cloud deployment ID 'cldrsrc_12345abcdefgh67890ijklmnop'. |
| 108 | +``` |
| 109 | +
|
| 110 | +### Install the Anyscale Operator |
| 111 | +
|
| 112 | +1. Using the below example, replace `<aws_region>` with the AWS region where EKS is running, and replace `<cloud-deployment-id>` with the appropriate value from the `anyscale cloud register` output. Please note that you can also change the namespace to one that you wish to associate with Anyscale pods. |
| 113 | +2. Using your updated helm upgrade command, install the Anyscale Operator. |
| 114 | +
|
| 115 | +```shell |
| 116 | +helm repo add anyscale https://anyscale.github.io/helm-charts |
| 117 | +helm upgrade anyscale-operator anyscale/anyscale-operator \ |
| 118 | + --set-string global.cloudDeploymentId=<cloud-deployment-id> \ |
| 119 | + --set-string global.cloudProvider=aws \ |
| 120 | + --set-string global.aws.region=<aws_region> \ |
| 121 | + --set-string workloads.serviceAccount.name=anyscale-operator \ |
| 122 | + --namespace anyscale-operator \ |
| 123 | + --create-namespace \ |
| 124 | + --install |
| 125 | +``` |
| 126 | +3. Verify operator is installed: |
| 127 | +```shell |
| 128 | +helm list -n anyscale-operator |
| 129 | +``` |
| 130 | +### Add label to HyperPod node group(s) |
| 131 | +```shell |
| 132 | +kubectl label nodes --all eks.amazonaws.com/capacityType=ON_DEMAND |
| 133 | +``` |
| 134 | +You need to wait until the HyperPod node group is available in your EKS cluster. And re-run this if you add new instance groups in the HyperPod cluster. You can check if the HyperPod node group is available by re-running this command: |
| 135 | +```shell |
| 136 | +kubectl get nodes -L node.kubernetes.io/instance-type -L sagemaker.amazonaws.com/node-health-status -L sagemaker.amazonaws.com/deep-health-check-status $@ |
| 137 | +``` |
| 138 | +
|
| 139 | +### Verify your Anyscale Cloud |
| 140 | +```shell |
| 141 | +anyscale job submit --cloud <anyscale-cloud-name> --working-dir https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip -- python hello_world.py |
| 142 | +``` |
| 143 | +
|
| 144 | +### Clean up |
| 145 | +
|
| 146 | +```shell |
| 147 | +kubectl delete deployment anyscale-operator -n anyscale |
| 148 | +kubectl delete deployment ingress-nginx-controller -n ingress-nginx |
| 149 | +terraform -destroy |
| 150 | +``` |
0 commit comments