Using Spot Instances Part 2 - Run Sample Workload

This workshop has been deprecated and archived. The new Amazon EKS Workshop is now available at www.eksworkshop.com.

Spark Pod Template

With Amazon EMR versions 5.33.0 and later, Amazon EMR on EKS supports pod template feature in Spark. Pod templates are specifications that determine how to run each pod. You can use pod template files to define the driver or executor pod’s configurations that Spark configurations do not support.

For more information about the pod templates support in EMR on EKS, see Pod Templates.

To reduce costs, you can schedule Spark driver tasks to run on On-Demand instances while scheduling Spark executor tasks to run on Spot instances.

With pod templates you can define label eks.amazonaws.com/capacityType as a node selector, so that you can schedule Spark driver pods on On-demand Instances and Spark executor pods on the Spot Instances.

Now, you will create a sample pod template for Spark Driver. Using nodeSelector eks.amazonaws.com/capacityType: ON_DEMAND this will run on On-demand Instances.

cat > spark_driver_pod_template.yml <<EOF 
apiVersion: v1
kind: Pod
spec:
  volumes:
    - name: source-data-volume
      emptyDir: {}
    - name: metrics-files-volume
      emptyDir: {}
  nodeSelector:
    eks.amazonaws.com/capacityType: ON_DEMAND
  containers:
  - name: spark-kubernetes-driver # This will be interpreted as Spark driver container
EOF

Next, you will create a sample pod template for Spark executors. Using nodeSelector eks.amazonaws.com/capacityType: SPOT this will run on Spot Instances.

cat > spark_executor_pod_template.yml <<EOF 
apiVersion: v1
kind: Pod
spec:
  volumes:
    - name: source-data-volume
      emptyDir: {}
    - name: metrics-files-volume
      emptyDir: {}
  nodeSelector:
    eks.amazonaws.com/capacityType: SPOT
  containers:
  - name: spark-kubernetes-executor # This will be interpreted as Spark executor container
EOF

Let’s upload sample pod templates and python script to s3 bucket.

aws s3 cp threadsleep.py ${s3DemoBucket}
aws s3 cp spark_driver_pod_template.yml ${s3DemoBucket}/pod_templates/
aws s3 cp spark_executor_pod_template.yml ${s3DemoBucket}/pod_templates/

Next we submit the job.

#Get required virtual cluster-id and role arn
export VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query "virtualClusters[?state=='RUNNING'].id" --output text)

export EMR_ROLE_ARN=$(aws iam get-role --role-name EMRContainers-JobExecutionRole --query Role.Arn --output text)

#start spark job with start-job-run
aws emr-containers start-job-run \
  --virtual-cluster-id $VIRTUAL_CLUSTER_ID \
  --name pi-spot \
  --execution-role-arn $EMR_ROLE_ARN \
  --release-label emr-5.33.0-latest \
  --job-driver '{
    "sparkSubmitJobDriver": {
      "entryPoint": "'${s3DemoBucket}'/threadsleep.py",
      "sparkSubmitParameters": "--conf spark.kubernetes.driver.podTemplateFile=\"'${s3DemoBucket}'/pod_templates/spark_driver_pod_template.yml\" --conf spark.kubernetes.executor.podTemplateFile=\"'${s3DemoBucket}'/pod_templates/spark_executor_pod_template.yml\" --conf spark.executor.instances=15 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"}}' \
  --configuration-overrides '{
        "applicationConfiguration": [
            {
                "classification": "spark-defaults",
                "properties": {
                  "spark.dynamicAllocation.enabled": "false",
                  "spark.kubernetes.executor.deleteOnTermination": "true"
                }
            }
        ],
        "monitoringConfiguration": {
            "cloudWatchMonitoringConfiguration": {
                "logGroupName": "/emr-on-eks/eksworkshop-eksctl",
                "logStreamNamePrefix": "pi"
            },
            "s3MonitoringConfiguration": {
                "logUri": "'${s3DemoBucket}'/"
            }
        }
    }'

You will be able to see the completed job in EMR console.

Let’s check the pods deployed on On-Demand Instances and should now see Spark driver pods running on On-Demand instances.

 for n in $(kubectl get nodes -l eks.amazonaws.com/capacityType=ON_DEMAND --no-headers | cut -d " " -f1); do echo "Pods on instance ${n}:";kubectl get pods -n spark  --no-headers --field-selector spec.nodeName=${n} ; echo ; done

Let’s check the pods deployed on Spot Instances and should now see Spark executor pods running on Spot Instances.

 for n in $(kubectl get nodes -l eks.amazonaws.com/capacityType=SPOT --no-headers | cut -d " " -f1); do echo "Pods on instance ${n}:";kubectl get pods -n spark  --no-headers --field-selector spec.nodeName=${n} ; echo ; done