A guide to schedule SageMaker Notebooks

Dileep Patchigolla
Analytics Vidhya
Published in
6 min readFeb 4, 2022

--

Amazon SageMaker has become one of the most widely used platforms for Machine Learning practitioners. Data Scientists often launch a notebook either from SageMaker’s notebook instance or from SageMaker studio (a low code/no code way to build ML applications) and create their analysis. Sometimes, we would like to schedule these notebooks to run periodically.

In this article, I will show how you can schedule a SageMaker notebook to run whenever a file gets created in a specific s3 location. This can easily be extended to schedule the notebook to run at a given time daily.

Before we begin, let’s clear a few concepts:

  1. SageMaker notebook instance: A notebook instance is a compute instance within SageMaker, on which you can run a notebook. Basically, this is the remote machine on which the actual notebooks run. Within SageMaker, we can create multiple notebook instances. Each instance is a standalone remote machine. Within each machine, we can create multiple notebooks. Here’s more details about a notebook instance.
  2. Lifecycle configurations: These are shell scripts that can be run whenever a notebook instance is started (or created — but we won’t be using that here). Both these options can be found on the sagemaker home page as shown in the screenshot here.

3. AWS Lambda: Imagine you just have some code you want to run, but don’t want the headache of handling all the underlying infrastructure. Lambda is the place to go for this.

4. IAM roles: IAM roles are everywhere once you are in AWS. They are the way permissions are managed in aws. A role is a collection of policies, and a policy defines what you can (and can’t) do within aws. For example, you want to read data in s3, you need a policy. Create a bucket in s3, another policy. Use SageMaker, a new policy. You get the gist. When we create a notebook instance or a lambda function, both require IAM roles.

So, here’s how we schedule the notebook jobs:

  1. Create a notebook instance, let’s call schedule-notebook. Make sure the IAM role has a SageMakerExecution policy. Launch Jupyter Lab and upload the Jupyter notebook in this instance. Let’s call it schedule_script.ipynb; Assuming the notebook is created in home directory, it should have the complete path as /home/ec2-user/SageMaker/schedule_script.ipynb. This becomes important shortly. Once the notebook is created and tested that it runs fine, go back and stop the notebook instance.
  2. Create a Lifecycle Configuration from SageMaker home (refer to above image for where the option is present). Below screenshot helps with what you are likely to see. Give a name (ex: sample-lifecycle) to the configuration

The actual script is given below. This script is supposed to run whenever the notebook instance starts (we didn’t create this hook yet). It just activates the conda environment, runs the jupyter notebook, and then stops the notebook instance when it finds that the jupyter notebook is in idle state for more than 5 mins. It assumes that the notebook called schedule_script.ipynb is in the home location, and that gets triggered. Change the variable NOTEBOOK_FILE to point to your notebook. The /home/ec2-user/SageMaker part should ideally be common in the path.

set -eENVIRONMENT=python3
NOTEBOOK_FILE="/home/ec2-user/SageMaker/schedule_script.ipynb"
AUTO_STOP_FILE="/home/ec2-user/SageMaker/auto-stop.py"

echo "Activating conda env"
source /home/ec2-user/anaconda3/bin/activate "$ENVIRONMENT"
echo "Starting notebook"
nohup jupyter nbconvert --to notebook --inplace --ExecutePreprocessor.timeout=600 --ExecutePreprocessor.kernel_name=python3 --execute "$NOTEBOOK_FILE" &
echo "Decativating conda env"
source /home/ec2-user/anaconda3/bin/deactivate
# PARAMETERS
IDLE_TIME=600 # 10minute
echo "Fetching the autostop script"
wget https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/master/scripts/auto-stop-idle/autostop.py
echo "Starting the SageMaker autostop script in cron"
(crontab -l 2>/dev/null; echo "*/1 * * * * /usr/bin/python $PWD/autostop.py --time $IDLE_TIME --ignore-connections") | crontab -

3. Now we go back to the Jupyter instance (which is in stopped state — to allow the edit option), edit its configuration and add the above Lifecycle configuration to it. With this, whenever the notebook instance is started, the lifecycle configuration triggers the notebook in background and sets up a cron that keeps checking the notebook instance to see if it was ever idle for more than IDLE_TIME (=10 mins here). If it finds that the instance is idle, the instance is immediately stopped. One point to note here, a lifecycle configuration script can not run for more than 5 mins (limitation from aws). This is the reason why we are running the notebook in background using nohup. Else, a notebook may run for longer than 5 mins and the lifecycle script fails to complete — which causes failure in starting the notebook instance.

4. Finally, we need to schedule the starting of Jupyter instance itself. We use AWS Lambda for this. Below script does exactly that. Just 6 lines of code is what it takes. I added the code too below, for easier access.

import boto3
import logging
def lambda_handler(event, context):
client = boto3.client(‘sagemaker’)
client.start_notebook_instance(NotebookInstanceName=’schedule-notebook’)
return 0

The Lambda uses boto3, an aws specific SDK used to communicate with various AWS services. We don’t need all the details for this. We just need to rename the NotebookInstanceName to whatever we instance we created in the first step. Once we added the script, we also need to add a trigger, which acts as a trigger to run the script. We can create a trigger by clicking on the Add trigger button, which supports various options such as receiving certain events, scheduling based on time etc. It leverages different aws services such as EventBridge, Kinesis etc to do this. The one I used is S3 — where we can define different events such as creating an object in a given s3 bucket, deleting an object etc. We can also give prefixes and suffixes so we don’t need a dedicated s3 bucket for the triggers.

One important step is to ensure the Lambda has the right IAM role configured. As we want to launch Sagemaker instance and also access S3 bucket for trigger, it needs both these policies attached to the IAM role. AmazonS3FullAccess and AmazonSageMakerFullAccess, packaged by Amazon by defalt, should satisfy all these needs.

A few points to note:

  1. The notebook instance can not be used normally now, as the lifecycle configuration gets kicked in on startup and it kills the instance after the job is complete. So we need to remove the configuration if we want to make any changes to the notebook, and attach it again once the changes are complete.
  2. This flow works only if the notebook instance is in stopped state. If it is already in start state when the lambda gets triggered, nothing happens.

And that’s it. We just scheduled SageMaker notebook to run based on whenever a file gets created in an S3 location. Here’s a summary of how the event flow happens:

  1. File gets written in the given s3 location by some service or a job
  2. It triggers the Lambda function, which instantiates a SageMaker Notebook instance
  3. As the notebook instance is started, its lifecycle configuration kicks in, and runs the given jupyter notebook
  4. Once the notebook run is complete, the lifecycle configuration stops the instance, preventing unnecessary data costs

--

--

Dileep Patchigolla
Analytics Vidhya

Principal Data Scientist @ RudderStack | MSc Analytics, Georgia Tech | B Tech, IIT Madras