Following the evolution of computer systems, a demand for more advanced solutions to daily life issues has arisen. Not only are these advanced computerized solutions desired for the digitization of more daily activities and tasks, but also to process more data in the same (or lesser) amount of time. The ever-increasing content growth demands a reduction in computation time as it is needed to avoid regression. At the same time, topics like machine learning and data mining are gaining more and more attention. This is the moment when sequential task solving is no longer sufficient, because in most cases it linearly increases waiting times. The fact is that parallel computing is the next step in data processing as it is more profitable – especially in cloud environments. It allows us to solve problems simultaneously and, therefore, do them faster and (if enough resources are available) tackle more of them. The purpose of this article is to get you familiar with AWS Batch, a batch processing platform providing parallel computing capabilities for various needs. Its wide range of configuration options will let you easily adapt it to your needs.
Running your first job
Let’s start with some simple tasks which will show you how to traverse AWS Batch and will prove that the tool is working before we attempt to solve a more complex task.
Setting up the AWS Batch environment
Firstly, we need to prepare the configuration determining how your compute resources will be arranged and how tasks will be distributed in such an environment. The following section assumes you have some knowledge of using Amazon Web Services.
Environment for running computations
We will start with setting up a “compute environment” – this is a collection of AWS Batch computational resources that can be associated with a job queue. It is good to know that under the hood, Batch is using AWS EC2 and automatically allocates the necessary resources. This means that it eliminates the problem of having to set individual machines and also handles auto-scaling which will help reduce overall costs – there will only be a certain amount of EC2 instances running that are needed.
- Start by creating an environment. To do this, go to the Compute environments tab and click on Create environment button.
- Leave environment type as Managed and type your environment name.
- Now you can move to the compute resources configuration. For now, the only options we are interested in are instance types and vCPU limits. For allowed instance types choose c4.large – more on this later. For Maximum vCPUs lower the value to 4.
Before you finally press the Create button, let me explain what we are doing here. Setting the compute environment type as Managed means that it will be fully operated by AWS algorithms. As part of its Provisioning Model, AWS will run on-demand EC2 instances running Docker, as well as other services requested. Also, it will shut them down when they are no longer needed (once all the tasks have been completed or the demand has simply decreased).
Heading up to Compute resources we allowed only one type of instance which provided the power of 2 vCPUs. The following vCPU limits relate to scaling. Setting the maximum vCPUs limit will determine the maximum count of concurrently running instances. As we set it to 4 vCPUs, it is easy to calculate as we will allow a maximum of two instances at a time. The minimum is set to zero, meaning there will be no instances running when there is nothing to do.
Now you can press the Create button and then you will see your newly created environment in the environments list.
As I said previously, sequential task solving is no longer sufficient and parallel computing is the solution. There will always come a moment when you hit the limit and find that you can’t supply more instances and, therefore, must look to parallelize more tasks. It is in this moment when things start queueing once again, the value of parallelizing becomes apparent – The last thing we want is a queue.
- Navigate to the Job Queues tab and click on the Create Queue button.
- Assign a Queue Name, set the Priority to 1 and choose your previously created Compute Environment to service tasks for this queue.
- Press Create job queue button and we can proceed to the creation of a job template.
Creating the first job template
As your Compute Environment and Job Queue are ready we can prepare the first job template. Job definition is the last element of configuration before we can actually execute the task. It is the obvious way to gain leverage from the AWS Batch because batch processing often relies on repetition. If we separate the common configuration of each task we can simplify further submissions.
- Open the Job Definitions tab and click on the Create button.
- Set the Job Definition Name and move to the Environment section.
- Set the container image to hello-world, vCPUs to 1 and Memory to 1024. Leave the rest of the fields with their default values.
- Click Create Job Definition and see the resulting list.
Moving on, we set the name of our template next and then comes the real thing – Environment. Under the hood of AWS Batch, an AWS Elastic Container Service is used which allows you to run single Docker containers. The whole tool is based on containers so the actual jobs are simply running containers executing logic. In this manner, all we have to do is to prepare the Docker image which later will be instantiated as a container. For the purpose of this example, we have just taken the hello-world image from the official Docker registry, but in a real-life situation, you would use your own. Setting vCPUs gives the scaling algorithm information indicating that for this specific job definition a minimum of 1 virtual CPU is needed. If such resources are not yet available then the task will wait in the queue. The amount of Memory provided is a hard limit for a Docker container — when this limit is surpassed it results in job termination. It’s mostly a Docker thing, but in the job definition creation form, it needs to be specified explicitly.
Finally, we have all three elements required to run our first job:
- Compute Environment which is the place where the job will be executed. AWS will take care of the provisioning’s required instances so you don’t have to worry.
- Job Queue which will buffer incoming tasks until the necessary compute resources are available.
- Job Definition which is simply the template of our job.
You are just a few steps from running your AWS Batch job:
- Navigate to the Jobs tab and press the Submit Job button.
- Type your Job Name and select the previously created Job Definition and Job Queue.
- Submit the job!
The job is submitted and it will go through consecutive stages until it succeeds or fails. When in a runnable state it means the task is waiting for any available resources. Allow a couple of minutes for the AWS to instantiate the required EC2 instance.
After approximately 10 minutes the job will have finally started and after a few seconds of execution, it will succeed.
The output of the Batch job is automatically stored in the AWS CloudWatch. To go there simply click on the Job ID link and find the Attempts list:
Click on View Logs and you will be directed to CloudWatch log:
Prepare for real-world usage
It is not a secret that the above example is far from actual production usage. We can’t assume that whole logic will be built into the Docker image every time. It is necessary to parametrize the job, or in other words to simply pass data to it, otherwise, it will be useless. Also, it is obvious you will need to use AWS API to integrate it with your application because submitting jobs manually makes no sense. Fortunately, AWS SDK’s supporting Batch is available.
Passing data to and from the task
Getting data inside the container can be achieved in many ways. Apparent ones may include using AWS S3, downloading data using HTTP, or even mounting NFS. However, covering these topics goes beyond the scope of this article. Opting for communication over the HTTP layer will most likely suit many needs, but it really depends on the architecture you need.
During the submission of the job, we have omitted two places where additional data could be passed to the task. The first of them is the Command passed to the container. Additional arguments following the executable could be read inside the script, therefore allowing us to customize the flow of a processed job.
Another option is to pass Environment Variables which can be read from inside of the container.
That way you can, for example, download data from your API and after successful computations upload the results back to it.
Handle failures and timeouts
Not everything goes as smoothly as it should. Sometimes, the execution of the task exceeds acceptable standards and should be abandoned. In this case, the Timeout property could be set to some specific value.
It is also possible that the job won’t reach the timeout criteria, but still fail due to a non-zero return code. In both cases, you should notify your API about such circumstances and actually, you have two options:
- Use AWS to catch and further process such errors using, for example, AWS Lambda.
- Handle the error inside the container. However, there is always a risk that the job could be killed from outside and the internal error handling simply won’t catch it.
It is also possible that the job failure is temporary, e.g. a short 3rd party API outage. In such a case you can benefit from the retry job function. You can set up how many times AWS Batch should reattempt the execution of the job before it finally marks it as failed. In order to do this, simply set the Job Attempts parameter.
Assign computational resources
When it comes to assigning resources it is important to consider three factors: instance type, maximum amount of instances running and the computational demand for a single job.
Choosing the available instance types really depends on your needs and, of course, you can select multiple instances if required. If your task is complex, maybe you should go with a robust one, if not, something less beasty could be the more prudent choice. If you really don’t know how much you require, consider a more empiricist approach and simply test it out. It’s not worth paying more than you really need to.
One more aspect worth considering is the minimum vCPUs. For our compute environment we set it to 0, but have you noticed it took around 10 minutes (in my case, yours may be slightly different) to provide a particular instance? Depending on your needs, you can set the minimum vCPUs to any non-zero value and simply keep some instances running. The AWS Batch will still try to scale it up if more resources are needed, and scale it down when they are released, but it won’t go under the minimum amount.
After this article you should have some basic knowledge on how to use AWS Batch. You will have configured your basic environment and ran your first job.
Adapting AWS Batch to solve computing tasks in your project may be very beneficial. Dividing complex problems into smaller pieces can optimize the compute times in parallel calculations thanks to reduced granulation. Batch will only maintain the number of resources that are needed – so it is no longer necessary to keep your own server all the time. This could be a great cost reduction attribute. You can also create many queues and distribute tasks with different priorities, or even resources like GPU-optimized compute environments. What’s more, you are provided with logs and monitoring so you can inspect how the computations go.
There are also some challenges to face. The first one relates to the data exchange between a job and your application. The second one is about handling errors. It’s not about solving the unsolvable, but simply adapting it to your specific needs.
The areas in which you can apply this tool are numerous – from simple data processing and video rendering, to complex computations in machine learning and data mining projects. In addition to this, even more, demanding solutions will certainly come in the future.