-
-
-
-
URL copied!
Deep Learning has been the cool kid on the block in the Data Science world for the better part of a decade now. Despite this, a lot of companies still do not know how to create Deep Learning models that are commercially viable. In this three-part blog series, I will share our GlobalLogic secrets about how we approach data-parallel modelling at an awesome scale and how we orchestrate Deep Learning pipelines using Amazon SageMaker.
The examples in this blog series are real ongoing projects productionised using our in-house bespoke Deep Learning Accelerator service offering. If your company require Deep Learning models for life (not just for Christmas) then get in touch and we’ll be happy to help!
This first blog will introduce Deep Learning and focus primarily on scaling data processing using Amazon Web Services (AWS). The second blog will focus on distributed training using SageMaker’s Tensorflow container and Uber’s Horovod. The third and final blog will focus on model deployment, custom inference, and share some extra tips.
Figure 1 SageMaker Training Pipeline
Note: This blog only contains snippets of SageMaker Pipelines code, not the processing scripts. The purpose of this blog series is to provide guidance and links to helpful resources.
Our processing high-level design
By the end of this blog, I will have created a SageMaker Pipeline to handle distributed data pre-processing in two pipeline steps:
- Create manifest files
- Shard these manifest files to multi-instance processing jobs as shown in the following architecture
Figure 2 High-Level Design for Processing steps
What is Deep Learning?
Before we get into the crux of this blog it’s important to take a minute to explain what Deep Learning is, where it fits in line with Machine Learning (ML) and why it is so popular.
Broadly speaking, Deep Learning is a subset of ML and Artificial Intelligence (AI) as shown in the graphic below. AI, on a practical level, is approaching a problem using a perpetual Inductive Learning school-of-thought. This differs to traditional software programming that uses deductive reasoning to translate pre-defined business logic into conditional statements (with certainty).
ML is simply recursive algorithms that repeatedly update a set of model parameters to minimise or maximise a business objective (e.g. maximise model accuracy of correctly identifying a dog).
Deep Learning is algorithms that update a lot of model parameters; conversely, shallow learning are algorithms that update a small number of parameters – however, there is no formal definition of what is ‘a lot’. That’s it!
Deep Learning is typically associated to neural networks that scale well for complex problems requiring a huge number of parameters. I will be demonstrating a Tensorflow Deep Neural Network model in the second part of this blog series.
Data Processing at scale
Okay, so we know Deep Learning uses a lot of model parameters but what does this have to do with the amount of data? Well, usually you will want the number of model parameters vs. quantity of data to have the same scaling factor. I won’t get into specifics here, but essentially each model parameter is akin to our model learning, a feature (a common characteristic) of the input data and when we learn more model parameters, these features become extremely specific to the dataset we are learning from, and our model starts to memorise patterns (overfit) on the input data. Increasing the amount of data our model is learning from avoids this memorisation issue; to generalize these observed patterns.
Figure 3 Effect of increasing amount of data compared to number of model parameters
Importance of data lineage in MLOPs
After you have collected, curated, and beautifully organised your data in Amazon S3. What now? Well, we could just go right ahead and train our model on all of this, but when we’re talking about gigabytes, possibly terabytes of data, it’s crucial we know exactly what data is being used where and when. This is particularly important when needing to track data bias, data quality and quantity over time.
So, the first step is tracking data lineage. There are many tools available to help us do this (such as Amazon Feature Store), but for simplicity we will be using S3 manifest files. And you’ll understand why this simple approach works extremely well in the next few steps.
Step 1: Create manifest files for all data channels
In any ML pipeline, we need to define our training, validation, and testing datasets. It’s extremely important we do not mix up our datasets! Otherwise, we could unknowingly report false model performance.
I create these three manifest files in a SageMaker Processing step. At a high level, this is the process:
- Queries S3 to extract class labels from S3 prefixes (using boto3)
- Create a dataframe of all S3 prefixes with two columns: ‘prefix’ and ‘class_label’
- Apply any encoding to the class_label column (e.g. Label encoding)
- Create a new column ‘channel’ and assign each row to one of our three datasets: train, validation, test
- For all rows in each unique ‘channel’, create a manifest file for this channel (make sure there are enough samples per class label for each channel)
- Save the dataframe and each manifest file
Each manifest file should follow this schema template:
Figure 4 An example of what a manifest file looks like
Here is an example of a processing step in our pipeline to create these manifest files:
Figure 5 SageMaker Pipeline Create Manifest files step
Step 2: ShardedByS3Key automatic data distribution
Now we have created our manifest files, the power of AWS processing can be exploited! Besides having created a flat-file database for helping track data lineage, we can also use S3’s built-in data distribution strategy: ‘ShardedByS3Key’ to automatically subsample and distribute data to each of our processing instances.
This means we can scale our processing job to an (almost) unlimited number of processing instances. Each processing instance will be given a copy of our pre-processing code and each instance will also be given their own unique subset of each data channel.
Beware the urge to spread your processing workload to hundreds of processors as you will reach a point of diminishing returns. Each processor will output their own processed dataset channels (train, validation, and test). A useful tip is to scale your processing jobs to the sum of GPUs being used in your training job/s, as this makes the data flow much easier to orchestrate (and avoids having to track multiple processed data locations per GPU).
Our SageMaker Pipeline code for this step will look like the following:
Figure 6 SageMaker Pipeline processing step
And that’s it! We can test our pipeline by running the following code:
Figure 7 Execute SageMaker Pipeline
You can test this is working by looking at the logs in Amazon Cloudwatch. Each processing instance has their own log stream (‘algo-{instance_number}’).
Figure 8 Checking multi-instance processing is working
Summary
In two very simple steps we have a scalable SageMaker Pipeline for data processing. This processing technique can be applied to most ML projects and best-of-all, as Amazon S3 and SageMaker Processing are managed services we do not need to spend resources maintaining the data distribution codebase.
Data scientists can now spend more of their time focusing on translating business requirements to data science methodology and optimising ML solutions.
We’ve started using this framework for developing Computer Vision models for the Tech4Pets Charity, which has allowed us to produce production-grade models at record-low development time and cost! Check out my colleague Roger’s article about how ML is helping save lockdown pets.
Closing tip
In this blog series I talk almost exclusively about horizontal scaling (increasing the number of compute instances). However, you should always try scaling vertically first (increasing the compute resources per machine/s) as increased processing power alone may solve your scaling problems. Horizontal scaling is much more challenging to get right and brings with it a host of potential problems to address, such as handling rank-specific code blocks, multi-instance networking issues and processes deadlock.
Keep your eyes peeled for part two of my three part blog series!
About the author
Jonathan Hill, Senior Data Scientist at GlobalLogic. I’m passionate about data science and more recently focused helping productionise Machine Learning code and artefacts to MLOPs pipelines using AWS SageMaker best practices.
Top Insights
Manchester City Scores Big with GlobalLogic
AI and MLBig Data & AnalyticsCloudDigital TransformationExperience DesignMobilitySecurityMediaTwitter users urged to trigger SARs against energy...
Big Data & AnalyticsDigital TransformationInnovationRetail After COVID-19: How Innovation is Powering the...
Digital TransformationInsightsConsumer and RetailTop Authors
Top Insights Categories
Let’s Work Together
Related Content
The Rise of The Invisible Bank
Banks will power experiences, but everyone will ignore them. Inspiration for this blog title comes from Jerry Neumann, the author of the blog Reaction Wheel, who wrote in 2015 that ‘software eats the world and everybody ignores it’. Neumann also observed that ‘information and communications technology becomes ubiquitous but invisible’ – in other words, … Continue reading Data Parallelism and Distributed Deep Learning at production scale →
Learn More
MLOps Principles Part Two: Model Bias and Fairness
Welcome back to the second instalment of our two-part series – MLOps (Machine Learning Operations) Principles. If you missed part one, which focused on the importance of model monitoring, it can be found here. This blog explores the various forms that model bias can take, whilst delving into the challenges of detecting and mitigating bias, … Continue reading Data Parallelism and Distributed Deep Learning at production scale →
Learn More
How Smart Cars Will Change Cityscapes
Which technologies do we need to improve the quality of life for people in the city and make governance more effective? What role does a car play in the city of the future, and how can it make traffic safer? GlobalLogic’s engineers are already working on solving these issues.
Learn More
What happens in Vegas stays in Vegas
At GlobalLogic, both our consultants and clients love utilising Amazon Web Services (AWS) technologies extensively. This passion has helped us secure a number of highly coveted accreditations, including Amazon Connect Service Delivery Partner, AWS DevOps Competency Partner and AWS Advanced Consulting Partner. As part of our partnership, we thoroughly enjoy attending AWS events to keep … Continue reading Data Parallelism and Distributed Deep Learning at production scale →
Learn More
Share this page:
-
-
-
-
URL copied!