Paws for thought – how ML is saving lockdown pets

tech4pets is a not-for-profit company that works with charities to improve animal welfare using technology solutions and big data. The Data Science team here at GlobalLogic saw an immense opportunity to get hands-on with an interesting dataset, whilst helping a heart-warming cause.

URL copied!

Categories: AI and MLBig Data & Analytics

By applying machine learning (ML), GlobalLogic enriched tech4pets data and helped them gain a better understanding of the movement of animals across the UK.

Continue reading to discover how this was achieved…

The challenge

During the global pandemic, the demand for pets boomed significantly – with the number of adverts doubling, and the average price more than doubling at peak. This unprecedented increase in demand led to a rise in problem sellers, fraud, theft, tax evasion, smuggling, organised crime and, most upsettingly, a rise in animal abuse as pets have become an increasingly lucrative cash ‘commodity’.

tech4pets

With the majority of these advertisements posted online, tech4pets has been using graph databases and algorithms to identify entities of interest and work with various organisations to tackle problem selling [1]. One of the first use cases tech4pets brought to GlobalLogic was a Breed Classifier that would help label millions of unclassified adverts by building a much clearer picture of the flow of animals online.

Figure: Visualisation of a graph database and the connections [1]

How AWS sped up workflows

tech4pets provided GlobalLogic with a sample of their data – over a million rows of labelled and unlabelled classified adverts with the text descriptions and much more. We chose to use a machine learning (ML) approach called Natural Language Processing (NLP) to extract insights from the text description from the advert. We then built a classification model to predict the dog, cat, rabbit, or horse breed given these insights.

The data was supplied in AWS S3, organised cleanly into hundreds of folders split by dates and animals – this is a great use case for AWS Glue & Athena and was an opportunity for the team to test out this powerful combination. A Glue crawler was created and pointed at our S3 bucket where the data was held, which then returned a single database table combining all fragmented datasets.

This table was then accessible in Athena, where we could run SQL queries on our table and create further refined tables if needed. I was pleasantly surprised by how seamless the integration between Glue, Athena & S3 was, how easy and quick it was to access the data in an AWS SageMaker notebook instance using PyAthena.

Figure: Workflow from disjointed files to S3 to SageMaker

Exploratory Data Analysis – Initial Modelling

Once the data was accessible in SageMaker, GlobalLogic began the data science process with an initial exploratory data analysis. Using PyAthena and pandas, we gained an overview of all the data, columns and sparsity of information. As the data was well organised, minimal processing was needed to extract the inherent value – both for this use case of predicting pet breeds and beyond.

After getting an understanding of the data, GlobalLogic hypothesised that using the ‘description’ field of the unlabelled data, we could train a model capable of predicting the breed of the advert, allowing any unlabelled breed data to be useful for tech4pet’s further analyses.

To get the best results from ML models on text, we ‘cleaned’ the text. We did this by removing symbols and punctuation, the (over)use of emojis, remove the stop words (common words with no contextual meaning e.g., ‘the’, ‘to’, ‘from’, ‘and’) and converted it all to lowercase.

Cleaning the text made it easier for the model to read data more accurately. Because we had over 250+ breed labels (I didn’t even know that many breeds existed!), we chose to just use the ten labels with the most examples for initial modelling.

Figure: Two stage text cleaning for a model

The figure above shows an example of the removal of noise from the description. Whilst the words ‘beautiful’ and ‘puppies’ might help with a sale, they don’t add much information for us in this use case. Once cleaned, we are left with ‘cavapoochon’ – which unbeknown to most is a dog breed! (A Cavapoochon is a triple-cross breed of Cavalier King Charles Spaniels, a Bichon Frise, and a Poodle – something I also just learned!)

Initially, an NLP technique called Term Frequency-Inverse Document Frequency (TF-IDF)[2] was used in conjunction with an XGBoost classification model. TF-IDFcombines the analysis of the most frequent words in a document (our description) with how frequent/rare this is across a corpus (all our descriptions).

This essentially identifies the most distinctively frequent or significant words in a document to decide which words are most ‘important’. XGBoost is a popular algorithm which creates an ensemble of decision trees by learning from the data to predict the next class. More information here [3]. GlobalLogic used these two techniques as they are really quick to set up in SageMaker due to the templates provided and also super easy to see the probability of the hypothesis in the initial modelling phase.

Results – further modelling and improvements

With the above approach, GlobalLogic achieved a 65% accuracy on predicting dog breeds from the description. The model learned the variations of breed names used (e.g., lab = Labrador Retriever), and which words corresponded to the posts’ breed even if there were multiple breeds in the description. This was a decent first-pass model which demonstrated that we could get a more accurate model using stronger algorithms and small improvements around the data.

Figure: Testing predictions of dog breed from sample text

Data Science team collaboration meant that GlobalLogic had access to a powerful MLmodel template. This used Long Short-Term Memory (LSTM) networks – a type of recurrent neural network capable of learning patterns in word order within sentences, which helps to predict which word comes next [4].

The template also uses GloVe, an unsupervised learning algorithm for obtaining vector representations (turning words into numbers) for words trained on a large dataset that has ‘learned’ what words mean in different contexts, i.e., king – man + woman ≈ queen [5]. This is a big step up in complexity of the TF-IDF technique used previously, and potentially way more accurate.

Figure: Visualisation of word’s turned to vectors and their correlations

Using stronger models proved to be beneficial to our overall accuracy in this use case. We were able to implement our template quickly and run it on all other animals to achieve a variation of increased accuracies of 78% for dogs, 81% for cats & 75% for rabbits! Wow! We now have this approach and codebase in our ‘arsenal’ for future use cases as well.

Figure: Plot of accuracy of classification model by breed upon tech4pets data

Conclusion – where the project is today and going next

We’ve achieved our initial goal and proved that we can predict the breed of animals from a subset of historical unlabelled data. This is helping tech4pets enrich their data and get a better understanding of the movement of animals across the UK.

Co-founder of tech4pets, Keith Hinde, discusses the challenges the organisation faces and the takeaways from GlobalLogic’s involvement in this project:

Some of the most problematic sources of data we deal with at tech4pets are those which neither solicit nor provide breed information about the pets advertised. Unfortunately, those same sources are often the most problematic from both animal and consumer welfare perspectives. With the great work that the team has done, we now have access to an efficient and accurate breed classifier. This significantly improves the depth and quality of analysis we can provide to clients and therefore tangibly helps the pets they seek to protect.

As our model is now deployed, we can pass in bulk historical unlabelled data. This allows us to extract useful data or combine it with other models for a better solution. A use case we’re looking forward to is using Computer Vision to predict dog and cat breeds from images of the animals in the listings! After that, we can combine the two models as an ‘ensemble model’ that will be more powerful still.

This has been a great first step for our project with tech4pets, and the start of many more use cases and machine learning implementations to help assist the identifying of illegal sellers and animal abuse. For me personally, getting my hands on real data and creating models was invaluable. The fact this project will help animals in the real world has made the work feel like it would have a real impact.

If you’d like to support vulnerable pets in the UK, why not donate?

Our Data Science team within GlobalLogic

Off the back of this project, we now have additional machine learning and data science accelerators and knowledge. With our sizeable data science team, GlobalLogic can replicate this success within a customer environment if so needed.

Thanks for reading if you got this far! Thanks also to all the Data Science team for their assistance. I’m glad I had an opportunity to work on a project so close to my heart and aligned with GlobalLogic’s focus on charitable work.

Also, massive thanks to my GlobalLogic colleague (and tech4pets co-founder) Keith Hinde for getting the team involved in this project. A link to a recent blog outlining some of their work can be found here [6].

Thanks for reading!

****

About the author

Hello everyone, I’m Roger Zorlu and this is my first blog post for GlobalLogic! I’ve been part of the GlobalLogic Data Science team for just over half a year now, having spent the last month on an important project for an organisation called tech4pets.

****