tech4pets began by providing a sample of their data – over a million rows of labelled and unlabelled classified adverts with text descriptions and more. Using Natural Language Processing (NLP), we extracted insights from the text description and built a classification model to predict the dog, cat, rabbit, or horse breed given these insights.
The data was supplied in AWS S3 and a Glue crawler was created and pointed at a S3 bucket, which then returned a single database table combining all fragmented datasets. This table was accessible in Athena, where we ran SQL queries and created further refined tables if needed.
Once the data was accessible in SageMaker, GlobalLogic began an initial exploratory data analysis. Using PyAthena and pandas, we gained an overview of all the data, columns and sparsity of information. It was then hypothesised that by using the ‘description’ field of the unlabelled data, a model could be trained to predict the breed of the advert, allowing any unlabelled data to be useful for tech4pet’s further analyses.
To get the best results from ML models on text, the text was cleaned by removing symbols, punctuation, the (over)use of emojis, stop words and was converted to lowercase – this made it easier for the model to read data more accurately.
Initially, an NLP technique called Term Frequency-Inverse Document Frequency (TF-IDF) was used in conjunction with an XGBoost classification model. These techniques proved beneficial as they were quick to set up in SageMaker and it was easier to see the probability of the hypothesis in the initial modelling phase.
Using this approach, GlobalLogic achieved 65% accuracy on predicting dog breeds from the description. The model learned the variations of breed names used e.g. lab = Labrador Retriever, and which words corresponded to the post’s breed even with multiple breeds detailed.
In an effort to improve results, our Data Science team used a powerful ML model template, using Long Short-Term Memory (LSTM) networks – a type of recurrent neural network capable of learning patterns in word order within sentences, which helps to predict which word comes next.
This template also uses GloVe, an unsupervised learning algorithm for obtaining vector representations (turning words into numbers) for words trained on a large dataset that has ‘learned’ what words mean in different contexts, i.e. king – man + woman ≈ queen. This was a big step up in complexity of the TF-IDF technique used previously, and potentially more accurate.
By using these stronger models, we were able to implement our template quickly and run it on all other animals to achieve a variation of increased accuracies of 78% for dogs, 81% for cats and 75% for rabbits.