Day 10 - Integrating the soft labels model to the pipeline
Industrializing the soft labels model
We have created a classification model to quantify the categories of each item between 0 and 1. It is now important to industrialize our code by integrating this model and this code into our workflow.
The model we decided to use for the classification is bert-tiny
from the huggingface hub. Our goal is to make data retrieval fast, in order to avoid making predictions on the category every time a user sends a query. We decided to make the inference offline and leverage Redis speed to serve the metadata to the user.
The step of the workflow
We decided to split our classification pipeline in 3 steps:
training
: We train our model and store the resulting weights locally. As a next step, it would be good to save the weights to a service such as W&B or Artifact Registry.inference
: We use the weights of our trained model to do the inference on the dataset, this is the step where we obtain the soft labels between 0 and 1.re-training
: This scriptscripts/retrain_model.sh
retrains the model, it needs to be run if new papers are added to the dataset. As a next step, it would be good to create a Cloud Function in order to automate the retraining of our model if new data is detected.
Including the soft labels and the text encoding in Redis
We decided to have only one entry point to generate the metadata of every paper:
generate_index.py
:- run the encoding for the text of every paper
- run the soft label inference pipeline
- save the data in a pickle file
The final pickle file is then pushed to redis
using load_data.py
.
One challenge we faced when pushing the categories to Redis was to choose a format to store the predicted categories and there associated score. We decided to store them in a string. In order to store and retrieve the categories/values we had to parse the string.
As a next step, it would be possible to choose a better Redis data type than strings.
Final result
If we are looking directly inside the Redis database we can see our soft labels stored as a string: