Paper Classification Integration

Day 10 - Integrating the soft labels model to the pipeline

Industrializing the soft labels model

We have created a classification model to quantify the categories of each item between 0 and 1. It is now important to industrialize our code by integrating this model and this code into our workflow.

The model we decided to use for the classification is bert-tiny from the huggingface hub. Our goal is to make data retrieval fast, in order to avoid making predictions on the category every time a user sends a query. We decided to make the inference offline and leverage Redis speed to serve the metadata to the user.

The step of the workflow

We decided to split our classification pipeline in 3 steps:

  • training: We train our model and store the resulting weights locally. As a next step, it would be good to save the weights to a service such as W&B or Artifact Registry.
  • inference: We use the weights of our trained model to do the inference on the dataset, this is the step where we obtain the soft labels between 0 and 1.
  • re-training: This script scripts/retrain_model.sh retrains the model, it needs to be run if new papers are added to the dataset. As a next step, it would be good to create a Cloud Function in order to automate the retraining of our model if new data is detected.

Including the soft labels and the text encoding in Redis

We decided to have only one entry point to generate the metadata of every paper:

  • generate_index.py:
    • run the encoding for the text of every paper
    • run the soft label inference pipeline
    • save the data in a pickle file

The final pickle file is then pushed to redis using load_data.py.

One challenge we faced when pushing the categories to Redis was to choose a format to store the predicted categories and there associated score. We decided to store them in a string. In order to store and retrieve the categories/values we had to parse the string.

As a next step, it would be possible to choose a better Redis data type than strings.

Final result

If we are looking directly inside the Redis database we can see our soft labels stored as a string:

links