Indexing Contents to Redis

Reindexing contents

If you want to index your own private database of documents, you can reuse our scripts/.

Generating the index

To generate index for papers updated on at specific month YYYYMM, you can run this script. You must specify a JSON file containing your corpus, the output where to store the index file (in Pickle format), and the model name to encode your sentences.

This task can be performed every month of lastly updated arXiv data on a regular desktop computer, you might want to schedule it using tools like Airflow to automate.

% ./ --help

NAME - Generate Embeddings and Create a File Index.


    Generate Embeddings and Create a File Index.


To cold start your database, you can also run the single-gpu-arxiv-embedding Jupyter Notebook on Saturn Cloud.

Using a T4-XLarge 4-cores, saturn-python-rapids image, the data from the historical 2 million papers can be indexed in less than 5 minutes.

Loading the index to Redis

% ./ --help

NAME - Load the Embedding Index to Redis.

SYNOPSIS <flags>

    Load the Embedding Index to Redis.

Using a basic pipeline is a basic Bash script that takes a list of cutoffs and a model name for embeddings encoding, and chains both previous tasks.

Progress of completition is tracked thanks to tqdm/tqdm.

You might want to enhance that script depending on your workflow. Things you could do in this script: enrich data, perform checks, send email notifications...

% ./
2022-11-03 19:36:13.555 | INFO     | __main__:run:39 - Reading papers for 200907...
2022-11-03 19:36:30.045 | INFO     | __main__:run:45 - Creating embeddings from title and abstract...
2022-11-03 19:36:30.045 | INFO     | __main__:run:46 - sentence-transformers/all-MiniLM-L12-v2
100%|██████████████████████████████████████████████████████████████████████████████| 2306/2306 [01:14<00:00, 30.78it/s]
2022-11-03 19:37:44.977 | INFO     | __main__:run:55 - Exporting to pickle file...
2022-11-03 19:37:45.803 | INFO     | __main__:run:111 - TODO False
2022-11-03 19:37:45.804 | INFO     | __main__:load_all_data:64 - Loading papers...
2022-11-03 19:37:46.052 | INFO     | __main__:load_all_data:68 - Writing to Redis...
 87%|███████████████████████████████████████████████████████████████████▊          | 2003/2306 [01:28<00:13, 22.24it/s]
