Indexing Contents to Redis

Reindexing contents

If you want to index your own private database of documents, you can reuse our scripts/.

Generating the index

To generate index for papers updated on at specific month YYYYMM, you can run this script. You must specify a JSON file containing your corpus, the output where to store the index file (in Pickle format), and the model name to encode your sentences.

This task can be performed every month of lastly updated arXiv data on a regular desktop computer, you might want to schedule it using tools like Airflow to automate.

% ./generate-index.py --help


NAME
    generate_index.py - Generate Embeddings and Create a File Index.

SYNOPSIS
    generate_index.py YEAR_MONTH <flags>

DESCRIPTION
    Generate Embeddings and Create a File Index.

POSITIONAL ARGUMENTS
    YEAR_MONTH

FLAGS
    --input_path=INPUT_PATH
        Default: 'arxiv-metadata-o...
    --output_path=OUTPUT_PATH
        Default: 'arxiv_embeddings_10000...
    --model_name=MODEL_NAME
        Default: 'sentence-...

To cold start your database, you can also run the single-gpu-arxiv-embedding Jupyter Notebook on Saturn Cloud.

Using a T4-XLarge 4-cores, saturn-python-rapids image, the data from the historical 2 million papers can be indexed in less than 5 minutes.

Loading the index to Redis

% ./load_data.py --help


NAME
    load_data.py - Load the Embedding Index to Redis.

SYNOPSIS
    load_data.py <flags>

DESCRIPTION
    Load the Embedding Index to Redis.

FLAGS
    --concurrency_level=CONCURRENCY_LEVEL
        Type: int
        Default: 2
    --separator=SEPARATOR
        Type: str
        Default: '|'
    --reset_db=RESET_DB
        Type: bool
        Default: False
    --embeddings_path=EMBEDDINGS_PATH
        Type: str
        Default: ''
    --vector_size=VECTOR_SIZE
        Type: int
        Default: 768

Using a basic pipeline

pipeline.sh is a basic Bash script that takes a list of cutoffs and a model name for embeddings encoding, and chains both previous tasks.

Progress of completition is tracked thanks to tqdm/tqdm.

You might want to enhance that script depending on your workflow. Things you could do in this script: enrich data, perform checks, send email notifications...

% ./pipeline.sh
2022-11-03 19:36:13.555 | INFO     | __main__:run:39 - Reading papers for 200907...
2022-11-03 19:36:30.045 | INFO     | __main__:run:45 - Creating embeddings from title and abstract...
2022-11-03 19:36:30.045 | INFO     | __main__:run:46 - sentence-transformers/all-MiniLM-L12-v2
100%|██████████████████████████████████████████████████████████████████████████████| 2306/2306 [01:14<00:00, 30.78it/s]
2022-11-03 19:37:44.977 | INFO     | __main__:run:55 - Exporting to pickle file...
2022-11-03 19:37:45.803 | INFO     | __main__:run:111 - TODO False
2022-11-03 19:37:45.804 | INFO     | __main__:load_all_data:64 - Loading papers...
2022-11-03 19:37:46.052 | INFO     | __main__:load_all_data:68 - Writing to Redis...
 87%|███████████████████████████████████████████████████████████████████▊          | 2003/2306 [01:28<00:13, 22.24it/s]

links