RUMnet model Usage

Introduction to modelling with RUMnet

We reproduce in this notebook the results of the paper Representing Random Utility Choice Models with Neural Networks on the SwissMetro dataset.

# Install necessary requirements

# If you run this notebook on Google Colab, or in standalone mode, you need to install the required packages.
# Uncomment the following lines:

# !pip install choice-learn

# If you run the notebook within the GitHub repository, you need to run the following lines, that can skipped otherwise:
import os
import sys

sys.path.append("../../")

import os
# Remove/Add GPU use
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf

from choice_learn.data import ChoiceDataset
from choice_learn.models import RUMnet
from choice_learn.datasets import load_swissmetro

Note that there are two implementations of RUMnet: one more CPU-oriented and one more GPU-oriented. The import of the right model is automatically done. You can also import the model directly with:

from choice_learn.models import CPURUMnet, GPURUMnet

First, we download the SwissMetro dataset:

We follow the same data preparation as in the original paper in order to get the exact same results.

Now, we can create our ChoiceDataset from the dataframe.

dataset = load_swissmetro(as_frame=False, preprocessing="rumnet")

Let's Cross-Validate ! We keep a scikit-learn-like structure. To avoid creating dependancies, we use a different train/test split code, but the following would totally work:

from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=5, test_size=.2, random_state=0)

for i, (train_index, test_index) in enumerate(rs.split(dataset.choices)):
    train_dataset = dataset[train_index]
    test_dataset = dataset[test_index]

    model = RUMnet(**args)
    model.instantiate()
    model.fit(train_dataset)
    model.evaluate(test_dataset)

We just use a numpy based split, but the core code is the same!

model_args = {
    "num_products_features": 6,
    "num_customer_features": 83,
    "width_eps_x": 20,
    "depth_eps_x": 5,
    "heterogeneity_x": 10,
    "width_eps_z": 20,
    "depth_eps_z": 5,
    "heterogeneity_z": 10,
    "width_u": 20,
    "depth_u": 5,
    "optimizer": "Adam",
    "lr": 0.0002,
    "logmin": 1e-10,
    "label_smoothing": 0.02,
    "callbacks": [],
    "epochs": 140,
    "batch_size": 32,
    "tol": 0,
}

indexes = np.random.permutation(list(range(len(dataset))))

fit_losses = []
test_eval = []
for i in range(5):
    test_indexes = indexes[int(len(indexes) * 0.2 * i):int(len(indexes) * 0.2 * (i + 1))]
    train_indexes = np.concatenate([indexes[:int(len(indexes) * 0.2 * i)],
                                    indexes[int(len(indexes) * 0.2 * (i + 1)):]],
                                   axis=0)

    train_dataset = dataset[train_indexes]
    test_dataset = dataset[test_indexes]

    model = RUMnet(**model_args)
    model.instantiate()

    losses = model.fit(train_dataset, val_dataset=test_dataset)
    probas = model.predict_probas(test_dataset)
    eval = tf.keras.losses.CategoricalCrossentropy(from_logits=False)(y_pred=model.predict_probas(test_dataset), y_true=tf.one_hot(test_dataset.choices, 3))
    test_eval.append(eval)
    print(test_eval)

    fit_losses.append(losses)

cmap = plt.cm.coolwarm
colors = [cmap(j / 4) for j in range(5)]
for i in range(len(fit_losses)):
    plt.plot(fit_losses[i]["train_loss"], c=colors[i], linestyle="--")
    plt.plot(fit_losses[i]["test_loss"], label=f"fold {i}", c=colors[i])
plt.legend()

model.evaluate(test_dataset)

print("Average LogLikeliHood on test:", np.mean(test_eval))

A larger and more complex dataset: Expedia ICDM 2013

The RUMnet paper benchmarks the model on a second dataset. If you want to use it you need to download the file from Kaggle and place the train.csv file in the folder choice_learn/datasets/data with the name expedia.csv.

from choice_learn.datasets import load_expedia

# It takes some time...
expedia_dataset = load_expedia(preprocessing="rumnet")

test_dataset = expedia_dataset[int(len(expedia_dataset)*0.8):]
train_dataset = expedia_dataset[:int(len(expedia_dataset)*0.8)]

model_args = {
    "num_products_features": 46,
    "num_customer_features": 84,
    "width_eps_x": 10,
    "depth_eps_x": 3,
    "heterogeneity_x": 5,
    "width_eps_z": 10,
    "depth_eps_z": 3,
    "heterogeneity_z": 5,
    "width_u": 10,
    "depth_u": 3,
    "tol": 0,
    "optimizer": "Adam",
    "lr": 0.001,
    "logmin": 1e-10,
    "label_smoothing": 0.02,
    "callbacks": [],
    "epochs": 15,
    "batch_size": 128,
    "tol": 1e-5,
}
model = RUMnet(**model_args)
model.instantiate()

losses = model.fit(train_dataset, val_dataset=test_dataset)
probas = model.predict_probas(test_dataset)
test_loss = tf.keras.losses.CategoricalCrossentropy(from_logits=False)(y_pred=model.predict_probas(test_dataset), y_true=tf.one_hot(test_dataset.choices, 39))

print(test_loss)