Deep dive on FeaturesStorage

Here is are detailed explanations of what's possible with FeaturesStorage and its use as features_by_ids in ChoiceDataset.

Summary

Different instantiations of FeaturesStorage
Different instatiations of OneHotStorage
Using FeaturesStorage or OneHotStorage in a ChoiceDataset
Example with the SwissMetro
Link to another example: Expedia Dataset

# Install necessary requirements

# If you run this notebook on Google Colab, or in standalone mode, you need to install the required packages.
# Uncomment the following lines:

# !pip install choice-learn

# If you run the notebook within the GitHub repository, you need to run the following lines, that can skipped otherwise:
import os
import sys

sys.path.append("../../")

import os
# Remove GPU use
os.environ["CUDA_VISIBLE_DEVICES"] = ""

import numpy as np
import pandas as pd

from choice_learn.data.storage import FeaturesStorage
from choice_learn.data import ChoiceDataset

Different Instantiation Possibilities for Storage:

1 - from dict

features = {"customerA": [1, 2, 3], "customerB": [4, 5, 6], "customerC": [7, 8, 9]}
# dict must be {id: features}
storage = FeaturesStorage(values=features,
                          values_names=["age", "income", "children_nb"],
                          name="customers_features")

DictStorage

# Subset in order to only keep som ids
sub_storage = storage[["customerA", "customerC"]]

DictStorage

# Batch to access the features values
storage.batch[["customerA", "customerC", "customerA", "customerC"]]

array([[1, 2, 3],
       [7, 8, 9],
       [1, 2, 3],
       [7, 8, 9]])

2 - from list

features = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
ids = ["customerA", "customerB", "customerC"]

storage = FeaturesStorage(ids=ids,
                          values=features,
                          values_names=["age", "income", "children_nb"],
                          name="customers")
# We get the same result as before
storage.batch[["customerA", "customerC", "customerA", "customerC"]]

DictStorage





array([[1, 2, 3],
       [7, 8, 9],
       [1, 2, 3],
       [7, 8, 9]])

3 - from list, without ids

The ids are generated automatically as increasing integers:

features = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

storage = FeaturesStorage(values=features, values_names=["age", "income", "children_nb"], name="customers")
storage.batch[[0, 2, 0, 2]]

array([[1, 2, 3],
       [7, 8, 9],
       [1, 2, 3],
       [7, 8, 9]])

4 - from pandas.DataFrame

# Here the DataFrame has a column "id" that identifies the keys from the features values
features = {"age": [1, 4, 7], "income": [2, 5, 8], "children_nb": [3, 6, 9], "id": ["customerA", "customerB", "customerC"]}
features = pd.DataFrame(features)
storage = FeaturesStorage(values=features, name="customers")
storage.batch[["customerA", "customerC", "customerA", "customerC"]]

DictStorage





array([[1, 2, 3],
       [7, 8, 9],
       [1, 2, 3],
       [7, 8, 9]])

# Here the DataFrame does not have a column "id" that identifies the keys from the features values
# We thus specify the 'index'
features = {"age": [1, 4, 7], "income": [2, 5, 8], "children_nb": [3, 6, 9]}
features = pd.DataFrame(features, index=["customerA", "customerB", "customerC"])
storage = FeaturesStorage(values=features, name="customers")
storage.batch[["customerA", "customerC", "customerA", "customerC"]]

DictStorage





array([[1, 2, 3],
       [7, 8, 9],
       [1, 2, 3],
       [7, 8, 9]])

Different instantiations of OneHotStorage

5 - OneHotStorage from lists

ids = [0, 1, 2, 3, 4]
values = [4, 3, 2, 1, 0]

# Here the Storage will map the ids to the values
# value = 4 means that the fifth value is a one, the rest are zeros
oh_storage = FeaturesStorage(ids=ids, values=values, as_one_hot=True, name="OneHotTest")

# Get OneHot vectors:
oh_storage.batch[[0, 2, 4]]

array([[0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0]], dtype=uint8)

# Get the Storage value
oh_storage.get_element_from_index(0), oh_storage.storage

(4, {0: 4, 1: 3, 2: 2, 3: 1, 4: 0})

6 - OneHotStorage from single list

If only the values are given, the ids are created as increasing integers.

oh_storage = FeaturesStorage(values=values, as_one_hot=True, name="OneHotTest")
oh_storage.batch[[0, 2, 4]]

array([[0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0]], dtype=uint8)

If the values are not given, they are also created from the ids as increasing integers.

oh_storage = FeaturesStorage(ids=ids, as_one_hot=True, name="OneHotTest")
oh_storage.batch[[0, 2, 4]]
# Note that here it changes the order !

array([[1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1]], dtype=uint8)

7 - OneHotStorage from dict

values_dict = {k:v for k, v in zip(ids, values)}
oh_storage = FeaturesStorage(values=values_dict, as_one_hot=True, name="OneHotTest")
oh_storage.batch[[0, 2, 4]]

array([[0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0]], dtype=uint8)

Use of FeaturesByID and Storage in the ChoiceDataset

Here is a small example on how a ChoiceDataset is instantiated with a FeatureStorage. For it to fully work you need to: - Give the different FeatureStorage in a list in the features_by_ids argument - The name of the FeaturesStorage needs to be the same as the name of the columns containing the ids in shared_features_by_choice or items_features_by_choice - Make sure that all ids in shared_features_by_choice or items_features_by_choice have a corresponding id in the FeaturesStorage

features = {"customerA": [1, 2, 3], "customerB": [4, 5, 6], "customerC": [7, 8, 9]}
customer_storage = FeaturesStorage(values=features,
                          values_names=["age", "income", "children_nb"],
                          name="customers_features")
shared_features_by_choice = pd.DataFrame({"is_weekend": [0, 1, 1, 0],
                                          # This column is the one matching with the FeaturesStorage customer_storage
                                          # It follows the conditions 2/ and 3/ about naming and ids
                                          "customers_features": ["customerA", "customerB", "customerA", "customerC"]})

DictStorage

features = {"item1": [1, 2, 3], "item2": [4, 5, 6], "item3": [7, 8, 9], "item4": [10, 11, 12]}
storage = FeaturesStorage(values=features, values_names=["f1", "f2", "f3"], name="items_features")

price_storage = {"price1": [1], "price2": [2], "price3": [3], "price4": [4]}
price_storage = FeaturesStorage(values=price_storage, values_names=["price"], name="items_prices")

prices = [[[4, 1], [4, 1], [5, 1]], [[5, 2], [4, 2], [6, 2]],
          [[6, 3], [7, 3], [8, 3]], [[4, 4], [5, 4], [4, 4]]]
items_features_by_choice = [[["item1", "price1"], ["item2", "price2"], ["item3", "price3"]],
                           [["item1", "price1"], ["item4", "price2"], ["item3", "price4"]],
                           [["item1", "price1"], ["item2", "price3"], ["item3", "price4"]],
                           [["item1", "price1"], ["item2", "price3"], ["item3", "price4"]]]
choices = [0, 1, 2, 2]

dataset = ChoiceDataset(
    choices=choices,
    shared_features_by_choice=shared_features_by_choice,
    items_features_by_choice=items_features_by_choice,
    features_by_ids=[storage, price_storage, customer_storage],
    items_features_by_choice_names=["items_features", "items_prices"],
    )

Now we can use the ChoiceDataset as any other one to estimate a choice model. In particular the .batch argument will make reconstruct all features:

batch = dataset.batch[[0, 2]]
print("Shared features by choice:", batch[0])
print("Items features by choice:", batch[1])
print("Available items by choice:", batch[2])
print("Choices:", batch[3])

Shared features by choice: [[0 1 2 3]
 [1 1 2 3]]
Items features by choice: [[[1 2 3 1]
  [4 5 6 2]
  [7 8 9 3]]

 [[1 2 3 1]
  [4 5 6 3]
  [7 8 9 4]]]
Available items by choice: [[1. 1. 1.]
 [1. 1. 1.]]
Choices: [0 2]

Example with the SwissMetro dataset

from choice_learn.datasets import load_swissmetro

swiss_df = load_swissmetro(as_frame=True)
swiss_df.head()

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	48	120	63	52	20	117	65	1
1	2	1	1	1	1	1	3	...	48	30	60	49	10	117	84	1
2	2	1	1	1	1	1	3	...	48	60	67	58	30	117	52	1
3	2	1	1	1	1	1	3	...	40	30	63	52	20	72	52	1
4	2	1	1	1	1	1	3	...	36	60	63	42	20	90	84	1

5 rows × 29 columns

The ID column refers to a unique participant to the survey. Each participant answered several cases. We therefore have several times the features concerning this participant. A perfect example for FeaturesStorage.

customer_columns = ['ID', 'GROUP', 'SURVEY', 'SP', 'PURPOSE', 'FIRST', 'TICKET', 'WHO',
                    'LUGGAGE', 'AGE', 'MALE', 'INCOME', 'GA', 'ORIGIN', 'DEST']
customer_features = swiss_df[customer_columns].drop_duplicates()
customer_features = customer_features.rename(columns={"ID": "id"})
customer_storage = FeaturesStorage(values=customer_features, name="customer_features")

shared_features_by_choice = swiss_df[["ID"]]
shared_features_by_choice = shared_features_by_choice.rename(columns={"ID": "customer_features"})

available_items_by_choice = swiss_df[["TRAIN_AV", "SM_AV", "CAR_AV"]].to_numpy()
items_features_by_choice = np.stack([swiss_df[["TRAIN_TT", "TRAIN_CO", "TRAIN_HE"]].to_numpy(),
                                    swiss_df[["SM_TT", "SM_CO", "SM_HE"]].to_numpy(),
                                    swiss_df[["CAR_TT", "CAR_CO", "CAR_HE"]].to_numpy()], axis=1)
choices = swiss_df.CHOICE.to_numpy()

choice_dataset = ChoiceDataset(shared_features_by_choice=shared_features_by_choice,
                               items_features_by_choice=items_features_by_choice,
                               available_items_by_choice=available_items_by_choice,
                               choices=choices,
                               features_by_ids=[customer_storage],)

Et voilà !

batch = choice_dataset.batch[[0, 10, 200]]
print("Shared features by choice:", batch[0])
print("Items features by choice:", batch[1])
print("Available items by choice:", batch[2])
print("Choices:", batch[3])

Shared features by choice: [[ 2  0  1  1  0  1  1  0  3  0  2  0  2  1]
 [ 2  0  1  1  0  1  1  1  2  0  1  0 22  1]
 [ 2  0  1  1  0  3  2  1  2  1  2  0 15  1]]
Items features by choice: [[[112.  48. 120.]
  [ 63.  52.  20.]
  [117.  65.   0.]]

 [[170.  62.  30.]
  [ 70.  66.  10.]
  [  0.   0.   0.]]

 [[116.  54.  60.]
  [ 53.  83.  30.]
  [ 78.  40.   0.]]]
Available items by choice: [[1. 1. 1.]
 [1. 1. 0.]
 [1. 1. 1.]]
Choices: [1 1 0]

Link to another example

Finally you can find here a good examples of how memory efficient FeaturesStorage can be. The Expedia datasets incorporates several OneHot features that are encoded as OneHotStorage saving up a lot of memory.

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	48	120	63	52	20	117	65	1
1	2	1	1	1	1	1	3	...	48	30	60	49	10	117	84	1
2	2	1	1	1	1	1	3	...	48	60	67	58	30	117	52	1
3	2	1	1	1	1	1	3	...	40	30	63	52	20	72	52	1
4	2	1	1	1	1	1	3	...	36	60	63	42	20	90	84	1

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	48	120	63	52	20	117	65	1
1	2	1	1	1	1	1	3	...	48	30	60	49	10	117	84	1
2	2	1	1	1	1	1	3	...	48	60	67	58	30	117	52	1
3	2	1	1	1	1	1	3	...	40	30	63	52	20	72	52	1
4	2	1	1	1	1	1	3	...	36	60	63	42	20	90	84	1

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	48	120	63	52	20	117	65	1
1	2	1	1	1	1	1	3	...	48	30	60	49	10	117	84	1
2	2	1	1	1	1	1	3	...	48	60	67	58	30	117	52	1
3	2	1	1	1	1	1	3	...	40	30	63	52	20	72	52	1
4	2	1	1	1	1	1	3	...	36	60	63	42	20	90	84	1