Deep dive on FeaturesStorage
Here is are detailed explanations of what's possible with FeaturesStorage and its use as features_by_ids in ChoiceDataset.
Summary
# Install necessary requirements
# If you run this notebook on Google Colab, or in standalone mode, you need to install the required packages.
# Uncomment the following lines:
# !pip install choice-learn
# If you run the notebook within the GitHub repository, you need to run the following lines, that can skipped otherwise:
import os
import sys
sys.path.append("../../")
import os
# Remove GPU use
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import numpy as np
import pandas as pd
Different Instantiation Possibilities for Storage:
1 - from dict
features = {"customerA": [1, 2, 3], "customerB": [4, 5, 6], "customerC": [7, 8, 9]}
# dict must be {id: features}
storage = FeaturesStorage(values=features,
values_names=["age", "income", "children_nb"],
name="customers_features")
DictStorage
DictStorage
# Batch to access the features values
storage.batch[["customerA", "customerC", "customerA", "customerC"]]
array([[1, 2, 3],
[7, 8, 9],
[1, 2, 3],
[7, 8, 9]])
2 - from list
features = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
ids = ["customerA", "customerB", "customerC"]
storage = FeaturesStorage(ids=ids,
values=features,
values_names=["age", "income", "children_nb"],
name="customers")
# We get the same result as before
storage.batch[["customerA", "customerC", "customerA", "customerC"]]
DictStorage
array([[1, 2, 3],
[7, 8, 9],
[1, 2, 3],
[7, 8, 9]])
3 - from list, without ids
The ids are generated automatically as increasing integers:
features = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
storage = FeaturesStorage(values=features, values_names=["age", "income", "children_nb"], name="customers")
storage.batch[[0, 2, 0, 2]]
array([[1, 2, 3],
[7, 8, 9],
[1, 2, 3],
[7, 8, 9]])
4 - from pandas.DataFrame
# Here the DataFrame has a column "id" that identifies the keys from the features values
features = {"age": [1, 4, 7], "income": [2, 5, 8], "children_nb": [3, 6, 9], "id": ["customerA", "customerB", "customerC"]}
features = pd.DataFrame(features)
storage = FeaturesStorage(values=features, name="customers")
storage.batch[["customerA", "customerC", "customerA", "customerC"]]
DictStorage
array([[1, 2, 3],
[7, 8, 9],
[1, 2, 3],
[7, 8, 9]])
# Here the DataFrame does not have a column "id" that identifies the keys from the features values
# We thus specify the 'index'
features = {"age": [1, 4, 7], "income": [2, 5, 8], "children_nb": [3, 6, 9]}
features = pd.DataFrame(features, index=["customerA", "customerB", "customerC"])
storage = FeaturesStorage(values=features, name="customers")
storage.batch[["customerA", "customerC", "customerA", "customerC"]]
DictStorage
array([[1, 2, 3],
[7, 8, 9],
[1, 2, 3],
[7, 8, 9]])
Different instantiations of OneHotStorage
5 - OneHotStorage from lists
ids = [0, 1, 2, 3, 4]
values = [4, 3, 2, 1, 0]
# Here the Storage will map the ids to the values
# value = 4 means that the fifth value is a one, the rest are zeros
oh_storage = FeaturesStorage(ids=ids, values=values, as_one_hot=True, name="OneHotTest")
array([[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0]], dtype=uint8)
(4, {0: 4, 1: 3, 2: 2, 3: 1, 4: 0})
6 - OneHotStorage from single list
If only the values are given, the ids are created as increasing integers.
oh_storage = FeaturesStorage(values=values, as_one_hot=True, name="OneHotTest")
oh_storage.batch[[0, 2, 4]]
array([[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0]], dtype=uint8)
If the values are not given, they are also created from the ids as increasing integers.
oh_storage = FeaturesStorage(ids=ids, as_one_hot=True, name="OneHotTest")
oh_storage.batch[[0, 2, 4]]
# Note that here it changes the order !
array([[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 1]], dtype=uint8)
7 - OneHotStorage from dict
values_dict = {k:v for k, v in zip(ids, values)}
oh_storage = FeaturesStorage(values=values_dict, as_one_hot=True, name="OneHotTest")
oh_storage.batch[[0, 2, 4]]
array([[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0]], dtype=uint8)
Use of FeaturesByID and Storage in the ChoiceDataset
Here is a small example on how a ChoiceDataset is instantiated with a FeatureStorage. For it to fully work you need to: - Give the different FeatureStorage in a list in the features_by_ids argument - The name of the FeaturesStorage needs to be the same as the name of the columns containing the ids in shared_features_by_choice or items_features_by_choice - Make sure that all ids in shared_features_by_choice or items_features_by_choice have a corresponding id in the FeaturesStorage
features = {"customerA": [1, 2, 3], "customerB": [4, 5, 6], "customerC": [7, 8, 9]}
customer_storage = FeaturesStorage(values=features,
values_names=["age", "income", "children_nb"],
name="customers_features")
shared_features_by_choice = pd.DataFrame({"is_weekend": [0, 1, 1, 0],
# This column is the one matching with the FeaturesStorage customer_storage
# It follows the conditions 2/ and 3/ about naming and ids
"customers_features": ["customerA", "customerB", "customerA", "customerC"]})
DictStorage
features = {"item1": [1, 2, 3], "item2": [4, 5, 6], "item3": [7, 8, 9], "item4": [10, 11, 12]}
storage = FeaturesStorage(values=features, values_names=["f1", "f2", "f3"], name="items_features")
price_storage = {"price1": [1], "price2": [2], "price3": [3], "price4": [4]}
price_storage = FeaturesStorage(values=price_storage, values_names=["price"], name="items_prices")
prices = [[[4, 1], [4, 1], [5, 1]], [[5, 2], [4, 2], [6, 2]],
[[6, 3], [7, 3], [8, 3]], [[4, 4], [5, 4], [4, 4]]]
items_features_by_choice = [[["item1", "price1"], ["item2", "price2"], ["item3", "price3"]],
[["item1", "price1"], ["item4", "price2"], ["item3", "price4"]],
[["item1", "price1"], ["item2", "price3"], ["item3", "price4"]],
[["item1", "price1"], ["item2", "price3"], ["item3", "price4"]]]
choices = [0, 1, 2, 2]
dataset = ChoiceDataset(
choices=choices,
shared_features_by_choice=shared_features_by_choice,
items_features_by_choice=items_features_by_choice,
features_by_ids=[storage, price_storage, customer_storage],
items_features_by_choice_names=["items_features", "items_prices"],
)
Now we can use the ChoiceDataset as any other one to estimate a choice model. In particular the .batch argument will make reconstruct all features:
batch = dataset.batch[[0, 2]]
print("Shared features by choice:", batch[0])
print("Items features by choice:", batch[1])
print("Available items by choice:", batch[2])
print("Choices:", batch[3])
Shared features by choice: [[0 1 2 3]
[1 1 2 3]]
Items features by choice: [[[1 2 3 1]
[4 5 6 2]
[7 8 9 3]]
[[1 2 3 1]
[4 5 6 3]
[7 8 9 4]]]
Available items by choice: [[1. 1. 1.]
[1. 1. 1.]]
Choices: [0 2]
Example with the SwissMetro dataset
from choice_learn.datasets import load_swissmetro
swiss_df = load_swissmetro(as_frame=True)
swiss_df.head()
GROUP | SURVEY | SP | ID | PURPOSE | FIRST | TICKET | WHO | LUGGAGE | AGE | ... | TRAIN_CO | TRAIN_HE | SM_TT | SM_CO | SM_HE | SM_SEATS | CAR_TT | CAR_CO | CHOICE | CAR_HE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 48 | 120 | 63 | 52 | 20 | 0 | 117 | 65 | 1 | 0.0 |
1 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 48 | 30 | 60 | 49 | 10 | 0 | 117 | 84 | 1 | 0.0 |
2 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 48 | 60 | 67 | 58 | 30 | 0 | 117 | 52 | 1 | 0.0 |
3 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 40 | 30 | 63 | 52 | 20 | 0 | 72 | 52 | 1 | 0.0 |
4 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 36 | 60 | 63 | 42 | 20 | 0 | 90 | 84 | 1 | 0.0 |
5 rows × 29 columns
The ID column refers to a unique participant to the survey. Each participant answered several cases. We therefore have several times the features concerning this participant. A perfect example for FeaturesStorage.
customer_columns = ['ID', 'GROUP', 'SURVEY', 'SP', 'PURPOSE', 'FIRST', 'TICKET', 'WHO',
'LUGGAGE', 'AGE', 'MALE', 'INCOME', 'GA', 'ORIGIN', 'DEST']
customer_features = swiss_df[customer_columns].drop_duplicates()
customer_features = customer_features.rename(columns={"ID": "id"})
customer_storage = FeaturesStorage(values=customer_features, name="customer_features")
shared_features_by_choice = swiss_df[["ID"]]
shared_features_by_choice = shared_features_by_choice.rename(columns={"ID": "customer_features"})
available_items_by_choice = swiss_df[["TRAIN_AV", "SM_AV", "CAR_AV"]].to_numpy()
items_features_by_choice = np.stack([swiss_df[["TRAIN_TT", "TRAIN_CO", "TRAIN_HE"]].to_numpy(),
swiss_df[["SM_TT", "SM_CO", "SM_HE"]].to_numpy(),
swiss_df[["CAR_TT", "CAR_CO", "CAR_HE"]].to_numpy()], axis=1)
choices = swiss_df.CHOICE.to_numpy()
choice_dataset = ChoiceDataset(shared_features_by_choice=shared_features_by_choice,
items_features_by_choice=items_features_by_choice,
available_items_by_choice=available_items_by_choice,
choices=choices,
features_by_ids=[customer_storage],)
Et voilà !
batch = choice_dataset.batch[[0, 10, 200]]
print("Shared features by choice:", batch[0])
print("Items features by choice:", batch[1])
print("Available items by choice:", batch[2])
print("Choices:", batch[3])
Shared features by choice: [[ 2 0 1 1 0 1 1 0 3 0 2 0 2 1]
[ 2 0 1 1 0 1 1 1 2 0 1 0 22 1]
[ 2 0 1 1 0 3 2 1 2 1 2 0 15 1]]
Items features by choice: [[[112. 48. 120.]
[ 63. 52. 20.]
[117. 65. 0.]]
[[170. 62. 30.]
[ 70. 66. 10.]
[ 0. 0. 0.]]
[[116. 54. 60.]
[ 53. 83. 30.]
[ 78. 40. 0.]]]
Available items by choice: [[1. 1. 1.]
[1. 1. 0.]
[1. 1. 1.]]
Choices: [1 1 0]
Link to another example
Finally you can find here a good examples of how memory efficient FeaturesStorage can be. The Expedia datasets incorporates several OneHot features that are encoded as OneHotStorage saving up a lot of memory.