Exhaustive example of ChoiceDataset creation
The different possible ways to create a ChoiceDataset
Listed below:
- From a single long format DataFrame
- From a single wide format DataFrame
- From several DataFrames
- From several np.ndarrays
# Install necessary requirements
# If you run this notebook on Google Colab, or in standalone mode, you need to install the required packages.
# Uncomment the following lines:
# !pip install choice-learn
# If you run the notebook within the GitHub repository, you need to run the following lines, that can skipped otherwise:
import os
import sys
sys.path.append("../../")
import numpy as np
import pandas as pd
from choice_learn.data import ChoiceDataset
from choice_learn.data.storage import FeaturesStorage
We will use the CanadaMode dataset for this example. We can download it directly:
from choice_learn.datasets import load_modecanada
canada_df = load_modecanada(as_frame=True)
canada_df.head()
case | alt | choice | dist | cost | ivt | ovt | freq | income | urban | noalt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | train | 0 | 83 | 28.25 | 50 | 66 | 4 | 45.0 | 0 | 2 |
1 | 1 | car | 1 | 83 | 15.77 | 61 | 0 | 0 | 45.0 | 0 | 2 |
2 | 2 | train | 0 | 83 | 28.25 | 50 | 66 | 4 | 25.0 | 0 | 2 |
3 | 2 | car | 1 | 83 | 15.77 | 61 | 0 | 0 | 25.0 | 0 | 2 |
4 | 3 | train | 0 | 83 | 28.25 | 50 | 66 | 4 | 70.0 | 0 | 2 |
Let's create a column indicating whether the considered transport alternative is individual or not transport.
From a single long format dataframe
dataset = ChoiceDataset.from_single_long_df(df=canada_df,
shared_features_columns=["dist", "income", "urban"],
items_features_columns=["freq", "cost", "ivt", "ovt"],
items_id_column="alt",
choices_id_column="case",
choices_column="choice",
# the choice columns indicates if the item is chosen (1) or not (0)
choice_format="one_zero",
)
print(dataset.summary())
%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
Shared Features by Choice:
3 shared features
with names: (['dist', 'income', 'urban'],)
Items Features by Choice:
4 items features
with names: (['freq', 'cost', 'ivt', 'ovt'],)
%=====================================================================%
Another mode is possible, if the dataframe indicates the name of the chosen item instead of ones and zeros:
case | alt | choice | dist | cost | ivt | ovt | freq | income | urban | noalt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | train | car | 83 | 28.25 | 50 | 66 | 4 | 45.0 | 0 | 2 |
1 | 1 | car | car | 83 | 15.77 | 61 | 0 | 0 | 45.0 | 0 | 2 |
2 | 2 | train | car | 83 | 28.25 | 50 | 66 | 4 | 25.0 | 0 | 2 |
3 | 2 | car | car | 83 | 15.77 | 61 | 0 | 0 | 25.0 | 0 | 2 |
4 | 3 | train | car | 83 | 28.25 | 50 | 66 | 4 | 70.0 | 0 | 2 |
This time, the choice is not given by ones and zeros but actually names for each context which alternative (item) has been chosen. The ChoiceDataset handles this case easily, by specifying 'choice_format="items_id"'.
dataset = ChoiceDataset.from_single_long_df(df=canada_df,
shared_features_columns=["dist", "income", "urban"],
items_features_columns=["freq", "cost", "ivt", "ovt"],
items_id_column="alt",
choices_id_column="case",
choices_column="choice",
# the choice columns indicates the id of the chosen item
choice_format="items_id",
)
print(dataset.summary())
%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
Shared Features by Choice:
3 shared features
with names: (['dist', 'income', 'urban'],)
Items Features by Choice:
4 items features
with names: (['freq', 'cost', 'ivt', 'ovt'],)
%=====================================================================%
From a single wide format DataFrame
If your DataFrame is in the wide format you can use the 'from_single_wide_df' method. Here is an example with the SwissMetro dataset.
from choice_learn.datasets import load_swissmetro
swiss_df = load_swissmetro(as_frame=True)
swiss_df.head()
GROUP | SURVEY | SP | ID | PURPOSE | FIRST | TICKET | WHO | LUGGAGE | AGE | ... | TRAIN_CO | TRAIN_HE | SM_TT | SM_CO | SM_HE | SM_SEATS | CAR_TT | CAR_CO | CHOICE | CAR_HE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 48 | 120 | 63 | 52 | 20 | 0 | 117 | 65 | 1 | 0.0 |
1 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 48 | 30 | 60 | 49 | 10 | 0 | 117 | 84 | 1 | 0.0 |
2 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 48 | 60 | 67 | 58 | 30 | 0 | 117 | 52 | 1 | 0.0 |
3 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 40 | 30 | 63 | 52 | 20 | 0 | 72 | 52 | 1 | 0.0 |
4 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | ... | 36 | 60 | 63 | 42 | 20 | 0 | 90 | 84 | 1 | 0.0 |
5 rows × 29 columns
dataset = ChoiceDataset.from_single_wide_df(
df=swiss_df,
items_id=["TRAIN", "SM", "CAR"],
shared_features_columns=["GROUP", "SURVEY", "SP", "PURPOSE", "FIRST", "TICKET", "WHO", "LUGGAGE", "AGE",
"MALE", "INCOME", "GA", "ORIGIN", "DEST"],
items_features_suffixes=["CO", "TT", "HE", "SEATS"],
available_items_suffix="AV", # ["TRAIN_AV", "SM_AV", "CAR_AV"] also works
choices_column="CHOICE",
choice_format="item_index",
)
From several DataFrames
Now, let's say that you have your data split into several files. It can happen if you store the different type of features in different SQL Tables for example. You will only need to follow some restrictions:
shared_features, items_features, choices =\
load_modecanada(as_frame=True, split_features=True, add_is_public=True)
fixed_items_features need to have a column named "item_id" referencing the item. Others columns are free to be any feature.
contexts_features need to have a "context_id" column (otherwise index is used). Other columns are free to be any feature.
choice_id | income | dist | urban | |
---|---|---|---|---|
0 | 1 | 45.0 | 83 | 0 |
2 | 2 | 25.0 | 83 | 0 |
4 | 3 | 70.0 | 83 | 0 |
6 | 4 | 70.0 | 83 | 0 |
8 | 5 | 55.0 | 83 | 0 |
contexts_items_features need to have the column "item_id" and is recommended to have the column "context_id" (otherwise index is used).\ Of course "item_id" and "context_id" should match fixed_items_features and contexts_features.
choice_id | item_id | cost | freq | ovt | ivt | is_public | |
---|---|---|---|---|---|---|---|
0 | 1 | train | 28.25 | 4 | 66 | 50 | 1.0 |
1 | 1 | car | 15.77 | 0 | 0 | 61 | 0.0 |
2 | 2 | train | 28.25 | 4 | 66 | 50 | 1.0 |
3 | 2 | car | 15.77 | 0 | 0 | 61 | 0.0 |
4 | 3 | train | 28.25 | 4 | 66 | 50 | 1.0 |
choices should have a column "context_id" and a column "choice". The value in "choice" should match the values in the column "item_id" in items_features and contexts_items_features.
choice_id | choice | |
---|---|---|
1 | 1 | car |
3 | 2 | car |
5 | 3 | car |
7 | 4 | car |
9 | 5 | car |
# And now you can create the dataset with:
dataset = ChoiceDataset(shared_features_by_choice=shared_features,
items_features_by_choice=items_features,
choices=choices)
print(dataset.summary())
WARNING:root:Shared Features Names were not provided, will not be able to
fit models needing them such as Conditional Logit.
WARNING:root:Items Features Names were not provided, will not be able to
fit models needing them such as Conditional Logit.
%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
Shared Features by Choice:
3 shared features
with names: (Index(['income', 'dist', 'urban'], dtype='object'),)
Items Features by Choice:
5 items features
with names: (Index(['cost', 'freq', 'is_public', 'ivt', 'ovt'], dtype='object'),)
%=====================================================================%
From several np.ndarrays
Finally, another alternative is to specify each type of feature as np.ndarrays. You can or not also give features names. It is not necessary unless you plan to use a model with specification w.r.t. to those features names.
shared_features, items_features, available_items_by_choice, choices =\
load_modecanada(as_frame=False, split_features=True)
If you are using this method, it is your job to make sure that the arrays are well organized.\ First, shared_features_by_choice, items_features_by_choice, available_items_by_choice and choices must be in the right order and their dimension (first one) must match.\ Second, available_items_by_choice and items_features must also have the same number of items and ordered the sames, in their second dimension. Third, choices must indicate the index of the chosen item as ordered items_features_by_choice and available_items_by_choice. Finally you have to precise the available_items_by_choice, or which items were available (1) or not (0) for each context/choice.
To summarize the shape of the arrays must be: - (n_choices, n_shared_features) for shared_features_by_choice - (n_choices, n_items, n_items_features) for items_features_by_choice - (n_choices, n_items) for available_items_by_choice - (n_choices, ) for choices
print("For our example here are the arrays shapes:")
print(f"Contexts Features shape: {shared_features.shape}, 4324 choices, 3 features (income, dist, urban)")
print(f"Contexts Items Features shape: {items_features.shape}, 4324 choices, 4 items, 4 features (freq, cost, ivt, ovt)")
print(f"Contexts Items Availabilities shape: {available_items_by_choice.shape}, 4324 choices, 4 items")
print(f"Choices shape: {choices.shape}, 4324 choices")
For our example here are the arrays shapes:
Contexts Features shape: (4324, 3), 4324 choices, 3 features (income, dist, urban)
Contexts Items Features shape: (4324, 4, 4), 4324 choices, 4 items, 4 features (freq, cost, ivt, ovt)
Contexts Items Availabilities shape: (4324, 4), 4324 choices, 4 items
Choices shape: (4324,), 4324 choices
dataset = ChoiceDataset(shared_features_by_choice=shared_features,
items_features_by_choice=items_features,
choices=choices,
available_items_by_choice=available_items_by_choice,
# We can give the name of the features as follows, with the right order:
shared_features_by_choice_names=["income", "dist", "urban"],
items_features_by_choice_names=["freq", "cost", "ivt", "ovt"],
)
print(dataset.summary())
%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
Shared Features by Choice:
3 shared features
with names: (['income', 'dist', 'urban'],)
Items Features by Choice:
4 items features
with names: (['freq', 'cost', 'ivt', 'ovt'],)
%=====================================================================%