Skip to content

Exhaustive example of ChoiceDataset creation

The different possible ways to create a ChoiceDataset

Listed below:

Open In Colab

# Install necessary requirements

# If you run this notebook on Google Colab, or in standalone mode, you need to install the required packages.
# Uncomment the following lines:

# !pip install choice-learn

# If you run the notebook within the GitHub repository, you need to run the following lines, that can skipped otherwise:
import os
import sys

sys.path.append("../../")
import numpy as np
import pandas as pd

from choice_learn.data import ChoiceDataset
from choice_learn.data.storage import FeaturesStorage

We will use the CanadaMode dataset for this example. We can download it directly:

from choice_learn.datasets import load_modecanada

canada_df = load_modecanada(as_frame=True)
canada_df.head()
case alt choice dist cost ivt ovt freq income urban noalt
0 1 train 0 83 28.25 50 66 4 45.0 0 2
1 1 car 1 83 15.77 61 0 0 45.0 0 2
2 2 train 0 83 28.25 50 66 4 25.0 0 2
3 2 car 1 83 15.77 61 0 0 25.0 0 2
4 3 train 0 83 28.25 50 66 4 70.0 0 2

Let's create a column indicating whether the considered transport alternative is individual or not transport.

From a single long format dataframe

dataset = ChoiceDataset.from_single_long_df(df=canada_df,
                                       shared_features_columns=["dist", "income", "urban"],
                                       items_features_columns=["freq", "cost", "ivt", "ovt"],
                                       items_id_column="alt",
                                       choices_id_column="case",
                                       choices_column="choice",
                                       # the choice columns indicates if the item is chosen (1) or not (0)
                                       choice_format="one_zero",
                                       )
print(dataset.summary())
%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
 Shared Features by Choice:
 3 shared features
 with names: (['dist', 'income', 'urban'],)


 Items Features by Choice:
4 items features 
 with names: (['freq', 'cost', 'ivt', 'ovt'],)
%=====================================================================%

Another mode is possible, if the dataframe indicates the name of the chosen item instead of ones and zeros:

canada_df = load_modecanada(as_frame=True, choice_format="items_id")
canada_df.head()
case alt choice dist cost ivt ovt freq income urban noalt
0 1 train car 83 28.25 50 66 4 45.0 0 2
1 1 car car 83 15.77 61 0 0 45.0 0 2
2 2 train car 83 28.25 50 66 4 25.0 0 2
3 2 car car 83 15.77 61 0 0 25.0 0 2
4 3 train car 83 28.25 50 66 4 70.0 0 2

This time, the choice is not given by ones and zeros but actually names for each context which alternative (item) has been chosen. The ChoiceDataset handles this case easily, by specifying 'choice_format="items_id"'.

dataset = ChoiceDataset.from_single_long_df(df=canada_df,
                                       shared_features_columns=["dist", "income", "urban"],
                                       items_features_columns=["freq", "cost", "ivt", "ovt"],
                                       items_id_column="alt",
                                       choices_id_column="case",
                                       choices_column="choice",
                                       # the choice columns indicates the id of the chosen item
                                       choice_format="items_id",
                                       )
print(dataset.summary())
%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
 Shared Features by Choice:
 3 shared features
 with names: (['dist', 'income', 'urban'],)


 Items Features by Choice:
4 items features 
 with names: (['freq', 'cost', 'ivt', 'ovt'],)
%=====================================================================%

From a single wide format DataFrame

If your DataFrame is in the wide format you can use the 'from_single_wide_df' method. Here is an example with the SwissMetro dataset.

from choice_learn.datasets import load_swissmetro

swiss_df = load_swissmetro(as_frame=True)
swiss_df.head()
GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE ... TRAIN_CO TRAIN_HE SM_TT SM_CO SM_HE SM_SEATS CAR_TT CAR_CO CHOICE CAR_HE
0 2 0 1 1 1 0 1 1 0 3 ... 48 120 63 52 20 0 117 65 1 0.0
1 2 0 1 1 1 0 1 1 0 3 ... 48 30 60 49 10 0 117 84 1 0.0
2 2 0 1 1 1 0 1 1 0 3 ... 48 60 67 58 30 0 117 52 1 0.0
3 2 0 1 1 1 0 1 1 0 3 ... 40 30 63 52 20 0 72 52 1 0.0
4 2 0 1 1 1 0 1 1 0 3 ... 36 60 63 42 20 0 90 84 1 0.0

5 rows × 29 columns

dataset = ChoiceDataset.from_single_wide_df(
    df=swiss_df,
    items_id=["TRAIN", "SM", "CAR"],
    shared_features_columns=["GROUP", "SURVEY", "SP", "PURPOSE", "FIRST", "TICKET", "WHO", "LUGGAGE", "AGE",
                               "MALE", "INCOME", "GA", "ORIGIN", "DEST"],
    items_features_suffixes=["CO", "TT", "HE", "SEATS"],
    available_items_suffix="AV", # ["TRAIN_AV", "SM_AV", "CAR_AV"] also works
    choices_column="CHOICE",
    choice_format="item_index",
)

From several DataFrames

Now, let's say that you have your data split into several files. It can happen if you store the different type of features in different SQL Tables for example. You will only need to follow some restrictions:

shared_features, items_features, choices =\
load_modecanada(as_frame=True, split_features=True, add_is_public=True)

fixed_items_features need to have a column named "item_id" referencing the item. Others columns are free to be any feature.

contexts_features need to have a "context_id" column (otherwise index is used). Other columns are free to be any feature.

shared_features.head()
choice_id income dist urban
0 1 45.0 83 0
2 2 25.0 83 0
4 3 70.0 83 0
6 4 70.0 83 0
8 5 55.0 83 0

contexts_items_features need to have the column "item_id" and is recommended to have the column "context_id" (otherwise index is used).\ Of course "item_id" and "context_id" should match fixed_items_features and contexts_features.

items_features.head()
choice_id item_id cost freq ovt ivt is_public
0 1 train 28.25 4 66 50 1.0
1 1 car 15.77 0 0 61 0.0
2 2 train 28.25 4 66 50 1.0
3 2 car 15.77 0 0 61 0.0
4 3 train 28.25 4 66 50 1.0

choices should have a column "context_id" and a column "choice". The value in "choice" should match the values in the column "item_id" in items_features and contexts_items_features.

choices.head()
choice_id choice
1 1 car
3 2 car
5 3 car
7 4 car
9 5 car
# And now you can create the dataset with:
dataset = ChoiceDataset(shared_features_by_choice=shared_features,
                        items_features_by_choice=items_features,
                        choices=choices)
print(dataset.summary())
WARNING:root:Shared Features Names were not provided, will not be able to
                                    fit models needing them such as Conditional Logit.
WARNING:root:Items Features Names were not provided, will not be able to
                                fit models needing them such as Conditional Logit.


%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
 Shared Features by Choice:
 3 shared features
 with names: (Index(['income', 'dist', 'urban'], dtype='object'),)


 Items Features by Choice:
5 items features 
 with names: (Index(['cost', 'freq', 'is_public', 'ivt', 'ovt'], dtype='object'),)
%=====================================================================%

From several np.ndarrays

Finally, another alternative is to specify each type of feature as np.ndarrays. You can or not also give features names. It is not necessary unless you plan to use a model with specification w.r.t. to those features names.

shared_features, items_features, available_items_by_choice, choices =\
load_modecanada(as_frame=False, split_features=True)

If you are using this method, it is your job to make sure that the arrays are well organized.\ First, shared_features_by_choice, items_features_by_choice, available_items_by_choice and choices must be in the right order and their dimension (first one) must match.\ Second, available_items_by_choice and items_features must also have the same number of items and ordered the sames, in their second dimension. Third, choices must indicate the index of the chosen item as ordered items_features_by_choice and available_items_by_choice. Finally you have to precise the available_items_by_choice, or which items were available (1) or not (0) for each context/choice.

To summarize the shape of the arrays must be: - (n_choices, n_shared_features) for shared_features_by_choice - (n_choices, n_items, n_items_features) for items_features_by_choice - (n_choices, n_items) for available_items_by_choice - (n_choices, ) for choices

print("For our example here are the arrays shapes:")
print(f"Contexts Features shape: {shared_features.shape}, 4324 choices, 3 features (income, dist, urban)")
print(f"Contexts Items Features shape: {items_features.shape}, 4324 choices, 4 items, 4 features (freq, cost, ivt, ovt)")
print(f"Contexts Items Availabilities shape: {available_items_by_choice.shape}, 4324 choices, 4 items")
print(f"Choices shape: {choices.shape}, 4324 choices")
For our example here are the arrays shapes:
Contexts Features shape: (4324, 3), 4324 choices, 3 features (income, dist, urban)
Contexts Items Features shape: (4324, 4, 4), 4324 choices, 4 items, 4 features (freq, cost, ivt, ovt)
Contexts Items Availabilities shape: (4324, 4), 4324 choices, 4 items
Choices shape: (4324,), 4324 choices
dataset = ChoiceDataset(shared_features_by_choice=shared_features,
                        items_features_by_choice=items_features,
                        choices=choices,
                        available_items_by_choice=available_items_by_choice,
                        # We can give the name of the features as follows, with the right order:
                        shared_features_by_choice_names=["income", "dist", "urban"],
                        items_features_by_choice_names=["freq", "cost", "ivt", "ovt"],
                        )
print(dataset.summary())
%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
 Shared Features by Choice:
 3 shared features
 with names: (['income', 'dist', 'urban'],)


 Items Features by Choice:
4 items features 
 with names: (['freq', 'cost', 'ivt', 'ovt'],)
%=====================================================================%