Exhaustive example of ChoiceDataset creation

The different possible ways to create a ChoiceDataset

Listed below:

From a single long format DataFrame
From a single wide format DataFrame
From several DataFrames
From several np.ndarrays

# Install necessary requirements

# If you run this notebook on Google Colab, or in standalone mode, you need to install the required packages.
# Uncomment the following lines:

# !pip install choice-learn

# If you run the notebook within the GitHub repository, you need to run the following lines, that can skipped otherwise:
import os
import sys

sys.path.append("../../")

import numpy as np
import pandas as pd

from choice_learn.data import ChoiceDataset
from choice_learn.data.storage import FeaturesStorage

We will use the CanadaMode dataset for this example. We can download it directly:

from choice_learn.datasets import load_modecanada

canada_df = load_modecanada(as_frame=True)
canada_df.head()

	case	alt	choice	dist	cost	ivt	ovt	freq	income	noalt
0	1	train	0	83	28.25	50	66	4	45.0	2
1	1	car	1	83	15.77	61	0	0	45.0	2
2	2	train	0	83	28.25	50	66	4	25.0	2
3	2	car	1	83	15.77	61	0	0	25.0	2
4	3	train	0	83	28.25	50	66	4	70.0	2

Let's create a column indicating whether the considered transport alternative is individual or not transport.

From a single long format dataframe

dataset = ChoiceDataset.from_single_long_df(df=canada_df,
                                       shared_features_columns=["dist", "income", "urban"],
                                       items_features_columns=["freq", "cost", "ivt", "ovt"],
                                       items_id_column="alt",
                                       choices_id_column="case",
                                       choices_column="choice",
                                       # the choice columns indicates if the item is chosen (1) or not (0)
                                       choice_format="one_zero",
                                       )
print(dataset.summary())

%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
 Shared Features by Choice:
 3 shared features
 with names: (['dist', 'income', 'urban'],)


 Items Features by Choice:
4 items features 
 with names: (['freq', 'cost', 'ivt', 'ovt'],)
%=====================================================================%

Another mode is possible, if the dataframe indicates the name of the chosen item instead of ones and zeros:

canada_df = load_modecanada(as_frame=True, choice_format="items_id")
canada_df.head()

	case	alt	choice	dist	cost	ivt	ovt	freq	income	noalt
0	1	train	car	83	28.25	50	66	4	45.0	2
1	1	car	car	83	15.77	61	0	0	45.0	2
2	2	train	car	83	28.25	50	66	4	25.0	2
3	2	car	car	83	15.77	61	0	0	25.0	2
4	3	train	car	83	28.25	50	66	4	70.0	2

This time, the choice is not given by ones and zeros but actually names for each context which alternative (item) has been chosen. The ChoiceDataset handles this case easily, by specifying 'choice_format="items_id"'.

dataset = ChoiceDataset.from_single_long_df(df=canada_df,
                                       shared_features_columns=["dist", "income", "urban"],
                                       items_features_columns=["freq", "cost", "ivt", "ovt"],
                                       items_id_column="alt",
                                       choices_id_column="case",
                                       choices_column="choice",
                                       # the choice columns indicates the id of the chosen item
                                       choice_format="items_id",
                                       )
print(dataset.summary())

%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
 Shared Features by Choice:
 3 shared features
 with names: (['dist', 'income', 'urban'],)


 Items Features by Choice:
4 items features 
 with names: (['freq', 'cost', 'ivt', 'ovt'],)
%=====================================================================%

From a single wide format DataFrame

If your DataFrame is in the wide format you can use the 'from_single_wide_df' method. Here is an example with the SwissMetro dataset.

from choice_learn.datasets import load_swissmetro

swiss_df = load_swissmetro(as_frame=True)
swiss_df.head()

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	48	120	63	52	20	117	65	1
1	2	1	1	1	1	1	3	...	48	30	60	49	10	117	84	1
2	2	1	1	1	1	1	3	...	48	60	67	58	30	117	52	1
3	2	1	1	1	1	1	3	...	40	30	63	52	20	72	52	1
4	2	1	1	1	1	1	3	...	36	60	63	42	20	90	84	1

5 rows × 29 columns

dataset = ChoiceDataset.from_single_wide_df(
    df=swiss_df,
    items_id=["TRAIN", "SM", "CAR"],
    shared_features_columns=["GROUP", "SURVEY", "SP", "PURPOSE", "FIRST", "TICKET", "WHO", "LUGGAGE", "AGE",
                               "MALE", "INCOME", "GA", "ORIGIN", "DEST"],
    items_features_suffixes=["CO", "TT", "HE", "SEATS"],
    available_items_suffix="AV", # ["TRAIN_AV", "SM_AV", "CAR_AV"] also works
    choices_column="CHOICE",
    choice_format="item_index",
)

From several DataFrames

Now, let's say that you have your data split into several files. It can happen if you store the different type of features in different SQL Tables for example. You will only need to follow some restrictions:

shared_features, items_features, choices =\
load_modecanada(as_frame=True, split_features=True, add_is_public=True)

fixed_items_features need to have a column named "item_id" referencing the item. Others columns are free to be any feature.

contexts_features need to have a "context_id" column (otherwise index is used). Other columns are free to be any feature.

shared_features.head()

	choice_id	income	dist
0	1	45.0	83
2	2	25.0	83
4	3	70.0	83
6	4	70.0	83
8	5	55.0	83

contexts_items_features need to have the column "item_id" and is recommended to have the column "context_id" (otherwise index is used).\ Of course "item_id" and "context_id" should match fixed_items_features and contexts_features.

items_features.head()

	choice_id	item_id	cost	freq	ovt	ivt	is_public
0	1	train	28.25	4	66	50	1.0
1	1	car	15.77	0	0	61	0.0
2	2	train	28.25	4	66	50	1.0
3	2	car	15.77	0	0	61	0.0
4	3	train	28.25	4	66	50	1.0

choices should have a column "context_id" and a column "choice". The value in "choice" should match the values in the column "item_id" in items_features and contexts_items_features.

choices.head()

	choice_id	choice
1	1	car
3	2	car
5	3	car
7	4	car
9	5	car

# And now you can create the dataset with:
dataset = ChoiceDataset(shared_features_by_choice=shared_features,
                        items_features_by_choice=items_features,
                        choices=choices)
print(dataset.summary())

WARNING:root:Shared Features Names were not provided, will not be able to
                                    fit models needing them such as Conditional Logit.
WARNING:root:Items Features Names were not provided, will not be able to
                                fit models needing them such as Conditional Logit.


%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
 Shared Features by Choice:
 3 shared features
 with names: (Index(['income', 'dist', 'urban'], dtype='object'),)


 Items Features by Choice:
5 items features 
 with names: (Index(['cost', 'freq', 'is_public', 'ivt', 'ovt'], dtype='object'),)
%=====================================================================%

From several np.ndarrays

Finally, another alternative is to specify each type of feature as np.ndarrays. You can or not also give features names. It is not necessary unless you plan to use a model with specification w.r.t. to those features names.

shared_features, items_features, available_items_by_choice, choices =\
load_modecanada(as_frame=False, split_features=True)

If you are using this method, it is your job to make sure that the arrays are well organized.\ First, shared_features_by_choice, items_features_by_choice, available_items_by_choice and choices must be in the right order and their dimension (first one) must match.\ Second, available_items_by_choice and items_features must also have the same number of items and ordered the sames, in their second dimension. Third, choices must indicate the index of the chosen item as ordered items_features_by_choice and available_items_by_choice. Finally you have to precise the available_items_by_choice, or which items were available (1) or not (0) for each context/choice.

To summarize the shape of the arrays must be: - (n_choices, n_shared_features) for shared_features_by_choice - (n_choices, n_items, n_items_features) for items_features_by_choice - (n_choices, n_items) for available_items_by_choice - (n_choices, ) for choices

print("For our example here are the arrays shapes:")
print(f"Contexts Features shape: {shared_features.shape}, 4324 choices, 3 features (income, dist, urban)")
print(f"Contexts Items Features shape: {items_features.shape}, 4324 choices, 4 items, 4 features (freq, cost, ivt, ovt)")
print(f"Contexts Items Availabilities shape: {available_items_by_choice.shape}, 4324 choices, 4 items")
print(f"Choices shape: {choices.shape}, 4324 choices")

For our example here are the arrays shapes:
Contexts Features shape: (4324, 3), 4324 choices, 3 features (income, dist, urban)
Contexts Items Features shape: (4324, 4, 4), 4324 choices, 4 items, 4 features (freq, cost, ivt, ovt)
Contexts Items Availabilities shape: (4324, 4), 4324 choices, 4 items
Choices shape: (4324,), 4324 choices

dataset = ChoiceDataset(shared_features_by_choice=shared_features,
                        items_features_by_choice=items_features,
                        choices=choices,
                        available_items_by_choice=available_items_by_choice,
                        # We can give the name of the features as follows, with the right order:
                        shared_features_by_choice_names=["income", "dist", "urban"],
                        items_features_by_choice_names=["freq", "cost", "ivt", "ovt"],
                        )
print(dataset.summary())

%=====================================================================%
%%% Summary of the dataset:
%=====================================================================%
Number of items: 4
Number of choices: 4324
%=====================================================================%
 Shared Features by Choice:
 3 shared features
 with names: (['income', 'dist', 'urban'],)


 Items Features by Choice:
4 items features 
 with names: (['freq', 'cost', 'ivt', 'ovt'],)
%=====================================================================%

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	48	120	63	52	20	117	65	1
1	2	1	1	1	1	1	3	...	48	30	60	49	10	117	84	1
2	2	1	1	1	1	1	3	...	48	60	67	58	30	117	52	1
3	2	1	1	1	1	1	3	...	40	30	63	52	20	72	52	1
4	2	1	1	1	1	1	3	...	36	60	63	42	20	90	84	1

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	48	120	63	52	20	117	65	1
1	2	1	1	1	1	1	3	...	48	30	60	49	10	117	84	1
2	2	1	1	1	1	1	3	...	48	60	67	58	30	117	52	1
3	2	1	1	1	1	1	3	...	40	30	63	52	20	72	52	1
4	2	1	1	1	1	1	3	...	36	60	63	42	20	90	84	1

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	48	120	63	52	20	117	65	1
1	2	1	1	1	1	1	3	...	48	30	60	49	10	117	84	1
2	2	1	1	1	1	1	3	...	48	60	67	58	30	117	52	1
3	2	1	1	1	1	1	3	...	40	30	63	52	20	72	52	1
4	2	1	1	1	1	1	3	...	36	60	63	42	20	90	84	1