Skip to content

Introduction to choice-learn's data management

import os
import sys

sys.path.append("../../")

import numpy as np
import pandas as pd

An introduction to ChoiceDataset

Choice-Learn's ChoiceDataset aims at being able to handle large datasets, typically by limiting the usage of memory to store several times the same feature. Its structure is made to fit a choice modelling setup it is needed to estimate choice models weights. This notebook introduces how the package handles data. Here is a summary of the different points that will be adressed:

Summary

from choice_learn.data import ChoiceDataset

Our example dataset: SwissMetro

The SwissMetro[2] is a well-known dataset used to illustrate choice modelling. The dataset is provided with the Choice-Learn package and can be downloaded as follows:

from choice_learn.datasets import load_swissmetro

swissmetro_df = load_swissmetro(as_frame=True)

The SwissMetro is a collection of answers to a survey about mode transportation choice in Switzerland. Before building a costly new public transport line the government decided to better understand the needs of the future customers. A complete description of the dataset and the columns can be found here. We will use a subset of the information during the tutorial.

Available Modes: - The current, existing train, 'TRAIN' - The potentially future SwissMetro, 'SM' - The customer car, 'CAR'

Columns: - PURPOSE: What is the customer's travel purpose: - AGE: The customer's age category - mode_AV: Whether the mode is available (1) or not (0) - mode_TT: The mode travel time - mode_CO: The mode cost - CHOICE: the transport mode chosen by the customer

kept_columns = ["PURPOSE", "AGE", "ORIGIN", "CAR_AV", "TRAIN_AV", "SM_AV", "CAR_TT",
                "TRAIN_TT", "SM_TT", "CAR_CO", "TRAIN_CO", "SM_CO", "CHOICE"]
swissmetro_df = swissmetro_df[kept_columns]
swissmetro_df.head()

The different type of data

We can split the columns into three distincts categories that are common to most choice modelling use-cases:

  • Choices - or outputs of our model: it's what we want to predict
  • Features - or inputs of our model
  • Availabilities - or the description of the set among which the customer chooses

Going further, we have two types of features: the features describing the customer and the features describing the mean of transportation. Those are the four types of data that can be specified in a ChoiceDataset.

Vocabulary:

Items represent a product, an alternative that can be chosen by the customer at some point.

Throughout Choice-Learn examples and code here is the naming of our four types of data:

  • choices: which item has been chosen among all availables

  • shared_features_by_choice: It represents all the features that might change from one choice to another and that are common to all items (e.g. day of week, customer features, etc...).

  • items_features_by_choice: The features each of the available item for a choice (e.g. prices might change from one choice to another and are specific to each sold item).

  • available_items_by_choice: For each choice it represents whether each item is proposed to the customer (1.) or not (0.).

Summary:

index feature typical shape Example Taken Values
1 shared_features_by_choice (n_choices, n_features) customer age, day of week float, int
2 items_features_by_choice (n_choices, n_items, n_items_features) price float, int
3 available_items_by_choice (n_choices, n_items) 1.(av) or 0. (not av.)
4 choices (n_choices,) int: index of chosen item

DatasetDiagram

Hands-on: example from a pandas' DataFrame

The easiest way create a ChoiceDataset is to use a pandas DataFrame.

First, here is a small explanation about wide vs long format, in case you have never heard about it, from Wikipedia.

Long (or narrow) Format: One column containing all the values and another column listing the context of the value\ Wide Format: Each different data variable in a separate column.

Example Long Format: Example Wide Format:
| choice id | item | price | availability | choice | |---|---|---|---|---| | 1 | A | 2.0 | 1 | 1 | | 1 | B | 6.0 | 1 | 0 | | 2 | A | 1.5 | 1 | 0 | | 2 | B | 5.5 | 1 | 1 | | choice id | price_A | price_B | availability_A | availability_B | choice | |---|---|---|---|---|---| | 1 | 2.0 | 6.0 | 1 | 1 | A | | 2 | 1.5 | 5.5 | 1 | 1 | B |

Choice-Learn handles both formats, but slightly differently: - example for wide format - example for long format

Creating a ChoiceDataset from a wide DataFrame

Our example dataframe on SwissMetro is on the wide format. Each row indicates a choice and each item has its specific features columns.

dataset = ChoiceDataset.from_single_wide_df(
    # The main DataFrame
    df=swissmetro_df,
    # The names of the items, will be used to find columns and organize them
    items_id=["TRAIN", "SM", "CAR"],

    # The column containing the choices
    choices_column="CHOICE",
    # How the choices are encoded: item_index means that the choice is the index of the item in the items_id list
    choice_format="items_index",

    # Columns for shared_features_by_choice
    shared_features_columns=["PURPOSE", "AGE"],

    # Columns for items_features_by_choice
    # They will be reconstructed as item_id + delimiter + feature_suffix
    items_features_suffixes=["CO", "TT"],
    # Same with availabilities
    available_items_suffix="AV",
    delimiter="_",
)

Options

choice_format: "item_index" or "item_id"

"item_index" "item_id"
| choice_column| |---| | 0 | | 1 | | 0 | | 2 | | choice_column| |---| | "TRAIN" | | "SM" | | "TRAIN" | | "CAR" |

items_features_by_choice and available_items_by_choice:

It is possible to precise: - Suffixes: in this case the column used will be "item_id" + "delimiter" + "suffix" - Prefixes: in this case the column used will be "prefix" + "delimiter" + "item_id" - Columns: each item's features in list. In this case it is you duty to ensure coherence in terms of items and features orders. For our example it would be:

```python
items_features_by_choice_columns=[["TRAIN_CO", "TRAIN_TT"], ["SM_CO", "SM_TT"], ["CAR_CO", "CAR_TT"]],
available_items_by_choice_columns=["TRAIN_AV", "SM_AV", "CAR_AV"],
```

Creating a ChoiceDataset from a long DataFrame

The long format is also commonly used in which each row represents an alternative. One of its benefits is represent unavailability through missing rows - taking litteraly zero memory space. On the contrary the 'shared_features' such as customer features must be duplicated on each row.\ The ChoiceDataset object can be instantiated from a long DF. It will infer the availabilities from existing/missing rows, if it is not specified.\ It is needed to precise: - columns representing the features ('shared_features_columns' and 'items_features_columns') - the column in which the choice is given and how it is formatted ('choices_columns' and 'choice_format') - which column can identify the items ('items_id_column') - which column can identify all the rows corresponding to the same choice ('choices_id_column')

# Transformation of our dataset on the long format
long_df = load_swissmetro(preprocessing="long_format")
long_df.head()
# Example of the long format instantiation
dataset = ChoiceDataset.from_single_long_df(
    df=long_df,
    items_id_column="item_id",
    choices_id_column="choice_id",

    shared_features_columns=["PURPOSE", "AGE"],
    items_features_columns=["TT", "CO"],

    choices_column="CHOICE",
    choice_format="one_zero")

Options

choice_format: "one_zero" or "item_id"

"one_zero" "item_id"
| choice_id_column | item_id_column | choice_column| |---|---|---| | 1 | "CAR" | 0 | | 1 | "SM" | 1 | | 2 | "CAR" | 1 | | 2 | "SM" | 0 | | choice_id_column | item_id_column | choice_column| |---|---|---| | 1 | "CAR" | "SM" | | 1 | "SM" | "SM" | | 2 | "CAR" | "CAR" | | 2 | "SM" | "CAR" |

Instantiation from different objects

For RAM optimization purposes or just because of the format of the data source, it might happen that a dataset is split into separate files. You can instantiate a ChoiceDataset keeping this structure, saving time to concatenate everything.\ You can work either with pandas.DataFrames or numpy.ndarrays.

Separating data types

The four distinct data types: choices, shared_features_by_choice, items_features_by_choice, available_items_by_choice can be manually given to the ChoiceDataset:

# Using pandas.DataFrames
dataset = ChoiceDataset(
    choices=swissmetro_df["CHOICE"],
    shared_features_by_choice=swissmetro_df[["PURPOSE", "AGE"]],
    items_features_by_choice=long_df[["choice_id", "item_id", "CO", "TT"]]
)

Note that if you pass items_features_by_choice as a pandas.DataFrame, it needs to be in the long format and with the columns 'choice_id' and 'item_id'. They will be used to get the features in the right order.

# Using numpy.ndarrays
# Be aware of items_features_by_choices shape that is (n_choices, n_items, n_features)

items_features_by_choice = np.stack([swissmetro_df[["TRAIN_CO", "TRAIN_TT"]].to_numpy(),
                                     swissmetro_df[["SM_CO", "SM_TT"]].to_numpy(),
                                     swissmetro_df[["CAR_CO", "CAR_TT"]].to_numpy()],
                                     axis=1)
shared_features_by_choice = swissmetro_df[["PURPOSE", "AGE"]].to_numpy()
available_items_by_choice = swissmetro_df[["TRAIN_AV", "SM_AV", "CAR_AV"]].to_numpy()

print("The data shapes are:")
print(f"choices: {swissmetro_df['CHOICE'].shape}")
print(f"shared_features_by_choice: {shared_features_by_choice.shape}")
print(f"items_features_by_choice: {items_features_by_choice.shape}")
print(f"available_items_by_choice: {available_items_by_choice.shape}")

dataset = ChoiceDataset(
    choices=swissmetro_df["CHOICE"].to_numpy(),
    shared_features_by_choice=shared_features_by_choice,
    items_features_by_choice=items_features_by_choice,
    available_items_by_choice=available_items_by_choice,

    # Features names can optionally be provided
    # the structure of data and names must match
    shared_features_by_choice_names=["PURPOSE", "AGE"],
    items_features_by_choice_names=["CO", "TT"],
)

Stacking Features

It is allowed to specify more than one features object by wrapping them in a tuple. This structure is kept inside the ChoiceDataset object as well as with the slicing into batches.

# Using pandas.DataFrames - Similar with np.ndarrays
dataset = ChoiceDataset(
    choices=swissmetro_df["CHOICE"],
    shared_features_by_choice=(swissmetro_df[["PURPOSE"]], swissmetro_df[["AGE"]]),
    items_features_by_choice=(long_df[["choice_id", "item_id", "CO"]],
                              long_df[["choice_id", "item_id", "TT"]]),
    available_items_by_choice=swissmetro_df[["TRAIN_AV", "SM_AV", "CAR_AV"]],
)

Other examples are provided here.

Using the ChoiceDataset object

Estimating choice models

With your ChoiceDataset instantiated, it can be used as is to fit choice models. An illustration can be found in the conditional MNL introduction notebook.

Slicing and batching

ChoiceDatasets are indexed by choice, meaning that accessing the i-th index corresponds to the i-th choice. Differently said it is the i-th value of the object given as 'choices' in the ChoiceDataset instantiation.

A ChoiceDataset can be sliced commonly using the [.] Python method:

sub_dataset = dataset[[0, 2, 4]]

sub_dataset will be a ChoiceDataset containing only the 0th, 2nd and 4th choice of dataset. The other data (shared_featurs_by_choice, items_features_by_choice and available_items_by_choice) are also kept and sliced accordingly.

In order to only get a chunk of data, it is possible to use .batch[.]. It will return the different data types sliced along choices in a raw np.ndarray format. Use .iter_batch() to iterate over all data in the ChoiceDataset by setting the batch_size argument to control the length of each chunk.

Also note that batch_size=-1returns the whole dataset

dataset.choices
batch = dataset.batch[[0, 2, 4]]
print("")
for batch in dataset.iter_batch(batch_size=1024):
    print("Num choices:", len(batch[-1]))

More Advanced use: the FeatureStorage & RAM optimization

In many use-cases we will see features or group of features values being repeated over the dataset. For example if one customer comes several times, its features will be repeated. With One-Hot representations, it can create memory-heavy repetitions.\ Choice-Learn introduces FeaturesStorage and FeaturesByIds in order to limit the memory usage before accessing a batch of data.

FeaturesStorage, why should I use it ?

If you are not using a large dataset with many features you can pass this part. FeaturesStorage are here if you want to further optimize your memory consumption and if you take some time to understand it.\ It is mainly built to work well with ChoiceDataset, but here is a small introduction on how it works.

StorageDiagram

Example on the SwissMetro dataset

/!\ Disclaimer For the sake of the example, some features will be introduced and created. They are totally made up and do not exist in the original - and true - version of the SwissMetro Dataset.

Let's consider the survey that happened in the three cantons: Geneva, Berne and Zürich. Now we want to integrate localization features.

Canton Inhabitants (M) Surface (km^2) Origin Code
Geneva 0.5 282 25
Zürich 1.5 1729 1
Berne 1.0 5959 2

A naive way to integreate those features is to add them as 'shared_features_by_choice'.

# Filtering cantons
swiss_df = swissmetro_df.loc[swissmetro_df.ORIGIN.isin([1, 2, 25])]

# Adding features
swiss_df.loc[:, "CANTON_SURFACE"] = swiss_df.apply(lambda row: {1: 1729, 2: 5959, 25: 282}[row.ORIGIN], axis=1)
swiss_df["CANTON_INHAB"] = swiss_df.apply(lambda row: {1: 1.5, 2: 1.0, 25: 0.5}[row.ORIGIN], axis=1)

dataset = ChoiceDataset.from_single_wide_df(
    df=swiss_df,
    items_id=["TRAIN", "SM", "CAR"],

    choices_column="CHOICE",
    choice_format="items_index",

    # The new features are added here compared to example above
    shared_features_columns=["PURPOSE", "AGE", "CANTON_SURFACE", "CANTON_INHAB"],
    items_features_suffixes=["CO", "TT"],
    available_items_suffix="AV",
    delimiter="_",
)

The main caveat is that the same features are repeated over the rows of the dataset. If we consider hundreds of stores on several millions - or billions - of choices, it would become... unreasonable!\ One idea is to regroup the features behind an ID (the canton id for example) and to reconstruct the features only in batches.

from choice_learn.data import FeaturesStorage

origin_canton_features = {1: [1.5, 1729], 2: [1.0, 5959], 25: [0.5, 282]}
canton_storage = FeaturesStorage(values=origin_canton_features, name="ORIGIN") # Remark that the name matches the ID column name in the DF

The FeaturesStorage is basically a Python dictionnary with a wrap-up to easily get batches of data.\ You can ask for a sequence of features with .batch. It works with the keys of our dictionnary that can be int, float, str, etc...

print("Retrieving features of canton id 1:")
print(canton_storage.batch[1])
print("Retrieving a batch of features:")
print(canton_storage.batch[[1, 25, 1]])

The FeaturesStorage is handy for its transparent use with ChoiceDataset. For it to work well it is needed to: - specify a FeaturesStorage name that matches the feature names given to the ChoiceDataset - match FeaturesStorage ids with the sequence (types and values) - specify the FeaturesStorage objects listed with the features_by_ids argument

In our case we call our FeaturesStorage "canton_storage", the ids are now strings, let's make the sequence match:

storage_dataset = ChoiceDataset(choices=swiss_df["CHOICE"],
                                items_features_by_choice=np.stack([swiss_df[["TRAIN_CO", "TRAIN_TT"]].to_numpy(),
                                                                   swiss_df[["SM_CO", "SM_TT"]].to_numpy(),
                                                                   swiss_df[["CAR_CO", "CAR_TT"]].to_numpy()],
                                                                   axis=1),
                                shared_features_by_choice=swiss_df[["AGE", "PURPOSE", "ORIGIN"]].to_numpy(),
                                features_by_ids=[canton_storage],
                                items_features_by_choice_names=["CO", "TT"],
                                shared_features_by_choice_names=["AGE", "PURPOSE", "ORIGIN"],
)

Looking at a batch of data, here is how it looks like:

# batching the dataset as before
batch = storage_dataset.batch[[1, 2, 3]]
print("Batch of shared_features_by_choice:", batch[0])
print("Batch of choices:", batch[3])

The features stored in the FeaturesStorage have been stacked with the usual 'shared_features_by_choice' !

Specific case of the OneHot Storage

Manually looking for canton features is quiet time consuming. Another idea is to represent each canton by a unique one-hot vector. A recurring usecase is the use of OneHot representation of features. The OneHotStorage is built specifically for one-hot encoded features and further improves memory consumption. The storage is to be used the same way as FeaturesStorage, but behind will only keep the index of the one of each element and will consitute the one-hot vector only when needed.\ In order terms it stores a sparse version of the vectors and returns a dense representation when batched.

from choice_learn.data import OneHotStorage
storage = OneHotStorage(ids=swissmetro_df.ORIGIN.unique())

print("RAM storage of the OneHotStore:", storage.storage)
# When indexing with .batch, we can access the one-hot encoding of the element using its id
print("One-hot vector batch: storage.batch[2]", storage.batch[2])
print("One-hot vector batch: storage.batch[[5, 20, 18, 23, 25, 15,  5, 20]]")
print(storage.batch[[5, 20, 18, 23, 25, 15,  5, 20]])

Other examples of features_by_ids usage can be found here.

Additional Examples

The ModeCanada dataset

We will use the ModeCanada [1] dataset for this example. The dataset is originally in the long format. It is provided with the choice-learn package and can loaded as follows:

from choice_learn.datasets import load_modecanada

canada_transport_df = load_modecanada(as_frame=True)
canada_transport_df.head()

An extensive description of the dataset can be found here. An extract indicates:

"The dataset was assembled in 1989 by VIA Rail (the Canadian national rail carrier) to estimate the demand for high-speed rail in the Toronto-Montreal corridor. The main information source was a Passenger Review administered to business travelers augmented by information about each trip. The observations consist of a choice between four modes of transportation (train, air, bus, car) with information about the travel mode and about the passenger. The posted dataset has been balanced to only include cases where all four travel modes are recorded. The file contains 11,116 observations on 2779 individuals. "

Alright ! If we go back to our dataframe, we can see the following columns:

case: an ID of the traveler alt: the alternative concerned by the row choice: 1 if the alternative was chosen, 0 otherwise dist: trip distance cost: trip cost ivt: travel time in-vehicule (minutes) ovt: travel time out-vehicule (minutes) income: housold income of traveler ($) urban: 1 if origin or destination is a large city noalt: the number of alternative among which the traveler had to chose freq: the frequence of the alternative (0 for car) (e.g. how many train by hour) Following our specification, we can see that one case corresponds to one customer thus one choice. In our choice-learn language it corresponds to "one context": a set of available alternatives and their features/specificites resulting in one choice. Let's regroup our features:

choices: Easy ! It is the alternative whenever the value is one.

shared_features_by_choice: The income, urban and distance (also noalt which is not really a feature) features are the same for all the alternatives within a single choice. They are all constant with respect to (case=traveler_ID).

items_features_by_choice: Ivt, Ovt, cost and freq depends on and describe each of the alternative.

available_items_by_choice: It in not directly indicated, however it can be easily deduced. Whenever an alternative is not available, it is not precised for its case. For example for the case=1, our first choice, only train and car are given as alternatives, meaning that air and bus could not be chosen/were not available.

dataset = ChoiceDataset.from_single_long_df(
    df=canada_transport_df,
    choices_column="choice",
    items_id_column="alt",
    choices_id_column="case",
    shared_features_columns=["income", "urban", "dist"],
    items_features_columns=["cost", "freq", "ovt", "ivt"],
    choice_format="one_zero")

In this example the 'choice_format' is "one_zero" while it was "item_id" in our previous SwissMetro example. As a short memento it specifies how the chosen alternative is precised: with ones (chosen) and zeros (not chosen) or directlu with the item_id of the chosen item.

"one_zero" "item_id"
| | case | alt | choice | dist | cost | ivt | ovt | freq | income | |---|---|---|---|---|---|---|---|---|---| | 1 | 1 | train | 0 | 83 | 28.25 | 50 | 66 | 4 | 45 | | 2 | 1 | car | 1 | 83 | 15.77 | 61 | 0 | 0 | 45 | | 3 | 2 | train | 0 | 83 | 28.25 | 50 | 66 | 4 | 25 | | 4 | 2 | car | 1 | 83 | 15.77 | 61 | 0 | 0 | 25 | | 5 | 3 | train | 0 | 83 | 28.25 | 50 | 66 | 4 | 70 | | | case | alt | choice | dist | cost | ivt | ovt | freq | income | |---|---|---|---|---|---|---|---|---|---| | 1 | 1 | train | car | 83 | 28.25 | 50 | 66 | 4 | 45 | | 2 | 1 | car | car | 83 | 15.77 | 61 | 0 | 0 | 45 | | 3 | 2 | train | car | 83 | 28.25 | 50 | 66 | 4 | 25 | | 4 | 2 | car | car | 83 | 15.77 | 61 | 0 | 0 | 25 | | 5 | 3 | train | car | 83 | 28.25 | 50 | 66 | 4 | 70 |

In the first 5 examples, the chosen transportation is always the car.

That's it !

A manual example of FeaturesStorage

Let's consider a case where we consider three supermarkets: - supermarket_1 with surface of 100 and 250 average nb of customers - supermarket_2 with surface of 150 and 500 average nb of customers - supermarket_3 with surface of 80 and 100 average nb of customers

In each store, we have 4 available products for which we have little information. For the example'sake, let's consider the following utility: With $S_s$ the surface of the store and $C_s$ its average number of customers.

We want to estimate the base utilities $u_i$ and the two coefficients: $\beta_1$ and $\beta_2$.

Let's start with creating a ChoiceDataset without the FeaturesStorage:

Let's consider a case where we consider three supermarkets: - supermarket_1 with surface of 100 and 250 average nb of customers - supermarket_2 with surface of 150 and 500 average nb of customers - supermarket_3 with surface of 80 and 100 average nb of customers

In each store, we have 4 available products for which we have little information. For the example'sake, let's consider the following utility: With $S_s$ the surface of the store and $C_s$ its average number of customers.

We want to estimate the base utilities $u_i$ and the two coefficients: $\beta_1$ and $\beta_2$.

Let's start with creating a ChoiceDataset without the FeaturesStorage:

# Here are our choices:
choices = [0, 1, 2, 0, 2, 1, 1, 0, 2, 1, 2, 0, 2, 0, 1, 2, 1, 0]
supermarket_features = [[100, 250], [150, 500], [80, 100]]
# Now our store sequence of supermarkets is:
supermarkets_sequence = [1, 1, 2, 3, 2, 1, 2, 1, 1, 2, 3, 2, 1, 2, 2, 3, 1, 2]

# The usual way to store the features would be to create the contexts_features array that contains
# the right features:
usual_supermarket_features = np.array([supermarket_features[supermarket_id - 1] for supermarket_id in supermarkets_sequence])
print("Usual Supermakerket Features Shape:", usual_supermarket_features.shape)
Usual Supermakerket Features Shape: (18, 2)

Supermarket features being repeated several times, it's a great opportunity to use a FeaturesStorage !\ Let's see how to use strings as IDs.

features_dict = {f"supermarket_{i+1}": supermarket_features[i] for i in range(3)}
storage = FeaturesStorage(values=features_dict, name="supermarket_features")
print("Retrieving features of first supermarket:")
print(storage.batch["supermarket_1"])
print("Retrieving a batch of features:")
print(storage.batch[["supermarket_1", "supermarket_2", "supermarket_1"]])
Retrieving features of first supermarket:
[100 250]
Retrieving a batch of features:
[[100 250]
 [150 500]
 [100 250]]

Reminder:

It is needed to: - specify a FeaturesStorage name - match FeaturesStorage ids with the sequence

In our case we call our FeaturesStorage "supermarket_features", the ids are now strings, let's make the sequence match:

str_supermarkets_sequence = [[f"supermarket_{i}"] for i in supermarkets_sequence]

And now we can create our ChoiceDataset:

storage_dataset = ChoiceDataset(choices=choices,
                                shared_features_by_choice=str_supermarkets_sequence,
                                shared_features_by_choice_names=["supermarket_features"],
                                available_items_by_choice=np.ones((len(choices), 3)),
                                features_by_ids=[storage],
)

And now let's see how batches work:

batch = storage_dataset.batch[0]
print("Batch Shared Items Features:", batch[0])
print("Batch Items Features:", batch[1])
print("Batch Choice:", batch[3])
print("%-------------------------%")
batch = storage_dataset.batch[[1, 2, 3]]
print("Batch Shared Items Features:", batch[0])
print("Batch Items Features:", batch[1])
print("Batch Choice:", batch[3])
print("%-------------------------%")
batch = storage_dataset.batch[[0, 1, 5]]
print("Batch Shared Items Features:", batch[0])
print("Batch Items Features:", batch[1])
print("Batch Choice:", batch[3])
Batch Shared Items Features: [100 250]
Batch Items Features: None
Batch Choice: 0
%-------------------------%
Batch Shared Items Features: [[100 250]
 [150 500]
 [ 80 100]]
Batch Items Features: None
Batch Choice: [1 2 0]
%-------------------------%
Batch Shared Items Features: [[100 250]
 [100 250]
 [100 250]]
Batch Items Features: None
Batch Choice: [0 1 1]

Everything is mapped as needed. And the great thing is that you can easily mix ''classical'' features with FeaturesStorages.\ Let's add a 'is_week_end' feature to our problem that will also be stored as a contexts_features.

shared_features = pd.DataFrame({"supermarket_features": np.array(str_supermarkets_sequence).squeeze(),
"is_week_end": [0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0]})
shared_features.head()
supermarket_features is_week_end
0 supermarket_1 0
1 supermarket_1 0
2 supermarket_2 0
3 supermarket_3 1
4 supermarket_2 1
# Creation of the ChoiceDataset
storage_dataset = ChoiceDataset(choices=choices,
                                shared_features_by_choice=shared_features,
                                available_items_by_choice=np.ones((len(choices), 3)),
                                features_by_ids=[storage],
)
# And now it's ready
batch = storage_dataset.batch[[1, 2, 3]]
print("Batch Shared Items Features:", batch[0])
print("Batch Items Features:", batch[1])
print("Batch Choice:", batch[3])
Batch Shared Items Features: [[100 250]
 [150 500]
 [ 80 100]]
Batch Items Features: None
Batch Choice: [1 2 0]

Note that: - We use strings as ids for the example, however we recommend to use integers. - FeaturesStorage can be instantiated from dict, np.ndarray, list, pandas.DataFrame, etc... - More in-depth examples and explanations can be found here

Ready-to-use datasets

A few well-known open source datasets are directly integrated and the package and can be downloaded in one line: - SwissMetro from Bierlaire et al (2001) [2] - ModeCanada from Koppleman et al. (1993) [1] - The Train dataset from Ben Akiva et al. (1993) [4] - The Heating & Electricity datasets from Kenneth Train [3] - The TaFeng dataset from Kaggle [5]

If you feel like another open-source dataset could be included, reach out !

from choice_learn.datasets import (load_swissmetro,
                                   load_modecanada,
                                   load_train,
                                   load_heating,
                                   load_electricity,
                                   load_tafeng
                                   )

canada_choice_dataset = load_modecanada()
swissmetro_choice_dataset = load_swissmetro()

The datasets can also be downloaded as dataframes:

swissmetro_df = load_swissmetro(as_frame=True)
swissmetro_df.head()
GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE ... TRAIN_CO TRAIN_HE SM_TT SM_CO SM_HE SM_SEATS CAR_TT CAR_CO CHOICE CAR_HE
0 2 0 1 1 1 0 1 1 0 3 ... 48 120 63 52 20 0 117 65 1 0.0
1 2 0 1 1 1 0 1 1 0 3 ... 48 30 60 49 10 0 117 84 1 0.0
2 2 0 1 1 1 0 1 1 0 3 ... 48 60 67 58 30 0 117 52 1 0.0
3 2 0 1 1 1 0 1 1 0 3 ... 40 30 63 52 20 0 72 52 1 0.0
4 2 0 1 1 1 0 1 1 0 3 ... 36 60 63 42 20 0 90 84 1 0.0

5 rows × 29 columns

References

[1] Koppelman et al. (1993), Application and Interpretation of Nested Logit Models of Intercity Mode Choice\ [2] Bierlaire, M., Axhausen, K. and Abay, G. (2001), The Acceptance of Modal Innovation: The Case of SwissMetro\ [3] Train, K.E. (2003) Discrete Choice Methods with Simulation. Cambridge University Press.\ [4] Ben-Akiva M.; Bolduc D.; Bradley M. (1993) Estimation of Travel Choice Models with Randomly Distributed Values of Time\ [5] The Ta Feng Grocery dataset on Kaggle