Handling time

When performing feature engineering to learn a model to predict the future, the value to predict will be associated with a time. In this case, it is paramount to only incorporate data prior to this “cutoff time” when calculating the feature values.

Featuretools is designed to take time into consideration when required. By specifying a cutoff time, we can control what portions of the data are used when calculating features.

In [1]: import featuretools as ft

In [2]: es = ft.demo.load_mock_customer(return_entityset=True)

Motivating Example

Consider the problem to predict if a customer is likely to buy an upgrade to their membership plan. To do this, you first identify historical examples of customers who upgraded and others who did not. For each customer, you can only use the interactions s/he had prior to upgrading or not upgrading their membership. This is a requirement – by definition.

The example above illustrates the importance of time in calculating features. Other situations are more subtle, and hence when building predictive models it is important identify if time is a consideration. If feature calculation does not account for time, it may include data in calculations that is past the outcome we want to predict and may cause the well known problem of Label Leakage.

Cutoff times

We can specify the time for each instance of the target_entity to calculate features. The timestamp represents the last time data can be used for calculating features. This is specified using a dataframe of cutoff times. Below we show an example of this dataframe for our customers example.

In [3]: import pandas as pd

In [4]: cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
   ...:                              "time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
   ...: 

In [5]: cutoff_times
Out[5]: 
   customer_id                time
0            1 2014-01-01 01:41:50
1            2 2014-01-01 02:06:50
2            3 2014-01-01 02:31:50
3            4 2014-01-01 02:56:50
4            5 2014-01-01 03:21:50

Time index for an entity

Given the cutoff time for each instance of the target entity, Featuretools needs to automatically identify the data points that are prior to this time point across all entities. In most temporal datasets, entities have a column that specifies the point in time when data in that row became available.

Users specify this point in time when a particular row became known by defining a time index for each entity. Read about setting the time index in Representing Data with EntitySets.

Running DFS with cutoff times

We provide the cutoff times as a parameter to DFS.

In [6]: feature_matrix, features = ft.dfs(entityset=es,
   ...:                                   target_entity="customers",
   ...:                                   cutoff_time=cutoff_times)
   ...: 

In [7]: feature_matrix
Out[7]: 
            zip_code  COUNT(sessions)  NUM_UNIQUE(sessions.device)                  ...                   MODE(sessions.YEAR(session_start))  MODE(sessions.MONTH(session_start))  MODE(sessions.WEEKDAY(session_start))
customer_id                                                                         ...                                                                                                                                 
1              60091                3                          1.0                  ...                                               2014.0                                  1.0                                    2.0
2              02139                3                          3.0                  ...                                               2014.0                                  1.0                                    2.0
3              02139                1                          1.0                  ...                                               2014.0                                  1.0                                    2.0
4              60091                0                          NaN                  ...                                                  NaN                                  NaN                                    NaN
5              02139                2                          2.0                  ...                                               2014.0                                  1.0                                    2.0

[5 rows x 69 columns]

There is one row in the feature matrix corresponding to a row in cutoff_times. The feature values in this row use only data prior to the cutoff time. Additionally, the returned feature matrix will be ordered by the time the rows was calculated, not by the order of cutoff_times. We can add the cutoff time to the returned feature matrix by using cutoff_time_in_index as shown below

In [8]: feature_matrix, features = ft.dfs(entityset=es,
   ...:                                   target_entity="customers",
   ...:                                   cutoff_time=cutoff_times,
   ...:                                   cutoff_time_in_index=True)
   ...: 

In [9]: feature_matrix
Out[9]: 
                                zip_code  COUNT(sessions)  NUM_UNIQUE(sessions.device)                  ...                   MODE(sessions.YEAR(session_start))  MODE(sessions.MONTH(session_start))  MODE(sessions.WEEKDAY(session_start))
customer_id time                                                                                        ...                                                                                                                                 
1           2014-01-01 01:41:50    60091                3                          1.0                  ...                                               2014.0                                  1.0                                    2.0
2           2014-01-01 02:06:50    02139                3                          3.0                  ...                                               2014.0                                  1.0                                    2.0
3           2014-01-01 02:31:50    02139                1                          1.0                  ...                                               2014.0                                  1.0                                    2.0
4           2014-01-01 02:56:50    60091                0                          NaN                  ...                                                  NaN                                  NaN                                    NaN
5           2014-01-01 03:21:50    02139                2                          2.0                  ...                                               2014.0                                  1.0                                    2.0

[5 rows x 69 columns]

It is often the case that we want our labels in our calculated feature matrix so that the ordering is consistent between the labels and the rows of the feature matrix. However, adding labels to the initial dataframe means that you would have to explicitly prohibit dfs from building features with that column. To bypass this, we can provide additional columns to cutoff times which will be added directly the feature matrix. While the first column will be treated as the index and the second as a cutoff time, any additional columns will appear as features in the resulting feature matrix.

In [10]: cutoff_times['label'] = pd.Series([0, 0, 1, 0, 1])

In [11]: feature_matrix, features = ft.dfs(entityset=es,
   ....:                                   target_entity="customers",
   ....:                                   cutoff_time=cutoff_times,
   ....:                                   cutoff_time_in_index=True)
   ....: 

In [12]: feature_matrix['label']
Out[12]: 
customer_id  time               
1            2014-01-01 01:41:50    0
2            2014-01-01 02:06:50    0
3            2014-01-01 02:31:50    1
4            2014-01-01 02:56:50    0
5            2014-01-01 03:21:50    1
Name: label, dtype: int64

Running DFS with training windows

Training windows are an extension of cutoff times: starting from the cutoff time and moving backwards through time, only data within that window of time will be used to calculate features. This example creates a window that only includes transactions that occurred less than 1 hour before the cutoff

In [13]: window_fm, window_features = ft.dfs(entityset=es,
   ....:                                     target_entity="customers",
   ....:                                     cutoff_time=cutoff_times,
   ....:                                     cutoff_time_in_index=True,
   ....:                                     training_window="1 hour")
   ....: 

In [14]: window_fm
Out[14]: 
                                zip_code  COUNT(sessions)  NUM_UNIQUE(sessions.device) MODE(sessions.device)  ...    MODE(sessions.YEAR(session_start))  MODE(sessions.MONTH(session_start))  MODE(sessions.WEEKDAY(session_start))  label
customer_id time                                                                                              ...                                                                                                                         
1           2014-01-01 01:41:50    60091                1                          1.0               desktop  ...                                2014.0                                  1.0                                    2.0      0
2           2014-01-01 02:06:50    02139                3                          3.0               desktop  ...                                2014.0                                  1.0                                    2.0      0
3           2014-01-01 02:31:50    02139                0                          NaN                   NaN  ...                                   NaN                                  NaN                                    NaN      1
4           2014-01-01 02:56:50    60091                0                          NaN                   NaN  ...                                   NaN                                  NaN                                    NaN      0
5           2014-01-01 03:21:50    02139                1                          1.0                tablet  ...                                2014.0                                  1.0                                    2.0      1

[5 rows x 70 columns]

We can see that that the counts for the same feature are lower when we shorten the training window

In [15]: feature_matrix[["COUNT(transactions)"]]
Out[15]: 
                                 COUNT(transactions)
customer_id time                                    
1           2014-01-01 01:41:50                   43
2           2014-01-01 02:06:50                   36
3           2014-01-01 02:31:50                   25
4           2014-01-01 02:56:50                    0
5           2014-01-01 03:21:50                   29

In [16]: window_fm[["COUNT(transactions)"]]
Out[16]: 
                                 COUNT(transactions)
customer_id time                                    
1           2014-01-01 01:41:50                   17
2           2014-01-01 02:06:50                   36
3           2014-01-01 02:31:50                    0
4           2014-01-01 02:56:50                    0
5           2014-01-01 03:21:50                   15

Creating a 3-Dimensional Feature Tensor Using Multiple Cutoff Times from make_temporal_cutoffs()

The make_temporal_cutoffs function generates a series of equally spaced cutoff times from a given set of cutoff times and instance ids. This function can be paired with dfs to create a feature tensor rather than feature matrix (but flattened as a 2D DataFrame with multiple rows per instance) and list of Featuretools feature definitions. This function takes in the the following parameters:

  • instance_ids (list, pd.Series, or np.ndarray): list of instances
  • cutoffs (list, pd.Series, or np.ndarray): associated list of cutoff times
  • window_size (str or pandas.DateOffset): amount of time between each cutoff time in the created time series
  • start (datetime.datetime or pd.Timestamp): first cutoff time in the created time series
  • num_windows (int): number of cutoff times in the created time series

Only 2 of window_size, start, and num_windows need to be specified to uniquely determine an equally-spaced set of cutoff times at which to compute each instance.

Let’s say the final cutoff times (which could be directly passed into dfs()) look like this:

In [17]: cutoffs = pd.DataFrame({
   ....:   'customer_id': [13458, 13602, 15222],
   ....:   'cutoff_time': [pd.Timestamp('2011/12/15'), pd.Timestamp('2012/10/05'), pd.Timestamp('2012/01/25')]
   ....: })
   ....: 

Then passing in window_size='3d' and num_windows=2 produces the following cutoff times to be passed into DFS.

In [18]: temporal_cutoffs = ft.make_temporal_cutoffs(cutoffs['customer_id'],
   ....:                                             cutoffs['cutoff_time'],
   ....:                                             window_size='3d',
   ....:                                             num_windows=2)
   ....: 

In [19]: temporal_cutoffs
Out[19]: 
        time  instance_id
0 2011-12-12        13458
1 2011-12-15        13458
2 2012-10-02        13602
3 2012-10-05        13602
4 2012-01-22        15222
5 2012-01-25        15222

In [20]: entityset = ft.demo.load_retail()

In [21]: feature_tensor, feature_defs = ft.dfs(entityset=entityset,
   ....:                                       target_entity='customers',
   ....:                                       cutoff_time=temporal_cutoffs,
   ....:                                       cutoff_time_in_index=True,
   ....:                                       max_features=4)
   ....: 

In [22]: feature_tensor
Out[22]: 
                               country  COUNT(orders)  SUM(order_products.quantity)  SUM(order_products.price)
customer_id time                                                                                              
13458       2011-12-12  United Kingdom             22                          5460                     967.51
            2011-12-15  United Kingdom             22                          5460                     967.51
15222       2012-01-22  United Kingdom              1                           440                       6.72
            2012-01-25  United Kingdom              1                           440                       6.72
13602       2012-10-02  United Kingdom              1                           267                      47.40
            2012-10-05  United Kingdom              1                           267                      47.40