Handling Time¶
When performing feature engineering with temporal data, carefully selecting the data that is used for any calculation is paramount. By annotating entities with a time index column and providing a cutoff time during feature calculation, Featuretools will automatically filter out any data after the cutoff time before running any calculations.
What is the Time Index?¶
The time index is the column in the data that specifies when the data in each row became known. For example, let’s examine a table of customer transactions:
In [1]: import featuretools as ft
In [2]: es = ft.demo.load_mock_customer(return_entityset=True, random_seed=0)
In [3]: es['transactions'].df.head()
Out[3]:
transaction_id session_id transaction_time amount product_id
298 298 1 2014-01-01 00:00:00 127.64 5
2 2 1 2014-01-01 00:01:05 109.48 2
308 308 1 2014-01-01 00:02:10 95.06 3
116 116 1 2014-01-01 00:03:15 78.92 4
371 371 1 2014-01-01 00:04:20 31.54 3
In this table, there is one row for every transaction and a transaction_time
column that specifies when the transaction took place. This means that transaction_time
is the time index because it indicates when the information in each row became known and available for feature calculations.
However, not every datetime column is a time index. Consider the customers
entity:
In [4]: es['customers'].df
Out[4]:
customer_id join_date date_of_birth zip_code
5 5 2010-07-17 05:27:50 1984-07-28 60091
4 4 2011-04-08 20:08:14 2006-08-15 60091
1 1 2011-04-17 10:48:33 1994-07-18 60091
3 3 2011-08-13 15:42:34 2003-11-21 13244
2 2 2012-04-15 23:31:04 1986-08-18 13244
Here, we have two time columns, join_date
and date_of_birth
. While either column might be useful for making features, the join_date
should be used as the time index because it indicates when that customer first became available in the dataset.
Important
The time index is defined as the first time that any information from a row can be used. If a cutoff time is specified when calculating features, rows that have a later value for the time index are automatically ignored.
What is the Cutoff Time?¶
The cutoff_time specifies the last point in time that a row’s data can be used for a feature calculation. Any data after this point in time will be filtered out before calculating features.
For example, let’s consider a dataset of timestamped customer transactions, where we want to predict whether customers 1
, 2
and 3
will spend $500 between 04:00
on January 1 and the end of the day. When building features for this prediction problem, we need to ensure that no data after 04:00
is used in our calculations.

We pass the cutoff time to featuretools.dfs()
or featuretools.calculate_feature_matrix()
using the cutoff_time
argument like this:
In [5]: fm, features = ft.dfs(entityset=es,
...: target_entity='customers',
...: cutoff_time=pd.Timestamp("2014-1-1 04:00"),
...: instance_ids=[1,2,3],
...: cutoff_time_in_index=True)
...:
In [6]: fm
Out[6]:
zip_code COUNT(sessions) NUM_UNIQUE(sessions.device) MODE(sessions.device) SUM(transactions.amount) STD(transactions.amount) MAX(transactions.amount) SKEW(transactions.amount) MIN(transactions.amount) MEAN(transactions.amount) COUNT(transactions) NUM_UNIQUE(transactions.product_id) MODE(transactions.product_id) DAY(date_of_birth) DAY(join_date) YEAR(date_of_birth) YEAR(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) SUM(sessions.SKEW(transactions.amount)) SUM(sessions.MIN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) SUM(sessions.MAX(transactions.amount)) SUM(sessions.STD(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.COUNT(transactions)) STD(sessions.SKEW(transactions.amount)) STD(sessions.MIN(transactions.amount)) STD(sessions.NUM_UNIQUE(transactions.product_id)) STD(sessions.MAX(transactions.amount)) STD(sessions.MEAN(transactions.amount)) MAX(sessions.SUM(transactions.amount)) MAX(sessions.COUNT(transactions)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) MAX(sessions.NUM_UNIQUE(transactions.product_id)) MAX(sessions.STD(transactions.amount)) MAX(sessions.MEAN(transactions.amount)) SKEW(sessions.SUM(transactions.amount)) SKEW(sessions.COUNT(transactions)) SKEW(sessions.MIN(transactions.amount)) SKEW(sessions.NUM_UNIQUE(transactions.product_id)) SKEW(sessions.MAX(transactions.amount)) SKEW(sessions.STD(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) MIN(sessions.SUM(transactions.amount)) MIN(sessions.COUNT(transactions)) MIN(sessions.SKEW(transactions.amount)) MIN(sessions.NUM_UNIQUE(transactions.product_id)) MIN(sessions.MAX(transactions.amount)) MIN(sessions.STD(transactions.amount)) MIN(sessions.MEAN(transactions.amount)) MEAN(sessions.SUM(transactions.amount)) MEAN(sessions.COUNT(transactions)) MEAN(sessions.SKEW(transactions.amount)) MEAN(sessions.MIN(transactions.amount)) MEAN(sessions.NUM_UNIQUE(transactions.product_id)) MEAN(sessions.MAX(transactions.amount)) MEAN(sessions.STD(transactions.amount)) MEAN(sessions.MEAN(transactions.amount)) NUM_UNIQUE(sessions.YEAR(session_start)) NUM_UNIQUE(sessions.MONTH(session_start)) NUM_UNIQUE(sessions.MODE(transactions.product_id)) NUM_UNIQUE(sessions.DAY(session_start)) NUM_UNIQUE(sessions.WEEKDAY(session_start)) MODE(sessions.YEAR(session_start)) MODE(sessions.MONTH(session_start)) MODE(sessions.MODE(transactions.product_id)) MODE(sessions.DAY(session_start)) MODE(sessions.WEEKDAY(session_start)) NUM_UNIQUE(transactions.sessions.device) NUM_UNIQUE(transactions.sessions.customer_id) MODE(transactions.sessions.device) MODE(transactions.sessions.customer_id)
customer_id time
1 2014-01-01 04:00:00 60091 4 3 tablet 4958.19 42.309717 139.23 -0.006928 5.81 74.002836 67 5 4 18 17 1994 2011 7 4 0 6 -0.505043 27.62 20 540.04 169.572874 304.601700 271.917637 5.678908 0.500353 1.285833 0.0 5.027226 10.426572 1613.93 25 0.234349 8.74 5 46.905665 85.469167 1.197406 1.614843 1.452325 0.0 -0.451371 1.235445 -0.233453 1025.63 12 -0.830975 5 129.00 39.825249 64.557200 1239.5475 16.75 -0.126261 6.905 5 135.0100 42.393218 76.150425 1 1 3 1 1 2014 1 4 1 2 3 1 tablet 1
2 2014-01-01 04:00:00 13244 4 2 desktop 4150.30 39.289512 146.81 -0.134786 12.07 84.700000 49 5 4 18 15 1986 2012 8 4 0 6 0.045171 105.24 20 569.29 157.262738 340.791792 307.743859 3.862210 0.324809 20.424007 0.0 3.470527 8.983533 1320.64 16 0.295458 56.46 5 47.935920 96.581000 -0.823347 -0.169238 1.815491 0.0 0.459305 -0.966834 0.651941 634.84 8 -0.455197 5 138.38 27.839228 76.813125 1037.5750 12.25 0.011293 26.310 5 142.3225 39.315685 85.197948 1 1 3 1 1 2014 1 2 1 2 2 1 desktop 2
3 2014-01-01 04:00:00 13244 1 1 tablet 941.87 47.264797 146.31 0.618455 8.19 62.791333 15 5 1 21 13 2003 2011 11 8 4 5 0.618455 8.19 5 146.31 47.264797 62.791333 NaN NaN NaN NaN NaN NaN NaN 941.87 15 0.618455 8.19 5 47.264797 62.791333 NaN NaN NaN NaN NaN NaN NaN 941.87 15 0.618455 5 146.31 47.264797 62.791333 941.8700 15.00 0.618455 8.190 5 146.3100 47.264797 62.791333 1 1 1 1 1 2014 1 1 1 2 1 1 tablet 3
Even though the entityset contains the complete transaction history for each customer, only data with a time index up to and including the cutoff time was used to calculate the features above.
Using a Cutoff Time DataFrame¶
Oftentimes, the training examples for machine learning will come from different points in time. To specify a unique cutoff time for each row of the resulting feature matrix, we can pass a dataframe where the first column is the instance id and the second column is the corresponding cutoff time.
Note
Only the first two columns are used to calculate features. Any additional columns passed through are appended to the resulting feature matrix. This is typically used to pass through machine learning labels to ensure that they stay aligned with the feature matrix.
In [7]: cutoff_times = pd.DataFrame()
In [8]: cutoff_times['customer_id'] = [1, 2, 3, 1]
In [9]: cutoff_times['time'] = pd.to_datetime(['2014-1-1 04:00',
...: '2014-1-1 05:00',
...: '2014-1-1 06:00',
...: '2014-1-1 08:00'])
...:
In [10]: cutoff_times['label'] = [True, True, False, True]
In [11]: cutoff_times
Out[11]:
customer_id time label
0 1 2014-01-01 04:00:00 True
1 2 2014-01-01 05:00:00 True
2 3 2014-01-01 06:00:00 False
3 1 2014-01-01 08:00:00 True
In [12]: fm, features = ft.dfs(entityset=es,
....: target_entity='customers',
....: cutoff_time=cutoff_times,
....: cutoff_time_in_index=True)
....:
In [13]: fm
Out[13]:
zip_code COUNT(sessions) NUM_UNIQUE(sessions.device) MODE(sessions.device) SUM(transactions.amount) STD(transactions.amount) MAX(transactions.amount) SKEW(transactions.amount) MIN(transactions.amount) MEAN(transactions.amount) COUNT(transactions) NUM_UNIQUE(transactions.product_id) MODE(transactions.product_id) DAY(date_of_birth) DAY(join_date) YEAR(date_of_birth) YEAR(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) SUM(sessions.SKEW(transactions.amount)) SUM(sessions.MIN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) SUM(sessions.MAX(transactions.amount)) SUM(sessions.STD(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.COUNT(transactions)) STD(sessions.SKEW(transactions.amount)) STD(sessions.MIN(transactions.amount)) STD(sessions.NUM_UNIQUE(transactions.product_id)) STD(sessions.MAX(transactions.amount)) STD(sessions.MEAN(transactions.amount)) MAX(sessions.SUM(transactions.amount)) MAX(sessions.COUNT(transactions)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) MAX(sessions.NUM_UNIQUE(transactions.product_id)) MAX(sessions.STD(transactions.amount)) MAX(sessions.MEAN(transactions.amount)) SKEW(sessions.SUM(transactions.amount)) SKEW(sessions.COUNT(transactions)) SKEW(sessions.MIN(transactions.amount)) SKEW(sessions.NUM_UNIQUE(transactions.product_id)) SKEW(sessions.MAX(transactions.amount)) SKEW(sessions.STD(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) MIN(sessions.SUM(transactions.amount)) MIN(sessions.COUNT(transactions)) MIN(sessions.SKEW(transactions.amount)) MIN(sessions.NUM_UNIQUE(transactions.product_id)) MIN(sessions.MAX(transactions.amount)) MIN(sessions.STD(transactions.amount)) MIN(sessions.MEAN(transactions.amount)) MEAN(sessions.SUM(transactions.amount)) MEAN(sessions.COUNT(transactions)) MEAN(sessions.SKEW(transactions.amount)) MEAN(sessions.MIN(transactions.amount)) MEAN(sessions.NUM_UNIQUE(transactions.product_id)) MEAN(sessions.MAX(transactions.amount)) MEAN(sessions.STD(transactions.amount)) MEAN(sessions.MEAN(transactions.amount)) NUM_UNIQUE(sessions.YEAR(session_start)) NUM_UNIQUE(sessions.MONTH(session_start)) NUM_UNIQUE(sessions.MODE(transactions.product_id)) NUM_UNIQUE(sessions.DAY(session_start)) NUM_UNIQUE(sessions.WEEKDAY(session_start)) MODE(sessions.YEAR(session_start)) MODE(sessions.MONTH(session_start)) MODE(sessions.MODE(transactions.product_id)) MODE(sessions.DAY(session_start)) MODE(sessions.WEEKDAY(session_start)) NUM_UNIQUE(transactions.sessions.device) NUM_UNIQUE(transactions.sessions.customer_id) MODE(transactions.sessions.device) MODE(transactions.sessions.customer_id) label
customer_id time
1 2014-01-01 04:00:00 60091 4 3 tablet 4958.19 42.309717 139.23 -0.006928 5.81 74.002836 67 5 4 18 17 1994 2011 7 4 0 6 -0.505043 27.62 20 540.04 169.572874 304.601700 271.917637 5.678908 0.500353 1.285833 0.0 5.027226 10.426572 1613.93 25 0.234349 8.74 5 46.905665 85.469167 1.197406 1.614843 1.452325 0.0 -0.451371 1.235445 -0.233453 1025.63 12 -0.830975 5 129.00 39.825249 64.557200 1239.5475 16.75 -0.126261 6.90500 5 135.01000 42.393218 76.150425 1 1 3 1 1 2014 1 4 1 2 3 1 tablet 1 True
2 2014-01-01 05:00:00 13244 5 2 desktop 5155.26 38.047944 146.81 -0.121811 12.07 83.149355 62 5 4 18 15 1986 2012 8 4 0 6 -0.269747 127.06 25 688.14 190.987775 418.096407 266.912832 3.361547 0.316873 17.801322 0.0 10.919023 8.543351 1320.64 16 0.295458 56.46 5 47.935920 96.581000 -0.667256 -0.379092 1.959531 0.0 -1.814717 -0.213518 1.082192 634.84 8 -0.455197 5 118.85 27.839228 76.813125 1031.0520 12.40 -0.053949 25.41200 5 137.62800 38.197555 83.619281 1 1 4 1 1 2014 1 2 1 2 2 1 desktop 2 True
3 2014-01-01 06:00:00 13244 4 2 desktop 2867.69 40.349758 146.31 0.318315 6.65 65.174773 44 5 1 21 13 2003 2011 11 8 4 5 0.860577 126.66 16 493.07 119.136697 290.968018 417.557763 7.118052 0.500999 40.508892 2.0 22.808351 16.540737 944.85 17 0.618455 91.76 5 47.264797 91.760000 -1.977878 -1.330938 1.874170 -2.0 -1.060639 1.722323 0.201588 91.76 1 -0.289466 1 91.76 35.704680 55.579412 716.9225 11.00 0.286859 31.66500 4 123.26750 39.712232 72.742004 1 1 2 1 1 2014 1 1 1 2 2 1 desktop 3 False
1 2014-01-01 08:00:00 60091 8 3 mobile 9025.62 40.442059 139.43 0.019698 5.81 71.631905 126 5 4 18 17 1994 2011 7 4 0 6 -0.476122 78.59 40 1057.97 312.745952 582.193117 279.510713 4.062019 0.589386 6.954507 0.0 7.322191 13.759314 1613.93 25 0.640252 26.36 5 46.905665 88.755625 0.778170 1.946018 2.440005 0.0 -0.780493 -0.312355 -0.424949 809.97 12 -1.038434 5 118.90 30.450261 50.623125 1128.2025 15.75 -0.059515 9.82375 5 132.24625 39.093244 72.774140 1 1 4 1 1 2014 1 4 1 2 3 1 mobile 1 True
We can now see that every row of the feature matrix is calculated at the corresponding time in the cutoff time dataframe. Because we calculate each row at a different time, it is possible to have a repeat customer. In this case, we calculated the feature vector for customer 1 at both 04:00
and 08:00
.
Training Window¶
By default, all data up to and including the cutoff time is used. We can restrict the amount of historical data that is selected for calculations using a “training window.”
Here’s an example of using a two hour training window:
In [14]: window_fm, window_features = ft.dfs(entityset=es,
....: target_entity="customers",
....: cutoff_time=cutoff_times,
....: cutoff_time_in_index=True,
....: training_window="2 hour")
....:
In [15]: window_fm
Out[15]:
zip_code COUNT(sessions) NUM_UNIQUE(sessions.device) MODE(sessions.device) SUM(transactions.amount) STD(transactions.amount) MAX(transactions.amount) SKEW(transactions.amount) MIN(transactions.amount) MEAN(transactions.amount) COUNT(transactions) NUM_UNIQUE(transactions.product_id) MODE(transactions.product_id) DAY(date_of_birth) DAY(join_date) YEAR(date_of_birth) YEAR(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) SUM(sessions.SKEW(transactions.amount)) SUM(sessions.MIN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) SUM(sessions.MAX(transactions.amount)) SUM(sessions.STD(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.COUNT(transactions)) STD(sessions.SKEW(transactions.amount)) STD(sessions.MIN(transactions.amount)) STD(sessions.NUM_UNIQUE(transactions.product_id)) STD(sessions.MAX(transactions.amount)) STD(sessions.MEAN(transactions.amount)) MAX(sessions.SUM(transactions.amount)) MAX(sessions.COUNT(transactions)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) MAX(sessions.NUM_UNIQUE(transactions.product_id)) MAX(sessions.STD(transactions.amount)) MAX(sessions.MEAN(transactions.amount)) SKEW(sessions.SUM(transactions.amount)) SKEW(sessions.COUNT(transactions)) SKEW(sessions.MIN(transactions.amount)) SKEW(sessions.NUM_UNIQUE(transactions.product_id)) SKEW(sessions.MAX(transactions.amount)) SKEW(sessions.STD(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) MIN(sessions.SUM(transactions.amount)) MIN(sessions.COUNT(transactions)) MIN(sessions.SKEW(transactions.amount)) MIN(sessions.NUM_UNIQUE(transactions.product_id)) MIN(sessions.MAX(transactions.amount)) MIN(sessions.STD(transactions.amount)) MIN(sessions.MEAN(transactions.amount)) MEAN(sessions.SUM(transactions.amount)) MEAN(sessions.COUNT(transactions)) MEAN(sessions.SKEW(transactions.amount)) MEAN(sessions.MIN(transactions.amount)) MEAN(sessions.NUM_UNIQUE(transactions.product_id)) MEAN(sessions.MAX(transactions.amount)) MEAN(sessions.STD(transactions.amount)) MEAN(sessions.MEAN(transactions.amount)) NUM_UNIQUE(sessions.YEAR(session_start)) NUM_UNIQUE(sessions.MONTH(session_start)) NUM_UNIQUE(sessions.MODE(transactions.product_id)) NUM_UNIQUE(sessions.DAY(session_start)) NUM_UNIQUE(sessions.WEEKDAY(session_start)) MODE(sessions.YEAR(session_start)) MODE(sessions.MONTH(session_start)) MODE(sessions.MODE(transactions.product_id)) MODE(sessions.DAY(session_start)) MODE(sessions.WEEKDAY(session_start)) NUM_UNIQUE(transactions.sessions.device) NUM_UNIQUE(transactions.sessions.customer_id) MODE(transactions.sessions.device) MODE(transactions.sessions.customer_id) label
customer_id time
1 2014-01-01 04:00:00 60091 2 2 desktop 2077.66 43.772157 139.09 -0.187686 5.81 76.950370 27 5 4 18 17 1994 2011 7 4 0 6 -0.604638 12.59 10 271.81 86.730914 155.604500 18.667619 2.121320 0.747633 0.685894 0.000000 4.504270 10.842658 1052.03 15 0.226337 6.78 5 46.905665 85.469167 NaN NaN NaN NaN NaN NaN NaN 1025.63 12 -0.830975 5 132.72 39.825249 70.135333 1038.830000 13.500000 -0.302319 6.295000 5.000000 135.905000 43.365457 77.802250 1 1 2 1 1 2014 1 1 1 2 2 1 desktop 1 True
2 2014-01-01 05:00:00 13244 3 2 desktop 2605.61 36.077146 146.81 -0.198611 12.07 84.051935 31 5 4 18 15 1986 2012 8 4 0 6 -0.110009 90.35 15 404.04 109.500185 253.240615 203.331699 2.516611 0.242542 23.329038 0.000000 14.342521 10.587085 1004.96 13 0.130019 56.46 5 47.935920 96.581000 -1.660092 0.585583 1.397956 0.000000 -1.083626 1.121470 1.659252 634.84 8 -0.314918 5 118.85 27.839228 77.304615 868.536667 10.333333 -0.036670 30.116667 5.000000 134.680000 36.500062 84.413538 1 1 3 1 1 2014 1 1 1 2 2 1 desktop 2 True
3 2014-01-01 06:00:00 13244 3 1 desktop 1925.82 37.130891 128.26 0.110145 6.65 66.407586 29 5 1 21 13 2003 2011 11 8 4 5 0.242122 118.47 11 346.76 71.871900 228.176684 477.281339 8.082904 0.580573 45.761028 2.309401 20.648490 18.557570 944.85 17 0.531588 91.76 5 36.167220 91.760000 -1.705607 -0.722109 1.566223 -1.732051 -1.721498 NaN -1.081879 91.76 1 -0.289466 1 91.76 35.704680 55.579412 641.940000 9.666667 0.121061 39.490000 3.666667 115.586667 35.935950 76.058895 1 1 2 1 1 2014 1 1 1 2 1 1 desktop 3 False
1 2014-01-01 08:00:00 60091 3 2 mobile 3124.15 38.952172 139.43 0.047120 5.91 66.471277 47 5 4 18 17 1994 2011 7 4 0 6 -0.003438 24.61 15 384.44 107.128899 198.984750 330.655558 0.577350 0.906666 3.016195 0.000000 10.415432 19.935229 1420.09 16 0.640252 11.62 5 44.354104 88.755625 1.606791 -1.732051 1.443486 0.000000 0.846298 1.612576 1.344879 809.97 15 -1.038434 5 118.90 30.450261 50.623125 1041.383333 15.666667 -0.001146 8.203333 5.000000 128.146667 35.709633 66.328250 1 1 3 1 1 2014 1 1 1 2 2 1 mobile 1 True
We can see that that the counts for the same feature are lower after we shorten the training window:
In [16]: fm[["COUNT(transactions)"]]
Out[16]:
COUNT(transactions)
customer_id time
1 2014-01-01 04:00:00 67
2 2014-01-01 05:00:00 62
3 2014-01-01 06:00:00 44
1 2014-01-01 08:00:00 126
In [17]: window_fm[["COUNT(transactions)"]]