Handling Time¶
When performing feature engineering with temporal data, carefully selecting the data that is used for any calculation is paramount. By annotating entities with a time index column and providing a cutoff time during feature calculation, Featuretools will automatically filter out any data after the cutoff time before running any calculations.
What is the Time Index?¶
The time index is the column in the data that specifies when the data in each row became known. For example, let’s examine a table of customer transactions:
In [1]: import featuretools as ft
In [2]: es = ft.demo.load_mock_customer(return_entityset=True, random_seed=0)
In [3]: es['transactions'].df.head()
Out[3]:
transaction_id session_id transaction_time amount product_id
298 298 1 2014-01-01 00:00:00 127.64 5
2 2 1 2014-01-01 00:01:05 109.48 2
308 308 1 2014-01-01 00:02:10 95.06 3
116 116 1 2014-01-01 00:03:15 78.92 4
371 371 1 2014-01-01 00:04:20 31.54 3
In this table, there is one row for every transaction and a transaction_time
column that specifies when the transaction took place. This means that transaction_time
is the time index because it indicates when the information in each row became known and available for feature calculations.
However, not every datetime column is a time index. Consider the customers
entity:
In [4]: es['customers'].df
Out[4]:
customer_id join_date date_of_birth zip_code
5 5 2010-07-17 05:27:50 1984-07-28 60091
4 4 2011-04-08 20:08:14 2006-08-15 60091
1 1 2011-04-17 10:48:33 1994-07-18 60091
3 3 2011-08-13 15:42:34 2003-11-21 13244
2 2 2012-04-15 23:31:04 1986-08-18 13244
Here, we have two time columns, join_date
and date_of_birth
. While either column might be useful for making features, the join_date
should be used as the time index because it indicates when that customer first became available in the dataset.
Important
The time index is defined as the first time that any information from a row can be used. If a cutoff time is specified when calculating features, rows that have a later value for the time index are automatically ignored.
What is the Cutoff Time?¶
The cutoff_time specifies the last point in time that a row’s data can be used for a feature calculation. Any data after this point in time will be filtered out before calculating features.
For example, let’s consider a dataset of timestamped customer transactions, where we want to predict whether customers 1
, 2
and 3
will spend $500 between 04:00
on January 1 and the end of the day. When building features for this prediction problem, we need to ensure that no data after 04:00
is used in our calculations.
We pass the cutoff time to featuretools.dfs()
or featuretools.calculate_feature_matrix()
using the cutoff_time
argument like this:
In [5]: fm, features = ft.dfs(entityset=es,
...: target_entity='customers',
...: cutoff_time=pd.Timestamp("2014-1-1 04:00"),
...: instance_ids=[1,2,3],
...: cutoff_time_in_index=True)
...:
In [6]: fm
Out[6]:
zip_code COUNT(sessions) NUM_UNIQUE(sessions.device) MODE(sessions.device) SUM(transactions.amount) STD(transactions.amount) MAX(transactions.amount) SKEW(transactions.amount) MIN(transactions.amount) MEAN(transactions.amount) COUNT(transactions) NUM_UNIQUE(transactions.product_id) MODE(transactions.product_id) DAY(date_of_birth) DAY(join_date) YEAR(date_of_birth) YEAR(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) SUM(sessions.MIN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) SUM(sessions.STD(transactions.amount)) SUM(sessions.MAX(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) SUM(sessions.SKEW(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.MIN(transactions.amount)) STD(sessions.NUM_UNIQUE(transactions.product_id)) STD(sessions.COUNT(transactions)) STD(sessions.MAX(transactions.amount)) STD(sessions.MEAN(transactions.amount)) STD(sessions.SKEW(transactions.amount)) MAX(sessions.SUM(transactions.amount)) MAX(sessions.MIN(transactions.amount)) MAX(sessions.NUM_UNIQUE(transactions.product_id)) MAX(sessions.COUNT(transactions)) MAX(sessions.STD(transactions.amount)) MAX(sessions.MEAN(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) SKEW(sessions.SUM(transactions.amount)) SKEW(sessions.MIN(transactions.amount)) SKEW(sessions.NUM_UNIQUE(transactions.product_id)) SKEW(sessions.COUNT(transactions)) SKEW(sessions.STD(transactions.amount)) SKEW(sessions.MAX(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) MIN(sessions.SUM(transactions.amount)) MIN(sessions.NUM_UNIQUE(transactions.product_id)) MIN(sessions.COUNT(transactions)) MIN(sessions.STD(transactions.amount)) MIN(sessions.MAX(transactions.amount)) MIN(sessions.MEAN(transactions.amount)) MIN(sessions.SKEW(transactions.amount)) MEAN(sessions.SUM(transactions.amount)) MEAN(sessions.MIN(transactions.amount)) MEAN(sessions.NUM_UNIQUE(transactions.product_id)) MEAN(sessions.COUNT(transactions)) MEAN(sessions.STD(transactions.amount)) MEAN(sessions.MAX(transactions.amount)) MEAN(sessions.MEAN(transactions.amount)) MEAN(sessions.SKEW(transactions.amount)) NUM_UNIQUE(sessions.DAY(session_start)) NUM_UNIQUE(sessions.YEAR(session_start)) NUM_UNIQUE(sessions.MODE(transactions.product_id)) NUM_UNIQUE(sessions.MONTH(session_start)) NUM_UNIQUE(sessions.WEEKDAY(session_start)) MODE(sessions.DAY(session_start)) MODE(sessions.YEAR(session_start)) MODE(sessions.MODE(transactions.product_id)) MODE(sessions.MONTH(session_start)) MODE(sessions.WEEKDAY(session_start)) NUM_UNIQUE(transactions.sessions.customer_id) NUM_UNIQUE(transactions.sessions.device) MODE(transactions.sessions.customer_id) MODE(transactions.sessions.device)
customer_id time
1 2014-01-01 04:00:00 60091 4 3 tablet 4958.19 42.309717 139.23 -0.006928 5.81 74.002836 67 5 4 18 17 1994 2011 7 4 0 6 27.62 20 169.572874 540.04 304.601700 -0.505043 271.917637 1.285833 0.0 5.678908 5.027226 10.426572 0.500353 1613.93 8.74 5 25 46.905665 85.469167 0.234349 1.197406 1.452325 0.0 1.614843 1.235445 -0.451371 -0.233453 1025.63 5 12 39.825249 129.00 64.557200 -0.830975 1239.5475 6.905 5 16.75 42.393218 135.0100 76.150425 -0.126261 1 1 3 1 1 1 2014 4 1 2 1 3 1 tablet
2 2014-01-01 04:00:00 13244 4 2 desktop 4150.30 39.289512 146.81 -0.134786 12.07 84.700000 49 5 4 18 15 1986 2012 8 4 0 6 105.24 20 157.262738 569.29 340.791792 0.045171 307.743859 20.424007 0.0 3.862210 3.470527 8.983533 0.324809 1320.64 56.46 5 16 47.935920 96.581000 0.295458 -0.823347 1.815491 0.0 -0.169238 -0.966834 0.459305 0.651941 634.84 5 8 27.839228 138.38 76.813125 -0.455197 1037.5750 26.310 5 12.25 39.315685 142.3225 85.197948 0.011293 1 1 3 1 1 1 2014 2 1 2 1 2 2 desktop
3 2014-01-01 04:00:00 13244 1 1 tablet 941.87 47.264797 146.31 0.618455 8.19 62.791333 15 5 1 21 13 2003 2011 11 8 4 5 8.19 5 47.264797 146.31 62.791333 0.618455 NaN NaN NaN NaN NaN NaN NaN 941.87 8.19 5 15 47.264797 62.791333 0.618455 NaN NaN NaN NaN NaN NaN NaN 941.87 5 15 47.264797 146.31 62.791333 0.618455 941.8700 8.190 5 15.00 47.264797 146.3100 62.791333 0.618455 1 1 1 1 1 1 2014 1 1 2 1 1 3 tablet
Even though the entityset contains the complete transaction history for each customer, only data with a time index up to and including the cutoff time was used to calculate the features above.
Using a Cutoff Time DataFrame¶
Oftentimes, the training examples for machine learning will come from different points in time. To specify a unique cutoff time for each row of the resulting feature matrix, we can pass a dataframe where the first column is the instance id and the second column is the corresponding cutoff time.
Note
Only the first two columns are used to calculate features. Any additional columns passed through are appended to the resulting feature matrix. This is typically used to pass through machine learning labels to ensure that they stay aligned with the feature matrix.
In [7]: cutoff_times = pd.DataFrame()
In [8]: cutoff_times['customer_id'] = [1, 2, 3, 1]
In [9]: cutoff_times['time'] = pd.to_datetime(['2014-1-1 04:00',
...: '2014-1-1 05:00',
...: '2014-1-1 06:00',
...: '2014-1-1 08:00'])
...:
In [10]: cutoff_times['label'] = [True, True, False, True]
In [11]: cutoff_times
Out[11]:
customer_id time label
0 1 2014-01-01 04:00:00 True
1 2 2014-01-01 05:00:00 True
2 3 2014-01-01 06:00:00 False
3 1 2014-01-01 08:00:00 True
In [12]: fm, features = ft.dfs(entityset=es,
....: target_entity='customers',
....: cutoff_time=cutoff_times,
....: cutoff_time_in_index=True)
....:
In [13]: fm
Out[13]:
zip_code COUNT(sessions) NUM_UNIQUE(sessions.device) MODE(sessions.device) SUM(transactions.amount) STD(transactions.amount) MAX(transactions.amount) SKEW(transactions.amount) MIN(transactions.amount) MEAN(transactions.amount) COUNT(transactions) NUM_UNIQUE(transactions.product_id) MODE(transactions.product_id) DAY(date_of_birth) DAY(join_date) YEAR(date_of_birth) YEAR(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) SUM(sessions.MIN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) SUM(sessions.STD(transactions.amount)) SUM(sessions.MAX(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) SUM(sessions.SKEW(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.MIN(transactions.amount)) STD(sessions.NUM_UNIQUE(transactions.product_id)) STD(sessions.COUNT(transactions)) STD(sessions.MAX(transactions.amount)) STD(sessions.MEAN(transactions.amount)) STD(sessions.SKEW(transactions.amount)) MAX(sessions.SUM(transactions.amount)) MAX(sessions.MIN(transactions.amount)) MAX(sessions.NUM_UNIQUE(transactions.product_id)) MAX(sessions.COUNT(transactions)) MAX(sessions.STD(transactions.amount)) MAX(sessions.MEAN(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) SKEW(sessions.SUM(transactions.amount)) SKEW(sessions.MIN(transactions.amount)) SKEW(sessions.NUM_UNIQUE(transactions.product_id)) SKEW(sessions.COUNT(transactions)) SKEW(sessions.STD(transactions.amount)) SKEW(sessions.MAX(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) MIN(sessions.SUM(transactions.amount)) MIN(sessions.NUM_UNIQUE(transactions.product_id)) MIN(sessions.COUNT(transactions)) MIN(sessions.STD(transactions.amount)) MIN(sessions.MAX(transactions.amount)) MIN(sessions.MEAN(transactions.amount)) MIN(sessions.SKEW(transactions.amount)) MEAN(sessions.SUM(transactions.amount)) MEAN(sessions.MIN(transactions.amount)) MEAN(sessions.NUM_UNIQUE(transactions.product_id)) MEAN(sessions.COUNT(transactions)) MEAN(sessions.STD(transactions.amount)) MEAN(sessions.MAX(transactions.amount)) MEAN(sessions.MEAN(transactions.amount)) MEAN(sessions.SKEW(transactions.amount)) NUM_UNIQUE(sessions.DAY(session_start)) NUM_UNIQUE(sessions.YEAR(session_start)) NUM_UNIQUE(sessions.MODE(transactions.product_id)) NUM_UNIQUE(sessions.MONTH(session_start)) NUM_UNIQUE(sessions.WEEKDAY(session_start)) MODE(sessions.DAY(session_start)) MODE(sessions.YEAR(session_start)) MODE(sessions.MODE(transactions.product_id)) MODE(sessions.MONTH(session_start)) MODE(sessions.WEEKDAY(session_start)) NUM_UNIQUE(transactions.sessions.customer_id) NUM_UNIQUE(transactions.sessions.device) MODE(transactions.sessions.customer_id) MODE(transactions.sessions.device) label
customer_id time
1 2014-01-01 04:00:00 60091 4 3 tablet 4958.19 42.309717 139.23 -0.006928 5.81 74.002836 67 5 4 18 17 1994 2011 7 4 0 6 27.62 20 169.572874 540.04 304.601700 -0.505043 271.917637 1.285833 0.0 5.678908 5.027226 10.426572 0.500353 1613.93 8.74 5 25 46.905665 85.469167 0.234349 1.197406 1.452325 0.0 1.614843 1.235445 -0.451371 -0.233453 1025.63 5 12 39.825249 129.00 64.557200 -0.830975 1239.5475 6.90500 5 16.75 42.393218 135.01000 76.150425 -0.126261 1 1 3 1 1 1 2014 4 1 2 1 3 1 tablet True
2 2014-01-01 05:00:00 13244 5 2 desktop 5155.26 38.047944 146.81 -0.121811 12.07 83.149355 62 5 4 18 15 1986 2012 8 4 0 6 127.06 25 190.987775 688.14 418.096407 -0.269747 266.912832 17.801322 0.0 3.361547 10.919023 8.543351 0.316873 1320.64 56.46 5 16 47.935920 96.581000 0.295458 -0.667256 1.959531 0.0 -0.379092 -0.213518 -1.814717 1.082192 634.84 5 8 27.839228 118.85 76.813125 -0.455197 1031.0520 25.41200 5 12.40 38.197555 137.62800 83.619281 -0.053949 1 1 4 1 1 1 2014 2 1 2 1 2 2 desktop True
3 2014-01-01 06:00:00 13244 4 2 desktop 2867.69 40.349758 146.31 0.318315 6.65 65.174773 44 5 1 21 13 2003 2011 11 8 4 5 126.66 16 119.136697 493.07 290.968018 0.860577 417.557763 40.508892 2.0 7.118052 22.808351 16.540737 0.500999 944.85 91.76 5 17 47.264797 91.760000 0.618455 -1.977878 1.874170 -2.0 -1.330938 1.722323 -1.060639 0.201588 91.76 1 1 35.704680 91.76 55.579412 -0.289466 716.9225 31.66500 4 11.00 39.712232 123.26750 72.742004 0.286859 1 1 2 1 1 1 2014 1 1 2 1 2 3 desktop False
1 2014-01-01 08:00:00 60091 8 3 mobile 9025.62 40.442059 139.43 0.019698 5.81 71.631905 126 5 4 18 17 1994 2011 7 4 0 6 78.59 40 312.745952 1057.97 582.193117 -0.476122 279.510713 6.954507 0.0 4.062019 7.322191 13.759314 0.589386 1613.93 26.36 5 25 46.905665 88.755625 0.640252 0.778170 2.440005 0.0 1.946018 -0.312355 -0.780493 -0.424949 809.97 5 12 30.450261 118.90 50.623125 -1.038434 1128.2025 9.82375 5 15.75 39.093244 132.24625 72.774140 -0.059515 1 1 4 1 1 1 2014 4 1 2 1 3 1 mobile True
We can now see that every row of the feature matrix is calculated at the corresponding time in the cutoff time dataframe. Because we calculate each row at a different time, it is possible to have a repeat customer. In this case, we calculated the feature vector for customer 1 at both 04:00
and 08:00
.
Training Window¶
By default, all data up to and including the cutoff time is used. We can restrict the amount of historical data that is selected for calculations using a “training window.”
Here’s an example of using a two hour training window:
In [14]: window_fm, window_features = ft.dfs(entityset=es,
....: target_entity="customers",
....: cutoff_time=cutoff_times,
....: cutoff_time_in_index=True,
....: training_window="2 hour")
....:
In [15]: window_fm
Out[15]:
zip_code COUNT(sessions) NUM_UNIQUE(sessions.device) MODE(sessions.device) SUM(transactions.amount) STD(transactions.amount) MAX(transactions.amount) SKEW(transactions.amount) MIN(transactions.amount) MEAN(transactions.amount) COUNT(transactions) NUM_UNIQUE(transactions.product_id) MODE(transactions.product_id) DAY(date_of_birth) DAY(join_date) YEAR(date_of_birth) YEAR(join_date) MONTH(date_of_birth) MONTH(join_date) WEEKDAY(date_of_birth) WEEKDAY(join_date) SUM(sessions.MIN(transactions.amount)) SUM(sessions.NUM_UNIQUE(transactions.product_id)) SUM(sessions.STD(transactions.amount)) SUM(sessions.MAX(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) SUM(sessions.SKEW(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.MIN(transactions.amount)) STD(sessions.NUM_UNIQUE(transactions.product_id)) STD(sessions.COUNT(transactions)) STD(sessions.MAX(transactions.amount)) STD(sessions.MEAN(transactions.amount)) STD(sessions.SKEW(transactions.amount)) MAX(sessions.SUM(transactions.amount)) MAX(sessions.MIN(transactions.amount)) MAX(sessions.NUM_UNIQUE(transactions.product_id)) MAX(sessions.COUNT(transactions)) MAX(sessions.STD(transactions.amount)) MAX(sessions.MEAN(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) SKEW(sessions.SUM(transactions.amount)) SKEW(sessions.MIN(transactions.amount)) SKEW(sessions.NUM_UNIQUE(transactions.product_id)) SKEW(sessions.COUNT(transactions)) SKEW(sessions.STD(transactions.amount)) SKEW(sessions.MAX(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) MIN(sessions.SUM(transactions.amount)) MIN(sessions.NUM_UNIQUE(transactions.product_id)) MIN(sessions.COUNT(transactions)) MIN(sessions.STD(transactions.amount)) MIN(sessions.MAX(transactions.amount)) MIN(sessions.MEAN(transactions.amount)) MIN(sessions.SKEW(transactions.amount)) MEAN(sessions.SUM(transactions.amount)) MEAN(sessions.MIN(transactions.amount)) MEAN(sessions.NUM_UNIQUE(transactions.product_id)) MEAN(sessions.COUNT(transactions)) MEAN(sessions.STD(transactions.amount)) MEAN(sessions.MAX(transactions.amount)) MEAN(sessions.MEAN(transactions.amount)) MEAN(sessions.SKEW(transactions.amount)) NUM_UNIQUE(sessions.DAY(session_start)) NUM_UNIQUE(sessions.YEAR(session_start)) NUM_UNIQUE(sessions.MODE(transactions.product_id)) NUM_UNIQUE(sessions.MONTH(session_start)) NUM_UNIQUE(sessions.WEEKDAY(session_start)) MODE(sessions.DAY(session_start)) MODE(sessions.YEAR(session_start)) MODE(sessions.MODE(transactions.product_id)) MODE(sessions.MONTH(session_start)) MODE(sessions.WEEKDAY(session_start)) NUM_UNIQUE(transactions.sessions.customer_id) NUM_UNIQUE(transactions.sessions.device) MODE(transactions.sessions.customer_id) MODE(transactions.sessions.device) label
customer_id time
1 2014-01-01 04:00:00 60091 2 2 desktop 2077.66 43.772157 139.09 -0.187686 5.81 76.950370 27 5 4 18 17 1994 2011 7 4 0 6 12.59 10 86.730914 271.81 155.604500 -0.604638 18.667619 0.685894 0.000000 2.121320 4.504270 10.842658 0.747633 1052.03 6.78 5 15 46.905665 85.469167 0.226337 NaN NaN NaN NaN NaN NaN NaN 1025.63 5 12 39.825249 132.72 70.135333 -0.830975 1038.830000 6.295000 5.000000 13.500000 43.365457 135.905000 77.802250 -0.302319 1 1 2 1 1 1 2014 1 1 2 1 2 1 desktop True
2 2014-01-01 05:00:00 13244 3 2 desktop 2605.61 36.077146 146.81 -0.198611 12.07 84.051935 31 5 4 18 15 1986 2012 8 4 0 6 90.35 15 109.500185 404.04 253.240615 -0.110009 203.331699 23.329038 0.000000 2.516611 14.342521 10.587085 0.242542 1004.96 56.46 5 13 47.935920 96.581000 0.130019 -1.660092 1.397956 0.000000 0.585583 1.121470 -1.083626 1.659252 634.84 5 8 27.839228 118.85 77.304615 -0.314918 868.536667 30.116667 5.000000 10.333333 36.500062 134.680000 84.413538 -0.036670 1 1 3 1 1 1 2014 1 1 2 1 2 2 desktop True
3 2014-01-01 06:00:00 13244 3 1 desktop 1925.82 37.130891 128.26 0.110145 6.65 66.407586 29 5 1 21 13 2003 2011 11 8 4 5 118.47 11 71.871900 346.76 228.176684 0.242122 477.281339 45.761028 2.309401 8.082904 20.648490 18.557570 0.580573 944.85 91.76 5 17 36.167220 91.760000 0.531588 -1.705607 1.566223 -1.732051 -0.722109 NaN -1.721498 -1.081879 91.76 1 1 35.704680 91.76 55.579412 -0.289466 641.940000 39.490000 3.666667 9.666667 35.935950 115.586667 76.058895 0.121061 1 1 2 1 1 1 2014 1 1 2 1 1 3 desktop False
1 2014-01-01 08:00:00 60091 3 2 mobile 3124.15 38.952172 139.43 0.047120 5.91 66.471277 47 5 4 18 17 1994 2011 7 4 0 6 24.61 15 107.128899 384.44 198.984750 -0.003438 330.655558 3.016195 0.000000 0.577350 10.415432 19.935229 0.906666 1420.09 11.62 5 16 44.354104 88.755625 0.640252 1.606791 1.443486 0.000000 -1.732051 1.612576 0.846298 1.344879 809.97 5 15 30.450261 118.90 50.623125 -1.038434 1041.383333 8.203333 5.000000 15.666667 35.709633 128.146667 66.328250 -0.001146 1 1 3 1 1 1 2014 1 1 2 1 2 1 mobile True
We can see that that the counts for the same feature are lower after we shorten the training window:
In [16]: fm[["COUNT(transactions)"]]
Out[16]:
COUNT(transactions)
customer_id time
1 2014-01-01 04:00:00 67
2 2014-01-01 05:00:00 62
3 2014-01-01 06:00:00 44
1 2014-01-01 08:00:00 126
In [17]: window_fm[["COUNT(transactions)"]]