Handling Time

Time is a naturally relevant factor in many predictive modeling problems. Consider the following questions:

  1. How much revenue will I bring in next month?
  2. What is the expected delay on my next flight?
  3. Will my user purchase an upgrade to their membership?

A good way to estimate how effective you are at predicting revenue would be to see how you would have done predicting it last month or the month before. You would similarly be interested in checking if you were able to predict the delay on your previous flight, or how good you are historically at detecting customers who would upgrade.

However, it is immensely tricky to make a feature matrix by hand for those predictions. To create historical predictions you need to set a time to make a prediction for every row and then cut off any data in your dataset that happens after that time. Then, you can use the remaining valid data to make any features you like.

Some of the most powerful functionality in Featuretools is the ability to accurately and precisely handle time. To make the most of that functionality, it is necessary to understand how provided times will be used.

Outline

This page is the answer to the question why should I pay attention to datetimes in my data? There are two interconnected parts to that answer:

  1. What are the implications of setting a time index?
  2. How does Featuretools take in predictions?

The first section shows explains how to handle the complexities that can come up when assigning times to your data and the second shows how to use those times to make rows of a feature matrix. While time can be a sticking point for our users, we have found that it’s often a useful construct in utilizing data from the real world.

Introduction to the Time Index

We’ll start with the Mock Customer entityset.

In [1]: import featuretools as ft

In [2]: es_mc = ft.demo.load_mock_customer(return_entityset=True, random_seed=0)

In [3]: es_mc['transactions'].df.head()
Out[3]: 
     transaction_id  session_id    transaction_time  amount product_id
298             298           1 2014-01-01 00:00:00  127.64          5
2                 2           1 2014-01-01 00:01:05  109.48          2
308             308           1 2014-01-01 00:02:10   95.06          3
116             116           1 2014-01-01 00:03:15   78.92          4
371             371           1 2014-01-01 00:04:20   31.54          3

The transactions entity has one row for every transaction and a transaction_time for every row. The user has an option to set a time index for any entity they create, representing the first time information from the row can be used. In this example, most people would make the reasonable choice to set the transaction_time as the time index for the transactions entity. Not every datetime column is a time index, so the choice is not always straightforward. Consider the customers entity:

In [4]: es_mc['customers'].df
Out[4]: 
   customer_id           join_date date_of_birth zip_code
5            5 2010-07-17 05:27:50    1984-07-28    60091
4            4 2011-04-08 20:08:14    2006-08-15    60091
1            1 2011-04-17 10:48:33    1994-07-18    60091
3            3 2011-08-13 15:42:34    2003-11-21    13244
2            2 2012-04-15 23:31:04    1986-08-18    13244

Here we have two time columns join_date and date_of_birth. While either column might be useful for making features, the join_date should be used as the time index. It represents when the data owner learns about the existence of a given customer. Generically: the time index is the first time anything from a row can be known to the dataset owner. Rows are treated as non-existent prior to the time index.

Important

The time index is defined as the first time information from a row can be used. It represents the first time anything from a row can be known to the dataset owner.

In databases, information tends to be written after an event has passed. This can be problematic on the machine learning side – it’s often necessary to ignore entire columns to avoid leaking labels. If you’re interested in how to safely use those columns, the advanced time index section below explores how time can used with a dataset from the US Department of Transportation. Before we get there, we’re going to show how to make predictions using these time indices.

Introduction to Cutoff Times

For a given EntitySet, there are many possible prediction problems that you might want to solve. Trying to predict customer purchases an hour in advance uses different data than trying to predict purchases a day in advance. Often, it’s desirable to test multiple questions and explore which one you want to use. Featuretools makes that process easier through cutoff times.

A cutoff_time dataframe is a concise way of passing complicated instructions to Deep Feature Synthesis (DFS). Each row contains a instance id, a time and optionally, a label. For every unique id-time pair, we will create a row of the feature matrix.

Let’s do a short example. We want to predict whether customers 1, 2 and 3 will spend $500 after 04:00 on January 1 by the end of the day. The time column emulates the way a human would make a historical prediction. It is an instruction to not use any future information constructing that row even if we have it in our entityset.

Important

A cutoff_time dataframe is a concise way of passing complicated instructions to Deep Feature Synthesis. For every id-time pair passed in, DFS creates a row of the feature matrix for that id at that time.

In this case, we’re making predictions for all three customers at the same time, 2014-1-1 04:00 so we set that as the second column. We have also checked that 1 and 2 will spend $500 while customer 3 will not, so we include those labels as a third column.

retail cutoff time diagram

We will use all of the information between the time_index of rows 1, 2 and 3 and the prediction time 04:00 2014-1-1 to make predictions about what will happen for the rest of the day.

In [5]: ct = pd.DataFrame()

In [6]: ct['customer_id'] = [1, 2, 3]

In [7]: ct['time'] = pd.to_datetime(['2014-1-1 04:00',
   ...:                              '2014-1-1 04:00',
   ...:                              '2014-1-1 04:00'])
   ...: 

In [8]: ct['label'] = [True, True, False]

In [9]: ct
Out[9]: 
   customer_id                time  label
0            1 2014-01-01 04:00:00   True
1            2 2014-01-01 04:00:00   True
2            3 2014-01-01 04:00:00  False

In [10]: fm, features = ft.dfs(entityset=es_mc,
   ....:                       target_entity='customers',
   ....:                       cutoff_time=ct,
   ....:                       cutoff_time_in_index=True)
   ....: 

In [11]: fm
Out[11]: 
                                zip_code  COUNT(sessions)  NUM_UNIQUE(sessions.device) MODE(sessions.device)  SUM(transactions.amount)  STD(transactions.amount)  MAX(transactions.amount)  SKEW(transactions.amount)  MIN(transactions.amount)  MEAN(transactions.amount)  COUNT(transactions)  NUM_UNIQUE(transactions.product_id)  MODE(transactions.product_id)  DAY(join_date)  DAY(date_of_birth)  YEAR(join_date)  YEAR(date_of_birth)  MONTH(join_date)  MONTH(date_of_birth)  WEEKDAY(join_date)  WEEKDAY(date_of_birth)  SUM(sessions.STD(transactions.amount))  SUM(sessions.MAX(transactions.amount))  SUM(sessions.SKEW(transactions.amount))  SUM(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  SUM(sessions.NUM_UNIQUE(transactions.product_id))  STD(sessions.SUM(transactions.amount))  STD(sessions.MAX(transactions.amount))  STD(sessions.SKEW(transactions.amount))  STD(sessions.MIN(transactions.amount))  STD(sessions.MEAN(transactions.amount))  STD(sessions.COUNT(transactions))  STD(sessions.NUM_UNIQUE(transactions.product_id))  MAX(sessions.SUM(transactions.amount))  MAX(sessions.STD(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  MAX(sessions.MEAN(transactions.amount))  MAX(sessions.COUNT(transactions))  MAX(sessions.NUM_UNIQUE(transactions.product_id))  SKEW(sessions.SUM(transactions.amount))  SKEW(sessions.STD(transactions.amount))  SKEW(sessions.MAX(transactions.amount))  SKEW(sessions.MIN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  SKEW(sessions.COUNT(transactions))  SKEW(sessions.NUM_UNIQUE(transactions.product_id))  MIN(sessions.SUM(transactions.amount))  MIN(sessions.STD(transactions.amount))  MIN(sessions.MAX(transactions.amount))  MIN(sessions.SKEW(transactions.amount))  MIN(sessions.MEAN(transactions.amount))  MIN(sessions.COUNT(transactions))  MIN(sessions.NUM_UNIQUE(transactions.product_id))  MEAN(sessions.SUM(transactions.amount))  MEAN(sessions.STD(transactions.amount))  MEAN(sessions.MAX(transactions.amount))  MEAN(sessions.SKEW(transactions.amount))  MEAN(sessions.MIN(transactions.amount))  MEAN(sessions.MEAN(transactions.amount))  MEAN(sessions.COUNT(transactions))  MEAN(sessions.NUM_UNIQUE(transactions.product_id))  NUM_UNIQUE(sessions.MODE(transactions.product_id))  NUM_UNIQUE(sessions.DAY(session_start))  NUM_UNIQUE(sessions.YEAR(session_start))  NUM_UNIQUE(sessions.MONTH(session_start))  NUM_UNIQUE(sessions.WEEKDAY(session_start))  MODE(sessions.MODE(transactions.product_id))  MODE(sessions.DAY(session_start))  MODE(sessions.YEAR(session_start))  MODE(sessions.MONTH(session_start))  MODE(sessions.WEEKDAY(session_start))  label
customer_id time
1           2014-01-01 04:00:00    60091                4                            3                tablet                   4958.19                 42.309717                    139.23                  -0.006928                      5.81                  74.002836                   67                                    5                              4              17                  18             2011                 1994                 4                     7                   6                       0                              169.572874                                  540.04                                -0.505043                                   27.62                               304.601700                                                 20                              271.917637                                5.027226                                 0.500353                                1.285833                                10.426572                           5.678908                                                0.0                                 1613.93                               46.905665                                 0.234349                                    8.74                                85.469167                                 25                                                  5                                 1.197406                                 1.235445                                -0.451371                                 1.452325                                 -0.233453                            1.614843                                                0.0                                  1025.63                               39.825249                                  129.00                                -0.830975                                64.557200                                 12                                                  5                                1239.5475                                42.393218                                 135.0100                                 -0.126261                                    6.905                                 76.150425                               16.75                                                  5                                                   3                                         1                                         1                                          1                                            1                                             4                                  1                                2014                                    1                                      2   True
2           2014-01-01 04:00:00    13244                4                            2               desktop                   4150.30                 39.289512                    146.81                  -0.134786                     12.07                  84.700000                   49                                    5                              4              15                  18             2012                 1986                 4                     8                   6                       0                              157.262738                                  569.29                                 0.045171                                  105.24                               340.791792                                                 20                              307.743859                                3.470527                                 0.324809                               20.424007                                 8.983533                           3.862210                                                0.0                                 1320.64                               47.935920                                 0.295458                                   56.46                                96.581000                                 16                                                  5                                -0.823347                                -0.966834                                 0.459305                                 1.815491                                  0.651941                           -0.169238                                                0.0                                   634.84                               27.839228                                  138.38                                -0.455197                                76.813125                                  8                                                  5                                1037.5750                                39.315685                                 142.3225                                  0.011293                                   26.310                                 85.197948                               12.25                                                  5                                                   3                                         1                                         1                                          1                                            1                                             2                                  1                                2014                                    1                                      2   True
3           2014-01-01 04:00:00    13244                1                            1                tablet                    941.87                 47.264797                    146.31                   0.618455                      8.19                  62.791333                   15                                    5                              1              13                  21             2011                 2003                 8                    11                   5                       4                               47.264797                                  146.31                                 0.618455                                    8.19                                62.791333                                                  5                                     NaN                                     NaN                                      NaN                                     NaN                                      NaN                                NaN                                                NaN                                  941.87                               47.264797                                 0.618455                                    8.19                                62.791333                                 15                                                  5                                      NaN                                      NaN                                      NaN                                      NaN                                       NaN                                 NaN                                                NaN                                   941.87                               47.264797                                  146.31                                 0.618455                                62.791333                                 15                                                  5                                 941.8700                                47.264797                                 146.3100                                  0.618455                                    8.190                                 62.791333                               15.00                                                  5                                                   1                                         1                                         1                                          1                                            1                                             1                                  1                                2014                                    1                                      2  False

We made 74 features for the three customers using only data whose time index was before the cutoff time. Since you can specify the prediction time for every row, you have a lot of control over which data will be used for a given row of your feature matrix. An advanced use of cutoff times can be found in the second part of the next section.

Advanced Scenarios

The Flights entityset is a prototypical example of a dataset where an individual row can happen over time. Each trip is recorded in a trip_logs entity, and has many times associated to it.

In [12]: es_flight = ft.demo.load_flight(nrows=100)
Downloading data from s3...

In [13]: es_flight
Out[13]: 
Entityset: Flight Data
  Entities:
    trip_logs [Rows: 100, Columns: 22]
    flights [Rows: 13, Columns: 9]
    airlines [Rows: 1, Columns: 1]
    airports [Rows: 6, Columns: 3]
  Relationships:
    trip_logs.flight_id -> flights.flight_id
    flights.carrier -> airlines.carrier
    flights.dest -> airports.dest

In [14]: es_flight['trip_logs'].df.head(3)
Out[14]: 
    trip_log_id flight_date  scheduled_dep_time  dep_delay  taxi_out  taxi_in  arr_delay  scheduled_elapsed_time  air_time  distance  carrier_delay  weather_delay  national_airspace_delay  security_delay  late_aircraft_delay            dep_time            arr_time  scheduled_arr_time          time_index        flight_id  cancelled  diverted
82           82  2017-01-01 2017-01-01 06:38:00       -5.0      12.0      6.0       -6.0           6060000000000      82.0     507.0            0.0            0.0                      0.0             0.0                  0.0 2017-01-01 06:33:00 2017-01-01 08:13:00 2017-01-01 08:19:00 2016-09-03 06:38:00  AA-495:TPA->CLT        0.0       0.0
92           92  2017-01-01 2017-01-01 07:00:00       -6.0      28.0     15.0        5.0          12180000000000     171.0    1067.0            0.0            0.0                      0.0             0.0                  0.0 2017-01-01 06:54:00 2017-01-01 10:28:00 2017-01-01 10:23:00 2016-09-03 07:00:00  AA-496:PIT->DFW        0.0       0.0
46           46  2017-01-01 2017-01-01 09:25:00       -2.0      18.0      8.0       -3.0           4620000000000      50.0     226.0            0.0            0.0                      0.0             0.0                  0.0 2017-01-01 09:23:00 2017-01-01 10:39:00 2017-01-01 10:42:00 2016-09-03 09:25:00  AA-495:CLT->ATL        0.0       0.0

For every trip we have real arrival and departure times and scheduled arrival and departure times.

With the columns we have, it would be problematic for the scheduled_dep_time, to be the time index: flights are scheduled far in advance! If the time index were set to the scheduled departure time, we wouldn’t be able to know anything about the flight at all until it was boarded.

However, it’s possible to know many things about a trip six months or more before it takes off; the trip distance, carrier, flight number and even when a flight is supposed to leave and land are always known before we buy a ticket. Our time_index exists to reflect the reality that those can be known much before the scheduled departure time.

That being said, not all columns can be known at our time index six months in advance. If we were able to know the real arrival time of the plane before we booked, we would have great success in predicting delays!

flight time index diagram

In this diagram of a row, we have set the time_index to the time the flight was scheduled. However, any information about what happens to the flight after it departs is invalid for use at that time. If we were to use any of that information prior to when the flight lands, we would be leaking labels.

While one option would be to remove that data from the entityset, a better option would be to use that data somehow. To that end, it’s possible to set a secondary_time_index which can mark specific columns as available at a later date. The secondary_time_index of this row is set to the arrival time.

flight secondary time index diagram

By setting a secondary_time_index, we can still use the delay information from a row, but only when they would become known. It’s possible to know everything about how a trip went after it has arrived, so we can happily use that information at any time after the flight lands.

Hint

It’s often a good idea to use a secondary time index if your entityset has inline labels. If you know when the label would be valid for use, it’s possible to automatically create very predictive features using historical labels.

As an exercise, take a minute to think about which of the twenty two columns here can be known at each time index. Which can be known 6 months in advance and which would be better to only learn after the flight lands?

In [15]: es_flight['trip_logs']
Out[15]: 
Entity: trip_logs
  Variables:
    trip_log_id (dtype: index)
    flight_date (dtype: datetime)
    scheduled_dep_time (dtype: datetime)
    dep_delay (dtype: numeric)
    taxi_out (dtype: numeric)
    taxi_in (dtype: numeric)
    arr_delay (dtype: numeric)
    scheduled_elapsed_time (dtype: numeric)
    air_time (dtype: numeric)
    distance (dtype: numeric)
    carrier_delay (dtype: numeric)
    weather_delay (dtype: numeric)
    national_airspace_delay (dtype: numeric)
    security_delay (dtype: numeric)
    late_aircraft_delay (dtype: numeric)
    dep_time (dtype: datetime)
    arr_time (dtype: datetime)
    scheduled_arr_time (dtype: datetime)
    time_index (dtype: datetime_time_index)
    flight_id (dtype: id)
    cancelled (dtype: boolean)
    diverted (dtype: boolean)
  Shape:
    (Rows: 100, Columns: 22)
  • These columns can be known at the time_index months before the flight: trip_log_id, flight_date, scheduled_dep_time, scheduled_elapsed_time, distance, scheduled_arr_time, time_index, flight_id
  • These only be known at the secondary_time_index, after the flight has completed: dep_delay, taxi_out, taxi_in, arr_delay, air_time, carrier_delay, weather_delay, national_airspace_delay, security_delay, late_aircraft_delay, dep_time, arr_time, cancelled, diverted

An entity can have a third, hidden, time index called the last_time_index. More details for that can be found in the other temporal workflows section.

Flight Predictions

Let’s make features at some varying times in the flight example. Trip 14 is a flight from CLT to PHX on January 31 2017 and trip 92 is a flight from PIT to DFW on January 1. We can set any cutoff time before the flight is scheduled to depart, emulating how we would make the prediction at that point in time.

We set two cutoff times for trip 14 at two different times: one which is more than a month before the flight and another which is only 5 days before. For trip 92, we’ll only set one cutoff time three days before it is scheduled to leave.

flight cutoff time diagram

Our cutoff time dataframe looks like this:

In [16]: ct_flight = pd.DataFrame()

In [17]: ct_flight['trip_log_id'] = [14, 14, 92]

In [18]: ct_flight['time'] = pd.to_datetime(['2016-12-28',
   ....:                                     '2017-1-25',
   ....:                                     '2016-12-28'])
   ....: 

In [19]: ct_flight['label'] = [True, True, False]

In [20]: ct_flight
Out[20]: 
   trip_log_id       time  label
0           14 2016-12-28   True
1           14 2017-01-25   True
2           92 2016-12-28  False

These instructions say to build two rows for trip 14 using data from different times and one row for trip 92. Here’s how DFS handles those instructions:

In [21]: fm, features = ft.dfs(entityset=es_flight,
   ....:                       target_entity='trip_logs',
   ....:                       cutoff_time=ct_flight,
   ....:                       cutoff_time_in_index=True)
   ....: 

In [22]: fm[['label', 'flight_id', 'flights.MAX(trip_logs.arr_delay)', 'MONTH(scheduled_dep_time)', 'DAY(scheduled_dep_time)']]
Out[22]: 
                        label        flight_id  flights.MAX(trip_logs.arr_delay)  MONTH(scheduled_dep_time)  DAY(scheduled_dep_time)
trip_log_id time                                                                                                                    
14          2016-12-28   True  AA-494:CLT->PHX                               NaN                          1                       31
92          2016-12-28  False  AA-496:PIT->DFW                               NaN                          1                        1
14          2017-01-25   True  AA-494:CLT->PHX                              33.0                          1                       31

There is a lot to unpack from this output:

  1. Even though one id showed up twice, a row was made for every id-time pair in ct_flight. The id and cutoff time were returned as the index of the feature matrix.
  2. The output, and label, were sorted by the passed in time column. Because of the sorting, it’s often helpful to pass in a label with the cutoff time dataframe so that it will remain sorted in the same way as the feature matrix. Any additional columns past the id and cutoff_time will not be used for making features.
  3. The column flights.MAX(trip_logs.arr_delay) is not always defined. It can only have real values when there are historical flights to aggregate. Since we excluded the arrival delay of this particular flight, there are no values to use!

Notice that for trip 14, there wasn’t historical data when we made the feature a month in advance, but there were flights from Charlotte to Phoenix before January 25 whose delay could be validly used. These are powerful features that are often excluded in manual processes because of how hard they are to make.

Other Settings

Training Window and the Last Time Index

The training window in DFS limits the amount of past data that can be used while calculating a particular feature vector. In the same way that a cutoff time filters out data which appears after it, a training window will filter out data that appears too much earlier. Here’s an example of a two hour training window:

In [23]: window_fm, window_features = ft.dfs(entityset=es_mc,
   ....:                                     target_entity="customers",
   ....:                                     cutoff_time=ct,
   ....:                                     cutoff_time_in_index=True,
   ....:                                     training_window="2 hours")
   ....: 

In [24]: window_fm.head()
Out[24]: 
                                zip_code  COUNT(sessions)  NUM_UNIQUE(sessions.device) MODE(sessions.device)  SUM(transactions.amount)  STD(transactions.amount)  MAX(transactions.amount)  SKEW(transactions.amount)  MIN(transactions.amount)  MEAN(transactions.amount)  COUNT(transactions)  NUM_UNIQUE(transactions.product_id)  MODE(transactions.product_id)  DAY(join_date)  DAY(date_of_birth)  YEAR(join_date)  YEAR(date_of_birth)  MONTH(join_date)  MONTH(date_of_birth)  WEEKDAY(join_date)  WEEKDAY(date_of_birth)  SUM(sessions.STD(transactions.amount))  SUM(sessions.MAX(transactions.amount))  SUM(sessions.SKEW(transactions.amount))  SUM(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  SUM(sessions.NUM_UNIQUE(transactions.product_id))  STD(sessions.SUM(transactions.amount))  STD(sessions.MAX(transactions.amount))  STD(sessions.SKEW(transactions.amount))  STD(sessions.MIN(transactions.amount))  STD(sessions.MEAN(transactions.amount))  STD(sessions.COUNT(transactions))  STD(sessions.NUM_UNIQUE(transactions.product_id))  MAX(sessions.SUM(transactions.amount))  MAX(sessions.STD(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  MAX(sessions.MEAN(transactions.amount))  MAX(sessions.COUNT(transactions))  MAX(sessions.NUM_UNIQUE(transactions.product_id))  SKEW(sessions.SUM(transactions.amount))  SKEW(sessions.STD(transactions.amount))  SKEW(sessions.MAX(transactions.amount))  SKEW(sessions.MIN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  SKEW(sessions.COUNT(transactions))  SKEW(sessions.NUM_UNIQUE(transactions.product_id))  MIN(sessions.SUM(transactions.amount))  MIN(sessions.STD(transactions.amount))  MIN(sessions.MAX(transactions.amount))  MIN(sessions.SKEW(transactions.amount))  MIN(sessions.MEAN(transactions.amount))  MIN(sessions.COUNT(transactions))  MIN(sessions.NUM_UNIQUE(transactions.product_id))  MEAN(sessions.SUM(transactions.amount))  MEAN(sessions.STD(transactions.amount))  MEAN(sessions.MAX(transactions.amount))  MEAN(sessions.SKEW(transactions.amount))  MEAN(sessions.MIN(transactions.amount))  MEAN(sessions.MEAN(transactions.amount))  MEAN(sessions.COUNT(transactions))  MEAN(sessions.NUM_UNIQUE(transactions.product_id))  NUM_UNIQUE(sessions.MODE(transactions.product_id))  NUM_UNIQUE(sessions.DAY(session_start))  NUM_UNIQUE(sessions.YEAR(session_start))  NUM_UNIQUE(sessions.MONTH(session_start))  NUM_UNIQUE(sessions.WEEKDAY(session_start))  MODE(sessions.MODE(transactions.product_id))  MODE(sessions.DAY(session_start))  MODE(sessions.YEAR(session_start))  MODE(sessions.MONTH(session_start))  MODE(sessions.WEEKDAY(session_start))  label
customer_id time
1           2014-01-01 04:00:00    60091              2.0                          2.0               desktop                   2077.66                 43.772157                    139.09                  -0.187686                      5.81                  76.950370                 27.0                                  5.0                            4.0              17                  18             2011                 1994                 4                     7                   6                       0                               86.730914                                  271.81                                -0.604638                                   12.59                               155.604500                                               10.0                               18.667619                                 4.50427                                 0.747633                                0.685894                                10.842658                           2.121320                                                0.0                                 1052.03                               46.905665                                 0.226337                                    6.78                                85.469167                               15.0                                                5.0                                      NaN                                      NaN                                      NaN                                      NaN                                       NaN                                 NaN                                                NaN                                  1025.63                               39.825249                                  132.72                                -0.830975                                70.135333                               12.0                                                5.0                              1038.830000                                43.365457                               135.905000                                 -0.302319                                    6.295                                 77.802250                                13.5                                                5.0                                                 2.0                                       1.0                                       1.0                                        1.0                                          1.0                                           1.0                                1.0                              2014.0                                  1.0                                    2.0   True
2           2014-01-01 04:00:00    13244              3.0                          2.0               desktop                   2921.29                 38.184810                    146.81                  -0.340927                     12.07                  88.523939                 33.0                                  5.0                            4.0              15                  18             2012                 1986                 4                     8                   6                       0                              115.661762                                  427.63                                -0.250288                                   84.33                               263.978667                                               15.0                              342.969170                                 4.21595                                 0.323138                               24.622553                                 8.613108                           3.605551                                                0.0                                 1320.64                               47.935920                                 0.130019                                   56.46                                96.581000                               15.0                                                5.0                                 0.104297                                -0.582634                                 0.110229                                 1.687442                                 -0.026006                             1.15207                                                0.0                                   634.84                               27.839228                                  138.38                                -0.455197                                79.355000                                8.0                                                5.0                               973.763333                                38.553921                               142.543333                                 -0.083429                                   28.110                                 87.992889                                11.0                                                5.0                                                 2.0                                       1.0                                       1.0                                        1.0                                          1.0                                           2.0                                1.0                              2014.0                                  1.0                                    2.0   True
3           2014-01-01 04:00:00    13244              0.0                          NaN                   NaN                      0.00                       NaN                       NaN                        NaN                       NaN                        NaN                  0.0                                  NaN                            NaN              13                  21             2011                 2003                 8                    11                   5                       4                                0.000000                                    0.00                                 0.000000                                    0.00                                 0.000000                                                0.0                                     NaN                                     NaN                                      NaN                                     NaN                                      NaN                                NaN                                                NaN                                     NaN                                     NaN                                      NaN                                     NaN                                      NaN                                NaN                                                NaN                                      NaN                                      NaN                                      NaN                                      NaN                                       NaN                                 NaN                                                NaN                                      NaN                                     NaN                                     NaN                                      NaN                                      NaN                                NaN                                                NaN                                      NaN                                      NaN                                      NaN                                       NaN                                      NaN                                       NaN                                 NaN                                                NaN                                                 NaN                                       NaN                                       NaN                                        NaN                                          NaN                                           NaN                                NaN                                 NaN                                  NaN                                    NaN  False

This works well for entities where an instance occurs at a single point in time. However, sometimes an instance can happen at many points in time.

For example, suppose a customer’s session has multiple transactions which can happen at different points in time. If we are trying to count the number of sessions a user had in a given time period, we often want to count all sessions that were active during the training window. To accomplish this, we need to not only know when a session starts, but when it ends. The last time that an instance appears in the data is stored as the last_time_index of an entity. We can compare the time index and the last time index of sessions:

In [25]: es_mc['sessions'].df['session_start'].head()
Out[25]: 
1   2014-01-01 00:00:00
2   2014-01-01 00:17:20
3   2014-01-01 00:28:10
4   2014-01-01 00:44:25
5   2014-01-01 01:11:30
Name: session_start, dtype: datetime64[ns]

In [26]: es_mc['sessions'].last_time_index.head()
Out[26]: 
1   2014-01-01 00:16:15
2   2014-01-01 00:27:05
3   2014-01-01 00:43:20
4   2014-01-01 01:10:25
5   2014-01-01 01:22:20
Name: last_time, dtype: datetime64[ns]

It is possible to automatically add last time indexes to every entity in an EntitySet by running EntitySet.add_last_time_indexes(). If a last_time_index has been set, Featuretools will check to see if the last_time_index is after the start of the training window. That, combined with the cutoff time, allows Deep Feature Synthesis to discover which data is relevant for a given training window.

Approximating features by rounding cutoff time

If there are a large number of unique cutoff times relative to the number of instances for which we are calculating features, this overhead can outweigh the time needed to calculate the features. Therefore, by reducing the number of unique cutoff times, we minimize the overhead from searching for and extracting data for feature calculations.

One way to decrease the number of unique cutoff times is to round cutoff times to an nearby earlier point in time. An earlier cutoff time is always valid for predictive modeling — it just means we’re not using some of the data we could potentially use while calculating that feature. In that way, we gain computational speed by losing some information.

To understand when approximation is useful, consider calculating features for a model to predict fraudulent credit card transactions. In this case, an important feature might be, “the average transaction amount for this card in the past”. While this value can change every time there is a new transaction, updating it less frequently might not impact accuracy.

Note

The bank BBVA used approximation when building a predictive model for credit card fraud using Featuretools. For more details, see the “Real-time deployment considerations” section of the white paper describing the work.

The frequency of approximation is controlled using the approximate parameter to DFS or calculate_feature_matrix(). For example, the following code would approximate aggregation features at 1 day intervals:

fm = ft.calculate_feature_matrix(entityset=es_flight
                                 cutoff_time=ct_flight,
                                 approximate="1 day")

In this computation, features that can be approximated will be calculated at 1 day intervals, while features that cannot be approximated (e.g “what is the destination of this flight?”) will be calculated at the exact cutoff time.

Creating and Flattening a Feature Tensor

The make_temporal_cutoffs() function generates a series of equally spaced cutoff times from a given set of cutoff times and instance ids. This function can be paired with DFS to create and flatten a feature tensor rather than making multiple feature matrices at different delays.

The function takes in the the following parameters:

  • instance_ids (list, pd.Series, or np.ndarray): A list of instances.
  • cutoffs (list, pd.Series, or np.ndarray): An associated list of cutoff times.
  • window_size (str or pandas.DateOffset): The amount of time between each cutoff time in the created time series.
  • start (datetime.datetime or pd.Timestamp): The first cutoff time in the created time series.
  • num_windows (int): The number of cutoff times to create in the created time series.

Only two of the three options window_size, start, and num_windows need to be specified to uninquely determine an equally-spaced set of cutoff times at which to compute each instance.

If your cutoff times are the ones used above:

In [27]: ct_flight
Out[27]: 
   trip_log_id       time  label
0           14 2016-12-28   True
1           14 2017-01-25   True
2           92 2016-12-28  False

Then passing in window_size='1h' and num_windows=2 makes one row an hour over the last two hours to produce the following new dataframe. The result can be directly passed into DFS to make features at the different time points.

In [28]: temporal_cutoffs = ft.make_temporal_cutoffs(ct['customer_id'],
   ....:                                             ct['time'],
   ....:                                             window_size='1h',
   ....:                                             num_windows=2)
   ....: 

In [29]: temporal_cutoffs
Out[29]: 
                 time  instance_id
0 2014-01-01 03:00:00            1
1 2014-01-01 04:00:00            1
2 2014-01-01 03:00:00            2
3 2014-01-01 04:00:00            2
4 2014-01-01 03:00:00            3
5 2014-01-01 04:00:00            3

In [30]: fm, features = ft.dfs(entityset=es_mc,
   ....:                       target_entity='customers',
   ....:                       cutoff_time=temporal_cutoffs,
   ....:                       cutoff_time_in_index=True)
   ....: 

In [31]: fm
Out[31]: 
                                zip_code  COUNT(sessions)  NUM_UNIQUE(sessions.device) MODE(sessions.device)  SUM(transactions.amount)  STD(transactions.amount)  MAX(transactions.amount)  SKEW(transactions.amount)  MIN(transactions.amount)  MEAN(transactions.amount)  COUNT(transactions)  NUM_UNIQUE(transactions.product_id)  MODE(transactions.product_id)  DAY(join_date)  DAY(date_of_birth)  YEAR(join_date)  YEAR(date_of_birth)  MONTH(join_date)  MONTH(date_of_birth)  WEEKDAY(join_date)  WEEKDAY(date_of_birth)  SUM(sessions.STD(transactions.amount))  SUM(sessions.MAX(transactions.amount))  SUM(sessions.SKEW(transactions.amount))  SUM(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  SUM(sessions.NUM_UNIQUE(transactions.product_id))  STD(sessions.SUM(transactions.amount))  STD(sessions.MAX(transactions.amount))  STD(sessions.SKEW(transactions.amount))  STD(sessions.MIN(transactions.amount))  STD(sessions.MEAN(transactions.amount))  STD(sessions.COUNT(transactions))  STD(sessions.NUM_UNIQUE(transactions.product_id))  MAX(sessions.SUM(transactions.amount))  MAX(sessions.STD(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  MAX(sessions.MEAN(transactions.amount))  MAX(sessions.COUNT(transactions))  MAX(sessions.NUM_UNIQUE(transactions.product_id))  SKEW(sessions.SUM(transactions.amount))  SKEW(sessions.STD(transactions.amount))  SKEW(sessions.MAX(transactions.amount))  SKEW(sessions.MIN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  SKEW(sessions.COUNT(transactions))  SKEW(sessions.NUM_UNIQUE(transactions.product_id))  MIN(sessions.SUM(transactions.amount))  MIN(sessions.STD(transactions.amount))  MIN(sessions.MAX(transactions.amount))  MIN(sessions.SKEW(transactions.amount))  MIN(sessions.MEAN(transactions.amount))  MIN(sessions.COUNT(transactions))  MIN(sessions.NUM_UNIQUE(transactions.product_id))  MEAN(sessions.SUM(transactions.amount))  MEAN(sessions.STD(transactions.amount))  MEAN(sessions.MAX(transactions.amount))  MEAN(sessions.SKEW(transactions.amount))  MEAN(sessions.MIN(transactions.amount))  MEAN(sessions.MEAN(transactions.amount))  MEAN(sessions.COUNT(transactions))  MEAN(sessions.NUM_UNIQUE(transactions.product_id))  NUM_UNIQUE(sessions.MODE(transactions.product_id))  NUM_UNIQUE(sessions.DAY(session_start))  NUM_UNIQUE(sessions.YEAR(session_start))  NUM_UNIQUE(sessions.MONTH(session_start))  NUM_UNIQUE(sessions.WEEKDAY(session_start))  MODE(sessions.MODE(transactions.product_id))  MODE(sessions.DAY(session_start))  MODE(sessions.YEAR(session_start))  MODE(sessions.MONTH(session_start))  MODE(sessions.WEEKDAY(session_start))
customer_id time
1           2014-01-01 03:00:00    60091                3                            3               desktop                   3932.56                 42.769602                    139.23                   0.140387                      5.81                  71.501091                   55                                    5                              1              17                  18             2011                 1994                 4                     7                   6                       0                              129.747625                                  400.95                                 0.325932                                   20.84                               219.132533                                                 15                              283.551883                                5.178021                                 0.210827                                1.571507                                10.255607                           5.773503                                                0.0                                 1613.93                               46.905665                                 0.234349                                    8.74                                84.440000                                 25                                                  5                                 0.685199                                 0.763052                                 0.782152                                 1.552040                                  1.173675                            1.732051                                                0.0                                  1052.03                               40.187205                                  129.00                                -0.134754                                64.557200                                 15                                                  5                              1310.853333                                43.249208                                 133.6500                                  0.108644                                 6.946667                                 73.044178                           18.333333                                                  5                                                   3                                         1                                         1                                          1                                            1                                             1                                  1                                2014                                    1                                      2
2           2014-01-01 03:00:00    13244                2                            2               desktop                   2549.65                 40.500652                    142.44                  -0.057293                     15.80                  82.246774                   31                                    5                              2              15                  18             2012                 1986                 4                     8                   6                       0                               81.487590                                  284.10                                -0.159738                                   36.71                               164.855792                                                 10                               64.792194                                0.551543                                 0.530793                                3.613316                                 7.940485                           0.707107                                                0.0                                 1320.64                               41.600976                                 0.295458                                   20.91                                88.042667                                 16                                                  5                                      NaN                                      NaN                                      NaN                                      NaN                                       NaN                                 NaN                                                NaN                                  1229.01                               39.886614                                  141.66                                -0.455197                                76.813125                                 15                                                  5                              1274.825000                                40.743795                                 142.0500                                 -0.079869                                18.355000                                 82.427896                           15.500000                                                  5                                                   2                                         1                                         1                                          1                                            1                                             2                                  1                                2014                                    1                                      2
3           2014-01-01 03:00:00    13244                1                            1                tablet                    941.87                 47.264797                    146.31                   0.618455                      8.19                  62.791333                   15                                    5                              1              13                  21             2011                 2003                 8                    11                   5                       4                               47.264797                                  146.31                                 0.618455                                    8.19                                62.791333                                                  5                                     NaN                                     NaN                                      NaN                                     NaN                                      NaN                                NaN                                                NaN                                  941.87                               47.264797                                 0.618455                                    8.19                                62.791333                                 15                                                  5                                      NaN                                      NaN                                      NaN                                      NaN                                       NaN                                 NaN                                                NaN                                   941.87                               47.264797                                  146.31                                 0.618455                                62.791333                                 15                                                  5                               941.870000                                47.264797                                 146.3100                                  0.618455                                 8.190000                                 62.791333                           15.000000                                                  5                                                   1                                         1                                         1                                          1                                            1                                             1                                  1                                2014                                    1                                      2
1           2014-01-01 04:00:00    60091                4                            3                tablet                   4958.19                 42.309717                    139.23                  -0.006928                      5.81                  74.002836                   67                                    5                              4              17                  18             2011                 1994                 4                     7                   6                       0                              169.572874                                  540.04                                -0.505043                                   27.62                               304.601700                                                 20                              271.917637                                5.027226                                 0.500353                                1.285833                                10.426572                           5.678908                                                0.0                                 1613.93                               46.905665                                 0.234349                                    8.74                                85.469167                                 25                                                  5                                 1.197406                                 1.235445                                -0.451371                                 1.452325                                 -0.233453                            1.614843                                                0.0                                  1025.63                               39.825249                                  129.00                                -0.830975                                64.557200                                 12                                                  5                              1239.547500                                42.393218                                 135.0100                                 -0.126261                                 6.905000                                 76.150425                           16.750000                                                  5                                                   3                                         1                                         1                                          1                                            1                                             4                                  1                                2014                                    1                                      2
2           2014-01-01 04:00:00    13244                4                            2               desktop                   4150.30                 39.289512                    146.81                  -0.134786                     12.07                  84.700000                   49                                    5                              4              15                  18             2012                 1986                 4                     8                   6                       0                              157.262738                                  569.29                                 0.045171                                  105.24                               340.791792                                                 20                              307.743859                                3.470527                                 0.324809                               20.424007                                 8.983533                           3.862210                                                0.0                                 1320.64                               47.935920                                 0.295458                                   56.46                                96.581000                                 16                                                  5                                -0.823347                                -0.966834                                 0.459305                                 1.815491                                  0.651941                           -0.169238                                                0.0                                   634.84                               27.839228                                  138.38                                -0.455197                                76.813125                                  8                                                  5                              1037.575000                                39.315685                                 142.3225                                  0.011293                                26.310000                                 85.197948                           12.250000                                                  5                                                   3                                         1                                         1                                          1                                            1                                             2                                  1                                2014                                    1                                      2
3           2014-01-01 04:00:00    13244                1                            1                tablet                    941.87                 47.264797                    146.31                   0.618455                      8.19                  62.791333                   15                                    5                              1              13                  21             2011                 2003                 8                    11                   5                       4                               47.264797                                  146.31                                 0.618455                                    8.19                                62.791333                                                  5                                     NaN                                     NaN                                      NaN                                     NaN                                      NaN                                NaN                                                NaN                                  941.87                               47.264797                                 0.618455                                    8.19                                62.791333                                 15                                                  5                                      NaN                                      NaN                                      NaN                                      NaN                                       NaN                                 NaN                                                NaN                                   941.87                               47.264797                                  146.31                                 0.618455                                62.791333                                 15                                                  5                               941.870000                                47.264797                                 146.3100                                  0.618455                                 8.190000                                 62.791333                           15.000000                                                  5                                                   1                                         1                                         1                                          1                                            1                                             1                                  1                                2014                                    1                                      2