What is Featuretools?

Featuertools

Featuretools is a framework to perform automated feature engineering. It excels at transforming transactional and relational datasets into feature matrices for machine learning.

5 Minute Quick Start

Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.

In [1]: import featuretools as ft

Load Mock Data

In [2]: data = ft.demo.load_mock_customer()

Prepare data

In this toy dataset, there are 3 tables. Each table is called an entity in Featuretools.

  • customers: unique customers who had sessions
  • sessions: unique sessions and associated attributes
  • transactions: list of events in this session
In [3]: customers_df = data["customers"]

In [4]: customers_df
Out[4]: 
   customer_id zip_code  join_date
0            1    60091 2008-01-01
1            2    02139 2008-02-20
2            3    02139 2008-04-10
3            4    60091 2008-05-30
4            5    02139 2008-07-19

In [5]: sessions_df = data["sessions"]

In [6]: sessions_df.sample(5)
Out[6]: 
    session_id  customer_id   device       session_start
16          17            4   mobile 2014-01-01 04:02:40
34          35            1  desktop 2014-01-01 08:45:25
4            5            2   tablet 2014-01-01 01:10:25
0            1            1  desktop 2014-01-01 00:00:00
29          30            4  desktop 2014-01-01 07:29:35

In [7]: transactions_df = data["transactions"]

In [8]: transactions_df.sample(5)
Out[8]: 
     transaction_id  session_id    transaction_time product_id  amount
481             441          34 2014-01-01 08:41:05          1   81.15
418              84          30 2014-01-01 07:32:50          4  149.02
19               85           2 2014-01-01 00:20:35          4  148.14
132             377           9 2014-01-01 02:23:00          4  112.07
148             109          10 2014-01-01 02:40:20          4   18.40

First, we specify a dictionary with all the entities in our dataset.

In [9]: entities = {
   ...:    "customers" : (customers_df, "customer_id"),
   ...:    "sessions" : (sessions_df, "session_id", "session_start"),
   ...:    "transactions" : (transactions_df, "transaction_id", "transaction_time")
   ...: }
   ...: 

Second, we specify how the entities are related. When 2 two entities have a one-to-many relationship, we call the “one” enitity, the “parent entity”. A relationship between a parent and child is defined like this:

(parent_entity, parent_variable, child_entity, child_variable)

In this dataset we have two relationships

In [10]: relationships = [("sessions", "session_id", "transactions", "session_id"),
   ....:                  ("customers", "customer_id", "sessions", "customer_id")]
   ....: 

Note

To manage setting up entities and relationships, the EntitySet class offer convenient APIs for managing data like this. See Representing Data with EntitySets for more information.

Run Deep Feature Synthesis

A minimal input to DFS is a set of entities, a list of relationships, and the “target_entity” to calculate features for. The ouput of DFS is a feature matrix and the corresponding list of feature defintions.

Let’s first create a feature matrix for each customer in the data

In [11]: feature_matrix_customers, features_defs = ft.dfs(entities=entities,
   ....:                                                  relationships=relationships,
   ....:                                                  target_entity="customers")
   ....: 

In [12]: feature_matrix_customers
Out[12]: 
            zip_code  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount) MODE(sessions.device)  MIN(transactions.amount)  MAX(transactions.amount)  YEAR(join_date)  SKEW(transactions.amount)  DAY(join_date)                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id
1              60091                  131               10                  10236.77               desktop                      5.60                    149.95             2008                   0.070041               1                   ...                                                     169.77                                 0.610052                                   41.95                               791.976505                              175.939423                                 9.299023                                 -0.377150                                5.857976                                        1                                -0.395358
2              02139                  122                8                   9118.81                mobile                      5.81                    149.15             2008                   0.028647              20                   ...                                                     114.85                                 0.492531                                   42.96                               596.243506                              230.333502                                10.925037                                  0.962350                                7.420480                                        1                                -0.470007
3              02139                   78                5                   5758.24               desktop                      6.78                    147.73             2008                   0.070814              10                   ...                                                      64.98                                 0.645728                                   21.77                               369.770121                              471.048551                                 9.819148                                 -0.244976                               12.537259                                        1                                -0.630425
4              60091                  111                8                   8205.28               desktop                      5.73                    149.56             2008                   0.087986              30                   ...                                                      83.53                                 0.516262                                   17.27                               584.673126                              322.883448                                13.065436                                 -0.548969                               12.738488                                        1                                -0.497169
5              02139                   58                4                   4571.37                tablet                      5.91                    148.17             2008                   0.085883              19                   ...                                                      73.09                                 0.830112                                   27.46                               313.448942                              198.522508                                 8.950528                                  0.098885                                5.599228                                        1                                -0.396571

[5 rows x 69 columns]

We now have dozens of new features to describe a customer’s behavior.

Change target entity

One of the reasons DFS is so powerful is that it can create a feature matrix for any entity in our data. For example, if we wanted to build features for sessions.

In [13]: feature_matrix_sessions, features_defs = ft.dfs(entities=entities,
   ....:                                                 relationships=relationships,
   ....:                                                 target_entity="sessions")
   ....: 

In [14]: feature_matrix_sessions.head(5)
Out[14]: 
            customer_id   device  WEEKDAY(session_start)  MONTH(session_start)  MODE(transactions.product_id)  MEAN(transactions.amount) customers.zip_code  DAY(session_start)  MIN(transactions.amount)  NUM_UNIQUE(transactions.product_id)                       ...                        customers.MODE(sessions.device)  customers.WEEKDAY(join_date)  MODE(transactions.MONTH(transaction_time))  customers.COUNT(transactions)  customers.MAX(transactions.amount)  customers.MIN(transactions.amount)  MODE(transactions.YEAR(transaction_time))  MODE(transactions.WEEKDAY(transaction_time))  NUM_UNIQUE(transactions.MONTH(transaction_time))  NUM_UNIQUE(transactions.DAY(transaction_time))
session_id
1                     1  desktop                       2                     1                              2                  77.846250              60091                   1                      5.60                                    5                       ...                                                desktop                             1                                           1                            131                              149.95                                5.60                                       2014                                             2                                                 1                                               1
2                     1  desktop                       2                     1                              3                  89.533000              60091                   1                      8.67                                    4                       ...                                                desktop                             1                                           1                            131                              149.95                                5.60                                       2014                                             2                                                 1                                               1
3                     5   mobile                       2                     1                              5                  67.130000              02139                   1                     20.91                                    5                       ...                                                 tablet                             5                                           1                             58                              148.17                                5.91                                       2014                                             2                                                 1                                               1
4                     3   mobile                       2                     1                              1                  82.172800              02139                   1                      8.70                                    5                       ...                                                desktop                             3                                           1                             78                              147.73                                6.78                                       2014                                             2                                                 1                                               1
5                     2   tablet                       2                     1                              1                  65.031818              02139                   1                      6.29                                    5                       ...                                                 mobile                             2                                           1                            122                              149.15                                5.81                                       2014                                             2                                                 1                                               1

[5 rows x 40 columns]

What’s next?

Get help

The Featuretools community is happy to provide support to users of Featuretools.

  • For installation, usage, or trouble shooting please join on the Gitter chat room
  • For bugs or pull requests, please put them on GitHub.
  • For everything else, the core developers can be reached by email at help@featuretools.com.