Tuning Deep Feature Synthesis

There are several parameters that can be tuned to change the output of DFS.

In [1]: import featuretools as ft

In [2]: es = ft.demo.load_mock_customer(return_entityset=True)

In [3]: es
Out[3]: 
Entityset: transactions
  Entities:
    customers [Rows: 5, Columns: 3]
    sessions [Rows: 35, Columns: 4]
    products [Rows: 5, Columns: 2]
    transactions [Rows: 500, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

Using “Seed Features”

Seed features are manually defined, problem specific, features a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.

By using seed features, we can include domain specific knowledge in feature engineering automation.

In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125

In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ...:                                       target_entity="customers",
   ...:                                       agg_primitives=["percent_true"],
   ...:                                       seed_features=[expensive_purchase])
   ...: 

In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]
Out[6]: 
             PERCENT_TRUE(transactions.amount > 125)
customer_id                                         
1                                           0.213740
2                                           0.172131
3                                           0.115385
4                                           0.144144
5                                           0.224138

We can now see that PERCENT_TRUE was automatically applied to this boolean variable.

Add “interesting” values to variables

Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a “where clause”.

By default, where clauses are built using the interesting_values of a variable.

In [7]: es["sessions"]["device"].interesting_values = ["desktop", "mobile", "tablet"]

We then specify the aggregation primitive to make where clauses for using where_primitives

In [8]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ...:                                       target_entity="customers",
   ...:                                       agg_primitives=["count", "avg_time_between"],
   ...:                                       where_primitives=["count", "avg_time_between"],
   ...:                                       trans_primitives=[])
   ...: 

In [9]: feature_matrix
Out[9]: 
            zip_code  COUNT(transactions)  AVG_TIME_BETWEEN(sessions.session_start)  COUNT(sessions)  AVG_TIME_BETWEEN(transactions.transaction_time)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)  COUNT(sessions WHERE device = desktop)  COUNT(sessions WHERE device = mobile)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop)  COUNT(sessions WHERE device = tablet)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile)
customer_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
1              60091                  131                               3502.777778               10                                       249.500000                                                NaN                                                    7                                      2                                        5254.166667                                                  1.0                                             6825.0             
2              02139                  122                               2655.714286                8                                       161.694215                                             9295.0                                                    2                                      3                                       14755.000000                                                  3.0                                             7410.0             
3              02139                   78                               6971.250000                5                                       374.805195                                                NaN                                                    3                                      2                                        2827.500000                                                  0.0                                            11245.0             
4              60091                  111                               2405.000000                8                                       163.090909                                             4160.0                                                    3                                      2                                        4842.500000                                                  3.0                                             2145.0             
5              02139                   58                               9316.666667                4                                       504.035088                                            15860.0                                                    1                                      1                                                NaN                                                  2.0                                                NaN             

Now, we have several new potentially useful features. For example, the two features below tell us how many sessions a customer completed on a tablet, and the time between those sessions.

In [10]: feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]]
Out[10]: 
             COUNT(sessions WHERE device = tablet)  AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
customer_id                                                                                                       
1                                              1.0                                                NaN             
2                                              3.0                                             9295.0             
3                                              0.0                                                NaN             
4                                              3.0                                             4160.0             
5                                              2.0                                            15860.0             

We can see that customer who only had 0 or 1 sessions on a tablet, had NaN values for average time between such sessions.

Encoding categorical features

Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.

In [11]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ....:                                       target_entity="customers",
   ....:                                       agg_primitives=["mode"],
   ....:                                       max_depth=1)
   ....: 

In [12]: feature_matrix
Out[12]: 
            zip_code  DAY(join_date)  YEAR(join_date) MODE(sessions.device)  MONTH(join_date)  WEEKDAY(join_date)
customer_id                                                                                                      
1              60091               1             2008               desktop                 1                   1
2              02139              20             2008                mobile                 2                   2
3              02139              10             2008               desktop                 4                   3
4              60091              30             2008               desktop                 5                   4
5              02139              19             2008                tablet                 7                   5

This feature matrix contains 2 categorical variables, zip_code and MODE(sessions.device). We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.

In [13]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)

In [14]: feature_matrix_enc
Out[14]: 
             zip_code = 02139  zip_code = 60091  zip_code = unknown  DAY(join_date) = 30  DAY(join_date) = 20  DAY(join_date) = 19  DAY(join_date) = 10  DAY(join_date) = 1  DAY(join_date) = unknown  YEAR(join_date) = 2008              ...               MONTH(join_date) = 4  MONTH(join_date) = 2  MONTH(join_date) = 1  MONTH(join_date) = unknown  WEEKDAY(join_date) = 5  WEEKDAY(join_date) = 4  WEEKDAY(join_date) = 3  WEEKDAY(join_date) = 2  WEEKDAY(join_date) = 1  WEEKDAY(join_date) = unknown
customer_id                                                                                                                                                                                                                                ...                                                                                                                                                                                                                                                                 
1                           0                 1                   0                    0                    0                    0                    0                   1                         0                       1              ...                                  0                     0                     1                           0                       0                       0                       0                       0                       1                             0
2                           1                 0                   0                    0                    1                    0                    0                   0                         0                       1              ...                                  0                     1                     0                           0                       0                       0                       0                       1                       0                             0
3                           1                 0                   0                    0                    0                    0                    1                   0                         0                       1              ...                                  1                     0                     0                           0                       0                       0                       1                       0                       0                             0
4                           0                 1                   0                    1                    0                    0                    0                   0                         0                       1              ...                                  0                     0                     0                           0                       0                       1                       0                       0                       0                             0
5                           1                 0                   0                    0                    0                    1                    0                   0                         0                       1              ...                                  0                     0                     0                           0                       1                       0                       0                       0                       0                             0

[5 rows x 27 columns]

The returned feature matrix is now all numeric. Additionally, we get a new set of feature definitions that contain the encoded values.

In [15]: print features_enc
[<Feature: zip_code = 02139>, <Feature: zip_code = 60091>, <Feature: zip_code = unknown>, <Feature: DAY(join_date) = 30>, <Feature: DAY(join_date) = 20>, <Feature: DAY(join_date) = 19>, <Feature: DAY(join_date) = 10>, <Feature: DAY(join_date) = 1>, <Feature: DAY(join_date) = unknown>, <Feature: YEAR(join_date) = 2008>, <Feature: YEAR(join_date) = unknown>, <Feature: MODE(sessions.device) = desktop>, <Feature: MODE(sessions.device) = tablet>, <Feature: MODE(sessions.device) = mobile>, <Feature: MODE(sessions.device) = unknown>, <Feature: MONTH(join_date) = 7>, <Feature: MONTH(join_date) = 5>, <Feature: MONTH(join_date) = 4>, <Feature: MONTH(join_date) = 2>, <Feature: MONTH(join_date) = 1>, <Feature: MONTH(join_date) = unknown>, <Feature: WEEKDAY(join_date) = 5>, <Feature: WEEKDAY(join_date) = 4>, <Feature: WEEKDAY(join_date) = 3>, <Feature: WEEKDAY(join_date) = 2>, <Feature: WEEKDAY(join_date) = 1>, <Feature: WEEKDAY(join_date) = unknown>]

These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read Deployment.