Tuning Deep Feature Synthesis¶
There are several parameters that can be tuned to change the output of DFS.
In [1]: import featuretools as ft
In [2]: es = ft.demo.load_mock_customer(return_entityset=True)
In [3]: es
Out[3]:
Entityset: transactions
Entities:
transactions [Rows: 500, Columns: 5]
products [Rows: 5, Columns: 2]
sessions [Rows: 35, Columns: 4]
customers [Rows: 5, Columns: 4]
Relationships:
transactions.product_id -> products.product_id
transactions.session_id -> sessions.session_id
sessions.customer_id -> customers.customer_id
Using “Seed Features”¶
Seed features are manually defined, problem specific, features a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.
By using seed features, we can include domain specific knowledge in feature engineering automation.
In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125
In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es,
...: target_entity="customers",
...: agg_primitives=["percent_true"],
...: seed_features=[expensive_purchase])
...:
In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]
Out[6]:
PERCENT_TRUE(transactions.amount > 125)
customer_id
5 0.227848
4 0.220183
1 0.119048
3 0.182796
2 0.129032
We can now see that PERCENT_TRUE
was automatically applied to this boolean variable.
Add “interesting” values to variables¶
Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a “where clause”.
By default, where clauses are built using the interesting_values
of a variable.
In [7]: es["sessions"]["device"].interesting_values = ["desktop", "mobile", "tablet"]
We then specify the aggregation primitive to make where clauses for using where_primitives
In [8]: feature_matrix, feature_defs = ft.dfs(entityset=es,
...: target_entity="customers",
...: agg_primitives=["count", "avg_time_between"],
...: where_primitives=["count", "avg_time_between"],
...: trans_primitives=[])
...:
In [9]: feature_matrix
Out[9]:
zip_code COUNT(sessions) AVG_TIME_BETWEEN(sessions.session_start) COUNT(transactions) AVG_TIME_BETWEEN(transactions.transaction_time) COUNT(sessions WHERE device = desktop) COUNT(sessions WHERE device = tablet) COUNT(sessions WHERE device = mobile) AVG_TIME_BETWEEN(sessions.session_start WHERE device = desktop) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = mobile) AVG_TIME_BETWEEN(transactions.sessions.session_start) COUNT(transactions WHERE sessions.device = tablet) COUNT(transactions WHERE sessions.device = desktop) COUNT(transactions WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.sessions.session_start WHERE sessions.device = mobile) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = tablet) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = desktop) AVG_TIME_BETWEEN(transactions.transaction_time WHERE sessions.device = mobile)
customer_id
5 60091 6 5577.000000 79 363.333333 2 1 3 9685.0 NaN 13942.500000 357.500000 14 29 36 0.000000 345.892857 796.714286 65.000000 376.071429 809.714286
4 60091 8 2516.428571 109 168.518519 3 1 4 4127.5 NaN 3336.666667 163.101852 18 38 53 0.000000 223.108108 192.500000 65.000000 238.918919 206.250000
1 60091 8 3305.714286 126 192.920000 2 3 3 7150.0 8807.5 11570.000000 185.120000 43 27 56 419.404762 275.000000 420.727273 442.619048 302.500000 438.454545
3 13244 6 5096.000000 93 287.554348 4 1 1 4745.0 NaN NaN 276.956522 15 62 16 0.000000 233.360656 0.000000 65.000000 251.475410 65.000000
2 13244 7 4907.500000 93 328.532609 3 2 2 6890.0 5330.0 1690.000000 320.054348 28 34 31 197.407407 417.575758 56.333333 226.296296 435.303030 82.333333
Now, we have several new potentially useful features. For example, the two features below tell us how many sessions a customer completed on a tablet, and the time between those sessions.
In [10]: feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]]
Out[10]:
COUNT(sessions WHERE device = tablet) AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)
customer_id
5 1 NaN
4 1 NaN
1 3 8807.5
3 1 NaN
2 2 5330.0
We can see that customer who only had 0 or 1 sessions on a tablet, had NaN
values for average time between such sessions.
Encoding categorical features¶
Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.
In [11]: feature_matrix, feature_defs = ft.dfs(entityset=es,
....: target_entity="customers",
....: agg_primitives=["mode"],
....: max_depth=1)
....:
In [12]: feature_matrix
Out[12]:
zip_code MODE(sessions.device) DAY(join_date) DAY(date_of_birth) YEAR(join_date) YEAR(date_of_birth) MONTH(join_date) MONTH(date_of_birth) WEEKDAY(join_date) WEEKDAY(date_of_birth)
customer_id
5 60091 mobile 17 28 2010 1984 7 7 5 5
4 60091 mobile 8 15 2011 2006 4 8 4 1
1 60091 mobile 17 18 2011 1994 4 7 6 0
3 13244 desktop 13 21 2011 2003 8 11 5 4
2 13244 desktop 15 18 2012 1986 4 8 6 0
This feature matrix contains 2 categorical variables, zip_code
and MODE(sessions.device)
. We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.
In [13]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
In [14]: feature_matrix_enc
Out[14]:
zip_code = 60091 zip_code = 13244 zip_code is unknown MODE(sessions.device) = mobile MODE(sessions.device) = desktop MODE(sessions.device) is unknown DAY(join_date) = 17 DAY(join_date) = 15 DAY(join_date) = 13 DAY(join_date) = 8 DAY(join_date) is unknown DAY(date_of_birth) = 18 DAY(date_of_birth) = 28 DAY(date_of_birth) = 21 DAY(date_of_birth) = 15 DAY(date_of_birth) is unknown YEAR(join_date) = 2011 YEAR(join_date) = 2012 YEAR(join_date) = 2010 YEAR(join_date) is unknown YEAR(date_of_birth) = 2006 YEAR(date_of_birth) = 2003 YEAR(date_of_birth) = 1994 YEAR(date_of_birth) = 1986 YEAR(date_of_birth) = 1984 YEAR(date_of_birth) is unknown MONTH(join_date) = 4 MONTH(join_date) = 8 MONTH(join_date) = 7 MONTH(join_date) is unknown MONTH(date_of_birth) = 8 MONTH(date_of_birth) = 7 MONTH(date_of_birth) = 11 MONTH(date_of_birth) is unknown WEEKDAY(join_date) = 6 WEEKDAY(join_date) = 5 WEEKDAY(join_date) = 4 WEEKDAY(join_date) is unknown WEEKDAY(date_of_birth) = 0 WEEKDAY(date_of_birth) = 5 WEEKDAY(date_of_birth) = 4 WEEKDAY(date_of_birth) = 1 WEEKDAY(date_of_birth) is unknown
customer_id
5 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0
4 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0
3 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0
2 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0
The returned feature matrix is now all numeric. Additionally, we get a new set of feature definitions that contain the encoded values.
In [15]: print(features_enc)
[<Feature: zip_code = 60091>, <Feature: zip_code = 13244>, <Feature: zip_code is unknown>, <Feature: MODE(sessions.device) = mobile>, <Feature: MODE(sessions.device) = desktop>, <Feature: MODE(sessions.device) is unknown>, <Feature: DAY(join_date) = 17>, <Feature: DAY(join_date) = 15>, <Feature: DAY(join_date) = 13>, <Feature: DAY(join_date) = 8>, <Feature: DAY(join_date) is unknown>, <Feature: DAY(date_of_birth) = 18>, <Feature: DAY(date_of_birth) = 28>, <Feature: DAY(date_of_birth) = 21>, <Feature: DAY(date_of_birth) = 15>, <Feature: DAY(date_of_birth) is unknown>, <Feature: YEAR(join_date) = 2011>, <Feature: YEAR(join_date) = 2012>, <Feature: YEAR(join_date) = 2010>, <Feature: YEAR(join_date) is unknown>, <Feature: YEAR(date_of_birth) = 2006>, <Feature: YEAR(date_of_birth) = 2003>, <Feature: YEAR(date_of_birth) = 1994>, <Feature: YEAR(date_of_birth) = 1986>, <Feature: YEAR(date_of_birth) = 1984>, <Feature: YEAR(date_of_birth) is unknown>, <Feature: MONTH(join_date) = 4>, <Feature: MONTH(join_date) = 8>, <Feature: MONTH(join_date) = 7>, <Feature: MONTH(join_date) is unknown>, <Feature: MONTH(date_of_birth) = 8>, <Feature: MONTH(date_of_birth) = 7>, <Feature: MONTH(date_of_birth) = 11>, <Feature: MONTH(date_of_birth) is unknown>, <Feature: WEEKDAY(join_date) = 6>, <Feature: WEEKDAY(join_date) = 5>, <Feature: WEEKDAY(join_date) = 4>, <Feature: WEEKDAY(join_date) is unknown>, <Feature: WEEKDAY(date_of_birth) = 0>, <Feature: WEEKDAY(date_of_birth) = 5>, <Feature: WEEKDAY(date_of_birth) = 4>, <Feature: WEEKDAY(date_of_birth) = 1>, <Feature: WEEKDAY(date_of_birth) is unknown>]
These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read Deployment.