Feature primitives

Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. Because a primitive only constrains the input and output data types, they can be applied across datasets and can stack to create new calculations.

Why primitives?

The space of potential functions that humans use to create a feature is expansive. By breaking common feature engineering calculations down into primitive components, we are able to capture the underlying structure of the features humans create today.

A primitive only constrains the input and output data types. This means they can be used to transfer calculations known in one domain to another. Consider a feature which is often calculated by data scientists for transactional or event logs data: average time between events. This feature is incredibly valuable in predicting fraudulent behavior or future customer engagement.

DFS achieves the same feature by stacking two primitives "time_since_previous" and "mean"

In [1]: feature_defs = ft.dfs(entityset=es,
   ...:                       target_entity="customers",
   ...:                       agg_primitives=["mean"],
   ...:                       trans_primitives=["time_since_previous"],
   ...:                       features_only=True)
   ...: 

In [2]: feature_defs
Out[2]: 
[<Feature: zip_code>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: TIME_SINCE_PREVIOUS(join_date)>,
 <Feature: MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))>,
 <Feature: MEAN(sessions.MEAN(transactions.amount))>]

Note

When dfs is called with features_only=True, only feature definitions are returned as output. By default this parameter is set to False. This parameter is used quickly inspect the feature definitions before the spending time calculating the feature matrix.

A second advantage of primitives is that they can be used to quickly enumerate many interesting features in a parameterized way. This is used by Deep Feature Synthesis to get several different ways of summarizing the time since the previous event.

In [3]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ...:                                       target_entity="customers",
   ...:                                       agg_primitives=["mean", "max", "min", "std", "skew"],
   ...:                                       trans_primitives=["time_since_previous"])
   ...: 

In [4]: feature_matrix[["MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))",
   ...:                 "MAX(sessions.TIME_SINCE_PREVIOUS(session_start))",
   ...:                 "MIN(sessions.TIME_SINCE_PREVIOUS(session_start))",
   ...:                 "STD(sessions.TIME_SINCE_PREVIOUS(session_start))",
   ...:                 "SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))"]]
   ...: 
Out[4]: 
             MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))  MAX(sessions.TIME_SINCE_PREVIOUS(session_start))  MIN(sessions.TIME_SINCE_PREVIOUS(session_start))  STD(sessions.TIME_SINCE_PREVIOUS(session_start))  SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))
customer_id                                                                                                                                                                                                                                                            
5                                                  1007.500000                                            1170.0                                             715.0                                        157.884451                                          -1.507217
4                                                   999.375000                                            1625.0                                             650.0                                        308.688904                                           1.065177
1                                                   966.875000                                            1170.0                                             715.0                                        171.754341                                          -0.254557
3                                                   888.333333                                            1170.0                                             650.0                                        177.613813                                           0.434581
2                                                   725.833333                                             975.0                                             520.0                                        194.638554                                           0.162631

Aggregation vs Transform Primitive

In the example above, we use two types of primitives.

Aggregation primitives: These primitives take related instances as an input and output a single value. They are applied across a parent-child relationship in an entity set. E.g: "count", "sum", "avg_time_between".

Transform primitives: These primitives take one or more variables from an entity as an input and output a new variable for that entity. They are applied to a single entity. E.g: "hour", "time_since_previous", "absolute".

For a DataFrame that lists and describes each built-in primitive in Featuretools, call ft.list_primitives(). In addition, a list of all available primitives can be obtained by visiting primitives.featurelabs.com.

In [5]: ft.list_primitives().head(5)
Out[5]: 
               name         type                                        description
0             trend  aggregation      Calculates the trend of a variable over time.
1             first  aggregation              Determines the first value in a list.
2             count  aggregation  Determines the total number of values, excludi...
3              skew  aggregation  Computes the extent to which a distribution di...
4  avg_time_between  aggregation  Computes the average number of seconds between...

Defining Custom Primitives

The library of primitives in Featuretools is constantly expanding. Users can define their own primitive using the APIs below. To define a primitive, a user will

  • Specify the type of primitive Aggregation or Transform

  • Define the input and output data types

  • Write a function in python to do the calculation

  • Annotate with attributes to constrain how it is applied

Once a primitive is defined, it can stack with existing primitives to generate complex patterns. This enables primitives known to be important for one domain to automatically be transfered to another.

Simple Custom Primitives

In [6]: from featuretools.primitives import make_agg_primitive, make_trans_primitive

In [7]: from featuretools.variable_types import Text, Numeric

In [8]: def absolute(column):
   ...:     return abs(column)
   ...: 

In [9]: Absolute = make_trans_primitive(function=absolute,
   ...:                                 input_types=[Numeric],
   ...:                                 return_type=Numeric)
   ...: 

Above we created a new transform primitive that can be used with Deep Feature Synthesis using make_trans_primitive and a python function we defined. Additionally, we annotated the input data types that the primitive can be applied to and the data type it returns.

Similarly, we can make a new aggregation primitive using make_agg_primitive.

In [10]: def maximum(column):
   ....:     return max(column)
   ....: 

In [11]: Maximum = make_agg_primitive(function=maximum,
   ....:                           input_types=[Numeric],
   ....:                           return_type=Numeric)
   ....: 

Because we defined an aggregation primitive, the function takes in a list of values but only returns one.

Now that we’ve defined two primitives, we can use them with the dfs function as if they were built-in primitives.

In [12]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ....:                                       target_entity="sessions",
   ....:                                       agg_primitives=[Maximum],
   ....:                                       trans_primitives=[Absolute],
   ....:                                       max_depth=2)
   ....: 

In [13]: feature_matrix[["customers.MAXIMUM(transactions.amount)", "MAXIMUM(transactions.ABSOLUTE(amount))"]].head(5)
Out[13]: 
            customers.MAXIMUM(transactions.amount)  MAXIMUM(transactions.ABSOLUTE(amount))
session_id                                                                                
1                                           146.81                                  141.66
2                                           149.02                                  135.25
3                                           149.95                                  147.73
4                                           139.43                                  129.00
5                                           149.95                                  139.20

Word Count Example

Here we define a function, word_count, which counts the number of words in each row of an input and returns a list of the counts.

In [14]: def word_count(column):
   ....:     '''
   ....:     Counts the number of words in each row of the column. Returns a list
   ....:     of the counts for each row.
   ....:     '''
   ....:     word_counts = []
   ....:     for value in column:
   ....:         words = value.split(None)
   ....:         word_counts.append(len(words))
   ....:     return word_counts
   ....: 

Next, we need to create a custom primitive from the word_count function.

In [15]: WordCount = make_trans_primitive(function=word_count,
   ....:                                  input_types=[Text],
   ....:                                  return_type=Numeric)
   ....: 
In [16]: from featuretools.tests.testing_utils import make_ecommerce_entityset

In [17]: es = make_ecommerce_entityset()

Since WordCount is a transform primitive, we need to add it to the list of transform primitives DFS can use when generating features.

In [18]: feature_matrix, features = ft.dfs(entityset=es,
   ....:                                   target_entity="sessions",
   ....:                                   agg_primitives=["sum", "mean", "std"],
   ....:                                   trans_primitives=[WordCount])
   ....: 

In [19]: feature_matrix[["customers.WORD_COUNT(favorite_quote)", "STD(log.WORD_COUNT(comments))", "SUM(log.WORD_COUNT(comments))", "MEAN(log.WORD_COUNT(comments))"]]
Out[19]: 
    customers.WORD_COUNT(favorite_quote)  STD(log.WORD_COUNT(comments))  SUM(log.WORD_COUNT(comments))  MEAN(log.WORD_COUNT(comments))
id                                                                                                                                    
0                                      9                     540.436860                           2500                             500
1                                      9                     583.702550                           1732                             433
2                                      9                            NaN                            246                             246
3                                      6                     883.883476                           1256                             628
4                                      6                       0.000000                              9                               3
5                                     12                      19.798990                             68                              34

By adding some aggregation primitives as well, Deep Feature Synthesis was able to make four new features from one new primitive.

Multiple Input Types

If a primitive requires multiple features as input, input_types has multiple elements, eg [Numeric, Numeric] would mean the primitive requires two Numeric features as input. Below is an example of a primitive that has multiple input features.

In [20]: from featuretools.variable_types import Datetime, Timedelta, Variable

In [21]: import pandas as pd

In [22]: def mean_sunday(numeric, datetime):
   ....:     '''
   ....:     Finds the mean of non-null values of a feature that occurred on Sundays
   ....:     '''
   ....:     days = pd.DatetimeIndex(datetime).weekday.values
   ....:     df = pd.DataFrame({'numeric': numeric, 'time': days})
   ....:     return df[df['time'] == 6]['numeric'].mean()
   ....: 

In [23]: MeanSunday = make_agg_primitive(function=mean_sunday,
   ....:                                  input_types=[Numeric, Datetime],
   ....:                                  return_type=Numeric)
   ....: 

In [24]: feature_matrix, features = ft.dfs(entityset=es,
   ....:                                   target_entity="sessions",
   ....:                                   agg_primitives=[MeanSunday],
   ....:                                   trans_primitives=[],
   ....:                                   max_depth=1)
   ....: 

In [25]: feature_matrix[["MEAN_SUNDAY(log.value, datetime)", "MEAN_SUNDAY(log.value_2, datetime)"]]
Out[25]: 
    MEAN_SUNDAY(log.value, datetime)  MEAN_SUNDAY(log.value_2, datetime)
id                                                                      
0                                NaN                                 NaN
1                                NaN                                 NaN
2                                NaN                                 NaN
3                                2.5                                 1.0
4                                7.0                                 3.0
5                                NaN                                 NaN