Feature types

Featuretools groups features into four general types:

Identity Features

In Featuretools, each feature is defined as a combination of other features. At the lowest level are IdentityFeature features which are equal to the plain old value of a single variable. Most of the time, identity features will be defined transparently for you, such as in the transform feature example below. They may also be defined explicitly:

In [1]: from featuretools.primitives import Feature

In [2]: time_feature = Feature(es["transactions"]["transaction_time"])

In [3]: time_feature
Out[3]: <Feature: transaction_time>

Note: Feature is an alias for IdentityFeature if only a single argument is provided.

Direct Features

Direct features are used to “inherit” feature values from a parent to a child entity. Suppose each event is associated with a single instance of the entity products. This entity has metadata about different products, such as brand, price, etc. We can pull the brand of the product into a feature of the event entity by including the event entity as an argument to Feature. In this case, Feature is an alias for primitives.DirectFeature:

In [4]: from featuretools.primitives import Feature

In [5]: brand = Feature(es["products"]["brand"], es["transactions"])

In [6]: brand
Out[6]: <Feature: products.brand>

Transform Features

Transform features take one or more features on an Entity and create a single new feature for that same entity. For example, we may want to take a fine-grained “timestamp” feature and convert it into the hour of the day in which it occurred.

In [7]: from featuretools.primitives import Hour, Weekend

In [8]: Hour(time_feature)
Out[8]: <Feature: HOUR(transaction_time)>

In [9]: Weekend(time_feature)
Out[9]: <Feature: WEEKEND(transaction_time)>

The Hour feature takes one parameter: the variable or feature we want to transform. If a variable is passed in, as in this case, an IdentityFeature will be created automatically.

Using algebraic and boolean operations, transform features can combine other features into arbitrary expressions. For example, to determine if a given event event happened in the afternoon, we can write:

In [10]: hour_feature = Hour(time_feature)

In [11]: after_twelve = hour_feature > 12

In [12]: after_twelve
Out[12]: <Feature: HOUR(transaction_time) > 12>

In [13]: at_twelve = hour_feature == 12

In [14]: before_five = hour_feature <= 17

In [15]: is_afternoon = after_twelve & before_five

In [16]: is_afternoon
Out[16]: <Feature: AND(HOUR(transaction_time) > 12, HOUR(transaction_time) <= 17)>

Aggregation Features

Aggregation features are used to create features for a parent entity by summarizing data from a child entity. For example, we can create a Count feature which counts the total number of events for each customer:

In [17]: from featuretools.primitives import Count

In [18]: total_events = Count(es["transactions"]["transaction_id"], es["customers"])

In [19]: fm = ft.calculate_feature_matrix([total_events], es)

In [20]: fm.head()
1                            126
2                             93
3                             93
4                            109
5                             79


For users who have written aggregations in SQL, this concept will be familar. One key difference in featuretools is that GROUP BY and JOIN are implicit. Since the parent and child entities are specified, featuretools can infer how to group the child entity and then join the resulting aggregation back to the parent entity.

Often times, we only want to aggregate using a certain amount of previous data. For example, we might only want to count events from the past 30 days. In this case, we can provide the use_previous parameter:

In [21]: total_events_last_30_days = Count(es["transactions"]["transaction_id"],
   ....:                                   parent_entity=es["customers"],
   ....:                                   use_previous="30 days")

In [22]: fm = ft.calculate_feature_matrix([total_events_last_30_days], es)

In [23]: fm.head()
             COUNT(transactions, Last 30 Days)
1                                          0.0
2                                          0.0
3                                          0.0
4                                          0.0
5                                          0.0

Unlike with cumulative transform features, the use_previous parameter here is evaluated relative to instances of the parent entity, not the child entity. The above feature translates roughly to the following: “For each customer, count the events which occurred in the 30 days preceding the customer’s timestamp.”

Find the list of the supported aggregation features here.

Where clauses

When defining aggregation or cumulative transform features, we can provide a where parameter to filter the instances we are aggregating over. Using the is_afternoon feature from earlier, we can count the total number of events which occurred in the afternoon:

In [24]: afternoon_events = Count(es["transactions"]["transaction_id"],
   ....:                      parent_entity=es["customers"],
   ....:                      where=is_afternoon).rename("afternoon_events")

In [25]: fm = ft.calculate_feature_matrix([afternoon_events], es)

In [26]: fm.head()
1                         0.0
2                         0.0
3                         0.0
4                         0.0
5                         0.0

The where argument can be any previously-defined boolean feature. Only instances for which the where feature is True are included in the final calculation.

Cumulative Transform Features

Like regular transform features, cumulative transform are features for an entity based on other features already defined on that entity. However, they differ in that they use data from many instances to compute a single value.

Each cumulative transform feature is created with a new parameter, use_previous, that takes a Timedelta object. This parameter specifies how long before the timestamp of each instance to look for data. Think of a cumulative transform feature like a rolling function: the feature iterates over the entity and, for each instance i, aggregates data from the window defined by (i.timestamp - use_previous, i.timestamp].

Say we want to calculate the number of events per customer in the past 30 days. We can create a cumulative count feature that tallies, for each event, the number of events which share a customer in the 30 days preceding that event’s timestamp.

In [27]: from featuretools.primitives import CumCount

In [28]: total_events = CumCount(base_feature=es["transactions"]["transaction_id"],
   ....:                         group_feature=es["transactions"]["session_id"],
   ....:                         use_previous="1 hour")

In [29]: fm = ft.calculate_feature_matrix([total_events], es)

In [30]: fm.head()
                CUM_COUNT(transaction_id by session_id, Last 1 Hour)
1                                                             2.0   
2                                                             2.0   
3                                                             1.0   
4                                                             5.0   
5                                                            25.0   

Because they use previous data, cumulative transform features can only be defined on entities that have a time index. Find the list of available cumulative transform features here.

Aggregations of Direct Feature

Composing multiple feature types is an extremely powerful abstraction that Featuretools makes simple. For instance, we can aggregate direct features on a child entity from a different parent entity. For example, to calculate the most common brand a customer interacted with:

In [31]: from featuretools.primitives import Mode

In [32]: brand = Feature(es["products"]["brand"], es["transactions"])

In [33]: favorite_brand = Mode(brand, parent_entity=es["customers"])

In [34]: fm = ft.calculate_feature_matrix([favorite_brand], es)

In [35]: fm.head()
1                                           B
2                                           B
3                                           B
4                                           B
5                                           B

Side note: Feature equality overrides default equality

Because we can check if two features are equal (or a feature is equal to a value), we override Python’s equals (==) operator. This means to check if two feature objects are equal (instead of their computed values in the feature matrix), we need to compare their hashes:

In [36]: hour_feature.hash() == hour_feature.hash()
Out[36]: True

In [37]: hour_feature.hash() != hour_feature.hash()
Out[37]: False

dictionaries and sets use equality underneath, so those keys need to be hashes as well

In [38]: myset = set()

In [39]: myset.add(hour_feature.hash())

In [40]: hour_feature.hash() in myset
Out[40]: True

In [41]: mydict = dict()

In [42]: mydict[hour_feature.hash()] = hour_feature

In [43]: hour_feature.hash() in mydict
Out[43]: True