Advanced Custom Primitives Guide

Functions With Additonal Arguments

One caveat with the make_primitive functions is that the required arguments of function must be input features. Here we create a function for StringCount, a primitive which counts the number of occurrences of a string in a Text input. Since string is not a feature, it needs to be a keyword argument to string_count.

In [1]: def string_count(column, string=None):
   ...:     '''
   ...:     ..note:: this is a naive implementation used for clarity
   ...:     '''
   ...:     assert string is not None, "string to count needs to be defined"
   ...:     counts = [element.lower().count(string) for element in column]
   ...:     return counts

In order to have features defined using the primitive reflect what string is being counted, we define a custom generate_name function.

In [2]: def string_count_generate_name(self):
   ...:     return u"STRING_COUNT(%s, %s)" % (self.base_features[0].get_name(),
   ...:                                       '"'+str(self.kwargs['string']+'"'))

Now that we have the function, we create the primitive using the make_trans_primitive function.

In [3]: StringCount = make_trans_primitive(function=string_count,
   ...:                                    input_types=[Text],
   ...:                                    return_type=Numeric,
   ...:                                    cls_attributes={"generate_name": string_count_generate_name})

Passing in string="test" as a keyword argument when creating a StringCount feature will make “test” the value used for string when string_count is called to calculate the feature values. Now we use this primitive to create a feature and calculate the feature values.

In [4]: from featuretools.tests.testing_utils import make_ecommerce_entityset

In [5]: es = make_ecommerce_entityset()

In [6]: count_the_feat = StringCount(es['log']['comments'], string="the")

Since string is a non-feature input Deep Feature Synthesis cannot automatically stack StringCount on other primitives to create more features. However, a user-defined StringCount feature can be used by DFS as a seed feature that DFS can stack on top of.

In [7]: feature_matrix, features = ft.dfs(entityset=es,
   ...:                                   target_entity="sessions",
   ...:                                   agg_primitives=["sum", "mean", "std"],
   ...:                                   seed_features=[count_the_feat])

In [8]: feature_matrix[['STD(log.STRING_COUNT(comments, "the"))', 'SUM(log.STRING_COUNT(comments, "the"))', 'MEAN(log.STRING_COUNT(comments, "the"))']]
    STD(log.STRING_COUNT(comments, "the"))  SUM(log.STRING_COUNT(comments, "the"))  MEAN(log.STRING_COUNT(comments, "the"))
0                                47.124304                                     209                                    41.80
1                                36.509131                                     109                                    27.25
2                                      NaN                                      29                                    29.00
3                                49.497475                                      70                                    35.00
4                                 0.000000                                       0                                     0.00
5                                 1.414214                                       4                                     2.00