Improving Computational Performance¶
Feature engineering is a computationally expensive task. While Featuretools comes with reasonable default settings for feature calculation, there are a number of built-in approaches to improve computational performance based on dataset and problem specific considerations.
Reduce number of unique cutoff times¶
Each row in a feature matrix created by Featuretools is calculated at a specific cutoff time that represents the last point in time that data from any entity in an entity set can be used to calculate the feature. As a result, calculations incur an overhead in finding the subset of allowed data for each distinct time in the calculation.
Featuretools is very precise in how it deals with time. For more information, see Handling Time.
If you have many unique cutoff times, it is often worthwhile to figure out how to have fewer. This can be done manually by figuring out which unique times are necessary for your prediction problem or automatically using approximate.
Adjust chunk size when calculating feature matrix¶
When Featuretools calculates a feature matrix, it first groups the rows to be calculated into chunks. Each chunk is a collection of rows that will be computed at the same time. The results of calculating each chunk are combined into the single feature matrix that is returned to you as the user.
If you wish to optimize chunk size, picking the right value will depend on the memory you have available and how often you’d like to get progress updates.
Peak memory usage¶
If rows have the same cutoff time they are placed in the same chunk until the chunk is full, so they can be calculated simultaneously. By increasing the size of a chunk, it is more likely that there is room for all rows for a given cutoff time to be grouped together. This allows us to minimize the overhead of finding allowable data. The downside is that computation now requires more memory per chunk. If the machine in use can’t handle the larger peak memory usage this can start to slow down the process more than the time saved by removing the overhead.
Frequency of progress updates and overall runtime¶
After the completion of a chunk, there is a progress update along with an estimation of remaining compute time provided to the user. Smaller chunks mean more fine-grained updates to the user. Even more, the average runtime over many smaller chunks is a better estimate of the overall remaining runtime.
However, if we make the chunk size too small, we may split up rows that share the same cutoff time into separate chunks. This means Featuretools will have to slice the data for that cutoff time multiple times, resulting in repeated work. Additionally, if chunks get really small (e.g. one row per chunk), then the overhead from other parts of the calculation process will start to contribute more significantly to overall runtime.
We can control chunk size using the
chunk_size argument to
calculate_feature_matrix. By default, the chunk size is set to 10% of all rows in the feature matrix. We can modify it like this:
# use 100 rows per chunk feature_matrix, features_list = ft.dfs(entityset=es, target_entity="customers", chunk_size=100)
We can also set chunksize to be a percentage of total rows or variable based on cutoff times.
# use 5% of rows per chunk feature_matrix, features_list = ft.dfs(entityset=es, target_entity="customers", chunk_size=.05) # use one chunk per unique cutoff time feature_matrix, features_list = ft.dfs(entityset=es, target_entity="customers", chunk_size="cutoff time")
To understand the impact of chunk size on one Entity Set with multiple entities, see the graph below
For a more in-depth look at chunk sizes see the Chunk Size Guide
Partition and Distribute Data¶
When an entire dataset is not required to calculate the features for a given set of instances, we can split the data into independent partitions and calculate on each partition. For example, imagine we are calculating features for customers and the features are “number of other customers in this zip code” or “average age of other customers in this zip code”. In this case, we can load in data partitioned by zip code. As long as we have all of the data for a zip code when calculating, we can calculate all features for a subset of customers.
An example of this approach can be seen in the Predict Next Purchase demo notebook. In this example, we partition data by customer and only load a fixed number of customers into memory at any given time. We implement this easily using Dask, which could also be used to scale the computation to a cluster of computers. A framework like Spark could be used similarly.
An additional example of partitioning data to distribute on multiple cores or a cluster using Dask can be seen in the Featuretools on Dask notebook. This approach is detailed in the Parallelizing Feature Engineering with Dask article on the Feature Labs engineering blog. Dask allows for simple scaling to multiple cores on a single computer or multiple machines on a cluster.
For a similar partition and distribute implementation using Apache Spark with PySpark, refer to the Feature Engineering on Spark notebook. This implementation shows how to carry out feature engineering on a cluster of EC2 instances using Spark as the distributed framework. A write-up of this approach is described in the Featuretools on Spark article on the Feature Labs engineering blog.