Removes columns in feature matrix that are highly correlated with another column.
We make the assumption that, for a pair of features, the feature that is further
right in the feature matrix produced by dfs is the more complex one.
The assumption does not hold if the order of columns in the feature
matrix has changed from what dfs produces.
feature_matrix (pd.DataFrame) – DataFrame whose columns are feature
names and rows are instances.
features (list[featuretools.FeatureBase] or list[str], optional) – List of features to select.
pct_corr_threshold (float) – The correlation threshold to be considered highly
correlated. Defaults to 0.95.
features_to_check (list[str], optional) – List of column names to check
whether any pairs are highly correlated. Will not check any
other columns, meaning the only columns that can be removed
are in this list. If null, defaults to checking all columns.
features_to_keep (list[str], optional) – List of colum names to keep even
if correlated to another column. If null, all columns will be
candidates for removal.
The feature matrix and the list of generated feature definitions.
Matches dfs output. If no feature list is provided as input,
the feature list will not be returned. For consistent results,
do not change the order of features outputted by dfs.