Representing Data with EntitySets#

An EntitySet is a collection of dataframes and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering. While many functions in Featuretools take dataframes and relationships as separate arguments, it is recommended to create an EntitySet, so you can more easily manipulate your data as needed.

The Raw Data#

Below we have two tables of data (represented as Pandas DataFrames) related to customer transactions. The first is a merge of transactions, sessions, and customers so that the result looks like something you might see in a log file:

[1]:

import featuretools as ft

data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])

transactions_df.sample(10)

[1]:

	transaction_id	session_id	transaction_time	product_id	amount	customer_id	device	session_start	zip_code	join_date	birthday
264	296	20	2014-01-01 04:46:00	5	53.22	5	desktop	2014-01-01 04:46:00	60091	2010-07-17 05:27:50	1984-07-28
19	74	2	2014-01-01 00:20:35	1	106.99	5	mobile	2014-01-01 00:17:20	60091	2010-07-17 05:27:50	1984-07-28
314	141	23	2014-01-01 05:40:10	5	128.26	3	desktop	2014-01-01 05:32:35	13244	2011-08-13 15:42:34	2003-11-21
290	236	21	2014-01-01 05:14:10	5	57.09	4	desktop	2014-01-01 05:02:15	60091	2011-04-08 20:08:14	2006-08-15
379	292	28	2014-01-01 06:50:35	1	133.71	5	mobile	2014-01-01 06:50:35	60091	2010-07-17 05:27:50	1984-07-28
335	482	25	2014-01-01 06:02:55	1	26.30	3	desktop	2014-01-01 05:59:40	13244	2011-08-13 15:42:34	2003-11-21
293	452	21	2014-01-01 05:17:25	5	69.62	4	desktop	2014-01-01 05:02:15	60091	2011-04-08 20:08:14	2006-08-15
271	169	20	2014-01-01 04:53:35	3	78.87	5	desktop	2014-01-01 04:46:00	60091	2010-07-17 05:27:50	1984-07-28
404	476	29	2014-01-01 07:17:40	4	11.62	1	mobile	2014-01-01 07:10:05	60091	2011-04-17 10:48:33	1994-07-18
179	72	12	2014-01-01 03:13:55	2	143.96	4	desktop	2014-01-01 03:04:10	60091	2011-04-08 20:08:14	2006-08-15

And the second dataframe is a list of products involved in those transactions.

[2]:

products_df = data["products"]
products_df

[2]:

	product_id	brand
0	1	B
1	2	B
2	3	B
3	4	B
4	5	A

Creating an EntitySet#

First, we initialize an EntitySet. If you’d like to give it a name, you can optionally provide an id to the constructor.

[3]:

es = ft.EntitySet(id="customer_data")

Adding dataframes#

To get started, we add the transactions dataframe to the EntitySet. In the call to add_dataframe, we specify three important parameters:

The index parameter specifies the column that uniquely identifies rows in the dataframe.
The time_index parameter tells Featuretools when the data was created.
The logical_types parameter indicates that “product_id” should be interpreted as a Categorical column, even though it is just an integer in the underlying data.

[4]:

from woodwork.logical_types import Categorical, PostalCode

es = es.add_dataframe(
    dataframe_name="transactions",
    dataframe=transactions_df,
    index="transaction_id",
    time_index="transaction_time",
    logical_types={
        "product_id": Categorical,
        "zip_code": PostalCode,
    },
)

es

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

[4]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
  Relationships:
    No relationships

You can also use a setter on the EntitySet object to add dataframes

Note

You can also use a setter on the EntitySet object to add dataframes

es["transactions"] = transactions_df

that this will use the default implementation of add_dataframe, notably the following:

if the DataFrame does not have Woodwork initialized, the first column will be the index column
if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.
if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe.

Note

You can also display your EntitySet structure graphically by calling EntitySet.plot().

This method associates each column in the dataframe to a Woodwork logical type. Each logical type can have an associated standard semantic tag that helps define the column data type. If you don’t specify the logical type for a column, it gets inferred based on the underlying data. The logical types and semantic tags are listed in the schema of the dataframe. For more information on working with logical types and semantic tags, take a look at the Woodwork documention.

[5]:

es["transactions"].ww.schema

[5]:

	Logical Type	Semantic Tag(s)
Column
transaction_id	Integer	['index']
session_id	Integer	['numeric']
transaction_time	Datetime	['time_index']
product_id	Categorical	['category']
amount	Double	['numeric']
customer_id	Integer	['numeric']
device	Categorical	['category']
session_start	Datetime	[]
zip_code	PostalCode	['category']
join_date	Datetime	[]
birthday	Datetime	[]

Now, we can do that same thing with our products dataframe.

[6]:

es = es.add_dataframe(
    dataframe_name="products", dataframe=products_df, index="product_id"
)

es

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/latest/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

[6]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    No relationships

With two dataframes in our EntitySet, we can add a relationship between them.

Adding a Relationship#

We want to relate these two dataframes by the columns called “product_id” in each dataframe. Each product has multiple transactions associated with it, so it is called the parent dataframe, while the transactions dataframe is known as the child dataframe. When specifying relationships, we need four parameters: the parent dataframe name, the parent column name, the child dataframe name, and the child column name. Note that each relationship must denote a one-to-many relationship rather than a relationship which is one-to-one or many-to-many.

[7]:

es = es.add_relationship("products", "product_id", "transactions", "product_id")
es

[7]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    transactions.product_id -> products.product_id

Now, we see the relationship has been added to our EntitySet.

Creating a dataframe from an existing table#

When working with raw data, it is common to have sufficient information to justify the creation of new dataframes. In order to create a new dataframe and relationship for sessions, we “normalize” the transaction dataframe.

[8]:

es = es.normalize_dataframe(
    base_dataframe_name="transactions",
    new_dataframe_name="sessions",
    index="session_id",
    make_time_index="session_start",
    additional_columns=[
        "device",
        "customer_id",
        "zip_code",
        "session_start",
        "join_date",
    ],
)
es

[8]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 6]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id

Looking at the output above, we see this method did two operations:

It created a new dataframe called “sessions” based on the “session_id” and “session_start” columns in “transactions”
It added a relationship connecting “transactions” and “sessions”

If we look at the schema from the transactions dataframe and the new sessions dataframe, we see two more operations that were performed automatically:

[9]:

es["transactions"].ww.schema

[9]:

	Logical Type	Semantic Tag(s)
Column
transaction_id	Integer	['index']
session_id	Integer	['numeric', 'foreign_key']
transaction_time	Datetime	['time_index']
product_id	Categorical	['foreign_key', 'category']
amount	Double	['numeric']
birthday	Datetime	[]

[10]:

es["sessions"].ww.schema

[10]:

	Logical Type	Semantic Tag(s)
Column
session_id	Integer	['index']
device	Categorical	['category']
customer_id	Integer	['numeric']
zip_code	PostalCode	['category']
session_start	Datetime	['time_index']
join_date	Datetime	[]

It removed “device”, “customer_id”, “zip_code” and “join_date” from “transactions” and created a new columns in the sessions dataframe. This reduces redundant information as the those properties of a session don’t change between transactions.
It copied and marked “session_start” as a time index column into the new sessions dataframe to indicate the beginning of a session. If the base dataframe has a time index and make_time_index is not set, normalize_dataframe will create a time index for the new dataframe. In this case it would create a new time index called “first_transactions_time” using the time of the first transaction of each session. If we don’t want this time index to be created, we can set make_time_index=False.

If we look at the dataframes, we can see what normalize_dataframe did to the actual data.

[11]:

es["sessions"].head(5)

[11]:

	session_id	device	customer_id	zip_code	session_start	join_date
1	1	desktop	2	13244	2014-01-01 00:00:00	2012-04-15 23:31:04
2	2	mobile	5	60091	2014-01-01 00:17:20	2010-07-17 05:27:50
3	3	mobile	4	60091	2014-01-01 00:28:10	2011-04-08 20:08:14
4	4	mobile	1	60091	2014-01-01 00:44:25	2011-04-17 10:48:33
5	5	mobile	4	60091	2014-01-01 01:11:30	2011-04-08 20:08:14

[12]:

es["transactions"].head(5)

[12]:

	transaction_id	session_id	transaction_time	product_id	amount	birthday
10	10	1	2014-01-01 00:00:00	5	127.64	1986-08-18
2	2	1	2014-01-01 00:01:05	2	109.48	1986-08-18
438	438	1	2014-01-01 00:02:10	3	95.06	1986-08-18
192	192	1	2014-01-01 00:03:15	4	78.92	1986-08-18
271	271	1	2014-01-01 00:04:20	3	31.54	1986-08-18

To finish preparing this dataset, create a “customers” dataframe using the same method call.

[13]:

es = es.normalize_dataframe(
    base_dataframe_name="sessions",
    new_dataframe_name="customers",
    index="customer_id",
    make_time_index="join_date",
    additional_columns=["zip_code", "join_date"],
)

es

[13]:

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 4]
    customers [Rows: 5, Columns: 3]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

Using the EntitySet#

Finally, we are ready to use this EntitySet with any functionality within Featuretools. For example, let’s build a feature matrix for each product in our dataset.

[14]:

feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="products")

feature_matrix

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/latest/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function std at 0x7fdda81013a0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/latest/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function sum at 0x7fdda80fc310> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/latest/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function max at 0x7fdda80fc940> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/latest/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function min at 0x7fdda80fca60> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  to_merge = base_frame.groupby(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-featuretools/envs/latest/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:781: FutureWarning: The provided callable <function mean at 0x7fdda8101280> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  to_merge = base_frame.groupby(

[14]:

	COUNT(transactions)	MAX(transactions.amount)	MEAN(transactions.amount)	MIN(transactions.amount)	SKEW(transactions.amount)	STD(transactions.amount)	SUM(transactions.amount)	MODE(transactions.DAY(birthday))	MODE(transactions.DAY(transaction_time))	MODE(transactions.MONTH(birthday))	...	MODE(transactions.sessions.device)	NUM_UNIQUE(transactions.DAY(birthday))	NUM_UNIQUE(transactions.DAY(transaction_time))	NUM_UNIQUE(transactions.MONTH(birthday))	NUM_UNIQUE(transactions.MONTH(transaction_time))	NUM_UNIQUE(transactions.WEEKDAY(birthday))	NUM_UNIQUE(transactions.WEEKDAY(transaction_time))	NUM_UNIQUE(transactions.YEAR(birthday))	NUM_UNIQUE(transactions.YEAR(transaction_time))	NUM_UNIQUE(transactions.sessions.device)
product_id
1	102	149.56	73.429314	6.84	0.125525	42.479989	7489.79	18	1	7	...	desktop	4	1	3	1	4	1	5	1	3
2	92	149.95	76.319891	5.73	0.151934	46.336308	7021.43	18	1	8	...	desktop	4	1	3	1	4	1	5	1	3
3	96	148.31	73.001250	5.89	0.223938	38.871405	7008.12	18	1	8	...	desktop	4	1	3	1	4	1	5	1	3
4	106	146.46	76.311038	5.81	-0.132077	42.492501	8088.97	18	1	7	...	desktop	4	1	3	1	4	1	5	1	3
5	104	149.02	76.264904	5.91	0.098248	42.131902	7931.55	18	1	7	...	mobile	4	1	3	1	4	1	5	1	3

5 rows × 25 columns

As we can see, the features from DFS use the relational structure of our EntitySet. Therefore it is important to think carefully about the dataframes that we create.

Table of Contents

Previous topic

Next topic

This Page

Representing Data with EntitySets#

The Raw Data#

Creating an EntitySet#

Adding dataframes#

Adding a Relationship#

Creating a dataframe from an existing table#

Using the EntitySet#

Table of Contents

Previous topic

Next topic

This Page

Quick search

Representing Data with EntitySets#

The Raw Data#

Creating an EntitySet#

Adding dataframes#

Adding a Relationship#

Creating a dataframe from an existing table#

Using the EntitySet#