Representing Data with EntitySets

An EntitySet is a collection of entities and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering. While many functions in Featuretools take entities and relationships as separate arguments, it is recommended to create an EntitySet, so you can more easily manipulate your data as needed.

The Raw Data

Below we have a two tables of data (represented as Pandas DataFrames) related to customer transactions. The first is a list of all transactions

In [1]: transactions_df.sample(10)
Out[1]: 
     transaction_id  session_id    transaction_time product_id  amount  customer_id   device zip_code
194             495           4 2014-01-01 00:48:45          4   90.69            3   mobile    02139
326             157          19 2014-01-01 04:30:50          1  110.87            2   tablet    02139
443             465          21 2014-01-01 05:07:40          1   54.66            4  desktop    60091
211              91           4 2014-01-01 01:07:10          2  143.93            3   mobile    02139
280             225           7 2014-01-01 01:42:55          3   71.53            2  desktop    02139
460               7          22 2014-01-01 05:26:05          3   83.33            4   tablet    60091
19               85           2 2014-01-01 00:20:35          4  148.14            1  desktop    60091
265             438          34 2014-01-01 08:43:15          4  100.04            3  desktop    02139
146             462          11 2014-01-01 02:49:00          1   27.46            5   tablet    02139
104             379          27 2014-01-01 06:40:50          4  131.83            1  desktop    60091

And the second dataframe is a list of products involved in those transactions.

In [2]: products_df
Out[2]: 
   product_id brand
0           1     B
1           2     B
2           3     C
3           4     A
4           5     C

Creating an EntitySet

First, we initialize an EntitySet and give it an id

In [3]: es = ft.EntitySet(id="transactions")

Adding entities

To get started, we load the transactions dataframe as an entity.

In [4]: es = es.entity_from_dataframe(entity_id="transactions",
   ...:                               dataframe=transactions_df,
   ...:                               index="transaction_id",
   ...:                               time_index="transaction_time",
   ...:                               variable_types={"product_id": ft.variable_types.Categorical})
   ...: 

In [5]: es
Out[5]: 
Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 8]
  Relationships:
    No relationships

This method loads each column in the dataframe in as a variable. We can see the variables in an entity using the code below.

In [6]: es["transactions"].variables
Out[6]: 
[<Variable: product_id (dtype = categorical, count = 500)>,
 <Variable: transaction_time (dtype: datetime_time_index, format: None)>,
 <Variable: session_id (dtype = numeric, count = 500)>,
 <Variable: amount (dtype = numeric, count = 500)>,
 <Variable: device (dtype = categorical, count = 500)>,
 <Variable: customer_id (dtype = numeric, count = 500)>,
 <Variable: transaction_id (dtype = index, count = 500)>,
 <Variable: zip_code (dtype = categorical, count = 500)>]

In the call to entity_from_dataframe, we specified three important parameters

  • The index parameter specifies the column that uniquely identifies rows in the dataframe
  • The time_index parameter tells Featuretools when the data was created.
  • The variable_types parameter indicates that “product_id” should be interpreted as a Categorical variable, even though it just an integer in the underlying data.

Now, we can do that same thing with our products dataframe

In [7]: es = es.entity_from_dataframe(entity_id="products",
   ...:                               dataframe=products_df,
   ...:                               index="product_id")
   ...: 

In [8]: es
Out[8]: 
Entityset: transactions
  Entities:
    products [Rows: 5, Columns: 2]
    transactions [Rows: 500, Columns: 8]
  Relationships:
    No relationships

With two entities in our entity set, we can add a relationship between them.

Adding a Relationship

We want to relate these two entities by the columns called “product_id” in each entity. Each product has multiple transactions associated with it, so it is called it the parent entity, while the transactions entity is known as the child entity. When specifying relationships we list the variable in the parent entity first. Note that each ft.Relationship must denote a one-to-many relationship rather than a relationship which is one-to-one or many-to-many.

In [9]: new_relationship = ft.Relationship(es["products"]["product_id"],
   ...:                                    es["transactions"]["product_id"])
   ...: 

In [10]: es = es.add_relationship(new_relationship)

In [11]: es
Out[11]: 
Entityset: transactions
  Entities:
    products [Rows: 5, Columns: 2]
    transactions [Rows: 500, Columns: 8]
  Relationships:
    transactions.product_id -> products.product_id

Now, we see the relationship has been added to our entity set.

Creating entity from existing table

When working with raw data, it is common to have sufficient information to justify the creation of new entities. In order to create a new entity and relationship for sessions, we “normalize” the transaction entity.

In [12]: es = es.normalize_entity(base_entity_id="transactions",
   ....:                          new_entity_id="sessions",
   ....:                          index="session_id",
   ....:                          additional_variables=["device", "customer_id", "zip_code"])
   ....: 

In [13]: es
Out[13]: 
Entityset: transactions
  Entities:
    sessions [Rows: 35, Columns: 5]
    products [Rows: 5, Columns: 2]
    transactions [Rows: 500, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id

Looking at the output above, we see this method did two operations

  1. It created a new entity called “sessions” based on the “session_id” variable in “transactions”
  2. It added a relationship connecting “transactions” and “sessions”.

If we look at the variables in transactions and the new sessions entity, we see two more operations that were performed automatically.

In [14]: es["transactions"].variables
Out[14]: 
[<Variable: product_id (dtype = categorical, count = 500)>,
 <Variable: transaction_time (dtype: datetime_time_index, format: None)>,
 <Variable: session_id (dtype = id, count = 500)>,
 <Variable: amount (dtype = numeric, count = 500)>,
 <Variable: transaction_id (dtype = index, count = 500)>]

In [15]: es["sessions"].variables
Out[15]: 
[<Variable: device (dtype = categorical, count = 35)>,
 <Variable: customer_id (dtype = numeric, count = 35)>,
 <Variable: first_transactions_time (dtype: datetime_time_index, format: None)>,
 <Variable: session_id (dtype = index, count = 35)>,
 <Variable: zip_code (dtype = categorical, count = 35)>]
  1. It removed “device”, “customer_id”, and “zip_code” from “transactions” and created a new variables in the sessions entity. This reduces redundant information as the those properties of a session don’t change between transactions.
  2. It created the “first_transactions_time” variable in the new sessions entity to indicate the beginning of a session. If we don’t want this variable to be created, we can set make_time_index=False.

If we look at the dataframes, can see what the normalize_entity did to the actual data.

In [16]: es["sessions"].head(5)
Out[16]: 
            session_id   device  customer_id zip_code first_transactions_time
session_id                                                                   
1                    1  desktop            1    60091     2014-01-01 00:00:00
2                    2  desktop            1    60091     2014-01-01 00:17:20
3                    3   mobile            5    02139     2014-01-01 00:28:10
4                    4   mobile            3    02139     2014-01-01 00:43:20
5                    5   tablet            2    02139     2014-01-01 01:10:25

In [17]: es["transactions"].head(5)
Out[17]: 
                transaction_id  session_id    transaction_time product_id  amount
transaction_id                                                                   
352                        352           1 2014-01-01 00:00:00          4    7.39
186                        186           1 2014-01-01 00:01:05          4  147.23
319                        319           1 2014-01-01 00:02:10          2  111.34
256                        256           1 2014-01-01 00:03:15          4   78.15
449                        449           1 2014-01-01 00:04:20          3   33.93

To finish preparing this dataset, create a “customers” entity using the same method call.

In [18]: es = es.normalize_entity(base_entity_id="sessions",
   ....:                          new_entity_id="customers",
   ....:                          index="customer_id",
   ....:                          additional_variables=["zip_code"],
   ....:                          make_time_index=False)
   ....: 

In [19]: es
Out[19]: 
Entityset: transactions
  Entities:
    customers [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 4]
    products [Rows: 5, Columns: 2]
    transactions [Rows: 500, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

Using the EntitySet

Finally, we are ready to use this EntitySet with any functionality within Featuretools. For example, let’s build a feature matrix for each product in our dataset.

In [20]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ....:                                       target_entity="products")
   ....: 

In [21]: feature_matrix
Out[21]: 
           brand  MAX(transactions.amount)  SUM(transactions.amount)  SKEW(transactions.amount)  NUM_UNIQUE(transactions.session_id)  MODE(transactions.session_id)  MEAN(transactions.amount)  STD(transactions.amount)  MIN(transactions.amount)  COUNT(transactions)                     ...                      NUM_UNIQUE(transactions.YEAR(transaction_time))  MODE(transactions.sessions.device)  NUM_UNIQUE(transactions.sessions.customer_id) MODE(transactions.DAY(transaction_time))  NUM_UNIQUE(transactions.DAY(transaction_time))  NUM_UNIQUE(transactions.sessions.device)  MODE(transactions.YEAR(transaction_time))  MODE(transactions.sessions.customer_id)  NUM_UNIQUE(transactions.MONTH(transaction_time))  MODE(transactions.MONTH(transaction_time))
product_id                                                                                                                                                                                                                                                                                  ...                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
1              B                    148.86                   7046.84                  -0.027598                                   31                              4                  71.906531                 40.232770                      6.29                   98                     ...                                                                    1                             desktop                                              5                                        1                                               1                                         3                                       2014                                        2                                                 1                                           1
2              B                    147.86                   7247.48                   0.180324                                   34                             19                  75.494583                 39.083334                      8.19                   96                     ...                                                                    1                             desktop                                              5                                        1                                               1                                         3                                       2014                                        1                                                 1                                           1
3              C                    149.95                   7916.96                  -0.075324                                   35                             31                  82.468333                 41.647180                      5.81                   96                     ...                                                                    1                             desktop                                              5                                        1                                               1                                         3                                       2014                                        1                                                 1                                           1
4              A                    149.02                   8181.19                   0.153199                                   34                             30                  75.056789                 44.354276                      5.73                  109                     ...                                                                    1                             desktop                                              5                                        1                                               1                                         3                                       2014                                        4                                                 1                                           1
5              C                    149.56                   7498.00                   0.087860                                   34                             28                  74.237624                 44.686334                      5.60                  101                     ...                                                                    1                             desktop                                              5                                        1                                               1                                         3                                       2014                                        1                                                 1                                           1

[5 rows x 22 columns]

As we can see, the features from DFS use the relational structure of our entity set. Therefore it is important to think carefully about the entities that we create.