Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When calculating direct features use default value if parent missing #682

Closed
CJStadler opened this issue Jul 23, 2019 · 5 comments · Fixed by #1312
Closed

When calculating direct features use default value if parent missing #682

CJStadler opened this issue Jul 23, 2019 · 5 comments · Fixed by #1312
Labels
good first issue Good for newcomers

Comments

@CJStadler
Copy link
Contributor

For example, if there is a relationship transaction.session_id -> sessions.id and we are calculating a feature transactions: sessions.SUM(transactions.value) any rows for which there is no corresponding session should be given the default value of 0 instead of NaN.

Of course this should not normally occur, but when it does it seems more reasonable to use the default_value.

DirectFeature.default_value is already implemented. We should be able to use the same logic that we do for aggregation features.
https://github.com/Featuretools/featuretools/blob/6f4ffd7ef7ea42f95dbaf3892615717a521299db/featuretools/computational_backends/feature_set_calculator.py#L611-L618

@kmax12 kmax12 added the good first issue Good for newcomers label Jul 23, 2019
@scorpioluck20
Copy link

Is there any sample codes that reproduces this problem? I would like to confirm my understanding of this problem.

For example, if transactions.value is [1,2,3,4,float('nan')], SUM(transactions.value) should be 10.0 (ignoring nan). Am I correct?

@kmax12
Copy link
Contributor

kmax12 commented Jul 30, 2019

Here is code that reproduces

import pandas as pd
import featuretools as ft
from featuretools.primitives import Sum

transactions = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "session_id": ["a", "a", "b", "c"],
    "value": [1, 1, 1, 1]
})

sessions = pd.DataFrame({
    "id": ["a", "b"]
})

es = ft.EntitySet()
es.entity_from_dataframe(entity_id="transactions",
                         dataframe=transactions,
                         index="id")
es.entity_from_dataframe(entity_id="sessions",
                         dataframe=sessions,
                         index="id")

es.add_relationship(ft.Relationship(es["sessions"]["id"], es["transactions"]["session_id"]))
es

sum_features = ft.Feature(es["transactions"]["value"], parent_entity=es["sessions"], primitive=Sum)
sessions_sum = ft.Feature(sum_features, entity=es["transactions"])

fm = ft.calculate_feature_matrix(features=[sessions_sum], entityset=es)
fm

the output of fm is

    sessions.SUM(transactions.value)
id                                  
1                                2.0
2                                2.0
3                                1.0
4                                NaN

id 4 should be 0

@seriallazer
Copy link
Contributor

If no one is working on this, may I take this up?

@rwedge
Copy link
Contributor

rwedge commented Nov 3, 2020

@seriallazer sure!

@seriallazer
Copy link
Contributor

I've created a pull-request for the change: #1217.
Can someone please review the changes.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
5 participants