Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fields difference #167

Merged
merged 13 commits into from
Oct 7, 2019
Merged

Fields difference #167

merged 13 commits into from
Oct 7, 2019

Conversation

manycoding
Copy link
Contributor

@manycoding manycoding commented Sep 12, 2019

Implements #158

I wrote a feature which we already had with price rules, which mimics set intersection and difference. It still reads product_url_field & name_field tags, but the output is different:

How it looks now:
Screenshot 2019-09-13 at 12 56 41

To compare the difference (notice the change in the output format) and timing - https://jupyter.scrapinghub.com/user/u/lab/tree/shared/Experiments/Arche/PRs/difference.ipynb

  1. Should we lower().strip() strings before comparing? Should we normalize numbers (e.g. 5 = 5.0). In this rule, should we care about particular format (case matters) or more about the meaning?

@codecov
Copy link

codecov bot commented Sep 12, 2019

Codecov Report

Merging #167 into master will decrease coverage by 0.01%.
The diff coverage is 76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #167      +/-   ##
==========================================
- Coverage   81.79%   81.78%   -0.02%     
==========================================
  Files          23       24       +1     
  Lines        1637     1658      +21     
  Branches      291      293       +2     
==========================================
+ Hits         1339     1356      +17     
- Misses        246      252       +6     
+ Partials       52       50       -2
Impacted Files Coverage Δ
src/arche/arche.py 85.1% <100%> (+0.21%) ⬆️
src/arche/rules/price.py 85.21% <100%> (+3.07%) ⬆️
src/arche/readers/schema.py 100% <100%> (ø) ⬆️
src/arche/__init__.py 100% <100%> (ø) ⬆️
src/arche/rules/compare.py 72.09% <72.09%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67a24b8...10897ce. Read the comment docs.

@manycoding manycoding marked this pull request as ready for review September 13, 2019 20:26
Copy link

@ejulio ejulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processing part is quite tricky.
I think we can add it, but not by default. Maybe a flag stating the behavior.
If the data is upper case in set and lower case in the other, it is a difference and the user must know about it (we should output it).
Then, if the user says it is the true behavior, he can set the flag to ignore these values

new = source[~(source.isin(target))]
missing = target[~(target.isin(source))]
except SystemError:
source = source.apply(str)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not astype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, astype is a bit faster

if len(missing) == 0:
continue

if len(missing) < 6:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 6 here is quite magical.
maybe a constant to make its intent clear

@simoess
Copy link
Contributor

simoess commented Sep 18, 2019

  1. Should we lower().strip() strings before comparing? Should we normalize numbers (e.g. 5 = 5.0). In this rule, should we care about particular format (case matters) or more about the meaning?

I think that would be nice to be add an option to be able to apply a transformation to some column values. But it shouldn't be the default rule behavior to normalize strings/numbers.

Also, probably it would be useful to be able to access the values that are in the intersection/difference.

@manycoding
Copy link
Contributor Author

I'll leave normalization for the other pr.

Also, probably it would be useful to be able to access the values that are in the intersection/difference.

One can extract keys and then values:
a.target_items.df.loc[list(res.messages[Level.ERROR][0].errors.values())[0]]

But that's not very conveniet. In what form would you like to see values? There's an option to shorten it, I think it could be:

  • missing values - [value1, ...., value20]
  • missing value1 - [key1, ...., keyx]
    ... missing valuex - [key1, ...., keyj]
  • or like this with pandas loc

@simoess
Copy link
Contributor

simoess commented Sep 19, 2019

I'll leave normalization for the other pr.

Also, probably it would be useful to be able to access the values that are in the intersection/difference.

One can extract keys and then values:
a.target_items.df.loc[list(res.messages[Level.ERROR][0].errors.values())[0]]

But that's not very conveniet. In what form would you like to see values? There's an option to shorten it, I think it could be:

  • missing values - [value1, ...., value20]
  • missing value1 - [key1, ...., keyx]
    ... missing valuex - [key1, ...., keyj]
  • or like this with pandas loc

The first option. It's easy to manipulate the results after.
.missing_values -> [value1, ..., value20]

Also I think that having all missing items on a df would be useful(e.g checking urls)
.missing_values_items -> pandas df with the complete item information

@manycoding
Copy link
Contributor Author

Also I think that having all missing items on a df would be useful(e.g checking urls)
.missing_values_items -> pandas df with the complete it

Can you give an example? Do you mean we should find added\dropped values between job by default or that we should have a feature which does that?

@simoess
Copy link
Contributor

simoess commented Sep 19, 2019

Also I think that having all missing items on a df would be useful(e.g checking urls)
.missing_values_items -> pandas df with the complete it

Can you give an example? Do you mean we should find added\dropped values between job by default or that we should have a feature which does that?

The rule could return the items that are in the set difference/intersection between two jobs as a pandas df, or an easy way to access these values. Sometimes missing values needs to be further investigated, to check if they really should be missing.

@manycoding
Copy link
Contributor Author

The rule could return the items that are in the set difference/intersection between two jobs as a pandas df, or an easy way to access these values. Sometimes missing values needs to be further investigated, to check if they really should be missing.

Got it, I created #169. It differs from this feature as here we care about values only, and there we care if any items changed.

@codecov-io
Copy link

codecov-io commented Sep 26, 2019

Codecov Report

Merging #167 into master will decrease coverage by 0.17%.
The diff coverage is 80.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #167      +/-   ##
==========================================
- Coverage   81.76%   81.59%   -0.18%     
==========================================
  Files          23       24       +1     
  Lines        1634     1646      +12     
  Branches      290      289       -1     
==========================================
+ Hits         1336     1343       +7     
- Misses        246      252       +6     
+ Partials       52       51       -1
Impacted Files Coverage Δ
src/arche/rules/result.py 86.52% <100%> (-1.79%) ⬇️
src/arche/tools/api.py 64.64% <100%> (ø) ⬆️
src/arche/rules/others.py 100% <100%> (ø) ⬆️
src/arche/rules/category.py 100% <100%> (ø) ⬆️
src/arche/arche.py 85.1% <100%> (+0.21%) ⬆️
src/arche/__init__.py 100% <100%> (ø) ⬆️
src/arche/tools/schema.py 87.95% <100%> (ø) ⬆️
src/arche/readers/items.py 88.07% <100%> (ø) ⬆️
src/arche/rules/compare.py 74.46% <74.46%> (ø)
src/arche/rules/price.py 85.34% <75%> (+3.07%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0538719...faa6263. Read the comment docs.

@manycoding
Copy link
Contributor Author

manycoding commented Sep 26, 2019

I added errors by introducing more_stats field, also normalization flag.

res = arche.rules.compare.fields(
    a.source_items.df, a.target_items.df, ["product_name", "url"]
)
res.more_stats["product_name"]["new"]
>>>https://app.scrapinghub.com/p/156349/93/22/item/12      Down Under Naturals Coconut Splash Weightless ...
https://app.scrapinghub.com/p/156349/93/22/item/13      Down Under Naturals Coconut Splash Weightless ...
https://app.scrapinghub.com/p/156349/93/22/item/14      L'Oreal Paris Smooth Intense Shampoo  for Friz...
                                                                              ...                        
             Garnier Fructis, Damage Eraser Shampoo
https://app.scrapinghub.com/p/156349/93/22/item/9993    EverCurl Hydracharge Shampoo by LOreal for Uni...
Length: 4390, dtype: object

So for this rule more_stats is {field: {new\same\missing: pd.Series}}

I also simplified tests by writing an assert and removing unneccesary logic from the code, there're lots of changes.

Please review @simoess @ejulio

Copy link
Contributor

@simoess simoess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I only recommend to improve the documentation of the function "fields" of src/arche/rules/compare.py, it's unclear what err_thr does.

else:
msg = f"{', '.join(missing.unique()[:5].astype(str))}..."
msg = f"{msg} `{field}s` are missing"
if len(missing) / len(target_df) >= err_thr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what err_thr does. I think it would be a good idea to document what the variable does on the doctring of the function;

@manycoding manycoding merged commit 77075db into master Oct 7, 2019
@manycoding manycoding deleted the fields_difference branch October 7, 2019 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants