Fields difference #167

manycoding · 2019-09-12T15:22:37Z

Implements #158

I wrote a feature which we already had with price rules, which mimics set intersection and difference. It still reads product_url_field & name_field tags, but the output is different:

How it looks now:

To compare the difference (notice the change in the output format) and timing - https://jupyter.scrapinghub.com/user/u/lab/tree/shared/Experiments/Arche/PRs/difference.ipynb

Should we lower().strip() strings before comparing? Should we normalize numbers (e.g. 5 = 5.0). In this rule, should we care about particular format (case matters) or more about the meaning?

codecov · 2019-09-12T18:04:55Z

Codecov Report

Merging #167 into master will decrease coverage by 0.01%.
The diff coverage is 76%.

@@            Coverage Diff             @@
##           master     #167      +/-   ##
==========================================
- Coverage   81.79%   81.78%   -0.02%     
==========================================
  Files          23       24       +1     
  Lines        1637     1658      +21     
  Branches      291      293       +2     
==========================================
+ Hits         1339     1356      +17     
- Misses        246      252       +6     
+ Partials       52       50       -2

Impacted Files	Coverage Δ
src/arche/arche.py	`85.1% <100%> (+0.21%)`	⬆️
src/arche/rules/price.py	`85.21% <100%> (+3.07%)`	⬆️
src/arche/readers/schema.py	`100% <100%> (ø)`	⬆️
src/arche/__init__.py	`100% <100%> (ø)`	⬆️
src/arche/rules/compare.py	`72.09% <72.09%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67a24b8...10897ce. Read the comment docs.

ejulio

The processing part is quite tricky.
I think we can add it, but not by default. Maybe a flag stating the behavior.
If the data is upper case in set and lower case in the other, it is a difference and the user must know about it (we should output it).
Then, if the user says it is the true behavior, he can set the flag to ignore these values

ejulio · 2019-09-17T12:38:48Z

src/arche/rules/compare.py

+            new = source[~(source.isin(target))]
+            missing = target[~(target.isin(source))]
+        except SystemError:
+            source = source.apply(str)


why not astype?

Right, astype is a bit faster

ejulio · 2019-09-17T12:39:36Z

src/arche/rules/compare.py

+        if len(missing) == 0:
+            continue
+
+        if len(missing) < 6:


This 6 here is quite magical.
maybe a constant to make its intent clear

into fields_difference

simoess · 2019-09-18T14:55:28Z

Should we lower().strip() strings before comparing? Should we normalize numbers (e.g. 5 = 5.0). In this rule, should we care about particular format (case matters) or more about the meaning?

I think that would be nice to be add an option to be able to apply a transformation to some column values. But it shouldn't be the default rule behavior to normalize strings/numbers.

Also, probably it would be useful to be able to access the values that are in the intersection/difference.

manycoding · 2019-09-18T15:54:52Z

I'll leave normalization for the other pr.

Also, probably it would be useful to be able to access the values that are in the intersection/difference.

One can extract keys and then values:
a.target_items.df.loc[list(res.messages[Level.ERROR][0].errors.values())[0]]

But that's not very conveniet. In what form would you like to see values? There's an option to shorten it, I think it could be:

missing values - [value1, ...., value20]
missing value1 - [key1, ...., keyx]
... missing valuex - [key1, ...., keyj]
or like this with pandas loc

simoess · 2019-09-19T16:13:59Z

I'll leave normalization for the other pr.

Also, probably it would be useful to be able to access the values that are in the intersection/difference.

One can extract keys and then values:
a.target_items.df.loc[list(res.messages[Level.ERROR][0].errors.values())[0]]

But that's not very conveniet. In what form would you like to see values? There's an option to shorten it, I think it could be:

missing values - [value1, ...., value20]

missing value1 - [key1, ...., keyx]
... missing valuex - [key1, ...., keyj]

or like this with pandas loc

The first option. It's easy to manipulate the results after.
.missing_values -> [value1, ..., value20]

Also I think that having all missing items on a df would be useful(e.g checking urls)
.missing_values_items -> pandas df with the complete item information

manycoding · 2019-09-19T16:22:03Z

Also I think that having all missing items on a df would be useful(e.g checking urls)
.missing_values_items -> pandas df with the complete it

Can you give an example? Do you mean we should find added\dropped values between job by default or that we should have a feature which does that?

simoess · 2019-09-19T16:33:36Z

Also I think that having all missing items on a df would be useful(e.g checking urls)
.missing_values_items -> pandas df with the complete it

Can you give an example? Do you mean we should find added\dropped values between job by default or that we should have a feature which does that?

The rule could return the items that are in the set difference/intersection between two jobs as a pandas df, or an easy way to access these values. Sometimes missing values needs to be further investigated, to check if they really should be missing.

manycoding · 2019-09-19T16:53:49Z

The rule could return the items that are in the set difference/intersection between two jobs as a pandas df, or an easy way to access these values. Sometimes missing values needs to be further investigated, to check if they really should be missing.

Got it, I created #169. It differs from this feature as here we care about values only, and there we care if any items changed.

…h assert

codecov-io · 2019-09-26T18:48:32Z

Codecov Report

Merging #167 into master will decrease coverage by 0.17%.
The diff coverage is 80.55%.

@@            Coverage Diff             @@
##           master     #167      +/-   ##
==========================================
- Coverage   81.76%   81.59%   -0.18%     
==========================================
  Files          23       24       +1     
  Lines        1634     1646      +12     
  Branches      290      289       -1     
==========================================
+ Hits         1336     1343       +7     
- Misses        246      252       +6     
+ Partials       52       51       -1

Impacted Files	Coverage Δ
src/arche/rules/result.py	`86.52% <100%> (-1.79%)`	⬇️
src/arche/tools/api.py	`64.64% <100%> (ø)`	⬆️
src/arche/rules/others.py	`100% <100%> (ø)`	⬆️
src/arche/rules/category.py	`100% <100%> (ø)`	⬆️
src/arche/arche.py	`85.1% <100%> (+0.21%)`	⬆️
src/arche/__init__.py	`100% <100%> (ø)`	⬆️
src/arche/tools/schema.py	`87.95% <100%> (ø)`	⬆️
src/arche/readers/items.py	`88.07% <100%> (ø)`	⬆️
src/arche/rules/compare.py	`74.46% <74.46%> (ø)`
src/arche/rules/price.py	`85.34% <75%> (+3.07%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0538719...faa6263. Read the comment docs.

manycoding · 2019-09-26T18:49:00Z

I added errors by introducing more_stats field, also normalization flag.

res = arche.rules.compare.fields(
    a.source_items.df, a.target_items.df, ["product_name", "url"]
)
res.more_stats["product_name"]["new"]
>>>https://app.scrapinghub.com/p/156349/93/22/item/12      Down Under Naturals Coconut Splash Weightless ...
https://app.scrapinghub.com/p/156349/93/22/item/13      Down Under Naturals Coconut Splash Weightless ...
https://app.scrapinghub.com/p/156349/93/22/item/14      L'Oreal Paris Smooth Intense Shampoo  for Friz...
                                                                              ...                        
             Garnier Fructis, Damage Eraser Shampoo
https://app.scrapinghub.com/p/156349/93/22/item/9993    EverCurl Hydracharge Shampoo by LOreal for Uni...
Length: 4390, dtype: object

So for this rule more_stats is {field: {new\same\missing: pd.Series}}

I also simplified tests by writing an assert and removing unneccesary logic from the code, there're lots of changes.

Please review @simoess @ejulio

tests/conftest.py

simoess

Looks good. I only recommend to improve the documentation of the function "fields" of src/arche/rules/compare.py, it's unclear what err_thr does.

simoess · 2019-09-30T04:43:43Z

src/arche/rules/compare.py

+        else:
+            msg = f"{', '.join(missing.unique()[:5].astype(str))}..."
+        msg = f"{msg} `{field}s` are missing"
+        if len(missing) / len(target_df) >= err_thr:


It's unclear what err_thr does. I think it would be a good idea to document what the variable does on the doctring of the function;

manycoding added 3 commits September 12, 2019 12:21

Add compare_fields

e336d33

Ignore * warnings

e89ac5a

Refactor price

46a6ab7

manycoding added 2 commits September 13, 2019 17:25

Support nested structures

6f2b100

Merge branch 'master' into fields_difference

6a24808

manycoding requested review from ejulio, andersonberg and simoess September 13, 2019 20:25

manycoding marked this pull request as ready for review September 13, 2019 20:26

ejulio reviewed Sep 17, 2019

View reviewed changes

manycoding added 2 commits September 18, 2019 11:46

Wee update

4df498a

Merge branch 'fields_difference' of https://github.com/scrapinghub/arche

e88618d

into fields_difference

manycoding mentioned this pull request Sep 19, 2019

Compare items between job #169

Open

manycoding added 4 commits September 19, 2019 15:03

Add normalization

10897ce

Add more_stats to easily access all data, replace Result class eq wit…

da3b312

…h assert

Update rules to new assert

22cd400

Fix tqdm warning

96e4aa6

manycoding commented Sep 26, 2019

View reviewed changes

tests/conftest.py Show resolved Hide resolved

simoess approved these changes Sep 30, 2019

View reviewed changes

manycoding added 2 commits September 30, 2019 12:42

Add line to changes, describe err_thr

933d05e

Merge with master

faa6263

manycoding merged commit 77075db into master Oct 7, 2019

manycoding deleted the fields_difference branch October 7, 2019 13:05

manycoding mentioned this pull request Oct 22, 2019

Output keys for compare_fields_counts #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fields difference #167

Fields difference #167

manycoding commented Sep 12, 2019 •

edited

Loading

codecov bot commented Sep 12, 2019 •

edited

Loading

ejulio left a comment

ejulio Sep 17, 2019

manycoding Sep 18, 2019

ejulio Sep 17, 2019

simoess commented Sep 18, 2019

manycoding commented Sep 18, 2019

simoess commented Sep 19, 2019

manycoding commented Sep 19, 2019

simoess commented Sep 19, 2019

manycoding commented Sep 19, 2019

codecov-io commented Sep 26, 2019 •

edited by codecov bot

Loading

manycoding commented Sep 26, 2019 •

edited

Loading

simoess left a comment

simoess Sep 30, 2019

Fields difference #167

Fields difference #167

Conversation

manycoding commented Sep 12, 2019 • edited Loading

codecov bot commented Sep 12, 2019 • edited Loading

Codecov Report

ejulio left a comment

Choose a reason for hiding this comment

ejulio Sep 17, 2019

Choose a reason for hiding this comment

manycoding Sep 18, 2019

Choose a reason for hiding this comment

ejulio Sep 17, 2019

Choose a reason for hiding this comment

simoess commented Sep 18, 2019

manycoding commented Sep 18, 2019

simoess commented Sep 19, 2019

manycoding commented Sep 19, 2019

simoess commented Sep 19, 2019

manycoding commented Sep 19, 2019

codecov-io commented Sep 26, 2019 • edited by codecov bot Loading

Codecov Report

manycoding commented Sep 26, 2019 • edited Loading

simoess left a comment

Choose a reason for hiding this comment

simoess Sep 30, 2019

Choose a reason for hiding this comment

manycoding commented Sep 12, 2019 •

edited

Loading

codecov bot commented Sep 12, 2019 •

edited

Loading

codecov-io commented Sep 26, 2019 •

edited by codecov bot

Loading

manycoding commented Sep 26, 2019 •

edited

Loading