-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fields difference #167
Fields difference #167
Conversation
Codecov Report
@@ Coverage Diff @@
## master #167 +/- ##
==========================================
- Coverage 81.79% 81.78% -0.02%
==========================================
Files 23 24 +1
Lines 1637 1658 +21
Branches 291 293 +2
==========================================
+ Hits 1339 1356 +17
- Misses 246 252 +6
+ Partials 52 50 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The processing part is quite tricky.
I think we can add it, but not by default. Maybe a flag stating the behavior.
If the data is upper case in set and lower case in the other, it is a difference and the user must know about it (we should output it).
Then, if the user says it is the true behavior, he can set the flag to ignore these values
src/arche/rules/compare.py
Outdated
new = source[~(source.isin(target))] | ||
missing = target[~(target.isin(source))] | ||
except SystemError: | ||
source = source.apply(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not astype
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, astype is a bit faster
src/arche/rules/compare.py
Outdated
if len(missing) == 0: | ||
continue | ||
|
||
if len(missing) < 6: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 6
here is quite magical.
maybe a constant to make its intent clear
I think that would be nice to be add an option to be able to apply a transformation to some column values. But it shouldn't be the default rule behavior to normalize strings/numbers. Also, probably it would be useful to be able to access the values that are in the intersection/difference. |
I'll leave normalization for the other pr.
One can extract keys and then values: But that's not very conveniet. In what form would you like to see values? There's an option to shorten it, I think it could be:
|
The first option. It's easy to manipulate the results after. Also I think that having all missing items on a df would be useful(e.g checking urls) |
Can you give an example? Do you mean we should find added\dropped values between job by default or that we should have a feature which does that? |
The rule could return the items that are in the set difference/intersection between two jobs as a pandas df, or an easy way to access these values. Sometimes missing values needs to be further investigated, to check if they really should be missing. |
Got it, I created #169. It differs from this feature as here we care about values only, and there we care if any items changed. |
Codecov Report
@@ Coverage Diff @@
## master #167 +/- ##
==========================================
- Coverage 81.76% 81.59% -0.18%
==========================================
Files 23 24 +1
Lines 1634 1646 +12
Branches 290 289 -1
==========================================
+ Hits 1336 1343 +7
- Misses 246 252 +6
+ Partials 52 51 -1
Continue to review full report at Codecov.
|
I added errors by introducing
So for this rule I also simplified tests by writing an assert and removing unneccesary logic from the code, there're lots of changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I only recommend to improve the documentation of the function "fields" of src/arche/rules/compare.py, it's unclear what err_thr does.
else: | ||
msg = f"{', '.join(missing.unique()[:5].astype(str))}..." | ||
msg = f"{msg} `{field}s` are missing" | ||
if len(missing) / len(target_df) >= err_thr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear what err_thr does. I think it would be a good idea to document what the variable does on the doctring of the function;
Implements #158
I wrote a feature which we already had with price rules, which mimics set intersection and difference. It still reads
product_url_field
&name_field
tags, but the output is different:How it looks now:
To compare the difference (notice the change in the output format) and timing - https://jupyter.scrapinghub.com/user/u/lab/tree/shared/Experiments/Arche/PRs/difference.ipynb
lower().strip()
strings before comparing? Should we normalize numbers (e.g. 5 = 5.0). In this rule, should we care about particular format (case matters) or more about the meaning?