-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Amazon] Amazon import source for amazon.de #144
Conversation
@@ -54,7 +54,7 @@ | |||
} | |||
], | |||
"pretax_adjustments": [], | |||
"tax": [], | |||
"tax": null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this, since otherwise the code (also the original?!) throws an exception here:
if invoice.tax is not None and invoice.tax.number != ZERO: |
so I set invoice.tax=None
instead of =[]
for digital invoices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the code instead, so all the fields in the json follow the same convention (field = []
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I changed the code in amazon.py
to handle []
Perhaps you can keep the parsing functions as free functions, but pass in the locale object as a parameter --- that way the diff would be much smaller? |
Also, locales are commonly named like "en_US" or "de_DE". You are just using the language, but given that you are including such things as whether the tax is included, it should probably be a full locale. |
very valid points, I really appreciate the feedback! I will change the code accordingly! |
…digital invoices (de_DE)
# Conflicts: # beancount_import/source/amazon_invoice_test.py
@@ -30,12 +30,18 @@ def parse_amount(x, assumed_currency=None): | |||
if not x: | |||
return None | |||
sign, amount_str = parse_possible_negative(x) | |||
m = re.fullmatch(r'(?:[(][^)]+[)])?\s*([\$€£])?((?:[0-9](?:,?[0-9])*|(?=\.))(?:\.[0-9]+)?)(?:\s+([A-Z]{3}))?', amount_str) | |||
m = re.fullmatch(r'(?:[(][^)]+[)])?\s*([\$€£]|[A-Z]{3})?\s*((?:[0-9](?:,?[0-9])*|(?=\.))(?:\.[0-9]+)?)(?:\s+([A-Z]{3}))?', amount_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow to specify currency with prepended three letters, e.g. EUR 20.00
regular_estimated_tax = 'Estimated tax to be collected:' | ||
regular_order_placed=r'(?:Subscribe and Save )?Order Placed:\s+([^\s]+ \d+, \d{4})' | ||
regular_order_id=r'.*Order ([0-9\-]+)' | ||
gift_card='Gift Cards' # not confirmed yet! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how gift cards are handled on .com
, these are just guesses inferred from .de
. There (electronic) giftcards are handled differently than regular shipments. They have an own "shipment table".
Everything is handled in the new parse_gift_cards
method.
If there is no table node with text matchinggift_card
, the code doesn't do anything. (probably the case for en_US
) I decided not to add an additional boolean flag to enable/disable the parse_gift_cards
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I also think we'd better not guess the correct string to match, but fail gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is not found, it does not even fail. But I made the attribute optional now.
I also added a check if there are no items found at all for the regular order or digital order (no shipments, no gift cards), this should indicate that there is something wrong with the parsing...
currency='EUR' | ||
items_subtotal='Zwischensumme:' | ||
total_before_tax='Summe ohne MwSt.:' | ||
# most of translations still missing ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, there is no way to get all these translations (if they exist at all). I parsed all my amazon transactions from the last years and added every case that occurred. Probably there will be minor PRs in the future adding additional translations...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens if a missing translated string is needed? Presumably an exception because the base class is abstract.
Do you see a way for us to handle this gracefully?
We'd ideally notify the user of the fact that we cannot parse their invoice, not crash, and even suggest they open an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exceptions should not happen. The base class is abstract but all attributes are (and have to be!) set in the derived class. Only alternatives may be missing which will not lead to exceptions, but just missed cases.
items_subtotal = parse_amount( | ||
get_field_in_table(shipment_table, r'Item\(s\) Subtotal:')) | ||
expected_items_subtotal = reduce_amounts( | ||
beancount.core.amount.mul(x.price, D(x.quantity)) for x in items) | ||
if (items_subtotal is not None and | ||
expected_items_subtotal != items_subtotal): | ||
errors.append( | ||
'expected items subtotal is %r, but parsed value is %r' % | ||
(expected_items_subtotal, items_subtotal)) | ||
|
||
output_fields = dict() | ||
output_fields['pretax_adjustments'] = get_adjustments_in_table( | ||
shipment_table, pretax_adjustment_fields_pattern) | ||
output_fields['posttax_adjustments'] = get_adjustments_in_table( | ||
shipment_table, posttax_adjustment_fields_pattern) | ||
pretax_parts = [items_subtotal or expected_items_subtotal] + [ | ||
a.amount for a in output_fields['pretax_adjustments'] | ||
] | ||
total_before_tax = parse_amount( | ||
get_field_in_table(shipment_table, 'Total before tax:')) | ||
expected_total_before_tax = reduce_amounts(pretax_parts) | ||
if total_before_tax is None: | ||
total_before_tax = expected_total_before_tax | ||
elif expected_total_before_tax != total_before_tax: | ||
errors.append( | ||
'expected total before tax is %s, but parsed value is %s' % | ||
(expected_total_before_tax, total_before_tax)) | ||
|
||
sales_tax = get_adjustments_in_table(shipment_table, 'Sales Tax:') | ||
|
||
posttax_parts = ( | ||
[total_before_tax] + [a.amount for a in sales_tax] + | ||
[a.amount for a in output_fields['posttax_adjustments']]) | ||
total = parse_amount( | ||
get_field_in_table(shipment_table, 'Total for This Shipment:')) | ||
expected_total = reduce_amounts(posttax_parts) | ||
if total is None: | ||
total = expected_total | ||
elif expected_total != total: | ||
errors.append('expected total is %s, but parsed value is %s' % | ||
(expected_total, total)) | ||
|
||
shipments.append( | ||
Shipment( | ||
shipped_date=shipped_date, | ||
items=items, | ||
items_subtotal=items_subtotal, | ||
total_before_tax=total_before_tax, | ||
tax=sales_tax, | ||
total=total, | ||
errors=errors, | ||
**output_fields)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved all this in separate parse_shipment_payments
method, since the same code is used for parse_gift_cards
as well
Sooo, I think this code is finally ready for review now. I tried to keep the changes to a minimum to reduce the diff, but unfortunately I had to cover some additional cases (special gift cards order table) which resulted in splitting some existing methods. As said in the above code comments, translations (both ways!) may not be complete yet! So minor PRs may follow along the way as we discover additional edge cases... I ran the code against all amazon transactions I had available (DE and tests) and there were no exceptions thrown. For the last months I thoroughly checked the parsed invoices and the created transactions against my account transactions as well. I added test cases for every invoice scenario I experienced to ensure code compatibility in the future. I think this makes sense when working with multiple languages since otherwise the contributors will only have their own data at hand for testing. As always, I'm thankful for any comments and improvements. I really appreciate the effort you guys take to maintain this (and the |
beancount_import/source/amazon.py
Outdated
@@ -387,7 +395,7 @@ def make_amazon_transaction( | |||
(INVOICE_DESCRIPTION, adjustment.description), | |||
]), | |||
)) | |||
if invoice.tax is not None and invoice.tax.number != ZERO: | |||
if len(invoice.tax)>0 and invoice.tax.number != ZERO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What guarantees invoice.tax is not None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it relates to the above discussion: #144 (comment)
But I cannot recall why I removed the check on None
...
For digital orders, tax=[]
on order level anyway, so no need to check for None
.
For regular orders, tax is set via:
tax = locale.parse_amount(
get_field_in_table(payment_table, locale.regular_estimated_tax))
Which may return None
if the field is not found...
Following up on the discussion above, I would vote for checking on None
in parse_regular_order_invoice
and setting tax=[]
in this case. This ensures that in the resulting JSON "tax": []
. Maybe also logging an error or warning might make sense, since so far all regular invoices contain tax information.
What are your thoughts on this? Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried to tackle this and stumbled upon an inconsistency in the current codebase:
According to the typedefs tax: Amount
for Order
and Shipment
. But this is not correct. It is actually tax: Sequence[Adjustment]
for Shipment
. So to have an empty tax
field on shipment level one needs to set tax=[]
, whereas on order level tax=None
, since Amount
can be None.
I don't know why the typedef check does not catch this...
When it comes to digital orders, the current code sets tax=[]
on order level. This is not correct, since Amount
cannot be []
, it must be None
. The code works nevertheless, since tax is always given on shipment level for digital orders and therefore tax on order level is not used anyway.
From the code perspective, I would go with the above corrections. But my dilemma arises when it comes to the JSON output for the tests: Setting tax=None
translates into tax: null
in the JSON. You suggested above that we should stick to the convention field=[]
. I would drop this requirement since it would require additional modifications of the data (only for the test files...).
@Zburatorul are you fine with this? what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a proposal in 4be65b3
- fix
shipment.tax
to typeList[Adjustment]
- fix
order.tax
to typeAmount
- results in
tax: null
in JSON for orders with no tax transaction (tax not given or included in item price)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this massive contribution @moritzj29. Great work!
I don't have any objections to the code structure, but I'd like to discuss the user experience a bit.
Right now I see two kinds of undesirable failure modes:
- The gift card (purchase) matching string is, as you say, a guess. If we're wrong it won't match silently(?) and the user will wonder why the importer didn't add a new transaction.
- Some missing strings in the de_DE locale. If they are ever needed, the failure will not be gentle.
I submit that a good UX in both cases is for the importer 1) not to fail 2) not to offer/identify any transactions 3) warn the user that it cannot handle these cases and suggest submitting an issue.
Does anyone see a better way?
In terms of implementation, my impromptu suggestion is to 1) leave the gift card related string empty 2) catch whatever exception the abstract base class throws when importer tries to use one of those fields 3) print a nice message for the user ideally specifying what field name was missing.
currency='EUR' | ||
items_subtotal='Zwischensumme:' | ||
total_before_tax='Summe ohne MwSt.:' | ||
# most of translations still missing ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens if a missing translated string is needed? Presumably an exception because the base class is abstract.
Do you see a way for us to handle this gracefully?
We'd ideally notify the user of the fact that we cannot parse their invoice, not crash, and even suggest they open an issue.
regular_estimated_tax = 'Estimated tax to be collected:' | ||
regular_order_placed=r'(?:Subscribe and Save )?Order Placed:\s+([^\s]+ \d+, \d{4})' | ||
regular_order_id=r'.*Order ([0-9\-]+)' | ||
gift_card='Gift Cards' # not confirmed yet! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I also think we'd better not guess the correct string to match, but fail gracefully.
Thank you so much for your comments and working through all that code! I agree with you that the UX should be as good as possible and crashing of the importer must be avoided. I will check the impact of the missing translations. Most of them are just additional list entries, so the class attribute itself is available. It just matches less cases. I think for the The I mean, the question boils down to "how to check for elements to be present we don't know yet how to check on". I think the most robust way would be to raise and catch exceptions. Then the importer would fail gracefully on invoice level. As you probably recognized, I added a lot of debug logging statements. Since I found it really difficult to get a grasp on where the code failed when I started porting it. This way one should at least get an idea on where it fails... Also it would be great to add more testdata, also for en_US, e.g. a corresponding test for all these regexes. I will try to add a README or similar with some images indicating which part of the HTML invoice translate into which pattern. This will make it more easy to compare invoice and code and check for missing cases. |
… possible), tax on shipment level is List
I went through the code once again and added missing types, comments and error messages. I think I addressed all of your comments above (see my comments above), at least with a proposal. I think the code is quite robust now. There are lots of debugging statements and error messages whenever possible. And the whole I propose that I squash everything into a single commit, once we are done with the review. There have been quite a lot of commits in the meantime... On the long run, I propose to move the |
Thanks again @moritzj29. |
Hello,
this is a proposal to generalize the amazon import source to allow additional languages. I added support for German files from amazon.de. For other languages it will be required to read out all the language specific strings for the regexes.
The code is working already but probably needs some polishing.
I added lots of debug logging output, I hope this fine. Otherwise you are completely lost finding the translated regexes...
It runs fine on the tests available, so it should be backward compatible.
Sorry that the code diff is so bad! I moved almost all existing methods into a new class, therefore the changed indentation results in every line marked as changed...
I mark this as draft for now, but comments are very welcome!
locale
argument for direct callinglocale
tode_DE
etc.locale
as argument