Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Amazon] Amazon import source for amazon.de #144

Merged
merged 46 commits into from
Mar 28, 2022

Conversation

moritzj29
Copy link
Contributor

@moritzj29 moritzj29 commented Dec 12, 2021

Hello,

this is a proposal to generalize the amazon import source to allow additional languages. I added support for German files from amazon.de. For other languages it will be required to read out all the language specific strings for the regexes.

The code is working already but probably needs some polishing.

I added lots of debug logging output, I hope this fine. Otherwise you are completely lost finding the translated regexes...

It runs fine on the tests available, so it should be backward compatible.

Sorry that the code diff is so bad! I moved almost all existing methods into a new class, therefore the changed indentation results in every line marked as changed...

I mark this as draft for now, but comments are very welcome!

  • add tests for amazon DE
  • same extent of logging in all methods, remove duplicates
  • add locale argument for direct calling
  • rename locale to de_DE etc.
  • minimize diff volume by moving methods outside the class, just pass locale as argument

@@ -54,7 +54,7 @@
}
],
"pretax_adjustments": [],
"tax": [],
"tax": null,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this, since otherwise the code (also the original?!) throws an exception here:

if invoice.tax is not None and invoice.tax.number != ZERO:

so I set invoice.tax=None instead of =[] for digital invoices

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the code instead, so all the fields in the json follow the same convention (field = []).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I changed the code in amazon.py to handle []

@jbms
Copy link
Owner

jbms commented Dec 15, 2021

Perhaps you can keep the parsing functions as free functions, but pass in the locale object as a parameter --- that way the diff would be much smaller?

@jbms
Copy link
Owner

jbms commented Dec 15, 2021

Also, locales are commonly named like "en_US" or "de_DE". You are just using the language, but given that you are including such things as whether the tax is included, it should probably be a full locale.

@moritzj29
Copy link
Contributor Author

very valid points, I really appreciate the feedback! I will change the code accordingly!

@@ -30,12 +30,18 @@ def parse_amount(x, assumed_currency=None):
if not x:
return None
sign, amount_str = parse_possible_negative(x)
m = re.fullmatch(r'(?:[(][^)]+[)])?\s*([\$€£])?((?:[0-9](?:,?[0-9])*|(?=\.))(?:\.[0-9]+)?)(?:\s+([A-Z]{3}))?', amount_str)
m = re.fullmatch(r'(?:[(][^)]+[)])?\s*([\$€£]|[A-Z]{3})?\s*((?:[0-9](?:,?[0-9])*|(?=\.))(?:\.[0-9]+)?)(?:\s+([A-Z]{3}))?', amount_str)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow to specify currency with prepended three letters, e.g. EUR 20.00

regular_estimated_tax = 'Estimated tax to be collected:'
regular_order_placed=r'(?:Subscribe and Save )?Order Placed:\s+([^\s]+ \d+, \d{4})'
regular_order_id=r'.*Order ([0-9\-]+)'
gift_card='Gift Cards' # not confirmed yet!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how gift cards are handled on .com, these are just guesses inferred from .de. There (electronic) giftcards are handled differently than regular shipments. They have an own "shipment table".

Everything is handled in the new parse_gift_cards method.
If there is no table node with text matchinggift_card, the code doesn't do anything. (probably the case for en_US) I decided not to add an additional boolean flag to enable/disable the parse_gift_cards method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I also think we'd better not guess the correct string to match, but fail gracefully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is not found, it does not even fail. But I made the attribute optional now.

I also added a check if there are no items found at all for the regular order or digital order (no shipments, no gift cards), this should indicate that there is something wrong with the parsing...

currency='EUR'
items_subtotal='Zwischensumme:'
total_before_tax='Summe ohne MwSt.:'
# most of translations still missing ...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, there is no way to get all these translations (if they exist at all). I parsed all my amazon transactions from the last years and added every case that occurred. Probably there will be minor PRs in the future adding additional translations...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens if a missing translated string is needed? Presumably an exception because the base class is abstract.

Do you see a way for us to handle this gracefully?
We'd ideally notify the user of the fact that we cannot parse their invoice, not crash, and even suggest they open an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exceptions should not happen. The base class is abstract but all attributes are (and have to be!) set in the derived class. Only alternatives may be missing which will not lead to exceptions, but just missed cases.

Comment on lines -258 to -309
items_subtotal = parse_amount(
get_field_in_table(shipment_table, r'Item\(s\) Subtotal:'))
expected_items_subtotal = reduce_amounts(
beancount.core.amount.mul(x.price, D(x.quantity)) for x in items)
if (items_subtotal is not None and
expected_items_subtotal != items_subtotal):
errors.append(
'expected items subtotal is %r, but parsed value is %r' %
(expected_items_subtotal, items_subtotal))

output_fields = dict()
output_fields['pretax_adjustments'] = get_adjustments_in_table(
shipment_table, pretax_adjustment_fields_pattern)
output_fields['posttax_adjustments'] = get_adjustments_in_table(
shipment_table, posttax_adjustment_fields_pattern)
pretax_parts = [items_subtotal or expected_items_subtotal] + [
a.amount for a in output_fields['pretax_adjustments']
]
total_before_tax = parse_amount(
get_field_in_table(shipment_table, 'Total before tax:'))
expected_total_before_tax = reduce_amounts(pretax_parts)
if total_before_tax is None:
total_before_tax = expected_total_before_tax
elif expected_total_before_tax != total_before_tax:
errors.append(
'expected total before tax is %s, but parsed value is %s' %
(expected_total_before_tax, total_before_tax))

sales_tax = get_adjustments_in_table(shipment_table, 'Sales Tax:')

posttax_parts = (
[total_before_tax] + [a.amount for a in sales_tax] +
[a.amount for a in output_fields['posttax_adjustments']])
total = parse_amount(
get_field_in_table(shipment_table, 'Total for This Shipment:'))
expected_total = reduce_amounts(posttax_parts)
if total is None:
total = expected_total
elif expected_total != total:
errors.append('expected total is %s, but parsed value is %s' %
(expected_total, total))

shipments.append(
Shipment(
shipped_date=shipped_date,
items=items,
items_subtotal=items_subtotal,
total_before_tax=total_before_tax,
tax=sales_tax,
total=total,
errors=errors,
**output_fields))
Copy link
Contributor Author

@moritzj29 moritzj29 Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved all this in separate parse_shipment_payments method, since the same code is used for parse_gift_cards as well

@moritzj29
Copy link
Contributor Author

Sooo, I think this code is finally ready for review now. I tried to keep the changes to a minimum to reduce the diff, but unfortunately I had to cover some additional cases (special gift cards order table) which resulted in splitting some existing methods.

As said in the above code comments, translations (both ways!) may not be complete yet! So minor PRs may follow along the way as we discover additional edge cases...

I ran the code against all amazon transactions I had available (DE and tests) and there were no exceptions thrown. For the last months I thoroughly checked the parsed invoices and the created transactions against my account transactions as well.

I added test cases for every invoice scenario I experienced to ensure code compatibility in the future. I think this makes sense when working with multiple languages since otherwise the contributors will only have their own data at hand for testing.

As always, I'm thankful for any comments and improvements. I really appreciate the effort you guys take to maintain this (and the finance-dl) codebase!

@moritzj29 moritzj29 marked this pull request as ready for review March 24, 2022 20:52
@@ -387,7 +395,7 @@ def make_amazon_transaction(
(INVOICE_DESCRIPTION, adjustment.description),
]),
))
if invoice.tax is not None and invoice.tax.number != ZERO:
if len(invoice.tax)>0 and invoice.tax.number != ZERO:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantees invoice.tax is not None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it relates to the above discussion: #144 (comment)

But I cannot recall why I removed the check on None...

For digital orders, tax=[] on order level anyway, so no need to check for None.

For regular orders, tax is set via:

tax = locale.parse_amount(
        get_field_in_table(payment_table, locale.regular_estimated_tax))

Which may return None if the field is not found...

Following up on the discussion above, I would vote for checking on None in parse_regular_order_invoice and setting tax=[] in this case. This ensures that in the resulting JSON "tax": []. Maybe also logging an error or warning might make sense, since so far all regular invoices contain tax information.

What are your thoughts on this? Did I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried to tackle this and stumbled upon an inconsistency in the current codebase:

According to the typedefs tax: Amount for Order and Shipment. But this is not correct. It is actually tax: Sequence[Adjustment] for Shipment. So to have an empty tax field on shipment level one needs to set tax=[], whereas on order level tax=None, since Amount can be None.

I don't know why the typedef check does not catch this...

When it comes to digital orders, the current code sets tax=[] on order level. This is not correct, since Amount cannot be [], it must be None. The code works nevertheless, since tax is always given on shipment level for digital orders and therefore tax on order level is not used anyway.

From the code perspective, I would go with the above corrections. But my dilemma arises when it comes to the JSON output for the tests: Setting tax=None translates into tax: null in the JSON. You suggested above that we should stick to the convention field=[]. I would drop this requirement since it would require additional modifications of the data (only for the test files...).

@Zburatorul are you fine with this? what are your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a proposal in 4be65b3

  • fix shipment.tax to type List[Adjustment]
  • fix order.tax to type Amount
  • results in tax: null in JSON for orders with no tax transaction (tax not given or included in item price)

Copy link
Collaborator

@Zburatorul Zburatorul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this massive contribution @moritzj29. Great work!
I don't have any objections to the code structure, but I'd like to discuss the user experience a bit.

Right now I see two kinds of undesirable failure modes:

  1. The gift card (purchase) matching string is, as you say, a guess. If we're wrong it won't match silently(?) and the user will wonder why the importer didn't add a new transaction.
  2. Some missing strings in the de_DE locale. If they are ever needed, the failure will not be gentle.

I submit that a good UX in both cases is for the importer 1) not to fail 2) not to offer/identify any transactions 3) warn the user that it cannot handle these cases and suggest submitting an issue.
Does anyone see a better way?

In terms of implementation, my impromptu suggestion is to 1) leave the gift card related string empty 2) catch whatever exception the abstract base class throws when importer tries to use one of those fields 3) print a nice message for the user ideally specifying what field name was missing.

beancount_import/source/amazon_invoice.py Show resolved Hide resolved
currency='EUR'
items_subtotal='Zwischensumme:'
total_before_tax='Summe ohne MwSt.:'
# most of translations still missing ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens if a missing translated string is needed? Presumably an exception because the base class is abstract.

Do you see a way for us to handle this gracefully?
We'd ideally notify the user of the fact that we cannot parse their invoice, not crash, and even suggest they open an issue.

regular_estimated_tax = 'Estimated tax to be collected:'
regular_order_placed=r'(?:Subscribe and Save )?Order Placed:\s+([^\s]+ \d+, \d{4})'
regular_order_id=r'.*Order ([0-9\-]+)'
gift_card='Gift Cards' # not confirmed yet!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I also think we'd better not guess the correct string to match, but fail gracefully.

@moritzj29
Copy link
Contributor Author

Thank you so much for your comments and working through all that code!

I agree with you that the UX should be as good as possible and crashing of the importer must be avoided.

I will check the impact of the missing translations. Most of them are just additional list entries, so the class attribute itself is available. It just matches less cases.

I think for the pretax_adjustment_fields_pattern this should not be an issue. Just the totals will not match and raise an error in the UI. I also guess that most of them don't even exist for DE...

The parse_gift_cards methods looks for an appropriate table and does nothing if the table is not found.

I mean, the question boils down to "how to check for elements to be present we don't know yet how to check on". I think the most robust way would be to raise and catch exceptions. Then the importer would fail gracefully on invoice level.

As you probably recognized, I added a lot of debug logging statements. Since I found it really difficult to get a grasp on where the code failed when I started porting it. This way one should at least get an idea on where it fails...

Also it would be great to add more testdata, also for en_US, e.g. a corresponding test for all these regexes. I will try to add a README or similar with some images indicating which part of the HTML invoice translate into which pattern. This will make it more easy to compare invoice and code and check for missing cases.

@moritzj29
Copy link
Contributor Author

I went through the code once again and added missing types, comments and error messages. I think I addressed all of your comments above (see my comments above), at least with a proposal.

I think the code is quite robust now. There are lots of debugging statements and error messages whenever possible. And the whole amazon_invoice code is wrapped inside a try: except: statement anyway.

I propose that I squash everything into a single commit, once we are done with the review. There have been quite a lot of commits in the meantime...

On the long run, I propose to move the amazon_invoice code into its own class. But this should be a separate PR due to the large expected diff...

@Zburatorul Zburatorul merged commit 01876b7 into jbms:master Mar 28, 2022
@Zburatorul
Copy link
Collaborator

Thanks again @moritzj29.
I squashed the commits myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants