[Amazon] Amazon import source for amazon.de #144

moritzj29 · 2021-12-12T19:48:41Z

Hello,

this is a proposal to generalize the amazon import source to allow additional languages. I added support for German files from amazon.de. For other languages it will be required to read out all the language specific strings for the regexes.

The code is working already but probably needs some polishing.

I added lots of debug logging output, I hope this fine. Otherwise you are completely lost finding the translated regexes...

It runs fine on the tests available, so it should be backward compatible.

Sorry that the code diff is so bad! I moved almost all existing methods into a new class, therefore the changed indentation results in every line marked as changed...

I mark this as draft for now, but comments are very welcome!

add tests for amazon DE
same extent of logging in all methods, remove duplicates
add locale argument for direct calling
rename locale to de_DE etc.
minimize diff volume by moving methods outside the class, just pass locale as argument

moritzj29 · 2021-12-12T19:57:08Z

testdata/source/amazon/D56-5204779-4181560.json

@@ -54,7 +54,7 @@
        }
    ],
    "pretax_adjustments": [],
-    "tax": [],
+    "tax": null,


I changed this, since otherwise the code (also the original?!) throws an exception here:

beancount-import/beancount_import/source/amazon.py

Line 390 in dca3406

if invoice.tax is not None and invoice.tax.number != ZERO:

so I set invoice.tax=None instead of =[] for digital invoices

Can we change the code instead, so all the fields in the json follow the same convention (field = []).

makes sense, I changed the code in amazon.py to handle []

jbms · 2021-12-15T05:27:54Z

Perhaps you can keep the parsing functions as free functions, but pass in the locale object as a parameter --- that way the diff would be much smaller?

jbms · 2021-12-15T05:29:32Z

Also, locales are commonly named like "en_US" or "de_DE". You are just using the language, but given that you are including such things as whether the tax is included, it should probably be a full locale.

moritzj29 · 2021-12-15T06:06:18Z

very valid points, I really appreciate the feedback! I will change the code accordingly!

…USD 1.99

…digital invoices (de_DE)

…fields

# Conflicts: # beancount_import/source/amazon_invoice_test.py

moritzj29 · 2022-03-24T18:18:24Z

beancount_import/amount_parsing.py

@@ -30,12 +30,18 @@ def parse_amount(x, assumed_currency=None):
    if not x:
        return None
    sign, amount_str = parse_possible_negative(x)
-    m = re.fullmatch(r'(?:[(][^)]+[)])?\s*([\$€£])?((?:[0-9](?:,?[0-9])*|(?=\.))(?:\.[0-9]+)?)(?:\s+([A-Z]{3}))?', amount_str)
+    m = re.fullmatch(r'(?:[(][^)]+[)])?\s*([\$€£]|[A-Z]{3})?\s*((?:[0-9](?:,?[0-9])*|(?=\.))(?:\.[0-9]+)?)(?:\s+([A-Z]{3}))?', amount_str)


allow to specify currency with prepended three letters, e.g. EUR 20.00

moritzj29 · 2022-03-24T20:16:21Z

beancount_import/source/amazon_invoice.py

+    regular_estimated_tax = 'Estimated tax to be collected:'
+    regular_order_placed=r'(?:Subscribe and Save )?Order Placed:\s+([^\s]+ \d+, \d{4})'
+    regular_order_id=r'.*Order ([0-9\-]+)'
+    gift_card='Gift Cards' # not confirmed yet!


I don't know how gift cards are handled on .com, these are just guesses inferred from .de. There (electronic) giftcards are handled differently than regular shipments. They have an own "shipment table".

Everything is handled in the new parse_gift_cards method.
If there is no table node with text matchinggift_card, the code doesn't do anything. (probably the case for en_US) I decided not to add an additional boolean flag to enable/disable the parse_gift_cards method.

Here I also think we'd better not guess the correct string to match, but fail gracefully.

if it is not found, it does not even fail. But I made the attribute optional now.

I also added a check if there are no items found at all for the regular order or digital order (no shipments, no gift cards), this should indicate that there is something wrong with the parsing...

moritzj29 · 2022-03-24T20:19:10Z

beancount_import/source/amazon_invoice.py

+    currency='EUR'
+    items_subtotal='Zwischensumme:'
+    total_before_tax='Summe ohne MwSt.:'
+    # most of translations still missing ...


unfortunately, there is no way to get all these translations (if they exist at all). I parsed all my amazon transactions from the last years and added every case that occurred. Probably there will be minor PRs in the future adding additional translations...

So what happens if a missing translated string is needed? Presumably an exception because the base class is abstract.

Do you see a way for us to handle this gracefully?
We'd ideally notify the user of the fact that we cannot parse their invoice, not crash, and even suggest they open an issue.

exceptions should not happen. The base class is abstract but all attributes are (and have to be!) set in the derived class. Only alternatives may be missing which will not lead to exceptions, but just missed cases.

moritzj29 · 2022-03-24T20:27:56Z

beancount_import/source/amazon_invoice.py

-        items_subtotal = parse_amount(
-            get_field_in_table(shipment_table, r'Item\(s\) Subtotal:'))
-        expected_items_subtotal = reduce_amounts(
-            beancount.core.amount.mul(x.price, D(x.quantity)) for x in items)
-        if (items_subtotal is not None and
-            expected_items_subtotal != items_subtotal):
-            errors.append(
-                'expected items subtotal is %r, but parsed value is %r' %
-                (expected_items_subtotal, items_subtotal))
-
-        output_fields = dict()
-        output_fields['pretax_adjustments'] = get_adjustments_in_table(
-            shipment_table, pretax_adjustment_fields_pattern)
-        output_fields['posttax_adjustments'] = get_adjustments_in_table(
-            shipment_table, posttax_adjustment_fields_pattern)
-        pretax_parts = [items_subtotal or expected_items_subtotal] + [
-            a.amount for a in output_fields['pretax_adjustments']
-        ]
-        total_before_tax = parse_amount(
-            get_field_in_table(shipment_table, 'Total before tax:'))
-        expected_total_before_tax = reduce_amounts(pretax_parts)
-        if total_before_tax is None:
-            total_before_tax = expected_total_before_tax
-        elif expected_total_before_tax != total_before_tax:
-            errors.append(
-                'expected total before tax is %s, but parsed value is %s' %
-                (expected_total_before_tax, total_before_tax))
-
-        sales_tax = get_adjustments_in_table(shipment_table, 'Sales Tax:')
-
-        posttax_parts = (
-            [total_before_tax] + [a.amount for a in sales_tax] +
-            [a.amount for a in output_fields['posttax_adjustments']])
-        total = parse_amount(
-            get_field_in_table(shipment_table, 'Total for This Shipment:'))
-        expected_total = reduce_amounts(posttax_parts)
-        if total is None:
-            total = expected_total
-        elif expected_total != total:
-            errors.append('expected total is %s, but parsed value is %s' %
-                          (expected_total, total))
-
-        shipments.append(
-            Shipment(
-                shipped_date=shipped_date,
-                items=items,
-                items_subtotal=items_subtotal,
-                total_before_tax=total_before_tax,
-                tax=sales_tax,
-                total=total,
-                errors=errors,
-                **output_fields))


moved all this in separate parse_shipment_payments method, since the same code is used for parse_gift_cards as well

moritzj29 · 2022-03-24T20:52:50Z

Sooo, I think this code is finally ready for review now. I tried to keep the changes to a minimum to reduce the diff, but unfortunately I had to cover some additional cases (special gift cards order table) which resulted in splitting some existing methods.

As said in the above code comments, translations (both ways!) may not be complete yet! So minor PRs may follow along the way as we discover additional edge cases...

I ran the code against all amazon transactions I had available (DE and tests) and there were no exceptions thrown. For the last months I thoroughly checked the parsed invoices and the created transactions against my account transactions as well.

I added test cases for every invoice scenario I experienced to ensure code compatibility in the future. I think this makes sense when working with multiple languages since otherwise the contributors will only have their own data at hand for testing.

As always, I'm thankful for any comments and improvements. I really appreciate the effort you guys take to maintain this (and the finance-dl) codebase!

beancount_import/source/amazon_invoice_test.py

Zburatorul · 2022-03-25T01:05:38Z

beancount_import/source/amazon.py

@@ -387,7 +395,7 @@ def make_amazon_transaction(
                        (INVOICE_DESCRIPTION, adjustment.description),
                    ]),
                ))
-        if invoice.tax is not None and invoice.tax.number != ZERO:
+        if len(invoice.tax)>0 and invoice.tax.number != ZERO:


What guarantees invoice.tax is not None?

it relates to the above discussion: #144 (comment)

But I cannot recall why I removed the check on None...

For digital orders, tax=[] on order level anyway, so no need to check for None.

For regular orders, tax is set via:

tax = locale.parse_amount( get_field_in_table(payment_table, locale.regular_estimated_tax))

Which may return None if the field is not found...

Following up on the discussion above, I would vote for checking on None in parse_regular_order_invoice and setting tax=[] in this case. This ensures that in the resulting JSON "tax": []. Maybe also logging an error or warning might make sense, since so far all regular invoices contain tax information.

What are your thoughts on this? Did I miss something?

So I tried to tackle this and stumbled upon an inconsistency in the current codebase:

According to the typedefs tax: Amount for Order and Shipment. But this is not correct. It is actually tax: Sequence[Adjustment] for Shipment. So to have an empty tax field on shipment level one needs to set tax=[], whereas on order level tax=None, since Amount can be None.

I don't know why the typedef check does not catch this...

When it comes to digital orders, the current code sets tax=[] on order level. This is not correct, since Amount cannot be [], it must be None. The code works nevertheless, since tax is always given on shipment level for digital orders and therefore tax on order level is not used anyway.

From the code perspective, I would go with the above corrections. But my dilemma arises when it comes to the JSON output for the tests: Setting tax=None translates into tax: null in the JSON. You suggested above that we should stick to the convention field=[]. I would drop this requirement since it would require additional modifications of the data (only for the test files...).

@Zburatorul are you fine with this? what are your thoughts?

I added a proposal in 4be65b3

fix shipment.tax to type List[Adjustment]

fix order.tax to type Amount

results in tax: null in JSON for orders with no tax transaction (tax not given or included in item price)

…antities

Zburatorul

Thank you for this massive contribution @moritzj29. Great work!
I don't have any objections to the code structure, but I'd like to discuss the user experience a bit.

Right now I see two kinds of undesirable failure modes:

The gift card (purchase) matching string is, as you say, a guess. If we're wrong it won't match silently(?) and the user will wonder why the importer didn't add a new transaction.
Some missing strings in the de_DE locale. If they are ever needed, the failure will not be gentle.

I submit that a good UX in both cases is for the importer 1) not to fail 2) not to offer/identify any transactions 3) warn the user that it cannot handle these cases and suggest submitting an issue.
Does anyone see a better way?

In terms of implementation, my impromptu suggestion is to 1) leave the gift card related string empty 2) catch whatever exception the abstract base class throws when importer tries to use one of those fields 3) print a nice message for the user ideally specifying what field name was missing.

beancount_import/source/amazon_invoice.py

Zburatorul · 2022-03-25T20:04:11Z

beancount_import/source/amazon_invoice.py

+    currency='EUR'
+    items_subtotal='Zwischensumme:'
+    total_before_tax='Summe ohne MwSt.:'
+    # most of translations still missing ...


So what happens if a missing translated string is needed? Presumably an exception because the base class is abstract.

Do you see a way for us to handle this gracefully?
We'd ideally notify the user of the fact that we cannot parse their invoice, not crash, and even suggest they open an issue.

Zburatorul · 2022-03-25T20:08:05Z

beancount_import/source/amazon_invoice.py

+    regular_estimated_tax = 'Estimated tax to be collected:'
+    regular_order_placed=r'(?:Subscribe and Save )?Order Placed:\s+([^\s]+ \d+, \d{4})'
+    regular_order_id=r'.*Order ([0-9\-]+)'
+    gift_card='Gift Cards' # not confirmed yet!


Here I also think we'd better not guess the correct string to match, but fail gracefully.

moritzj29 · 2022-03-26T06:40:13Z

Thank you so much for your comments and working through all that code!

I agree with you that the UX should be as good as possible and crashing of the importer must be avoided.

I will check the impact of the missing translations. Most of them are just additional list entries, so the class attribute itself is available. It just matches less cases.

I think for the pretax_adjustment_fields_pattern this should not be an issue. Just the totals will not match and raise an error in the UI. I also guess that most of them don't even exist for DE...

The parse_gift_cards methods looks for an appropriate table and does nothing if the table is not found.

I mean, the question boils down to "how to check for elements to be present we don't know yet how to check on". I think the most robust way would be to raise and catch exceptions. Then the importer would fail gracefully on invoice level.

As you probably recognized, I added a lot of debug logging statements. Since I found it really difficult to get a grasp on where the code failed when I started porting it. This way one should at least get an idea on where it fails...

Also it would be great to add more testdata, also for en_US, e.g. a corresponding test for all these regexes. I will try to add a README or similar with some images indicating which part of the HTML invoice translate into which pattern. This will make it more easy to compare invoice and code and check for missing cases.

…able

…e logical

… possible), tax on shipment level is List

moritzj29 · 2022-03-27T16:16:32Z

I went through the code once again and added missing types, comments and error messages. I think I addressed all of your comments above (see my comments above), at least with a proposal.

I think the code is quite robust now. There are lots of debugging statements and error messages whenever possible. And the whole amazon_invoice code is wrapped inside a try: except: statement anyway.

I propose that I squash everything into a single commit, once we are done with the review. There have been quite a lot of commits in the meantime...

On the long run, I propose to move the amazon_invoice code into its own class. But this should be a separate PR due to the large expected diff...

Zburatorul · 2022-03-28T03:28:01Z

Thanks again @moritzj29.
I squashed the commits myself.

moritzj29 added 2 commits December 12, 2021 20:28

amazon source with locale specification for EN and DE

27f88c0

update test reference output

75097fd

moritzj29 commented Dec 12, 2021

View reviewed changes

moritzj29 mentioned this pull request Dec 15, 2021

[Paypal] Add locales option, support for Paypal DE #142

Merged

4 tasks

moritzj29 added 17 commits December 18, 2021 14:37

addressed some PR comments, polishing

7570a8b

updated example docstring with new locale names

b3760b7

added some translations (de_DE)

6501b5f

add ability to parse prepended 3 letter currency specification, e.g. …

c1c8af8

…USD 1.99

Merge branch 'dev/amount_parse' into dev/amazonDE

93e6b11

updated invoice sanitizer to not completely remove payment table for …

e9a9c95

…digital invoices (de_DE)

add some sanitized invoice tests for de_DE

03b9a06

change invoice.tax from None to [] to match convention of other JSON …

02cb706

…fields

de_DE correct adjustments: posttax instead of pretax

1864838

add ability to parse gift card orders correctly

17661bf

remove debugging logs

7afeaac

Merge branch 'dev/amazonDE' into dev/amazonDEgiftCard

8c109ac

add some test data

0bbce16

add test for direct debit

a562159

Merge branch 'dev/amazonDE' into dev/amazonDEgiftCard

f4b676c

# Conflicts: # beancount_import/source/amazon_invoice_test.py

add amazon account charge up including test case

3d4a077

remove debugging print statement

d3e701b

moritzj29 mentioned this pull request Jan 12, 2022

[Amazon] Multi-Language Support jbms/finance-dl#50

Merged

moritzj29 added 5 commits January 13, 2022 14:28

make quantity parsing more consistent, add log message if parsing failed

a641c94

fix DE shipment_quantity_pattern

a460017

fix shipment quantity algorithm

d367cdb

make payee match chosen locale

eda2adb

add nonshipped header translation

98f15f0

moritzj29 commented Mar 24, 2022

View reviewed changes

clean up imports, do not create locale instance

d5263ea

moritzj29 force-pushed the dev/amazonDE branch from 3511513 to d5263ea Compare March 24, 2022 20:35

moritzj29 marked this pull request as ready for review March 24, 2022 20:52

Zburatorul reviewed Mar 24, 2022

View reviewed changes

beancount_import/source/amazon_invoice_test.py Outdated Show resolved Hide resolved

Zburatorul reviewed Mar 25, 2022

View reviewed changes

moritzj29 added 4 commits March 25, 2022 10:03

reduce excessive logging for amazon fresh invoices with irrelevant qu…

ff21279

…antities

correct test name

8715e51

add missing types

7aac6ad

add docstring with hierarchy of functions

f042a9f

Zburatorul reviewed Mar 25, 2022

View reviewed changes

moritzj29 added 11 commits March 26, 2022 17:10

use static class instead of class instance as default arguments

7338a7b

factor out is_items_ordered_header

a438613

clean up locales, add comments

69d146e

add comments and docstrings

577e196

add logging and error messages

e4b1ee9

reduce conditionals in parse_credit_card_transactions_from_payments_t…

8b6de9c

…able

move order ID and date extraction to beginning of parsing method, mor…

38fe597

…e logical

add check and error message if no items were found for an order

2287972

fix handling of cases with no tax; tax on Order level is Amount (None…

4be65b3

… possible), tax on shipment level is List

update types, fix Shipment tax type

41a14fa

make parse_gift_cards optional

66298c1

Zburatorul merged commit 01876b7 into jbms:master Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Amazon] Amazon import source for amazon.de #144

[Amazon] Amazon import source for amazon.de #144

moritzj29 commented Dec 12, 2021 •

edited

Loading

moritzj29 Dec 12, 2021

Zburatorul Dec 18, 2021

moritzj29 Dec 19, 2021

jbms commented Dec 15, 2021

jbms commented Dec 15, 2021

moritzj29 commented Dec 15, 2021

moritzj29 Mar 24, 2022

moritzj29 Mar 24, 2022

Zburatorul Mar 25, 2022

moritzj29 Mar 27, 2022

moritzj29 Mar 24, 2022

Zburatorul Mar 25, 2022

moritzj29 Mar 27, 2022

moritzj29 Mar 24, 2022 •

edited

Loading

moritzj29 commented Mar 24, 2022

Zburatorul Mar 25, 2022

moritzj29 Mar 25, 2022

moritzj29 Mar 27, 2022

moritzj29 Mar 27, 2022

Zburatorul left a comment

Zburatorul Mar 25, 2022

Zburatorul Mar 25, 2022

moritzj29 commented Mar 26, 2022

moritzj29 commented Mar 27, 2022

Zburatorul commented Mar 28, 2022

[Amazon] Amazon import source for amazon.de #144

[Amazon] Amazon import source for amazon.de #144

Conversation

moritzj29 commented Dec 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbms commented Dec 15, 2021

jbms commented Dec 15, 2021

moritzj29 commented Dec 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moritzj29 Mar 24, 2022 • edited Loading

Choose a reason for hiding this comment

moritzj29 commented Mar 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zburatorul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moritzj29 commented Mar 26, 2022

moritzj29 commented Mar 27, 2022

Zburatorul commented Mar 28, 2022

moritzj29 commented Dec 12, 2021 •

edited

Loading

moritzj29 Mar 24, 2022 •

edited

Loading