WoS Tagged: Reimplement the core algorithms. #3062

zoe-translates · 2023-06-20T04:43:31Z

The previous iteration of the import translator for Web of Science Tagged Format is becoming a bit difficult to maintain because of the tight coupling of logic, data, and actions.

To improve the separation of concerns, in the updated version several measures have been taken:

Use separate methods for distinct functions such as line parsing/validation (purely formal text-processing), intermediate transformations (data normalization based on semantics, etc.), and data-to-item mapping.
Use a lookup-table approach to help with mapping tagged data to object properties, which is easier to debug/maintain. It can also better support "polymorphism", the unfortunate fact that the same tag can mean different things depending on the item type.
Implement better text-processing (e.g. collapsing spaces when necessary, cleaning up line noise, more robust handling of author names, etc.)

Overall the goal is to make the translator more robust and easier to reason with.

TODO: To ensure best compatibility of behaviour, none of the test cases has been updated for now. After this commit, I'll introduce temporary devices to ensure the new code, with its new underlying structure, produces the same output as the old one (to the point of bug-compatible except for the most egregious). When that is achieved, the temporary compatibility devices will be removed, and the test cases will be updated and manually verified. After that, the new test cases will serve as the basis for incremental improvements of the new code.

zoe-translates · 2023-06-20T04:50:27Z

This supersedes #3053.

The previous iteration of the import translator for Web of Science Tagged Format is becoming a bit difficult to maintain because of the tight coupling of logic, data, and actions. To improve the separation of concerns, in the updated version several measures have been taken: - Use separate methods for distinct functions such as line parsing/validation (purely formal text-processing), intermediate transformations (data normalization based on semantics, etc.), and data-to-item mapping. - Use a lookup-table approach to help with mapping tagged data to object properties, which is easier to debug/maintain. It can also better support "polymorphism", the unfortunate fact that the same tag can mean different things depending on the item type. - Implement better text-processing (e.g. collapsing spaces when necessary, cleaning up line noise, more robust handling of author names, etc.) Overall the goal is to make the translator more robust and easier to reason with. TODO: To ensure best compatibility of behaviour, none of the test cases has been updated for now. After this commit, I'll introduce temporary devices to ensure the new code, with its new underlying structure, produces the same output as the old one (to the point of bug-compatible except for the most egregious). When that is achieved, the temporary compatibility devices will be removed, and the test cases will be updated and manually verified. After that, the new test cases will serve as the basis for incremental improvements of the new code.

- For non-title fields, especially those that tend to be in ALL CAPS, they are converted to "Title Case" while taking into account of some special forms such as IEEE, ACM, etc. - Minor readability fixes.

- Spurious spaces in original tests corrected. - Titles no longer converted to TitleCase; they respect the pref "capitalizeTitles". - Tags no longer unconditionally turned into lower case. - Some likely-misinterpreted fields removed.

Since there's no definite criteria of what constitutes a WoS tagged file, we simply do a mock doImport() for detection. If the mock import runs and would have saved a non-empty item, detectImport() returns true. detectImport() now uses the same parsing facilities as doImport() but with early exit and without saving any items.

- Move some getArrayJoiner() handlers to the simple handler category. - No need to cache the functions returned by getArrayJoiner(), which after simplification can no longer be reused. - Adjust the list of "special forms" in the wrapper to ZU.capitalizeTitle(), focus on what may appear in publisher names (WoS put those names in all-caps)

zoe-translates · 2023-07-05T10:32:26Z

Behavioural differences from the current implementation:

Title fields are converted to Title Case only if the user enables the capitalizeTitles pref.
Tags are kept as-is, without letter-case conversion.
Detection is no longer based on checking the "PT" tag. It is now done by a mock import that exits early and doesn't save items.

Bugfixes:

Much less prone to crashing due to line noise (e.g. trailing spaces).
Continued lines no longer generate spurious space characters.
Publisher, conference, and organization (assignee) names are more robust against "forced" Title Case conversion (so "IEEE" in publisher / conference name won't become "Ieee").
Some fields removed (e.g. "C1", which is for creator affiliation, no longer interpreted as the place field for patents).

Underlying enhancements:

Enforced separation of concerns so that parsing (lines -> internal data structure), normalization (operations on internal data), and item generation (mapping internal data to item) are now better separated.

These are for journal name components that, although by default not converted, may be turned into TitleCase by the config pref `capitalizeTitles`.

dstillman · 2023-07-05T11:12:39Z

I'm afraid this is just way too complicated. This translator went from being trivial to understand to being something that someone would have to study deeply to figure out what it's doing.

Translators really aren't supposed to be complicated programs in their own right where someone stepping into the code needs to figure out some custom software architecture each time — they're supposed to be simple, easy-to-understand, fairly procedural bits of code that someone uninitiated and/or inexperienced can make tweaks to in a few minutes. If the question is ever "should I do something clever/elegant/object-oriented/formal/functional-programming-based or should I not do that?", the answer should almost always be "not do that".

That certainly applies to creating an ItemMap object with tons of methods, and it possibly even applies to the getArrayJoiner()/etc. stuff. Maybe the latter somehow makes things much easier — I haven't studied this (which is sort of the point) — but if I'm someone fixing this translator a year from now, I feel like I'd much rather see this:

else if (field == "PU") {
		item.publisher = content;

than this:

PU: getArrayJoiner("publisher", true, true), // mitigate all-caps

and then have to figure what in the world getArrayJoiner() is.

Bug fixes are great, but the previous translator was written exactly how almost all our other translators are written, and unless there's an extremely compelling reason to change it, it should stay that way.

zoe-translates · 2023-07-05T14:39:39Z

Ah, am I correct in saying that we want less closures, jump tables, return function (...) { ... }; -- less indirect approaches, and doing things in a more direct way (else if, or case "PU": { ... })?

If that's right, I think I can combine the best of the two worlds.

I can still keep the three major steps fairly separate -

line scanning ("lexical" analysis),
internal "normalization" (removing redundant tags, consolidate related tags), and
transforming internal data to items.

The very reason I picked them apart was that I suffered from exactly this problem of "fixing this translator a year from now" -- I was that person. The earlier version crashes upon innocuous-looking input (one extra space after file-end "EF" tag, bare unknown tag without content, etc.) and the error message wasn't helpful. It was really difficult for me to even locate where it went wrong, because those three major jobs were intertwined tightly, rather than separate as steps.

The main source of complexity in my code was caused by step 3, and now I can see how to simplify it -

Flatten the higher-order functions - it turned out they mostly just moved the complex and repetitive stuff from the control flow to the jump tables, with more indirection. So I can switch to a more procedural style and it will actually be easier to follow.
Instead of grouping different kinds of operations into three (!!) lookup table objects, just group them by the order in the switch-case or else-if control flow: first do the array-type output fields (creators, tags), then string fields requiring some non-trivial processing, and then the trivial fields.
Use less intimidating names (e.g. "this.cursor") and more friendly ones ("this.currentTag").

If that looks good to you, I can do it (but not today :)

- There is only one lookup table now, and it is used for static mapping of tags to Zotero item fields. - The tags are processed in a loop with switch-case statements, rather than using dispatch tables. - detectImport() now uses a simplified logic (check first 10 lines for PT or DT, similar to RIS.js), and "checkOnly" is no longer referenced anywhere in the code. - Some minor fixes in data normalization.

zoe-translates · 2023-07-06T07:31:28Z

So by now, I've removed the indirect tables and replaced them with static mapping of input -> output fields. Each tag finds the code to process it in a switch-case group of statements.

The main difference from the earlier code is that there, the item was first populated, then normalized in completeItem(). Here, we first normalize, then (sort of mindlessly) populate the item.

For the older approach to work, the Zotero item was populated with some kind of hack: the creator field didn't really conform to the expected data structure, and it was difficult to explain how it was stitched up in completeItem().

       // If we have full names, drop the short ones
       if (item.creators[0]["AF"].length) {
               creators = item.creators[0]["AF"];
       } else {
               creators = item.creators[0]["AU"];
       }
       // Add other creators
       if (item.creators[1])
               item.creators = creators.concat(item.creators[1]);
       else
               item.creators = creators;

What was so special about item.creators[0] and item.creators[1] there? It was perplexing. I doubt "[the] uninitiated and/or inexperienced" could "make tweaks to in a few minutes" if that part of the code needed fixing.

Now, I can say for sure that each time a Zotero item field is populated in save(), it is populated with the expected type of data.

	case "AF":
	case "AU":
		addCreator(item, tagValueArray, type === "patent" ? "inventor" : "author");

Just like yours, my intention was also to make working on the code easier for some future person (myself included). But if the current version is still too complex for that purpose, I welcome your advice and ideas about possible solutions.

dstillman · 2023-07-06T21:12:50Z

Much better!

zoe-translates · 2023-07-07T03:24:12Z

Hi @dstillman, thank you, but I should've let you know that there was more revisions in that direction. Because this is already closed, I rebased the further commits as #3073. Can you take a look?

zoe-translates mentioned this pull request Jun 20, 2023

Web of Science Tagged: More robust against trailing space characters. #3053

Closed

zoe-translates added 6 commits July 5, 2023 11:02

[WIP] WoS Tagged: Move type determination into ItemMap#normalize()

bc6e07b

[WIP] WoS Tagged: Further cleanups.

b45a3ec

- For non-title fields, especially those that tend to be in ALL CAPS, they are converted to "Title Case" while taking into account of some special forms such as IEEE, ACM, etc. - Minor readability fixes.

[WIP] WoS Tagged: Update test cases.

ae9de29

- Spurious spaces in original tests corrected. - Titles no longer converted to TitleCase; they respect the pref "capitalizeTitles". - Tags no longer unconditionally turned into lower case. - Some likely-misinterpreted fields removed.

zoe-translates force-pushed the WoS-Tagged-Format-Reimplementation branch from e76770a to cb27a4f Compare July 5, 2023 10:19

zoe-translates marked this pull request as ready for review July 5, 2023 10:20

zoe-translates changed the title ~~[WIP] WoS Tagged: Reimplement the core algorithms.~~ WoS Tagged: Reimplement the core algorithms. Jul 5, 2023

WoS Tagged: Add back some protected all-caps words.

f1c6681

These are for journal name components that, although by default not converted, may be turned into TitleCase by the config pref `capitalizeTitles`.

dstillman mentioned this pull request Jul 6, 2023

Google Scholar: Add delays between consecutive network requests. #3043

Merged

zoe-translates added 2 commits July 6, 2023 14:24

WoS Tagged: [Minor] Remove leftover artefact in code.

bcfd55d

dstillman merged commit 80c211d into zotero:master Jul 6, 2023

zoe-translates mentioned this pull request Jul 7, 2023

WoS tagged format further simplifications #3073

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WoS Tagged: Reimplement the core algorithms. #3062

WoS Tagged: Reimplement the core algorithms. #3062

zoe-translates commented Jun 20, 2023

zoe-translates commented Jun 20, 2023

zoe-translates commented Jul 5, 2023

dstillman commented Jul 5, 2023

zoe-translates commented Jul 5, 2023

zoe-translates commented Jul 6, 2023

dstillman commented Jul 6, 2023

zoe-translates commented Jul 7, 2023

WoS Tagged: Reimplement the core algorithms. #3062

WoS Tagged: Reimplement the core algorithms. #3062

Conversation

zoe-translates commented Jun 20, 2023

zoe-translates commented Jun 20, 2023

zoe-translates commented Jul 5, 2023

dstillman commented Jul 5, 2023

zoe-translates commented Jul 5, 2023

zoe-translates commented Jul 6, 2023

dstillman commented Jul 6, 2023

zoe-translates commented Jul 7, 2023