Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WoS Tagged: Reimplement the core algorithms. #3062

Merged

Conversation

zoe-translates
Copy link
Collaborator

The previous iteration of the import translator for Web of Science Tagged Format is becoming a bit difficult to maintain because of the tight coupling of logic, data, and actions.

To improve the separation of concerns, in the updated version several measures have been taken:

  • Use separate methods for distinct functions such as line parsing/validation (purely formal text-processing), intermediate transformations (data normalization based on semantics, etc.), and data-to-item mapping.
  • Use a lookup-table approach to help with mapping tagged data to object properties, which is easier to debug/maintain. It can also better support "polymorphism", the unfortunate fact that the same tag can mean different things depending on the item type.
  • Implement better text-processing (e.g. collapsing spaces when necessary, cleaning up line noise, more robust handling of author names, etc.)

Overall the goal is to make the translator more robust and easier to reason with.

TODO: To ensure best compatibility of behaviour, none of the test cases has been updated for now. After this commit, I'll introduce temporary devices to ensure the new code, with its new underlying structure, produces the same output as the old one (to the point of bug-compatible except for the most egregious). When that is achieved, the temporary compatibility devices will be removed, and the test cases will be updated and manually verified. After that, the new test cases will serve as the basis for incremental improvements of the new code.

@zoe-translates
Copy link
Collaborator Author

This supersedes #3053.

The previous iteration of the import translator for Web of Science
Tagged Format is becoming a bit difficult to maintain because of the
tight coupling of logic, data, and actions.

To improve the separation of concerns, in the updated version several
measures have been taken:

- Use separate methods for distinct functions such as line
  parsing/validation (purely formal text-processing), intermediate
  transformations (data normalization based on semantics, etc.), and
  data-to-item mapping.
- Use a lookup-table approach to help with mapping tagged data to object
  properties, which is easier to debug/maintain. It can also better
  support "polymorphism", the unfortunate fact that the same tag can
  mean different things depending on the item type.
- Implement better text-processing (e.g. collapsing spaces when
  necessary, cleaning up line noise, more robust handling of author
  names, etc.)

Overall the goal is to make the translator more robust and easier to
reason with.

TODO: To ensure best compatibility of behaviour, none of the test cases
has been updated for now. After this commit, I'll introduce temporary
devices to ensure the new code, with its new underlying structure,
produces the same output as the old one (to the point of bug-compatible
except for the most egregious). When that is achieved, the temporary
compatibility devices will be removed, and the test cases will be
updated and manually verified. After that, the new test cases will
serve as the basis for incremental improvements of the new code.
- For non-title fields, especially those that tend to be in ALL CAPS,
  they are converted to "Title Case" while taking into account of some
  special forms such as IEEE, ACM, etc.
- Minor readability fixes.
- Spurious spaces in original tests corrected.
- Titles no longer converted to TitleCase; they respect the pref
  "capitalizeTitles".
- Tags no longer unconditionally turned into lower case.
- Some likely-misinterpreted fields removed.
Since there's no definite criteria of what constitutes a WoS tagged
file, we simply do a mock doImport() for detection. If the mock import
runs and would have saved a non-empty item, detectImport() returns true.

detectImport() now uses the same parsing facilities as doImport() but
with early exit and without saving any items.
- Move some getArrayJoiner() handlers to the simple handler category.
- No need to cache the functions returned by getArrayJoiner(), which
  after simplification can no longer be reused.
- Adjust the list of "special forms" in the wrapper to
  ZU.capitalizeTitle(), focus on what may appear in publisher names (WoS
  put those names in all-caps)
@zoe-translates zoe-translates force-pushed the WoS-Tagged-Format-Reimplementation branch from e76770a to cb27a4f Compare July 5, 2023 10:19
@zoe-translates zoe-translates marked this pull request as ready for review July 5, 2023 10:20
@zoe-translates zoe-translates changed the title [WIP] WoS Tagged: Reimplement the core algorithms. WoS Tagged: Reimplement the core algorithms. Jul 5, 2023
@zoe-translates
Copy link
Collaborator Author

Behavioural differences from the current implementation:

  • Title fields are converted to Title Case only if the user enables the capitalizeTitles pref.
  • Tags are kept as-is, without letter-case conversion.
  • Detection is no longer based on checking the "PT" tag. It is now done by a mock import that exits early and doesn't save items.

Bugfixes:

  • Much less prone to crashing due to line noise (e.g. trailing spaces).
  • Continued lines no longer generate spurious space characters.
  • Publisher, conference, and organization (assignee) names are more robust against "forced" Title Case conversion (so "IEEE" in publisher / conference name won't become "Ieee").
  • Some fields removed (e.g. "C1", which is for creator affiliation, no longer interpreted as the place field for patents).

Underlying enhancements:

  • Enforced separation of concerns so that parsing (lines -> internal data structure), normalization (operations on internal data), and item generation (mapping internal data to item) are now better separated.

These are for journal name components that, although by default not
converted, may be turned into TitleCase by the config pref
`capitalizeTitles`.
@dstillman
Copy link
Member

I'm afraid this is just way too complicated. This translator went from being trivial to understand to being something that someone would have to study deeply to figure out what it's doing.

Translators really aren't supposed to be complicated programs in their own right where someone stepping into the code needs to figure out some custom software architecture each time — they're supposed to be simple, easy-to-understand, fairly procedural bits of code that someone uninitiated and/or inexperienced can make tweaks to in a few minutes. If the question is ever "should I do something clever/elegant/object-oriented/formal/functional-programming-based or should I not do that?", the answer should almost always be "not do that".

That certainly applies to creating an ItemMap object with tons of methods, and it possibly even applies to the getArrayJoiner()/etc. stuff. Maybe the latter somehow makes things much easier — I haven't studied this (which is sort of the point) — but if I'm someone fixing this translator a year from now, I feel like I'd much rather see this:

else if (field == "PU") {
		item.publisher = content;

than this:

PU: getArrayJoiner("publisher", true, true), // mitigate all-caps

and then have to figure what in the world getArrayJoiner() is.

Bug fixes are great, but the previous translator was written exactly how almost all our other translators are written, and unless there's an extremely compelling reason to change it, it should stay that way.

@zoe-translates
Copy link
Collaborator Author

Ah, am I correct in saying that we want less closures, jump tables, return function (...) { ... }; -- less indirect approaches, and doing things in a more direct way (else if, or case "PU": { ... })?

If that's right, I think I can combine the best of the two worlds.

I can still keep the three major steps fairly separate -

  1. line scanning ("lexical" analysis),
  2. internal "normalization" (removing redundant tags, consolidate related tags), and
  3. transforming internal data to items.

The very reason I picked them apart was that I suffered from exactly this problem of "fixing this translator a year from now" -- I was that person. The earlier version crashes upon innocuous-looking input (one extra space after file-end "EF" tag, bare unknown tag without content, etc.) and the error message wasn't helpful. It was really difficult for me to even locate where it went wrong, because those three major jobs were intertwined tightly, rather than separate as steps.

The main source of complexity in my code was caused by step 3, and now I can see how to simplify it -

  • Flatten the higher-order functions - it turned out they mostly just moved the complex and repetitive stuff from the control flow to the jump tables, with more indirection. So I can switch to a more procedural style and it will actually be easier to follow.
  • Instead of grouping different kinds of operations into three (!!) lookup table objects, just group them by the order in the switch-case or else-if control flow: first do the array-type output fields (creators, tags), then string fields requiring some non-trivial processing, and then the trivial fields.
  • Use less intimidating names (e.g. "this.cursor") and more friendly ones ("this.currentTag").

If that looks good to you, I can do it (but not today :)

- There is only one lookup table now, and it is used for static mapping
  of tags to Zotero item fields.
- The tags are processed in a loop with switch-case statements, rather
  than using dispatch tables.
- detectImport() now uses a simplified logic (check first 10 lines for
  PT or DT, similar to RIS.js), and "checkOnly" is no longer referenced
  anywhere in the code.
- Some minor fixes in data normalization.
@zoe-translates
Copy link
Collaborator Author

So by now, I've removed the indirect tables and replaced them with static mapping of input -> output fields. Each tag finds the code to process it in a switch-case group of statements.

The main difference from the earlier code is that there, the item was first populated, then normalized in completeItem(). Here, we first normalize, then (sort of mindlessly) populate the item.

For the older approach to work, the Zotero item was populated with some kind of hack: the creator field didn't really conform to the expected data structure, and it was difficult to explain how it was stitched up in completeItem().

       // If we have full names, drop the short ones
       if (item.creators[0]["AF"].length) {
               creators = item.creators[0]["AF"];
       } else {
               creators = item.creators[0]["AU"];
       }
       // Add other creators
       if (item.creators[1])
               item.creators = creators.concat(item.creators[1]);
       else
               item.creators = creators;

What was so special about item.creators[0] and item.creators[1] there? It was perplexing. I doubt "[the] uninitiated and/or inexperienced" could "make tweaks to in a few minutes" if that part of the code needed fixing.

Now, I can say for sure that each time a Zotero item field is populated in save(), it is populated with the expected type of data.

	case "AF":
	case "AU":
		addCreator(item, tagValueArray, type === "patent" ? "inventor" : "author");

Just like yours, my intention was also to make working on the code easier for some future person (myself included). But if the current version is still too complex for that purpose, I welcome your advice and ideas about possible solutions.

@dstillman dstillman merged commit 80c211d into zotero:master Jul 6, 2023
@dstillman
Copy link
Member

Much better!

@zoe-translates
Copy link
Collaborator Author

Hi @dstillman, thank you, but I should've let you know that there was more revisions in that direction. Because this is already closed, I rebased the further commits as #3073. Can you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants