-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WoS Tagged: Reimplement the core algorithms. #3062
WoS Tagged: Reimplement the core algorithms. #3062
Conversation
This supersedes #3053. |
The previous iteration of the import translator for Web of Science Tagged Format is becoming a bit difficult to maintain because of the tight coupling of logic, data, and actions. To improve the separation of concerns, in the updated version several measures have been taken: - Use separate methods for distinct functions such as line parsing/validation (purely formal text-processing), intermediate transformations (data normalization based on semantics, etc.), and data-to-item mapping. - Use a lookup-table approach to help with mapping tagged data to object properties, which is easier to debug/maintain. It can also better support "polymorphism", the unfortunate fact that the same tag can mean different things depending on the item type. - Implement better text-processing (e.g. collapsing spaces when necessary, cleaning up line noise, more robust handling of author names, etc.) Overall the goal is to make the translator more robust and easier to reason with. TODO: To ensure best compatibility of behaviour, none of the test cases has been updated for now. After this commit, I'll introduce temporary devices to ensure the new code, with its new underlying structure, produces the same output as the old one (to the point of bug-compatible except for the most egregious). When that is achieved, the temporary compatibility devices will be removed, and the test cases will be updated and manually verified. After that, the new test cases will serve as the basis for incremental improvements of the new code.
- For non-title fields, especially those that tend to be in ALL CAPS, they are converted to "Title Case" while taking into account of some special forms such as IEEE, ACM, etc. - Minor readability fixes.
- Spurious spaces in original tests corrected. - Titles no longer converted to TitleCase; they respect the pref "capitalizeTitles". - Tags no longer unconditionally turned into lower case. - Some likely-misinterpreted fields removed.
Since there's no definite criteria of what constitutes a WoS tagged file, we simply do a mock doImport() for detection. If the mock import runs and would have saved a non-empty item, detectImport() returns true. detectImport() now uses the same parsing facilities as doImport() but with early exit and without saving any items.
- Move some getArrayJoiner() handlers to the simple handler category. - No need to cache the functions returned by getArrayJoiner(), which after simplification can no longer be reused. - Adjust the list of "special forms" in the wrapper to ZU.capitalizeTitle(), focus on what may appear in publisher names (WoS put those names in all-caps)
e76770a
to
cb27a4f
Compare
Behavioural differences from the current implementation:
Bugfixes:
Underlying enhancements:
|
These are for journal name components that, although by default not converted, may be turned into TitleCase by the config pref `capitalizeTitles`.
I'm afraid this is just way too complicated. This translator went from being trivial to understand to being something that someone would have to study deeply to figure out what it's doing. Translators really aren't supposed to be complicated programs in their own right where someone stepping into the code needs to figure out some custom software architecture each time — they're supposed to be simple, easy-to-understand, fairly procedural bits of code that someone uninitiated and/or inexperienced can make tweaks to in a few minutes. If the question is ever "should I do something clever/elegant/object-oriented/formal/functional-programming-based or should I not do that?", the answer should almost always be "not do that". That certainly applies to creating an
than this:
and then have to figure what in the world Bug fixes are great, but the previous translator was written exactly how almost all our other translators are written, and unless there's an extremely compelling reason to change it, it should stay that way. |
Ah, am I correct in saying that we want less closures, jump tables, If that's right, I think I can combine the best of the two worlds. I can still keep the three major steps fairly separate -
The very reason I picked them apart was that I suffered from exactly this problem of "fixing this translator a year from now" -- I was that person. The earlier version crashes upon innocuous-looking input (one extra space after file-end "EF" tag, bare unknown tag without content, etc.) and the error message wasn't helpful. It was really difficult for me to even locate where it went wrong, because those three major jobs were intertwined tightly, rather than separate as steps. The main source of complexity in my code was caused by step 3, and now I can see how to simplify it -
If that looks good to you, I can do it (but not today :) |
- There is only one lookup table now, and it is used for static mapping of tags to Zotero item fields. - The tags are processed in a loop with switch-case statements, rather than using dispatch tables. - detectImport() now uses a simplified logic (check first 10 lines for PT or DT, similar to RIS.js), and "checkOnly" is no longer referenced anywhere in the code. - Some minor fixes in data normalization.
So by now, I've removed the indirect tables and replaced them with static mapping of input -> output fields. Each tag finds the code to process it in a switch-case group of statements. The main difference from the earlier code is that there, the item was first populated, then normalized in For the older approach to work, the Zotero item was populated with some kind of hack: the // If we have full names, drop the short ones
if (item.creators[0]["AF"].length) {
creators = item.creators[0]["AF"];
} else {
creators = item.creators[0]["AU"];
}
// Add other creators
if (item.creators[1])
item.creators = creators.concat(item.creators[1]);
else
item.creators = creators; What was so special about Now, I can say for sure that each time a Zotero item field is populated in case "AF":
case "AU":
addCreator(item, tagValueArray, type === "patent" ? "inventor" : "author"); Just like yours, my intention was also to make working on the code easier for some future person (myself included). But if the current version is still too complex for that purpose, I welcome your advice and ideas about possible solutions. |
Much better! |
Hi @dstillman, thank you, but I should've let you know that there was more revisions in that direction. Because this is already closed, I rebased the further commits as #3073. Can you take a look? |
The previous iteration of the import translator for Web of Science Tagged Format is becoming a bit difficult to maintain because of the tight coupling of logic, data, and actions.
To improve the separation of concerns, in the updated version several measures have been taken:
Overall the goal is to make the translator more robust and easier to reason with.
TODO: To ensure best compatibility of behaviour, none of the test cases has been updated for now. After this commit, I'll introduce temporary devices to ensure the new code, with its new underlying structure, produces the same output as the old one (to the point of bug-compatible except for the most egregious). When that is achieved, the temporary compatibility devices will be removed, and the test cases will be updated and manually verified. After that, the new test cases will serve as the basis for incremental improvements of the new code.