-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create nameScoringStrategy #214
Comments
There are a couple opportunities for refinement but this seems to work as intended. |
@sarbajitdutta - A bug for ses9022 and 16614246:
|
Remaining work will be addressed in #289. |
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
The goal of this scoring strategy is to have a reliable score for how closely any of the names in the Identity table match the targetAuthor's indexed in the article.
Sample data
PubMed
Scopus
Intended output
The goal is to be able to return something in the feature-generator that looks like this...
The scoring lookup table for this and other features need to be stored in a single location such as application.properites. Use your judgment about formatting. Here's one option. Note we have a variable, a string value, and an integer value.
institutionalAuthorName
institutionalAuthorName is the set of possible names as recorded in the Identity table. These are stored in primaryName and alternateNames in the Identity table.
articleAuthorName
articleAuthorName is the name as recorded in the publication metadata.
Pseudocode
A. Decide whether to use Scopus
forename
andgivenName
. Now, go to 5.forename
andgivenName
. Now, go to 5.Match target author (nth) in PubMed to target author (nth) in Scopus.
Is length of
given-name
in Scopus greater thanforename
in PubMed?forename
andgivenName
. Now, go to 5.surname
andgiven-name
.B. Score the targetAuthor
How many cases where targetAuthor=TRUE were selected?
C. Preprocess all names
Retrieve article.firstName and all distinct cases of identity.firstName and identity.middleName where targetAuthor=TRUE.
Preprocess identity.firstName, identity.middleName, and article.firstName
Retrieve article.lastName where targetAuthor=TRUE and all distinct cases of identity.lastName for our target author from identity. Preprocess identity.lastName and article.lastName.
D. Score the last name
Attempt full exact match where identity.lastName = article.lastName.
Combine following identity.middleName, identity.lastName into mergedName. Now attempt match against article.lastName.
Attempt partial match where "%" + identity.lastName + "%" = article.lastName
Attempt match where identity.lastName >= 4 characters and levenshteinDistance between identity.lastName and article.lastName is <=1.
E. Determine if identity.middleName is available to match against
Identities with no middle name can be divided into two groups:
This logic will help us figure out which case is happening.
F. Score the first name in cases where identity.middleName is null
Overview:
Attempt match where identity.firstName = article.firstName
Attempt match where identity.firstName is a left-anchored substring of article.firstName
Attempt match where article.firstName is a left-anchored substring of identity.firstName
Attempt match where first three characters of identity.firstName = first three characters of article.firstName
Attempt match where identity.firstName is greater than 4 characters and Levenshtein distance between identity.firstName and article.firstName is 1.
Attempt match where first character of identity.firstName = first character of article.firstName
Else output the following:
G. Score the first and middle name
Context:
Preprocessing: ignore/discard name variants in which it's pretty clear that one name variant has a middle name that is an abbreviation of another.
Attempt match where identity.firstName + identity.middleName = article.firstName
Attempt match where identity.firstName + "%" + identity.middleName = article.firstName
Attempt match where identity.firstName + identity.middleInitial = article.firstName
Attempt match where identity.firstName + "%" + identity.middleInitial = article.firstName
Attempt match where identity.firstInitial + identity.middleInitial = article.firstName or where identity.firstInitial + " " + identity.middleInitial = article.firstName.
Attempt match where identity.firstInitial + identity.middleName = article.firstName
Attempt match where identity.firstName + identity.middleName + "%" = article.firstName
Attempt match where identity.firstName + identity.middleInitial + "%" = article.firstName
Attempt match where identity.firstName = article.firstName
Attempt match where identity.middleInitial + identity.firstInitial = article.firstName
If there's more than one capital letter in identity.firstName or identity.middleName, attempt match where any capitals in identity.firstName + any capital letters in identity.middleName = article.firstName
If there's more than one capital letter in identity.firstName, attempt match where any capitals in identity.firstName = article.firstName
If there's more than one capital letter in identity.firstName, attempt match where any capitals in identity.firstName + identity.middleName = article.firstName
Attempt match where identity.firstName + "%" = article.firstName
Attempt match where "%" + identity.firstName = article.firstName
Attempt match where identity.middleName = article.firstName
Attempt match where identity.middleName + "%" = article.firstName
Attempt match where "%" + identity.middleName = article.firstName
Attempt match where levenshteinDistance between identity.firstName + identity.middleName and article.firstName is <=2.
Attempt match where identity.firstName >= 4 characters and levenshteinDistance between identity.firstName and article.firstName is <=1.
Attempt match where first three characters of identity.firstName = first three characters of identity.firstName.
Attempt match where identity.firstInitial + "%" + identity.middleName = article.firstName
Attempt match where identity.middleName + identity.firstInitial = article.firstName
Attempt match where article.firstName is only one character and identity.firstName = first character of article.firstName.
Attempt match where first character of identity.firstName = first character of identity.firstName.
Else, we have no match of any kind.
H. Middle name score modification
If
middleNameMatchType = full-exact
and matching middle name is one character, override that score tonameMatchMiddleType=exact-singleInitial
.The text was updated successfully, but these errors were encountered: