Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create nameScoringStrategy #214

Closed
paulalbert1 opened this issue May 23, 2018 · 4 comments
Closed

Create nameScoringStrategy #214

paulalbert1 opened this issue May 23, 2018 · 4 comments
Assignees

Comments

@paulalbert1
Copy link
Contributor

paulalbert1 commented May 23, 2018

Background

The goal of this scoring strategy is to have a reliable score for how closely any of the names in the Identity table match the targetAuthor's indexed in the article.

Sample data

PubMed

<Author ValidYN="Y">
<LastName>Chen</LastName>
<forename>Kang</forename>
<Initials>K</Initials>
</Author>

Scopus

<author seq="1">
<author-url>https://api.elsevier.com/content/author/author_id/8938650800</author-url>
<authid>8938650800</authid>
<authname>Smith C.</authname>
<surname>Smith</surname>
<given-name>Catherine C.</given-name>
<initials>C.C.</initials>
</author>

Intended output

The goal is to be able to return something in the feature-generator that looks like this...

	"authorNameEvidence": {
		"institutionalAuthorName": {
			"firstName": "Curtis",
			"firstInitial": "C",
			"middleName": "Del",
			"middleInitial": "D",
			"lastName": "Cole"
  		},
		"articleAuthorName": {
			"firstName": "Curtis",
			"lastName": "Del Cole"
  		},
		"nameMatchFirstType":  "full-exact",
 		"nameMatchFirstScore":  2,
 		"nameMatchMiddleType":  "inferredInitials-exact",
 		"nameMatchMiddleScore":  1,
 		"nameMatchLastType":  "full-exact",
 		"nameMatchLastScore":  2,
 		"nameMatchModifier: "combinedMiddleNameLastName",
 		"nameMatchModifierScore: 1
	},

The scoring lookup table for this and other features need to be stored in a single location such as application.properites. Use your judgment about formatting. Here's one option. Note we have a variable, a string value, and an integer value.

nameMatchFirstType  full-exact  2
nameMatchFirstType  inferredInitials-exact  1
nameMatchFirstType  full-fuzzy  0
nameMatchFirstType  noMatch -1
nameMatchFirstType  full-conflictingAllButInitials  -2
nameMatchFirstType  full-conflictingEntirely  -3
nameMatchFirstType  nullTargetAuthor-MatchNotAttempted  -3
nameMatchLastType full-exact  2
nameMatchLastType full-fuzzy  1
nameMatchLastType full-conflictingEntirely  -3
nameMatchLastType nullTargetAuthor-MatchNotAttempted  -3
nameMatchMiddleType full-exact  2
nameMatchMiddleType full-exact  2
nameMatchMiddleType exact-singleInitial 1.5
nameMatchMiddleType inferredInitials-exact  1
nameMatchMiddleType noMatch 0
nameMatchMiddleType full-fuzzy  0
nameMatchMiddleType full-conflictingEntirely  -2
nameMatchMiddleType nullTargetAuthor-MatchNotAttempted  -2
nameMatchMiddleType identityNull-MatchNotAttempted  0
nameMatchModifier incorrectOrder  -1
nameMatchModifier articleSubstringOfIdentity-lastName  -1
nameMatchModifier articleSubstringOfIdentity-firstMiddleName -1 
nameMatchModifier identitySubstringOfArticle-lastName -2
nameMatchModifier identitySubstringOfArticle-firstName -1
nameMatchModifier identitySubstringOfArticle-middleName -1
nameMatchModifier identitySubstringOfArticle-firstMiddleName 1
nameMatchModifier combinedMiddleNameLastName  1

institutionalAuthorName

institutionalAuthorName is the set of possible names as recorded in the Identity table. These are stored in primaryName and alternateNames in the Identity table.

articleAuthorName

articleAuthorName is the name as recorded in the publication metadata.

Pseudocode

A. Decide whether to use Scopus

  1. Is use.scopus.articles=true?
  • if no, we're using the PubMed fields for name, forename and givenName. Now, go to 5.
  • if yes, go to 2
  1. Does number of authors in Scopus equal number of authors in PubMed?
  • if no, we're using the PubMed fields for name, forename and givenName. Now, go to 5.
  • if yes, go to 3
  1. Match target author (nth) in PubMed to target author (nth) in Scopus.

  2. Is length of given-name in Scopus greater than forename in PubMed?

  • if no, we're using the PubMed fields for name, forename and givenName. Now, go to 5.
  • if yes, we're using the Scopus fields for name, surname and given-name.
  1. Using author data from PubMed and Scopus according to above logic, create two fields for all authors: firstName and lastName. Let's call these article.firstName and article.lastName

B. Score the targetAuthor

How many cases where targetAuthor=TRUE were selected?

  • if 0, assign the following values to appear in positiveEvidence, and then stop:
nameMatchFirstType: nullTargetAuthor-MatchNotAttempted
nameMatchMiddleType: nullTargetAuthor-MatchNotAttempted
nameMatchLastType: nullTargetAuthor-MatchNotAttempted
  • if 1, go to C.
  • if >1, we need to score all cases where targetAuthor=TRUE. You will need to do the following for all cases where targetAuthor=TRUE. (Side note: when calculating the overallScore for an article, you should take the highest score for each attribute.) Go to C.

C. Preprocess all names

Retrieve article.firstName and all distinct cases of identity.firstName and identity.middleName where targetAuthor=TRUE.

Preprocess identity.firstName, identity.middleName, and article.firstName

  • If any names are in quotes or parentheses in identity.firstName or identity.middleName, pull them out so they can be matched against.
    • Wing Tak "Jack" --> add "Jack" to identity.firstName, add "Wing Tak" to identity.firstName
    • Qihui (Jim) --> add Jim to identity.firstName, add Qihui to identity.firstName
  • Remove any periods, spaces, or dashes from both identity.firstName, identity.middleName, and article.firstName. For example:
    • "Chi-chao" --> "Chichao"
    • "Minh-Nhut Yvonne" --> "MinhNhutYvonne"
    • "Eliot A." --> "EliotA"
  • Disregard any cases where one variant of identity.firstName is a substring of another case of identity.firstName. Disregard any cases where one variant of identity.middleName is a substring of another case of identity.middleName. For example:
    • "Cary" (keep) vs "C" (disregard)
  • Null for middle name should only be included as possible value if none of the names or aliases contain a middle name
  • A given target author might have N different first names and M different middle names.
  • Go to D

Retrieve article.lastName where targetAuthor=TRUE and all distinct cases of identity.lastName for our target author from identity. Preprocess identity.lastName and article.lastName.

  • Remove any periods from both identity.lastName and article.lastName
  • Remove any of the following endings from identity.lastName:
    • ", Jr"
    • ", MD PhD"
    • ", MD-PhD"
    • ", PhD"
    • ", MD"
    • ", III"
    • ", II"
    • ", Sr"
    • " Jr"
    • " MD PhD"
    • " MD-PhD"
    • " PhD"
    • " MD"
    • " III"
    • " II"
    • " Sr"
  • Remove any periods, spaces, dashes, or quotes.
  • For example:
    • "Del Cole" --> "Delcole"
    • "Garcia-Marquez" --> ""GarciaMarquez"
    • "Capetillo Gonzalez de Zarate" --> "CapetilloGonzalezdeZarate"
  • We're going to attempt a match on all of these. Go to D.

D. Score the last name

Attempt full exact match where identity.lastName = article.lastName.

  • Example: Cole (identity.lastName) = Cole (article.lastName)
  • If match, output following and go to E:
nameMatchLastType: full-exact
  • If no match, go to next.

Combine following identity.middleName, identity.lastName into mergedName. Now attempt match against article.lastName.

  • Example: Garcia (identity.middleName) + Marquez (identity.lastName) = GarciaMarques (article.lastName)
  • If match: stop scoring middle and last name; move on to only score first name; output following:
nameMatchMiddleType: full-exact
nameMatchLastType: full-exact
nameMatchModifier: combinedMiddleNameLastName
  • If no match, go to next

Attempt partial match where "%" + identity.lastName + "%" = article.lastName

  • Example: Cole (identity.lastName) = Del Cole (article.lastName)
  • If match, output following and go to E:
nameMatchLastType: full-exact
nameMatchModifier: identitySubstringOfArticle-lastName
  • If no match, go to next.

Attempt match where identity.lastName >= 4 characters and levenshteinDistance between identity.lastName and article.lastName is <=1.

  • Example: Kaushal (identity.lastName) = Kaushai (article.lastName)
  • If match, output following and go to E:
nameMatchLastType: full-fuzzy
  • If no match, output following and go to next:
nameMatchLastType: full-conflictingEntirely

E. Determine if identity.middleName is available to match against

Identities with no middle name can be divided into two groups:

  • Case 1: the name has been accidentally omitted
  • Case 2: the user does not have a middle name

This logic will help us figure out which case is happening.

  1. Is identity.middleName null in all name variants?
  • If yes, go to F
  • If no, go to 2
  1. Let's decide if we can ignore at least one of the name variants. To do so, they have to be very similar. Are the last names of any two name variants identical?
  • If yes, go to 3
  • If no, go to F
  1. Is one first name variant a substring of another (e.g., Jon vs. Jonathan)?
  • If yes, we can opt to ignore the name variant that does not have a middle name; return to 1 to repeat this process with remaining names
  • If no, name variant needs to be considered separately. Name variants without a middle name should be sent to F. Name variants with a middle name should be sent to G. May the best scoring name variant win!

F. Score the first name in cases where identity.middleName is null

Overview:

  • All of the matching that follows should be evaluated as a series of ifElse statements (once we get a match, we stop). The goal is to match as early on in the process as possible.
  • Matching should be case insensitive, however, we will pull out some characters below based on case.

Attempt match where identity.firstName = article.firstName

  • Example: Paul (identity.firstName) = Paul (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: identityNull-MatchNotAttempted

Attempt match where identity.firstName is a left-anchored substring of article.firstName

  • Example: Paul (identity.firstName) = PaulJames (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: identityNull-MatchNotAttempted
nameMatchModifier: identitySubstringOfArticle-firstName

Attempt match where article.firstName is a left-anchored substring of identity.firstName

  • Example: Paul (identity.firstName) = P (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: identityNull-MatchNotAttempted

Attempt match where first three characters of identity.firstName = first three characters of article.firstName

  • Example: Paul (identity.firstName) = Pau (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-fuzzy
nameMatchMiddleType: identityNull-MatchNotAttempted

Attempt match where identity.firstName is greater than 4 characters and Levenshtein distance between identity.firstName and article.firstName is 1.

  • Example: Paula (identity.firstName) = Pauly (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-fuzzy
nameMatchMiddleType: identityNull-MatchNotAttempted

Attempt match where first character of identity.firstName = first character of article.firstName

  • Example: Paul (identity.firstName) = Peter (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-conflictingAllButInitials
nameMatchMiddleType: identityNull-MatchNotAttempted

Else output the following:

  • Example: Pascale vs. Curtis
  • If match output the following and stop:
nameMatchFirstType: full-conflictingEntirely
nameMatchMiddleType: identityNull-MatchNotAttempted

G. Score the first and middle name

Context:

  • We need to score first and middle names in conjunction with each other, because PubMed and Scopus combine them into a single field; also, it's often the case that first and middle names are conflated in identity systems of record.
  • All of the matching that follows should be evaluated as a series of ifElse statements (once we get a match, we stop). The goal is to match as early on in the process as possible.
  • It's conceivable that we can use the firstName from one alias in conjunction with the middleName of another to match to our article. We can use any combination of first and middle names to do the matching.

Preprocessing: ignore/discard name variants in which it's pretty clear that one name variant has a middle name that is an abbreviation of another.

  • Example: there are two name variants for ajdannen:
    • "Andrew[firstName] + Jess[middleName] + Dannenberg[lastName]"
    • "Andrew[firstName] + J[middleName] + Dannenberg[lastName]"
  • In cases where one of the middle names is a left-anchored substring of the other, discard/ignore the shorter one.

Attempt match where identity.firstName + identity.middleName = article.firstName

  • Example: Paul (identity.firstName) + James (identity.middleName) = PaulJames (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: full-exact

Attempt match where identity.firstName + "%" + identity.middleName = article.firstName

  • Example: Paul (identity.firstName) + James (identity.middleName) = PaulaJames (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: full-exact
nameMatchModifier: identitySubstringOfArticle-firstMiddleName

Attempt match where identity.firstName + identity.middleInitial = article.firstName

  • Example: Paul (identity.firstName) + J (identity.middleInitial) = PaulJ (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: inferredInitials-exact

Attempt match where identity.firstName + "%" + identity.middleInitial = article.firstName

  • Example: Paul (identity.firstName) + J (identity.middleInitial) = PaulaJ (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: inferredInitials-exact
nameMatchModifier: identitySubstringOfArticle-firstMiddleName

Attempt match where identity.firstInitial + identity.middleInitial = article.firstName or where identity.firstInitial + " " + identity.middleInitial = article.firstName.

  • Example: P (identity.firstInitial) + J (identity.middleInitial) = PJ (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: inferredInitials-exact

Attempt match where identity.firstInitial + identity.middleName = article.firstName

  • Example: M (identity.firstInitial) + Carrington (identity.middleName) = MCarrington (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: full-exact

Attempt match where identity.firstName + identity.middleName + "%" = article.firstName

  • Example: Paul (identity.firstName) + James (identity.middleName) = PaulJamesA (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact 
nameMatchMiddleType: full-exact
nameMatchModifier: identitySubstringOfArticle-firstMiddleName

Attempt match where identity.firstName + identity.middleInitial + "%" = article.firstName

  • Example: Paul (identity.firstName) + J (identity.middleInitial) = PaulJZ (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: inferredInitials-exact
nameMatchModifier: identitySubstringOfArticle-firstMiddleName

Attempt match where identity.firstName = article.firstName

  • Example: Paul (identity.firstName) = Paul (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: noMatch

Attempt match where identity.middleInitial + identity.firstInitial = article.firstName

  • Example: J (identity.middleInitial) + P (identity.firstInitial) = JP (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: inferredInitials-exact
nameMatchModifier incorrectOrder

If there's more than one capital letter in identity.firstName or identity.middleName, attempt match where any capitals in identity.firstName + any capital letters in identity.middleName = article.firstName

  • Example: KS (identity.initialsInFirstName) + C (identity.initialsInMiddleName) = KSC (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: inferredInitials-exact

If there's more than one capital letter in identity.firstName, attempt match where any capitals in identity.firstName = article.firstName

  • Example: KS (identity.initialsInFirstName) = KS (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: noMatch

If there's more than one capital letter in identity.firstName, attempt match where any capitals in identity.firstName + identity.middleName = article.firstName

  • Example: KS (identity.initialsInFirstName) + Clifford (identity.middleName) = KSClifford (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: full-exact

Attempt match where identity.firstName + "%" = article.firstName

  • Example: Robert (identity.firstName) = RobertR (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: noMatch
nameMatchModifier: identitySubstringOfArticle-firstName

Attempt match where "%" + identity.firstName = article.firstName

  • Example: Cary (identity.firstName) = MCary (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-exact
nameMatchMiddleType: noMatch
nameMatchModifier: identitySubstringOfArticle-firstName

Attempt match where identity.middleName = article.firstName

  • Example: Clifford (identity.middleName) = Clifford (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: noMatch
nameMatchMiddleType: full-exact

Attempt match where identity.middleName + "%" = article.firstName

  • Example: Clifford (identity.middleName) = CliffordKS (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: noMatch
nameMatchMiddleType: full-exact
nameMatchModifier: identitySubstringOfArticle-middleName

Attempt match where "%" + identity.middleName = article.firstName

  • Example: Clifford (identity.middleName) = KunSungClifford (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: noMatch
nameMatchMiddleType: full-exact
nameMatchModifier: identitySubstringOfArticle-middleName

Attempt match where levenshteinDistance between identity.firstName + identity.middleName and article.firstName is <=2.

  • Example: Manney (identity.firstName) + Carrington (identity.middleName) = MannyCarrington (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-fuzzy
nameMatchMiddleType: full-fuzzy

Attempt match where identity.firstName >= 4 characters and levenshteinDistance between identity.firstName and article.firstName is <=1.

  • Example: Nassar (identity.firstName) = Nasser (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-fuzzy
nameMatchMiddleType: noMatch

Attempt match where first three characters of identity.firstName = first three characters of identity.firstName.

  • Example: Massimiliano (identity.firstName) = Massimo (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-fuzzy
nameMatchMiddleType: noMatch

Attempt match where identity.firstInitial + "%" + identity.middleName = article.firstName

  • Example: M (identity.firstInitial) + Carrington (identity.middleName) = MannyCarrington (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: full-exact
nameMatchModifier: identitySubstringOfArticle-firstMiddleName

Attempt match where identity.middleName + identity.firstInitial = article.firstName

  • Example: Carrington (identity.middleName) + M (identity.firstInitial) = CarringtonM (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: full-exact
nameMatchModifier: incorrectOrder

Attempt match where article.firstName is only one character and identity.firstName = first character of article.firstName.

  • Example: Jessica (identity.firstName) = J (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: inferredInitials-exact
nameMatchMiddleType: noMatch

Attempt match where first character of identity.firstName = first character of identity.firstName.

  • Example: Jessica (identity.firstName) = Jochen (article.firstName)
  • If match output the following and stop:
nameMatchFirstType: full-conflictingAllButInitials
nameMatchMiddleType: noMatch

Else, we have no match of any kind.

  • Example: Pascale vs. Curtis
  • If match output the following and stop:
nameMatchFirstType: full-conflictingEntirely
nameMatchMiddleType: full-conflictingEntirely

H. Middle name score modification

If middleNameMatchType = full-exact and matching middle name is one character, override that score to nameMatchMiddleType=exact-singleInitial.

@paulalbert1
Copy link
Contributor Author

This will override issues #111 and #132, and possibly #127.

@paulalbert1
Copy link
Contributor Author

There are a couple opportunities for refinement but this seems to work as intended.

@paulalbert1
Copy link
Contributor Author

@sarbajitdutta - A bug for ses9022 and 16614246:

  1. We should be returning this:
nameMatchModifier: identitySubstringOfArticle-lastName
  1. Instead of "lastName": "Somersankarakaya", we should return the name exactly as recorded in the Identity table: "lastName": "Somersan-Karakaya"
      "evidence": {
        "acceptedRejectedEvidence": null,
        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Selin",
            "firstInitial": "S",
            "middleName": null,
            "middleInitial": null,
            "lastName": "Somersankarakaya"
          },
          "articleAuthorName": {
            "firstName": "Selin",
            "firstInitial": "S",
            "middleName": null,
            "middleInitial": null,
            "lastName": "Somersan"
          },
          "nameScoreTotal": -1,
          "nameMatchFirstType": "full-exact",
          "nameMatchFirstScore": 2,
          "nameMatchMiddleType": "identityNull-MatchNotAttempted",
          "nameMatchMiddleScore": 0,
          "nameMatchLastType": "full-conflictingEntirely",
          "nameMatchLastScore": -3,
          "nameMatchModifier": null,
          "nameMatchModifierScore": 0

@paulalbert1
Copy link
Contributor Author

Remaining work will be addressed in #289.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants