-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create authorAffiliationScoringStrategy #47
Comments
michaelbales1
changed the title
Leverage data on institutional affiliation to improve phase 1 matching
Leverage data on institutional affiliation to improve phase two matching
Apr 28, 2015
michaelbales1
changed the title
Leverage data on institutional affiliation to improve phase two matching
Leverage data on past institutional affiliation to improve phase two matching
Jun 5, 2015
jl987-Jie
added a commit
that referenced
this issue
Mar 26, 2017
Added data from the provided file to MLab's MongoDB server. |
paulalbert1
changed the title
Leverage data on past institutional affiliation to improve phase two matching
Output score when targetAuthor has institutional affiliation which matches affiliation in Identity table
Jun 20, 2018
paulalbert1
changed the title
Output score when targetAuthor has institutional affiliation which matches affiliation in Identity table
Create targetAuthorAffiliationScore
Jul 13, 2018
paulalbert1
changed the title
Create targetAuthorAffiliationScore
Create authorAffiliationScore
Jul 15, 2018
paulalbert1
changed the title
Create authorAffiliationScore
Create authorAffiliationScoringStrategy
Jul 16, 2018
@sarbajitdutta - A bug for ses9022 and 16614246, the institutional affiliation in Scopus is null. Therefore the score should be 0 rather than -3.
|
Also, we should match against all affiliations. We're currently only doing first. Finally, we should incorporate home institution from application.properties. |
I think this is fixed. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Overview
With this scoring strategy, we're trying to account for the extent to which affiliation of all authors affects the likelihood a given targetAuthor authored an article.
To do this, we need to ask and answer several questions.
About Scopus data
There are currently 276,666 institutions in the Identity table, which represents 3,861 unique institutions. This comes from several sources, which use a controlled vocabulary.
We've looked up the Scopus Institution ID for the 1,786 institutions that are most often cited as being a current or historical affiliation. This collectively represents 273,006 affiliations. In other words, ~99% of the time we can predict what the Scopus Institution ID could be. Note that a given institution such as Weill Cornell might have multiple institution IDs.
Values in application.properties
Desired output
Variables
TargetAuthor
Case 1: Target author has affiliation statements in Scopus and PubMed
Notes:
Case 2: Target author has affiliation statements in Scopus only
Case 3: Target author has affiliation statements only in PubMed
Non-target author
Case 4: Non-target author(s) have one or more affiliation statements in Scopus
Notes:
Case 5: Non-target author(s) have an affiliation statement in PubMed but not Scopus
We don't consider this case.
Psuedocode
Evaluate targetAuthor
Decide which source to use for scoring.
We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.
1. As set in application.properties, is use.scopus.articles=true?
2. Does article have a Scopus affiliation for targetAuthor?
3. Does candidate article have a PubMed affiliation for targetAuthor?
4. Return the following:
Evaluate Scopus affiliation
1. Get list of institutions (these are strings) from identity.Institution for target person. Also, get Scopus institution IDs from
homeInstitution-scopusInstitutionIDs
from application.properties.2. Get any scopusInstitutionIDs (e.g., 60007997) from article.affiliation for targetAuthor.
3. Use values from identity.Institution to lookup Scopus institutional identifiers in InstitutionAfid table. For example
Weill Graduate School of Medical Sciences of Cornell University
returns:4. Attempt match between article and identity.
If there's a positive match between article and identity, output the following:
For EACH positive match between article and identity, output the following:
If match, go to 7.
If no match, go to 5.
5. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-scopusInstitutionIDs (stored in application.properties). Look for overlap between the two.
If there's any one positive match between article and identity, output the following for all matches:
While there can be multiple matches, the maximum score returned for this type of match should be 1.
If no match, go to 6.
6. There's no match. Output:
Test case: meb7002 and 22667600
Go to 7.
7. If PubMed affiliation exists, output that (but don't score it):
Evaluate PubMed affiliation
1. Get list of institutions (these are strings) from identity.institutions for person under consideration.
2. Get article.affiliation for targetAuthor.
3. Preprocess.
Get list of stopwords from
institution-Stopwords
field in application.properties.Remove stopwords, commas, and dashes from article.affiliation and identity.institutions.
Ignore any words inside parentheses. These are typically countries and are not included in affiliation statements.
4. Attempt match from article.affiliation and identity.institutions. The logic here is that keywords from identity.institutions are some substring of article.affiliation.
Here's how we do this match. Grab each affiliation and see if all the keywords are represented in a single affiliation. For example, suppose an author has a known affiliation in identity.institutions of "Weill Cornell Medical College". And, suppose the article affiliation is "Department of Pharmacology, Medical College of Weill Cornell." This would be a match because all the words in the identity affiliation are represented in the article affiliation.
If there's a match, output the following:
Maximum of one match.
If there's no match, go to 5.
5. Attempt match against homeInstitution-keywords.
Get homeInstitution-keywords from application.properties.
Look for cases where homeInstitution keywords is present in affiliation string in any order. Here's how we do this. Take any groups of terms from homeInstitution, e.g., "weill|cornell". In order for this to be a match, both terms must be present in any order, with any case.
If there's a match, output the following:
Maximum of one match.
If there's no match, go to 6.
6. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-keywords (stored in application.properties). Look for overlap between the two.
If there's any one positive match between article and identity, output the following for all matches:
While there can be multiple matches, the maximum score returned for this type of match should be 1.
If there's no match, go to 7.
7. There's no match. Output:
Evaluate nonTargetAuthor
Decide which source to use
We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.
1. As set in application.properties, is use.scopus.articles=true?
2. Does article have any Scopus affiliation for nonTargetAuthor?
3. Does candidate article have any PubMed affiliation for nonTargetAuthor?
4. Return the following:
Evaluate Scopus affiliation
1. Preprocessing
A. Create
scopusIDsNonTargetAuthor-Article
.B. Create
scopusIDsNonTargetAuthor-Identity-KnownInstitutions
.C. Create
scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutions
2. Determine overlap.
Compute the following:
countScopusIDNonTargetAuthor-Affiliations
- non-unique count of all Scopus affiliation IDs for all authorscountScopusIDsNonTargetAuthor-Article-KnownInstitution
- count of cases where affiliation ID from scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-KnownInstitutionscountScopusIDsNonTargetAuthor-Article-CollaboratingInstitution
- count of cases where affiliation IDfrom scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutionscountScopusIDsNonTargetAuthor-Article-NoMatch
- count of cases in which none of the above are true3. Compute overall score.
Get
nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight
andnonTargetAuthor-institutionalAffiliation-maxScore
from application.properties.4. Output values
Evaluate PubMed affiliation
At this time, we're not evaluating PubMed affiliation for nonTargetAuthors.
The text was updated successfully, but these errors were encountered: