Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create averageClusteringScoringStrategy #232

Closed
paulalbert1 opened this issue Jul 12, 2018 · 0 comments
Closed

Create averageClusteringScoringStrategy #232

paulalbert1 opened this issue Jul 12, 2018 · 0 comments
Assignees

Comments

@paulalbert1
Copy link
Contributor

paulalbert1 commented Jul 12, 2018

Background

Clustering is generally reliable. We can use shared associations to ensure that articles in the same cluster have scores that are more similar to one another.

However, in cases where it fails it can really undermine accuracy. (Examples: amc2056, cos2006, rjk9003).

In such cases, it is possible to infer to compare firstName of targetAuthors in a given cluster to infer how reliable that cluster is. If the names are consistent, we use the full cluster score. If it is not, we discount the cluster score.

Properties

Store this in application.properites.

clusterReliabilityScoreFactor: 3
clusterScore-Factor: 0.4

Psuedocode

  1. If an article is in its own cluster with no other members, do not use this strategy.

  2. Is use.gold.standard.evidence=true in application.properties?

  • If yes, go to 2.
  • If no, go 3.
  1. Add up all evidence scores for that article including, if they exist, acceptedArticleScore or rejectedArticleScore. We will call this totalArticleScore-WithoutClustering. Go to 4.

  2. Add up all evidence scores for that article excluding acceptedArticleScore or rejectedArticleScore. We will call this totalArticleScore-WithoutClustering. Go to 4.

  3. Take average of values of totalArticleScore-WithoutClustering in a given cluster. We will call this clusterScore-Average.

  4. For every article in a given cluster, retrieve all instances of articleAuthorName.firstName. For example:

firstName=[RaeKwon]
firstName=[RaeKwon]
firstName=[RaeKwon]
firstName=[RaeKwon]
firstName=[RaeKwon]
firstName=[RK]
firstName=[RockBum]
firstName=[RockBum]
firstName=[RockBum]
firstName=[RockBum]
firstName=[RockBum]
firstName=[RulBin]
firstName=[RyeoJin]
firstName=[RyeoJin]
firstName=[RyoonHo]
  1. Remove all capital letters. For example:
firstName=[aewon]
firstName=[aewon]
firstName=[aewon]
firstName=[aewon]
firstName=[aewon]
firstName=[]
firstName=[ockum]
firstName=[ockum]
firstName=[ockum]
firstName=[ockum]
firstName=[ockum]
firstName=[ulin]
firstName=[yeoin]
firstName=[yeoin]
firstName=[yoono]
  1. Remove cases where there are no longer any letters. For example:
firstName=[aewon]
firstName=[aewon]
firstName=[aewon]
firstName=[aewon]
firstName=[aewon]
firstName=[ockum]
firstName=[ockum]
firstName=[ockum]
firstName=[ockum]
firstName=[ockum]
firstName=[ulin]
firstName=[yeoin]
firstName=[yeoin]
firstName=[yoono]
  1. Count the total number of names remaining and the count of the most frequent. For example, the totalNameCount is 14, and the maxIdenticalNameCount is 5 ("aewon" and "ockum" are both 5).

  2. Compute clusterReliabilityScore using this formula.

(maxIdenticalNameCount / totalNameCount) ^ clusterReliabilityScoreFactor

For example: (5/14)^3 = 0.0455.

  1. Now let's figure out on an article by article basis how much clusterScore-Average should affect any one cluster.

Retrieve clusterScore-Factor from application.properties.

clusterScore-Factor: 0.4

For each given article, we will calculate clusterScoreDiscrepancy.

clusterScore-Discrepancy = (totalArticleScore-WithoutClustering - clusterScore-Average) * clusterScore-Factor * clusterReliabilityScoreFactor
  1. Calculate totalArticleScore-nonStandardized
totalArticleScore-nonStandardized = totalArticleScore-WithoutClustering - clusterScore-Discrepancy
  1. Output the following at the article level:
totalArticleScore-nonStandardized = 7.2 /* example */
clusterScore-Average: 7.5 /* example */
clusterScore-Discrepancy = 0.3 /* example */
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants