Documentation par rapport au score #50

cristianpb · 2020-05-13T19:10:31Z

Le score soit aussi facile à interpréter par quelqu'un et de préférence un pourcentage.

Il y a deux options: faire un scoring en node ou faire un scoring avec le langage de script elasticsearch.

Je n'avais jamais exploré cette voie pour matchID on avait à la fois la parallélisation déjà utilisée dans matchID, et surtout pandas qui offre la vectorisation des fonctions mathématiques, et quelques fonctions en C. Déjà avec matchID le scoring lorsqu'il était avancé (réinstanciant notamment pas mal de levenshtein) pouvait prendre 25 à 50% du temps de calcul.

Là tu as le choix entre le langage scripté node ou le langage scripté ES qui assume lui-même au sein de son coeur java la parallélisation. Je pense que c'est moins évident comme choix, et qu'il faut tester.

en gros score = f(Pa, Pb) (personne de dataset a et dataset b).
mais plus précisément
f ~ f_noms(Na,Nb) x f_lieu(La,Lb) x f_date (Da,Db) x f_sexe (Sa,Sb)

où les f sont normalisés à 1.

Pour les noms l'idée est de faire le rapprochement a niveau tokens et au niveau caractère. Je n'avais pas pris le phonétique, mais je pense qu'il faudrait l'ajouter.

Pour les lieux, ça dépend du niveau de description: libellé de commune vs code commune etc.
Du coup là nouveau score custom

rectification commune (label, dep) => commune (en fuzzy sur label)
score à 1 si 1 point commun dans l'historique de commune
score basé sur ~ inverse distance géographique

pour le reste c'est plus trivial

On rajoutait aussi une passe sur le nombre de candidats (Pa , Pb1, Pb2 etc) pour les cas à la mammadou diallo

https://github.com/matchID-project/backend/blob/dev/conf/recipes/matching.yml

scoring:
    steps:
      - scoring_name_lev:
      - scoring_location:
      - scoring_date:
      - scoring_sex:
      - scoring_final:

avec par exemple

  scoring_location:
    steps:   
      - eval:
         #scoring location
        - matchid_hit_score_location_lv_city: levenshtein_norm(matchid_location_city,hit_matchid_location_city)      
        - matchid_hit_score_location_lv_city_src: levenshtein_norm(matchid_location_city_src,hit_matchid_location_city_src)      
        - matchid_hit_score_location_lv_country: levenshtein_norm(matchid_location_country,hit_matchid_location_country)      
        - matchid_hit_score_location_citycode_history: 1 if (len([x for x in matchid_location_citycode_history if x in hit_matchid_location_citycode_history])>0) else 0     
        - matchid_hit_score_location_citycode: 1 if ((len(matchid_location_citycode)>0) & (matchid_location_citycode == hit_matchid_location_citycode)) else 0
        - matchid_hit_score_location_depcode: 1 if (matchid_location_depcode == hit_matchid_location_depcode) else 0      
        - matchid_hit_score_location_countrycode: 1 if (matchid_location_countrycode == hit_matchid_location_countrycode) else 0   
        - matchid_hit_distance: distance(matchid_location_city_geopoint_2d,hit_matchid_location_city_geopoint_2d)
        - matchid_hit_score_location_distance: 0 if (matchid_hit_distance == "") else round(100*40/(40+matchid_hit_distance))/100
        - matchid_hit_score_location: round(0.5*max(matchid_hit_score_location_citycode,matchid_hit_score_location_citycode_history,max(matchid_hit_score_location_lv_city,matchid_hit_score_location_lv_city_src),matchid_hit_score_location_distance)+0.25*max(matchid_hit_score_location_depcode, matchid_hit_score_location_citycode_history)+0.25*max(matchid_hit_score_location_countrycode,matchid_hit_score_location_lv_country,matchid_hit_score_location_citycode_history),2)

The text was updated successfully, but these errors were encountered:

cristianpb linked a pull request May 20, 2020 that will close this issue

Feat/bulk scoring #62

Merged

cristianpb closed this as completed in #62 May 20, 2020

rhanka mentioned this issue Oct 20, 2024

[Snyk] Upgrade fuzzball from 2.1.2 to 2.1.3 #433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation par rapport au score #50

Documentation par rapport au score #50

cristianpb commented May 13, 2020

Documentation par rapport au score #50

Documentation par rapport au score #50

Comments

cristianpb commented May 13, 2020