2024-05_school_mapping #3

sumanashrestha · 2024-05-27T02:53:25Z

sumanashrestha
May 27, 2024
Maintainer

Discussion thread for school mapping problem

Problem statement: https://github.com/moest-np/incubator/tree/main/2024-05_school_mapping

UPDATE

The school list is from two separate legacy systems, we want to perform analysis on the merged data. Yes, for future inputs we should make efforts such that the collected data is clean, however we need to deal with the existing data.
It is understood that it will not be possible to map each school from source A to source B. The ask is to automate wherever possible, manual mapping will be performed for remaining ones. Target should be to minimize manual work
school_list_A only contains Samudayik / Government schools. type field has been added to school_list_b, so subset of school_list_B with school_type = 2 for Samudayik / Government schools can be used.
mapping between district fields in list_a and b has been added to jilla.tsv

np-n · 2024-05-31T05:03:54Z

np-n
May 31, 2024

I agree that solution of the problem is to translate the school name of one source document ( translating source A will be more better) but before applying the translation applying text pre-processing will help to increase the % of match such as:

removing the district from school name, often comes after comma(सिंहदेवी आधारभूत विद्यालय, इलाम -> सिंहदेवी आधारभूत विद्यालय)
splitting texts by whitespace and mapping words to english using domain knowledge to increase the match sore: प्रा वि -> basic school, आधारभूत -> primary, माध्यमिक-> secondary, विद्यालय->school.
Apply regex and also remove the noise from the Source B.(remove the texts that exist after School, Vidyalaya etc. e.g. Chokmagu English Boarding School Pvt. Ltd. - > Chokmagu English Boarding School )
Converting both source and destination in consistent format( there must be either Vidyalaya or School, either secondary or madhyamic).

So, Creating new row for given school using above preprocessing techniques in source A and matching with source B will be definitely helpful to increase % of match.

1 reply

ashim-mahara Jun 2, 2024

The comma is good if implemented everywhere. We have to assume that the tsvs were created from a database and as such combined multiple columns to create the text. The combining character could have been a ' ,'. If not, then it is trivial to remove the comma and text after it.
We can do a tf/idf search to find the most recurring strings and remove them for the task.

AdhikariN · 2024-05-31T07:40:52Z

AdhikariN
May 31, 2024

with the data we have in hand, there is no way to be confident about the match or join we create. These datasets highlight wider issue: Address. Why don't we have proper addresses? People navigate their way from place to place by asking other people. Say you have proper address for every single house, buildings etc....I know we are very far away from that but at least a government can start using proper address for its organisations/services. I would suggest, get District Education Officer to collate data in a different way so you can match and use it with confidence, I am sure they will have plenty of time to that.

1 reply

ashim-mahara Jun 2, 2024

I agree that we don't have a dataset to quantitatively measure the performance of the proposed approaches. I would advise Hon. Minister @sumanashrestha (I hope that is the proper honorific, forgive me if it isn't) and associates to first create a sample 1-1 mapping to validate the approaches. This will help in iterating over the problem more reliably.

NavTheRaj · 2024-05-31T07:44:23Z

NavTheRaj
May 31, 2024

I agree @np-n with the text pre-processing which might significantly increase the matches.

Additionally we could use Soundex (Especially Daitch-Mokotoff Soundex which suites better for Non English Words) , Metaphones and such with some weight to determine the similarity.

I had it done with initial draft without preprocessing of the data and got these results, which shows the lack of total output in result because of untidy data. Here it is if anyone wanna look and get the analytics for it.

Script Used :

import nepali_roman
import pandas as pd
from rapidfuzz import process, fuzz

# Load data
school_list_A = pd.read_csv('../data/school_list_A.tsv', sep='\t')
school_list_B = pd.read_csv('../data/school_list_B.tsv', sep='\t')


# Normalize text function
def normalize_text(text):
    return text.strip().lower() if isinstance(text, str) else ''


# Function to transliterate Devanagari to Velthuis using indic-transliteration
def devanagari_to_velthuis(text):
    return nepali_roman.romanize_text(text)


# Apply normalization and transliteration
school_list_A['velthuis_normalized'] = school_list_A['velthuis'].apply(normalize_text)
school_list_A['district1_velthuis'] = school_list_A['district1'].apply(devanagari_to_velthuis).apply(normalize_text)
school_list_B['name_normalized'] = school_list_B['name'].apply(normalize_text)
school_list_B['old_name1_normalized'] = school_list_B['old_name1'].fillna('').apply(normalize_text)
school_list_B['old_name2_normalized'] = school_list_B['old_name2'].fillna('').apply(normalize_text)
school_list_B['old_name3_normalized'] = school_list_B['old_name3'].fillna('').apply(normalize_text)
school_list_B['district_normalized'] = school_list_B['district'].apply(normalize_text)


# Function to match schools
def match_school(row, school_list_B):
    district_a = row['district1_velthuis']
    velthuis_a = row['velthuis_normalized']

    # Check if district_a is empty or not a string
    if not district_a:
        return None, None

    # Filter Source B by district
    filtered_b = school_list_B[school_list_B['district_normalized'].str.contains(district_a, na=False)]

    # Gather names to match against
    names_b = filtered_b[
        [
            'school_id',
            'name_normalized',
            'old_name1_normalized',
            'old_name2_normalized',
            'old_name3_normalized',
            'district_normalized'
        ]]

    # Perform fuzzy matching
    matches = []
    for index, b_row in names_b.iterrows():
        names_score = [
            fuzz.ratio(velthuis_a, b_row['name_normalized']),
            fuzz.ratio(velthuis_a, b_row['old_name1_normalized']),
            fuzz.ratio(velthuis_a, b_row['old_name2_normalized']),
            fuzz.ratio(velthuis_a, b_row['old_name3_normalized'])
        ]

        district_score = fuzz.ratio(district_a, b_row['district_normalized'])
        best_name_score = max(names_score)
        # Weighted score
        # Name of the school = 70 %
        # District = 30 %
        # Tune it as needed
        combined_score = 0.7 * best_name_score + 0.3 * district_score
        matches.append((b_row['school_id'], combined_score))

    # Select the best match
    if matches:
        best_match = max(matches, key=lambda x: x[1])
        return best_match[0], best_match[1]
    else:
        return None, None


# Apply matching function
matches = school_list_A.apply(lambda row: match_school(row, school_list_B), axis=1)
school_list_A['school_id_b'], school_list_A['match_score'] = zip(*matches)

# Filter out low-confidence matches if necessary, tune as needed
threshold = 70
matched_schools = school_list_A[school_list_A['match_score'] >= threshold]

# Output the mapping
mapping = matched_schools[['school_id', 'school_id_b']].copy()
mapping.rename(columns={'school_id': 'school_id_a'}, inplace=True)
mapping.to_csv('school_mapping.csv', index=False)

Result :
school_mapping.csv

0 replies

LuluW8071 · 2024-05-31T08:23:05Z

LuluW8071
May 31, 2024

Also some of the data in nepali school characters name are not separated by \t delimiter when viewing though they are few in numbers

0 replies

Aadesh-Baral · 2024-05-31T15:14:59Z

Aadesh-Baral
May 31, 2024

I don't know what's the use case here but it would be great if we can bring all these schools on the open data platform like OpenStreetMap.

0 replies

Wahesh · 2024-05-31T16:04:05Z

Wahesh
May 31, 2024

Out put of just 100 matches. It takes quite some time to match. I tried multiprocessing but that did not quite work as expected.
school_matches-3.csv
Use the filter to see as the confidence is increased towards 80-90, the matches become better. Best at 100.

Steps:

I translated the Nepali School Names using Google Translate and the result is much better than velthuis
Split the School Name and match with the second file using Fuzzy.
One school could have multiple matches.

What can be done better:
Concate First Name and District Name, match this against the database, where a new column with first name and district.
This will exponentially decrease matches. I will do this sometime again.

0 replies

prajwoldhungana · 2024-05-31T19:10:04Z

prajwoldhungana
May 31, 2024

I reviewed the data and here are my observations and recommendation
1 . Data are not clean, standardize and in patterns. Eg In list B, name column have school name and address and on list A school column have school name and address. This will create problem while mapping.
2. Translate Nepali name into English through Google Translate or velthuis will not give correct data. Eg List A velthuis showing Bhojapur as bhojapura.
3. I would suggest to get new and clean data. Create new form through form creator like Omniuse, Google Form or Microsoft Form. Publish the form, ask each school admin team to fill the form with up-to-date details. Or we can ask volunteers to enter the data from the list A and B into new form.
For this process I created sample form on Omniuse Nepal - https://omniusenepal.com/form_details/MTE3Ng== . Now school Admin or Volunteers can enter the data. Once data is entered, it can be downloaded into excel sheet and use as per required.

OmniuseSchoolData.xlsx

0 replies

bibekhadka · 2024-06-01T04:15:46Z

bibekhadka
Jun 1, 2024

Why does it have to be a relational database? Is there a possibility to have the data published in ElasticSearch? The focus then shifts to writing elegant queries to fetch accurate data rather than establishing and maintaining relationships in data.

3 replies

xrawone Jun 1, 2024

🥴

ashim-mahara Jun 2, 2024

Unrelated. The problem is creating 1-1 mapping.

LuluW8071 Jun 2, 2024

This makes simple mapping problem even confusing for open source users💀

gyawaliamit7 · 2024-06-02T20:27:54Z

gyawaliamit7
Jun 2, 2024

Can someone help clarify a few points here?

Regarding using school_id in both datasets, I've noticed that the same school_id corresponds to completely different school names and districts. Does it really have any significance?
Concerning the problem statement, would it be correct to say that we aim to establish a single, centralized data source that provides information on school names and their locations across Nepal?
Assuming the above points are accurate, I believe we already have a robust school_list_b. We can utilize the same columns that have been defined in that list. This allows us to refine the problem statement to:
Check if a given school name (Velthuis) along with the district of school_list_a is present in school_list_b. If it is present, we can disregard it as a duplicate entry. If it is missing, we should add a new row to dataset_b.

cc @sumanashrestha

1 reply

sumanashrestha Jun 3, 2024
Maintainer Author

Regarding using school_id in both datasets, I've noticed that the same school_id corresponds to completely different school names and districts. Does it really have any significance?

ids in list_a and list_b are unrelated. the ask is to create a mapping between them.

Concerning the problem statement, would it be correct to say that we aim to establish a single, centralized data source that provides information on school names and their locations across Nepal?

These are extracted from two separate legacy systems, see updated original post for intent

Assuming the above points are accurate, I believe we already have a robust school_list_b. We can utilize the same columns that have been defined in that list. This allows us to refine the problem statement to:
Check if a given school name (Velthuis) along with the district of school_list_a is present in school_list_b. If it is present, we can disregard it as a duplicate entry. If it is missing, we should add a new row to dataset_b.

Yes this is one potential solution. other approaches are also welcome, if it yields better results

anishjoshi1999 · 2024-06-03T10:57:16Z

anishjoshi1999
Jun 3, 2024

1 reply

sapradhan Jun 3, 2024

Nepali school names contain characters that are not properly separated by the '\t' delimiter when viewed

could you point out which lines? I see 6 columns on each line, imports fine with python csv reader, vscode data wrangler, libreoffice calc, duckdb

and there is inconsistency in the usage of quotation marks within the 'velthuis' column

it is not inconsistent quotes, its because of how velthuis mapping works -
श = "s
ङ = "n

It's preferable to use CSV instead of TSV

It's just the choice of delimiter - comma, tab, pipe, caret. Since comma is present in the values, with tab as delimiter you don't have to quote as frequently.

Wranitz · 2024-06-04T13:55:59Z

Wranitz
Jun 4, 2024

As the conversation shifts towards usability and district-level action, a centralized Linux server with a Windows server architecture is necessary. Transitioning all local computers to a Linux-based environment to run applications like Microsoft 365 would provide an easy and scalable solution for future use, including in various ministries.

Considering Linux is prudent, especially with the impending End of Life for Windows 10 in 2025. This transition could reduce expenditures and mitigate security issues from hacking groups that utilize AI. It's uncertain if this has been considered in the recent budget speech.

0 replies

cegorah · 2024-06-05T03:58:57Z

cegorah
Jun 5, 2024

Was just walking around and spotted the thread.
I'm not deep into the problem and not sure I'm understand the task completely, but it is quite easy to use embeddings to match the distance between sets. It should be much more efficient than fuzzy search due to the transformer's attention mechanism.

https://github.com/nowalab/nepali-word-embeddings — for nepali language
https://sbert.net/examples/applications/paraphrase-mining/README.html — for paraphrase mining

Using Google Geocoding API for the address and save lat/long into the PostGIS.
Look for the nearest subsets from both of sets by the location, approx 2-3 km in radius.
Match subsets one by one and sort by the distance.

2 replies

sapradhan Jun 11, 2024

not much into word embeddings but isn't that more towards synonym - antonym spectrum?
This problem is more about phonetic/sounds-alike matching,
source A has school names in devanagari script and source B has school names in transliterated roman script.

cegorah Jun 20, 2024

@sapradhan
Embeddings' similarity is about sequences with each other.
So it will get the details much over the simply phonetic stuff.
To be completely fair, you can load the whole dataset into the ChatGPT and ask it to get rid of duplicates. :)

sharad461 · 2024-06-10T10:23:41Z

sharad461
Jun 10, 2024

Hi @sumanashrestha, I've been attempting a few methods to solve this. Here's some progress: Google Sheets link. You can sort the results by the confidence column conf. I think some matches are better than others, but this should help manual verification quite a bit. Those that don't match after verification can then be manually matched. I have retained relevant columns. There are columns for ids from both the files (listA_school_id, listB_school_id) and some columns to make simple verification possible. If there is any glaring mistake or question or comments, please let me know.

I'll keep working on this some more to hopefully find better results, but if we want to standardize the task we'll need to construct good evaluation sets. This will also allow leaderboards, etc. if we want to add a competitive edge. The data itself is unclean and not too descriptive, so I don't know how far we can push the accuracy. If you want to take a look at the code, I can initiate a pull request to the repository.

Edit: If this does feel like an explore-worthy direction (and in the absence of an evaluation set), another possible way forward would be for us to get ids of verified correct matches in the file above (by manual verification), so that I can exclude them in new experiments. This could be done iteratively to reduce manual mapping. Right now there's the risk that experimenting with data features might cause current potential matches to unalign. We can stop iterating when the returns are not worth the manual verification effort.

4 replies

sumanashrestha Jun 13, 2024
Maintainer Author

Thank you for your effort. This looks promising. Please follow this guide lines - https://github.com/moest-np/incubator/blob/main/2024-05_school_mapping/CONTRIBUTING.md and submit a PR with your implementation

sharad461 Jun 15, 2024

Thanks for getting back. I'll do that soon.

In the meantime I'm working on an evaluation set and I could use some help. This is also a call to everyone in this thread. Whoever wants to contribute, the task would be to manually match 50 to 100 schools depending on your bandwidth. If we have around 1000 samples, we should be good to go. Interested people please leave a reply.

sapradhan Jun 16, 2024

@sharad461 I made a copy of the output sheet you provided and added a manual classifier column.
I have manually classified about 50 high confidence, 50 medium confidence, 50 low confidence ones. The classifier field is editable to public, anybody interested can contribute.

sharad461 Jun 17, 2024

Hey, thanks. I had collected some 800 "synthetic" evaluation samples like this. The issue is: this current school matching method will always have a near-perfect score on this "synthetic" set (even when, in truth, it's correct only like half the time) and almost all methods I'm using are similar, so unless we can make use of the "No" rows as well, this will not work. Making use of the "No" rows will, again, require manual matching. But thanks. Lmk if you have other ideas.

2024-05_school_mapping #3

sumanashrestha May 27, 2024 Maintainer

Replies: 13 comments · 13 replies

Can someone help clarify a few points here?

sumanashrestha Jun 3, 2024 Maintainer Author

sumanashrestha Jun 13, 2024 Maintainer Author

sumanashrestha
May 27, 2024
Maintainer

Replies: 13 comments 13 replies

sumanashrestha Jun 3, 2024
Maintainer Author

sumanashrestha Jun 13, 2024
Maintainer Author