2024-05_school_mapping #3
Replies: 13 comments 13 replies
-
I agree that solution of the problem is to translate the school name of one source document ( translating source A will be more better) but before applying the translation applying text pre-processing will help to increase the % of match such as:
So, Creating new row for given school using above preprocessing techniques in source A and matching with source B will be definitely helpful to increase % of match. |
Beta Was this translation helpful? Give feedback.
-
with the data we have in hand, there is no way to be confident about the match or join we create. These datasets highlight wider issue: Address. Why don't we have proper addresses? People navigate their way from place to place by asking other people. Say you have proper address for every single house, buildings etc....I know we are very far away from that but at least a government can start using proper address for its organisations/services. I would suggest, get District Education Officer to collate data in a different way so you can match and use it with confidence, I am sure they will have plenty of time to that. |
Beta Was this translation helpful? Give feedback.
-
I agree @np-n with the text pre-processing which might significantly increase the matches. Additionally we could use Soundex (Especially Daitch-Mokotoff Soundex which suites better for Non English Words) , Metaphones and such with some weight to determine the similarity. I had it done with initial draft without preprocessing of the data and got these results, which shows the lack of total output in result because of untidy data. Here it is if anyone wanna look and get the analytics for it. Script Used :
Result : |
Beta Was this translation helpful? Give feedback.
-
Also some of the data in nepali school characters name are not separated by |
Beta Was this translation helpful? Give feedback.
-
I don't know what's the use case here but it would be great if we can bring all these schools on the open data platform like OpenStreetMap. |
Beta Was this translation helpful? Give feedback.
-
Out put of just 100 matches. It takes quite some time to match. I tried multiprocessing but that did not quite work as expected. Steps:
What can be done better: |
Beta Was this translation helpful? Give feedback.
-
I reviewed the data and here are my observations and recommendation |
Beta Was this translation helpful? Give feedback.
-
Why does it have to be a relational database? Is there a possibility to have the data published in ElasticSearch? The focus then shifts to writing elegant queries to fetch accurate data rather than establishing and maintaining relationships in data. |
Beta Was this translation helpful? Give feedback.
-
Can someone help clarify a few points here?
|
Beta Was this translation helpful? Give feedback.
-
As the conversation shifts towards usability and district-level action, a centralized Linux server with a Windows server architecture is necessary. Transitioning all local computers to a Linux-based environment to run applications like Microsoft 365 would provide an easy and scalable solution for future use, including in various ministries. Considering Linux is prudent, especially with the impending End of Life for Windows 10 in 2025. This transition could reduce expenditures and mitigate security issues from hacking groups that utilize AI. It's uncertain if this has been considered in the recent budget speech. |
Beta Was this translation helpful? Give feedback.
-
Was just walking around and spotted the thread. https://github.com/nowalab/nepali-word-embeddings — for nepali language
|
Beta Was this translation helpful? Give feedback.
-
Hi @sumanashrestha, I've been attempting a few methods to solve this. Here's some progress: Google Sheets link. You can sort the results by the confidence column I'll keep working on this some more to hopefully find better results, but if we want to standardize the task we'll need to construct good evaluation sets. This will also allow leaderboards, etc. if we want to add a competitive edge. The data itself is unclean and not too descriptive, so I don't know how far we can push the accuracy. If you want to take a look at the code, I can initiate a pull request to the repository. Edit: If this does feel like an explore-worthy direction (and in the absence of an evaluation set), another possible way forward would be for us to get ids of verified correct matches in the file above (by manual verification), so that I can exclude them in new experiments. This could be done iteratively to reduce manual mapping. Right now there's the risk that experimenting with data features might cause current potential matches to unalign. We can stop iterating when the returns are not worth the manual verification effort. |
Beta Was this translation helpful? Give feedback.
-
Discussion thread for school mapping problem
Problem statement: https://github.com/moest-np/incubator/tree/main/2024-05_school_mapping
UPDATE
Beta Was this translation helpful? Give feedback.
All reactions