Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WDQS scaling issues meetings in February #1806

Closed
dpriskorn opened this issue Feb 8, 2022 · 8 comments
Closed

WDQS scaling issues meetings in February #1806

dpriskorn opened this issue Feb 8, 2022 · 8 comments
Labels
community things related to the Scholia community events events relevant to Scholia performance the way Scholia treats the machines using it question something looking for an answer usability trying to minimize bad experiences while using Scholia

Comments

@dpriskorn
Copy link

Context

https://lists.wikimedia.org/hyperkitty/list/[email protected]/thread/KPA3CTQG2HCJO55EFZVNINGVFQJAHT4W/

Question

Is it a good idea to participate and deliver our perspective?

@dpriskorn dpriskorn added the question something looking for an answer label Feb 8, 2022
@Daniel-Mietchen Daniel-Mietchen added community things related to the Scholia community events events relevant to Scholia performance the way Scholia treats the machines using it usability trying to minimize bad experiences while using Scholia labels Feb 8, 2022
@Daniel-Mietchen Daniel-Mietchen added this to the 28 February 2022 milestone Feb 8, 2022
@Daniel-Mietchen
Copy link
Member

@dpriskorn Probably yes.

For the record, this relates to

@fnielsen
Copy link
Collaborator

fnielsen commented Feb 9, 2022

  1. WDQS scaling community meeting 1/2: SPARQL query features - Thursday, February 17 · 18:00 UTC
  2. WDQS scaling community meeting 2/2: RDF store backend needs - Monday, February 21 · 18:00 UTC

@Daniel-Mietchen
Copy link
Member

@dpriskorn
Copy link
Author

dpriskorn commented Feb 10, 2022

I created the request for bot flag for So9qBot earlier and met resistance presumably because of fear of breaking/overloading blazegraph and thus render WDQS unusable for everyone.

Since I helped raise the issue in the Wikidata Telegram channel and with the product manager, a lot has happened.

WMF started analyzing the issue in depth and now we have a disaster playbook. 🥳

WMF also very recently began the process of evaluating alternatives to blazegraph and now we have a rough timeline of 2-3 years until the problem is completely solved, assuming a competent team of engineers are dedicated to the task and funded appropriately.

I intend to finish the original idea of asseeibot soon so that it can add new items for each DOI found in Wikipedia, which we are currently missing in Wikidata.

This bot, if approved by the community, will increase the number of scientific items by 10-15% at a pace the community can control.
If all direct references to the DOIs found are also imported the number of items will probably double to 80M over time.

I also plan to create an importer for refcat from IA:

This first release of the Refcat dataset contains over 1.3 billion citations extracted from over 60 million metadata records and over 120 million scholarly artifacts [...]

This will bring our collection of papers in Wikidata up to a level around 120M in total. 🎉

Nobody knows if these two imports will push blazegraph past the breaking point, but that is not a big issue IMO since only 1% of all queries in WDQS are affected and since the data just existing and being curated and editable by any scientist in the world will probably be an epic game changer. I predict that a majority of scientists are gonna want to be well represented in our graph within 2-5 years.

If WDQS breaks, the playbook is enacted and we can continue working, but without SPARQL support from WMF during an interim period. This will of course affect Scholia a lot. We would have to set up our own sparql endpoint with the data somewhere. My hope is that we will succeed in getting a working endpoint up in less that 2 weeks with the help of IA and others in the Scholia and WikiCite communities.

The upside of this for Scholia is that all the new papers and citations would take us to a level of completeness that is on par with closed commercial databases currently used by scientists,but which completely lacks both the openness and empowerment of Wikidata and the graph powered features for finding author networks and new papers efficiently.

Since it takes roughly 3 months in average according to a study I read for a scientific article to be included in Wikipedia my bot alone would keep us only 3 months behind the bleeding edge of science publications.

We would have to find another approach to get closer to real time import of scientific papers as they are published.

@Daniel-Mietchen
Copy link
Member

The notes from the first meeting on Thursday Feb 17 about SPARQL query features sit at https://etherpad.wikimedia.org/p/R5n382Ld0Vvykc7Ak3iH .

I could not attend but @fnielsen and @dpriskorn did, and I went through the notes later on, particularly adding examples of Scholia queries that time out:

The next meeting on RDF store backend needs is today at 18:00 UTC, and I'll try to be there.

@Daniel-Mietchen
Copy link
Member

The WDQS scaling call just ended, and I found it useful. Notes in https://etherpad.wikimedia.org/p/yPUhyhbmXglC_Magx0Go .

@fnielsen
Copy link
Collaborator

Blogpost: "What SPARQL keywords do we use in Scholia?"

@Daniel-Mietchen
Copy link
Member

Official summary: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS-scaling-update-feb-2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community things related to the Scholia community events events relevant to Scholia performance the way Scholia treats the machines using it question something looking for an answer usability trying to minimize bad experiences while using Scholia
Projects
None yet
Development

No branches or pull requests

3 participants