WDQS scaling issues meetings in February #1806

dpriskorn · 2022-02-08T07:59:56Z

Context

https://lists.wikimedia.org/hyperkitty/list/[email protected]/thread/KPA3CTQG2HCJO55EFZVNINGVFQJAHT4W/

Question

Is it a good idea to participate and deliver our perspective?

Daniel-Mietchen · 2022-02-08T08:57:40Z

@dpriskorn Probably yes.

For the record, this relates to

Assess implications of Blazegraph failure playbook for Scholia #1721
and we are having a hackathon next week, as per
Scholia hackathon on 14 and 16 February 2022 #1807

fnielsen · 2022-02-09T14:56:24Z

WDQS scaling community meeting 1/2: SPARQL query features - Thursday, February 17 · 18:00 UTC
WDQS scaling community meeting 2/2: RDF store backend needs - Monday, February 21 · 18:00 UTC

Daniel-Mietchen · 2022-02-09T15:24:39Z

See also https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Feb_2022_scaling_community_meetings

dpriskorn · 2022-02-10T04:57:06Z

I created the request for bot flag for So9qBot earlier and met resistance presumably because of fear of breaking/overloading blazegraph and thus render WDQS unusable for everyone.

Since I helped raise the issue in the Wikidata Telegram channel and with the product manager, a lot has happened.

WMF started analyzing the issue in depth and now we have a disaster playbook. 🥳

WMF also very recently began the process of evaluating alternatives to blazegraph and now we have a rough timeline of 2-3 years until the problem is completely solved, assuming a competent team of engineers are dedicated to the task and funded appropriately.

I intend to finish the original idea of asseeibot soon so that it can add new items for each DOI found in Wikipedia, which we are currently missing in Wikidata.

This bot, if approved by the community, will increase the number of scientific items by 10-15% at a pace the community can control.
If all direct references to the DOIs found are also imported the number of items will probably double to 80M over time.

I also plan to create an importer for refcat from IA:

This first release of the Refcat dataset contains over 1.3 billion citations extracted from over 60 million metadata records and over 120 million scholarly artifacts [...]

This will bring our collection of papers in Wikidata up to a level around 120M in total. 🎉

Nobody knows if these two imports will push blazegraph past the breaking point, but that is not a big issue IMO since only 1% of all queries in WDQS are affected and since the data just existing and being curated and editable by any scientist in the world will probably be an epic game changer. I predict that a majority of scientists are gonna want to be well represented in our graph within 2-5 years.

If WDQS breaks, the playbook is enacted and we can continue working, but without SPARQL support from WMF during an interim period. This will of course affect Scholia a lot. We would have to set up our own sparql endpoint with the data somewhere. My hope is that we will succeed in getting a working endpoint up in less that 2 weeks with the help of IA and others in the Scholia and WikiCite communities.

The upside of this for Scholia is that all the new papers and citations would take us to a level of completeness that is on par with closed commercial databases currently used by scientists,but which completely lacks both the openness and empowerment of Wikidata and the graph powered features for finding author networks and new papers efficiently.

Since it takes roughly 3 months in average according to a study I read for a scientific article to be included in Wikipedia my bot alone would keep us only 3 months behind the bleeding edge of science publications.

We would have to find another approach to get closer to real time import of scientific papers as they are published.

Daniel-Mietchen · 2022-02-21T13:22:46Z

The notes from the first meeting on Thursday Feb 17 about SPARQL query features sit at https://etherpad.wikimedia.org/p/R5n382Ld0Vvykc7Ak3iH .

I could not attend but @fnielsen and @dpriskorn did, and I went through the notes later on, particularly adding examples of Scholia queries that time out:

The next meeting on RDF store backend needs is today at 18:00 UTC, and I'll try to be there.

Daniel-Mietchen · 2022-02-21T19:11:27Z

The WDQS scaling call just ended, and I found it useful. Notes in https://etherpad.wikimedia.org/p/yPUhyhbmXglC_Magx0Go .

fnielsen · 2022-02-21T20:20:44Z

Blogpost: "What SPARQL keywords do we use in Scholia?"

Daniel-Mietchen · 2022-02-28T17:16:25Z

Official summary: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS-scaling-update-feb-2022

dpriskorn added the question something looking for an answer label Feb 8, 2022

Daniel-Mietchen added community things related to the Scholia community events events relevant to Scholia performance the way Scholia treats the machines using it usability trying to minimize bad experiences while using Scholia labels Feb 8, 2022

Daniel-Mietchen added this to the 28 February 2022 milestone Feb 8, 2022

Daniel-Mietchen mentioned this issue Feb 16, 2022

Look into using Comunica as meta query engine #1850

Open

Daniel-Mietchen closed this as completed Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WDQS scaling issues meetings in February #1806

WDQS scaling issues meetings in February #1806

dpriskorn commented Feb 8, 2022

Daniel-Mietchen commented Feb 8, 2022

fnielsen commented Feb 9, 2022 •

edited

Loading

Daniel-Mietchen commented Feb 9, 2022

dpriskorn commented Feb 10, 2022 •

edited

Loading

Daniel-Mietchen commented Feb 21, 2022

Daniel-Mietchen commented Feb 21, 2022

fnielsen commented Feb 21, 2022

Daniel-Mietchen commented Feb 28, 2022

WDQS scaling issues meetings in February #1806

WDQS scaling issues meetings in February #1806

Comments

dpriskorn commented Feb 8, 2022

Context

Question

Daniel-Mietchen commented Feb 8, 2022

fnielsen commented Feb 9, 2022 • edited Loading

Daniel-Mietchen commented Feb 9, 2022

dpriskorn commented Feb 10, 2022 • edited Loading

Daniel-Mietchen commented Feb 21, 2022

Daniel-Mietchen commented Feb 21, 2022

fnielsen commented Feb 21, 2022

Daniel-Mietchen commented Feb 28, 2022

fnielsen commented Feb 9, 2022 •

edited

Loading

dpriskorn commented Feb 10, 2022 •

edited

Loading