adding missing annotations to corpus #670

caifand · 2020-09-25T21:01:06Z

data/corpus/softcite_corpus.tei.xml contains the following articles that have no software annotations in them:

article_with_no_mention_in_softcite_corpus_tei_xml.csv
10.20955%2Fr%2F2018.1-16
10.1257%2Fjep.4.1.99
PMC4863732
10.1002%2Fpam.22030
10.1257%2Fjep.24.4.187
PMC1538888
PMC4644012
10.1007%2Fs10290-016-0264-y
10.1111%2Fcaje.12091
10.1111%2Fjems.12230
10.1257%2F089533002320951064
PMC3371863

Here is a list of manual annotations in the articles listed above, extracted from our rdf data to be added to the TEI XML corpus.
I've noticed that quite a few of them are annotated as software entities with low certainty score.

Particularly the articles 10.1257%2F089533002320951064 & PMC4863732 (currently in softcite_corpus.tei.xml) only have annotations that are not software entities (marked as web_platform during our review). So we actually can remove these two articles from softcite_corpus.tei.xml and perhaps other corpora correspondingly.

@kermitt2 Could you help take a look? Thanks!

The text was updated successfully, but these errors were encountered:

kermitt2 · 2020-09-29T18:59:25Z

Hi @caifand !

Thank you for tracking all these removed annotations. I actually indeed removed them all during my full corpus review. However I left the un-annotated texts in the TEI XML file, as a way to have a bit of interesting negative examples.

Here is a quick review of these cases to explain why I removed them:

10.20955%2Fr%2F2018.1-16

"The Ethereum network is currently the leader in the field of smart contracts."

Ethereum is a crypto currency, as bitcoin, so I considered that it is similarly not a software. Ethereum is a complete infrastructure with virtual machines, network, etc. So as Bitcoin is not annotated, it would be inconsistent to annotate as software Ethereum.

However, a particular software of these infrastructure are annotated -> see document 10.20955%2Fr.2018.1-16 "Bitcoin wallet" is actually a software used to store Bitcoins and is annotated.

10.1257%2Fjep.4.1.99

"It is written in BASIC, a close analogue to FORTRAN."

As a general rule, I did not annotate software language per se (written in BASIC, in FORTRAN, ...), but software tools for a software language (like a C compiler, a Java virtual machine, etc.)

PMC4863732

PubMed MEDLINE is a database/web platform (it refers to the data and the service)

10.1002%2Fpam.22030

"Data come from the Integrated Public Use Microdata Series (IPUMS) database"

Integrated Public Use Microdata Series (IPUMS) is a database and an online platform.

10.1257%2Fjep.24.4.187

"Two kinds of web applications have a presence in the market. Some depart- wo kinds of web applications have a presence in the market. Some departments are in institutions that use a university-wide platform,"

Weird stuff... it talks about web platform and this is very generic, in particular we are not talking about a specific software.

PMC1538888

""already in SCOP90 (SCOP version 1.55,<90%sequence identity non-redundant set)""

What was annotated is "SCOP" and this is a database - Structural Classification of Proteins (SCOP) database.

PMC4644012

""The Gram-negative coccobacilli were initially identified as Pasturella pneumotropica by the VITEK 2 system, software version 06.01 (BioMerieux, France) using the GN card, with bionumber 0001010210040001 and an excellent identification (probabil-ity99%).""

This one is tough, I remember! What is annotated is not a software, it's a medical/laboratory device that contains a software.
https://www.biomerieux-usa.com/clinical/vitek-2-healthcare
This kind of device names are not annotated in the rest of the corpus, so basically I did the same... the issue with this particular snippet is that the mention of the device includes a software mention number (it's implicitly the version number of the software of the device).

But other mention of the same device are not annotated:

"Analysis performed using VITEK 2 (BioMerieux, France) and agar dilution as per the Clinical and Laboratory Standards Institute (1)"

So I have prioritized to be consistent with the rest of the corpus.

10.1007%2Fs10290-016-0264-y

"Source: Datastream, own calculations"

Datastream is a database. There is no mention how the "own calculations" have been made on the data by the author.

10.1111%2Fcaje.12091

"Specifically, the data are CIF imports measured in US$, taken from International Financial Statistics’ Direction of Trade CD-ROM, deflated by U.S. CPI for All Urban Consumers (CPI-U), all items, 1982 to 1984 = 100."

"Direction of Trade CD-ROM" -> not a software

10.1111%2Fjems.12230

"This information, communicated via WOM or eWOM ,1"

The footnote indicates that this is not a software but a generic name for social web platforms talking about commercial products.

1 WOM refers to product-related commentary shared between friends, family, neighbors, etc. Moreover, advances in information technology and the digital revolution both facilitate and amplify the exchange of information on products via social networking sites and other online fora, such as Facebook, Twitter, forums, blogs, etc., referred to as eWOM.

10.1257%2F089533002320951064

Elsevier "ScienceDirect" is a web platform.

PMC3371863

"Table 7 summarizes the TAIR 9 annotations (TAIR, 2009) for allthree groups of a total of 226 predicatedASM regions."

I considered TAIR as a database (https://www.arabidopsis.org/index.jsp). There are tools/software for exploiting the TAIR database (see tools on the web site), so there is ground to distinguish them from the database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding missing annotations to corpus #670

adding missing annotations to corpus #670

caifand commented Sep 25, 2020 •

edited

Loading

kermitt2 commented Sep 29, 2020

adding missing annotations to corpus #670

adding missing annotations to corpus #670

Comments

caifand commented Sep 25, 2020 • edited Loading

kermitt2 commented Sep 29, 2020

caifand commented Sep 25, 2020 •

edited

Loading