Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding missing annotations to corpus #670

Open
caifand opened this issue Sep 25, 2020 · 1 comment
Open

adding missing annotations to corpus #670

caifand opened this issue Sep 25, 2020 · 1 comment

Comments

@caifand
Copy link
Contributor

caifand commented Sep 25, 2020

data/corpus/softcite_corpus.tei.xml contains the following articles that have no software annotations in them:

article_with_no_mention_in_softcite_corpus_tei_xml.csv
10.20955%2Fr%2F2018.1-16
10.1257%2Fjep.4.1.99
PMC4863732
10.1002%2Fpam.22030
10.1257%2Fjep.24.4.187
PMC1538888
PMC4644012
10.1007%2Fs10290-016-0264-y
10.1111%2Fcaje.12091
10.1111%2Fjems.12230
10.1257%2F089533002320951064
PMC3371863

Here is a list of manual annotations in the articles listed above, extracted from our rdf data to be added to the TEI XML corpus.
I've noticed that quite a few of them are annotated as software entities with low certainty score.

Particularly the articles 10.1257%2F089533002320951064 & PMC4863732 (currently in softcite_corpus.tei.xml) only have annotations that are not software entities (marked as web_platform during our review). So we actually can remove these two articles from softcite_corpus.tei.xml and perhaps other corpora correspondingly.

@kermitt2 Could you help take a look? Thanks!

@kermitt2
Copy link
Member

Hi @caifand !

Thank you for tracking all these removed annotations. I actually indeed removed them all during my full corpus review. However I left the un-annotated texts in the TEI XML file, as a way to have a bit of interesting negative examples.

Here is a quick review of these cases to explain why I removed them:

  • 10.20955%2Fr%2F2018.1-16

"The Ethereum network is currently the leader in the field of smart contracts."

Ethereum is a crypto currency, as bitcoin, so I considered that it is similarly not a software. Ethereum is a complete infrastructure with virtual machines, network, etc. So as Bitcoin is not annotated, it would be inconsistent to annotate as software Ethereum.

However, a particular software of these infrastructure are annotated -> see document 10.20955%2Fr.2018.1-16 "Bitcoin wallet" is actually a software used to store Bitcoins and is annotated.

  • 10.1257%2Fjep.4.1.99

"It is written in BASIC, a close analogue to FORTRAN."

As a general rule, I did not annotate software language per se (written in BASIC, in FORTRAN, ...), but software tools for a software language (like a C compiler, a Java virtual machine, etc.)

  • PMC4863732

PubMed MEDLINE is a database/web platform (it refers to the data and the service)

  • 10.1002%2Fpam.22030

"Data come from the Integrated Public Use Microdata Series (IPUMS) database"

Integrated Public Use Microdata Series (IPUMS) is a database and an online platform.

  • 10.1257%2Fjep.24.4.187

"Two kinds of web applications have a presence in the market. Some depart- wo kinds of web applications have a presence in the market. Some departments are in institutions that use a university-wide platform,"

Weird stuff... it talks about web platform and this is very generic, in particular we are not talking about a specific software.

  • PMC1538888

""already in SCOP90 (SCOP version 1.55,<90%sequence identity non-redundant set)""

What was annotated is "SCOP" and this is a database - Structural Classification of Proteins (SCOP) database.

  • PMC4644012

""The Gram-negative coccobacilli were initially identified as Pasturella pneumotropica by the VITEK 2 system, software version 06.01 (BioMerieux, France) using the GN card, with bionumber 0001010210040001 and an excellent identification (probabil-ity99%).""

This one is tough, I remember! What is annotated is not a software, it's a medical/laboratory device that contains a software.
https://www.biomerieux-usa.com/clinical/vitek-2-healthcare
This kind of device names are not annotated in the rest of the corpus, so basically I did the same... the issue with this particular snippet is that the mention of the device includes a software mention number (it's implicitly the version number of the software of the device).

But other mention of the same device are not annotated:

"Analysis performed using VITEK 2 (BioMerieux, France) and agar dilution as per the Clinical and Laboratory Standards Institute (1)"

So I have prioritized to be consistent with the rest of the corpus.

  • 10.1007%2Fs10290-016-0264-y

"Source: Datastream, own calculations"

Datastream is a database. There is no mention how the "own calculations" have been made on the data by the author.

  • 10.1111%2Fcaje.12091

"Specifically, the data are CIF imports measured in US$, taken from International Financial Statistics’ Direction of Trade CD-ROM, deflated by U.S. CPI for All Urban Consumers (CPI-U), all items, 1982 to 1984 = 100."

"Direction of Trade CD-ROM" -> not a software

  • 10.1111%2Fjems.12230

"This information, communicated via WOM or eWOM ,1"

The footnote indicates that this is not a software but a generic name for social web platforms talking about commercial products.

1 WOM refers to product-related commentary shared between friends, family, neighbors, etc. Moreover, advances in information technology and the digital revolution both facilitate and amplify the exchange of information on products via social networking sites and other online fora, such as Facebook, Twitter, forums, blogs, etc., referred to as eWOM.

  • 10.1257%2F089533002320951064

Elsevier "ScienceDirect" is a web platform.

  • PMC3371863

"Table 7 summarizes the TAIR 9 annotations (TAIR, 2009) for allthree groups of a total of 226 predicatedASM regions."

I considered TAIR as a database (https://www.arabidopsis.org/index.jsp). There are tools/software for exploiting the TAIR database (see tools on the web site), so there is ground to distinguish them from the database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants