-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding missing annotations to corpus #670
Comments
Hi @caifand ! Thank you for tracking all these removed annotations. I actually indeed removed them all during my full corpus review. However I left the un-annotated texts in the TEI XML file, as a way to have a bit of interesting negative examples. Here is a quick review of these cases to explain why I removed them:
"The Ethereum network is currently the leader in the field of smart contracts." Ethereum is a crypto currency, as bitcoin, so I considered that it is similarly not a software. Ethereum is a complete infrastructure with virtual machines, network, etc. So as Bitcoin is not annotated, it would be inconsistent to annotate as software Ethereum. However, a particular software of these infrastructure are annotated -> see document 10.20955%2Fr.2018.1-16 "Bitcoin wallet" is actually a software used to store Bitcoins and is annotated.
"It is written in BASIC, a close analogue to FORTRAN." As a general rule, I did not annotate software language per se (written in BASIC, in FORTRAN, ...), but software tools for a software language (like a C compiler, a Java virtual machine, etc.)
PubMed MEDLINE is a database/web platform (it refers to the data and the service)
"Data come from the Integrated Public Use Microdata Series (IPUMS) database" Integrated Public Use Microdata Series (IPUMS) is a database and an online platform.
"Two kinds of web applications have a presence in the market. Some depart- wo kinds of web applications have a presence in the market. Some departments are in institutions that use a university-wide platform," Weird stuff... it talks about web platform and this is very generic, in particular we are not talking about a specific software.
""already in SCOP90 (SCOP version 1.55,<90%sequence identity non-redundant set)"" What was annotated is "SCOP" and this is a database - Structural Classification of Proteins (SCOP) database.
""The Gram-negative coccobacilli were initially identified as Pasturella pneumotropica by the VITEK 2 system, software version 06.01 (BioMerieux, France) using the GN card, with bionumber 0001010210040001 and an excellent identification (probabil-ity99%)."" This one is tough, I remember! What is annotated is not a software, it's a medical/laboratory device that contains a software. But other mention of the same device are not annotated: "Analysis performed using VITEK 2 (BioMerieux, France) and agar dilution as per the Clinical and Laboratory Standards Institute (1)" So I have prioritized to be consistent with the rest of the corpus.
"Source: Datastream, own calculations" Datastream is a database. There is no mention how the "own calculations" have been made on the data by the author.
"Specifically, the data are CIF imports measured in US$, taken from International Financial Statistics’ Direction of Trade CD-ROM, deflated by U.S. CPI for All Urban Consumers (CPI-U), all items, 1982 to 1984 = 100." "Direction of Trade CD-ROM" -> not a software
"This information, communicated via WOM or eWOM ,1" The footnote indicates that this is not a software but a generic name for social web platforms talking about commercial products.
Elsevier "ScienceDirect" is a web platform.
"Table 7 summarizes the TAIR 9 annotations (TAIR, 2009) for allthree groups of a total of 226 predicatedASM regions." I considered TAIR as a database (https://www.arabidopsis.org/index.jsp). There are tools/software for exploiting the TAIR database (see tools on the web site), so there is ground to distinguish them from the database. |
data/corpus/softcite_corpus.tei.xml
contains the following articles that have no software annotations in them:article_with_no_mention_in_softcite_corpus_tei_xml.csv
10.20955%2Fr%2F2018.1-16
10.1257%2Fjep.4.1.99
PMC4863732
10.1002%2Fpam.22030
10.1257%2Fjep.24.4.187
PMC1538888
PMC4644012
10.1007%2Fs10290-016-0264-y
10.1111%2Fcaje.12091
10.1111%2Fjems.12230
10.1257%2F089533002320951064
PMC3371863
Here is a list of manual annotations in the articles listed above, extracted from our rdf data to be added to the TEI XML corpus.
I've noticed that quite a few of them are annotated as software entities with low certainty score.
Particularly the articles
10.1257%2F089533002320951064
&PMC4863732
(currently insoftcite_corpus.tei.xml
) only have annotations that are not software entities (marked asweb_platform
during our review). So we actually can remove these two articles fromsoftcite_corpus.tei.xml
and perhaps other corpora correspondingly.@kermitt2 Could you help take a look? Thanks!
The text was updated successfully, but these errors were encountered: