Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency: annotate implementation & environment as separate entities? #642

Open
caifand opened this issue Oct 23, 2019 · 8 comments
Open

Comments

@caifand
Copy link
Contributor

caifand commented Oct 23, 2019

This example is from #637

The 95% CIs of the difference of percentage changes were evaluated using the <rs type="software">R package proCIs</rs>.

Here R package proCIs is annotated as one single entity in this context.
While in most of the cases, the package and the environment are separately annotated. For instance:

Hierarchical clustering and heatmap plots were generated with <rs type="software" xml:id="PMC4478705-software-3">R</rs> (<rs corresp="#PMC4478705-software-3" type="creator">R Development Core Team</rs>, <rs corresp="#PMC4478705-software-3" type="version">2012</rs>) using the library '<rs type="software">seriation</rs>'

<rs type="software">Monmlp</rs> is the implementation of ANN in <rs type="software">R</rs>.

Thus, <rs type="software">rgp</rs> is an implementation of GP methods in the <rs type="software">R</rs> environment. <ref type="bibr">29</ref> Package <rs type="software">rgp</rs> results are simple representations of the problem without being exposed to a priori information.

The package <rs type="software">fscaret</rs> allows semiautomatic feature selection, working as a wrapper for the caret package in <rs type="software">R</rs>.

(The final one has a package name missing annotation here :)

To me it's reasonable to annotate the software environment and the package separately.

@caifand
Copy link
Contributor Author

caifand commented Oct 23, 2019

A more granular example is commands/functions in the programming environment. They are close to individually authored scripts. They are not consistently annotated in the dataset at the moment (but the support language environment is often annotated). Some existing examples:

We used the <rs type="software">MATLAB</rs> command <rs type="software">fmin- search</rs> with multiple starting points to compute the maximum likelihood estimate for this value.
 linear regression with robust standard errors using the <rs type="software">STATA</rs> command "cluster (cluster variable)"was used-which relaxes the independence assumption and requires only that the observations should be independent across the clusters (STATA 2013)

Would we want to leave them to crowd judgment?

@caifand
Copy link
Contributor Author

caifand commented Oct 23, 2019

Similarly, the concern about annotating programming language may be addressed in this category of issues because:

  1. they are not consistently annotated in the current dataset.
  2. @kermitt2 keeps those ones referring to an implementation framework in the dataset and excludes ones not serving as this function. (If I understand it correctly)

Then what about Java in this case?

<p>The Java GUI interface of <rs type="software">FastPval</rs> is shown in <ref type="figure">Supplementary Figure S</ref>2a-c. In the 'Method' field, the user can either choose '<rs type="software">FastPval</rs>' or the traditional 'Exact' method to calculate P-values.

Thinking about the future annotation, the way we currently include these as valid annotations is still subject to subjective interpretation. (i.e., whether people understand the programming language as some sort of framework? They need to interpret the function of the programming language as implied in the textual context first). Though we can give some examples to prompt such understanding.

@caifand
Copy link
Contributor Author

caifand commented Oct 29, 2019

The same issue for the mentions of the non-named "chunks of code" implemented in a certain software environment (borrowed from #637 ):

Data analysis and model fitting were performed using <rs type="software">custom scripts</rs> written in <rs id="software-1" type="software">Igor Pro</rs> <rs corresp="#software-1" type="version">6</rs> (<rs corresp="#software-1" type="creator">WaveMetrics</rs>).</p>

Second, since <rs type="software">Matlab</rs> <rs id="software-0" type="software">routines</rs> applying Bayesian methods to the spatial lag, spatial error and spatial ...

@jameshowison
Copy link
Contributor

jameshowison commented Oct 29, 2019

I think the principle applied here should be socio-technical :) Ultimately we are interested in improving credit for software contributions, including motivating sharing and coalescence.

So I see three general categories (which have different names in different ecosystems).

  1. "Included code" Code that is part of some other code, always distributed with it. e.g., the print function in R or Python.
  2. "Distributed code" Code that is separately distributed. (e.g., readr in R). Note that might be via a package manager, a download, or even code printed in an appendix (that could be copied into a file and used by a potential user).
  3. "Personal code" Code that is separate from its platform but not yet distributed (e.g., personal scripts)

I propose that we do not annotate "Included code" as software_name, but we do code "Distributed code" and "Personal code".

@jameshowison
Copy link
Contributor

And programming languages or frameworks should be coded (since they are distributed and should be credited).

@kermitt2
Copy link
Member

The reasons why I separated programming language introduced as an aspect of the implementation of a mentioned software, from a programming language as a framework mentioned on its own, are actually very practical:

  • it can be helpful to disambiguate a software mention to know its programing language as attribute (like knowing its creator), for instance names like "Axiom" or "Atom" are ambiguous software names, but unique for a given programming language

  • it's hard to have 2 fields of the same type when labeling a software entity, so if we have two software names where one is the programming language and the second one is the main software implemented in this programing language, it's complicated to encode and to predict for a sequence labeling algorithm

@caifand
Copy link
Contributor Author

caifand commented Nov 26, 2019

@kermitt2 Per your second point above, would it make sense to annotate the programming language and the software as two entities? What would be the concern? e.g., technically it would be knotty to have the attribute of one entity as another entity in the serialized corpus?

@jameshowison
Copy link
Contributor

jameshowison commented Nov 27, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants