-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: 🐛 fix all ImportError exceptions for the current datasets
Datasets sometimes use libraries in their script. We install all of them (as of today). Note that https://github.com/TREMA-UNH/trec-car-tools could not be added using poetry since the package is in subdirectory. See python-poetry/poetry#755.
- Loading branch information
1 parent
1fc1a17
commit 1088f23
Showing
24 changed files
with
4,524 additions
and
98 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Maven template | ||
target/ | ||
pom.xml.tag | ||
pom.xml.releaseBackup | ||
pom.xml.versionsBackup | ||
pom.xml.next | ||
release.properties | ||
dependency-reduced-pom.xml | ||
buildNumber.properties | ||
.mvn/timing.properties | ||
|
||
# Distribution / packaging | ||
.Python | ||
env/ | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
|
||
# Sphinx documentation | ||
python3/_build/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
language: python | ||
python: | ||
- "3.5" | ||
before_install: | ||
- sudo apt-get -qq update | ||
- sudo apt-get install -y maven | ||
python: | ||
- "3.6" | ||
- "3.7" | ||
install: | ||
- pip install -r python3/requirements.txt | ||
script: | ||
- pip install python3/ | ||
- pushd trec-car-tools-example; mvn install; popd | ||
|
||
- curl http://trec-car.cs.unh.edu/datareleases/v2.0/test200.v2.0.tar.xz | tar -xJ | ||
- pages=test200/test200-train/train.pages.cbor outlines=test200/test200-train/train.pages.cbor-outlines.cbor paragraphs=test200/test200-train/train.pages.cbor-paragraphs.cbor bash .travis/test.sh | ||
|
||
- curl http://trec-car.cs.unh.edu/datareleases/v1.5/test200-v1.5.tar.xz | tar -xJ | ||
- pages=test200/train.test200.cbor outlines=test200/train.test200.cbor paragraphs=test200/train.test200.cbor.paragraphs bash .travis/test.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
#!/bin/bash | ||
|
||
set -ex | ||
|
||
python3/test.py pages $pages >/dev/null | ||
python3/test.py outlines $outlines >/dev/null | ||
python3/test.py paragraphs $paragraphs >/dev/null | ||
|
||
cd trec-car-tools-example/ | ||
mvn org.codehaus.mojo:exec-maven-plugin:1.5.0:java -Dexec.mainClass="edu.unh.cs.treccar_v2.read_data.ReadDataTest" -Dexec.args="header ../$pages" >/dev/null | ||
mvn org.codehaus.mojo:exec-maven-plugin:1.5.0:java -Dexec.mainClass="edu.unh.cs.treccar_v2.read_data.ReadDataTest" -Dexec.args="pages ../$pages" >/dev/null | ||
mvn org.codehaus.mojo:exec-maven-plugin:1.5.0:java -Dexec.mainClass="edu.unh.cs.treccar_v2.read_data.ReadDataTest" -Dexec.args="outlines ../$outlines" >/dev/null | ||
mvn org.codehaus.mojo:exec-maven-plugin:1.5.0:java -Dexec.mainClass="edu.unh.cs.treccar_v2.read_data.ReadDataTest" -Dexec.args="paragraphs ../$paragraphs" >/dev/null |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
BSD 3-Clause License | ||
|
||
Copyright (c) 2017, Laura Dietz and Ben Gamari | ||
All rights reserved. | ||
|
||
Redistribution and use in source and binary forms, with or without | ||
modification, are permitted provided that the following conditions are met: | ||
|
||
* Redistributions of source code must retain the above copyright notice, this | ||
list of conditions and the following disclaimer. | ||
|
||
* Redistributions in binary form must reproduce the above copyright notice, | ||
this list of conditions and the following disclaimer in the documentation | ||
and/or other materials provided with the distribution. | ||
|
||
* Neither the name of the copyright holder nor the names of its | ||
contributors may be used to endorse or promote products derived from | ||
this software without specific prior written permission. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | ||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE | ||
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE | ||
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | ||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | ||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | ||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | ||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
# TREC Car Tools | ||
|
||
[![Travis badge](https://travis-ci.org/TREMA-UNH/trec-car-tools.svg?branch=master)](https://travis-ci.org/TREMA-UNH/trec-car-tools) | ||
|
||
Development tools for participants of the TREC Complex Answer Retrieval track. | ||
|
||
Data release support for v1.5 and v2.0. | ||
|
||
Note that in order to allow to compile your project for two trec-car format versions, the maven artifact Id was changed to `treccar-tools-v2` with version 2.0, and the package path changed to `treccar_v2` | ||
|
||
|
||
Current support for | ||
- Python 3.6 | ||
- Java 1.8 | ||
|
||
If you are using [Anaconda](https://www.anaconda.com/), install the `cbor` | ||
library for Python 3.6: | ||
``` | ||
conda install -c laura-dietz cbor=1.0.0 | ||
``` | ||
|
||
## How to use the Python bindings for trec-car-tools? | ||
|
||
1. Get the data from [http://trec-car.cs.unh.edu](http://trec-car.cs.unh.edu) | ||
2. Clone this repository | ||
3. `python setup.py install` | ||
|
||
Look out for test.py for an example on how to access the data. | ||
|
||
|
||
## How to use the java 1.8 (or higher) bindings for trec-car-tools through maven? | ||
|
||
add to your project's pom.xml file (or similarly gradel or sbt): | ||
|
||
~~~~ | ||
<repositories> | ||
<repository> | ||
<id>jitpack.io</id> | ||
<url>https://jitpack.io</url> | ||
</repository> | ||
</repositories> | ||
~~~~ | ||
|
||
add the trec-car-tools dependency: | ||
|
||
~~~~ | ||
<dependency> | ||
<groupId>com.github.TREMA-UNH</groupId> | ||
<artifactId>trec-car-tools-java</artifactId> | ||
<version>17</version> | ||
</dependency> | ||
~~~~ | ||
|
||
compile your project with `mvn compile` | ||
|
||
|
||
|
||
|
||
## Tool support | ||
|
||
This package provides support for the following activities. | ||
|
||
- `read_data`: Reading the provided paragraph collection, outline collections, and training articles | ||
- `format_runs`: writing submission files | ||
|
||
|
||
## Reading Data | ||
|
||
If you use python or java, please use `trec-car-tools`, no need to understand the following. We provide bindings for haskell upon request. If you are programming under a different language, you can use any CBOR library and decode the grammar below. | ||
|
||
[CBOR](cbor.io) is similar to JSON, but it is a binary format that compresses better and avoids text file encoding issues. | ||
|
||
Articles, outlines, paragraphs are all described with CBOR following this grammar. Wikipedia-internal hyperlinks are preserved through `ParaLink`s. | ||
|
||
|
||
~~~~~ | ||
Page -> $pageName $pageId [PageSkeleton] PageType PageMetadata | ||
PageType -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage | ||
PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors | ||
RedirectNames -> [$pageName] | ||
DisambiguationNames -> [$pageName] | ||
DisambiguationIds -> [$pageId] | ||
CategoryNames -> [$pageName] | ||
CategoryIds -> [$pageId] | ||
InlinkIds -> [$pageId] | ||
InlinkAnchors -> [$anchorText] | ||
PageSkeleton -> Section | Para | Image | ListItem | ||
Section -> $sectionHeading [PageSkeleton] | ||
Para -> Paragraph | ||
Paragraph -> $paragraphId, [ParaBody] | ||
ListItem -> $nestingLevel, Paragraph | ||
Image -> $imageURL [PageSkeleton] | ||
ParaBody -> ParaText | ParaLink | ||
ParaText -> $text | ||
ParaLink -> $targetPage $targetPageId $linkSection $anchorText | ||
~~~~~ | ||
|
||
You can use any CBOR serialization library. Below a convenience library for reading the data into Python (3.5) | ||
|
||
- `./read_data/trec_car_read_data.py` | ||
Python 3.5 convenience library for reading the input data (in CBOR format). | ||
-- If you use anaconda, please install the cbor library with `conda install -c auto cbor=1.0` | ||
-- Otherwise install it with `pypi install cbor` | ||
|
||
## Ranking Results | ||
|
||
Given an outline, your task is to produce one ranking for each section $section (representing an information need in traditional IR evaluations). | ||
|
||
Each ranked element is an (entity,passage) pair, meaning that this passage is relevant for the section, because it features a relevant entity. "Relevant" means that the entity or passage must/should/could be listed in this section. | ||
|
||
The section is represented by the path of headings in the outline `$pageTitle/$heading1/$heading1.1/.../$section` in URL encoding. | ||
|
||
The entity is represented by the DBpedia entity id (derived from the Wikipedia URL). Optionally, the entity can be omitted. | ||
|
||
The passage is represented by the passage id given in the passage corpus (an MD5 hash of the content). Optionally, the passage can be omitted. | ||
|
||
|
||
The results are provided in a format that is similar to the "trec\_results file format" of [trec_eval](http://trec.nist.gov/trec_eval). More info on how to use [trec_eval](http://stackoverflow.com/questions/4275825/how-to-evaluate-a-search-retrieval-engine-using-trec-eval) and [source](https://github.com/usnistgov/trec_eval). | ||
|
||
Example of ranking format | ||
~~~~~ | ||
Green\_sea\_turtle\Habitat Pelagic\_zone 12345 0 27409 myTeam | ||
$qid $entity $passageId rank sim run_id | ||
~~~~~ | ||
|
||
|
||
|
||
## Integration with other tools | ||
|
||
It is recommended to use the `format_runs` package to write run files. Here an example: | ||
|
||
|
||
with open('runfile', mode='w', encoding='UTF-8') as f: | ||
writer = configure_csv_writer(f) | ||
for page in pages: | ||
for section_path in page.flat_headings_list(): | ||
ranking = [RankingEntry(page.page_name, section_path, p.para_id, r, s, paragraph_content=p) for p,s,r in ranking] | ||
format_run(writer, ranking, exp_name='test') | ||
|
||
f.close() | ||
|
||
This ensures that the output is correctly formatted to work with `trec_eval` and the provided qrels file. | ||
|
||
Run [trec_eval](https://github.com/usnistgov/trec_eval/blob/master/README) version 9.0.4 as usual: | ||
|
||
trec_eval -q release.qrel runfile > run.eval | ||
|
||
The output is compatible with the eval plotting package [minir-plots](https://github.com/laura-dietz/minir-plots). For example run | ||
|
||
python column.py --out column-plot.pdf --metric map run.eval | ||
python column_difficulty.py --out column-difficulty-plot.pdf --metric map run.eval run2.eval | ||
|
||
Moreover, you can compute success statistics such as hurts/helps or a paired-t-test as follows. | ||
|
||
python hurtshelps.py --metric map run.eval run2.eval | ||
python paired-ttest.py --metric map run.eval run2.eval | ||
|
||
|
||
|
||
|
||
<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/3.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Dataset" property="dct:title" rel="dct:type">TREC-CAR Dataset</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="trec-car.cs.unh.edu" property="cc:attributionName" rel="cc:attributionURL">Laura Dietz, Ben Gamari</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-ShareAlike 3.0 Unported License</a>.<br />Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="www.wikipedia.org" rel="dct:source">www.wikipedia.org</a>. |
Oops, something went wrong.