sebastianruder · sebastianruder · Jan 15, 2019 · Dec 16, 2018 · Dec 16, 2018 · Dec 17, 2018
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,5 @@
 _site/
 Gemfile*
+venv
+.idea
+structured.json
diff --git a/README.md b/README.md
@@ -135,6 +135,14 @@ These are tasks and datasets that are still missing:
 - Semi-supervised learning
 - Frame-semantic parsing (FrameNet full-sentence analysis)
 
+### Exporting into a structured format
+
+You can extract all the data into a structured, machine-readable JSON format with parsed tasks, descriptions and SOTA tables. 
+
+The instructions are in [structured/README.md](structured/README.md).
+
 ### Instructions for building the site locally
 
 Instructions for building the website locally using Jekyll can be found [here](jekyll_instructions.md).
+
+
diff --git a/english/grammatical_error_correction.md b/english/grammatical_error_correction.md
@@ -14,15 +14,19 @@ The [CoNLL-2014 shared task test set](https://www.comp.nus.edu.sg/~nlp/conll14st
 
 The shared task setting restricts that systems use only publicly available datasets for training to ensure a fair comparison between systems. The highest published scores on the the CoNLL-2014 test set are given below. A distinction is made between papers that report results in the restricted CoNLL-2014 shared task setting of training using publicly-available training datasets only (_**Restricted**_) and those that made use of large, non-public datasets (_**Unrestricted**_).
 
+**Restricted**:
 
 | Model           | F0.5  |  Paper / Source | Code |
 | ------------- | :-----:| --- | :-----: |
-|_**Restricted**_ |      
 | CNN Seq2Seq + Quality Estimation (Chollampatt and Ng, EMNLP 2018) | 56.52 | [Neural Quality Estimation of Grammatical Error Correction](http://aclweb.org/anthology/D18-1274) | [Official](https://github.com/nusnlp/neuqe/) |
 | SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) |  56.25 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA |
 | Transformer (Junczys-Dowmunt et al., 2018) | 55.8 | [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task](http://aclweb.org/anthology/N18-1055)| [Official](https://github.com/grammatical/neural-naacl2018) |
 | CNN Seq2Seq (Chollampatt and Ng, 2018)| 54.79 | [A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) |
-|_**Unrestricted**_  |
+
+**Unrestricted**:
+
+| Model           | F0.5  |  Paper / Source | Code |
+| ------------- | :-----:| --- | :-----: |
 | CNN Seq2Seq + Fluency Boost (Ge et al., 2018) |  61.34 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA |
 
 _**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets.
@@ -32,12 +36,17 @@ _**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: use
 
 [Bryant and Ng, 2015](http://aclweb.org/anthology/P15-1068) released 8 additional annotations (in addition to the two official annotations) for the CoNLL-2014 shared task test set ([link](http://www.comp.nus.edu.sg/~nlp/sw/10gec_annotations.zip)).
 
+**Restricted**:
+
 | Model           | F0.5  |  Paper / Source | Code |
 | ------------- | :-----:| --- | :-----: |
-|_**Restricted**_     |           
 | SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) |  72.04 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA |
 | CNN Seq2Seq (Chollampatt and Ng, 2018)| 70.14 (measured by Ge et al., 2018) | [ A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) |
-|_**Unrestricted**_            |
+
+**Unrestricted**:
+
+| Model           | F0.5  |  Paper / Source | Code |
+| ------------- | :-----:| --- | :-----: |
 | CNN Seq2Seq + Fluency Boost (Ge et al., 2018) |  76.88 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA |
 
 _**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets.
@@ -47,13 +56,19 @@ _**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: use
 
 [JFLEG test set](https://github.com/keisks/jfleg) released by [Napoles et al., 2017](http://aclweb.org/anthology/E17-2037) consists of 747 English sentences with 4 references for each sentence. Models are evaluated with [GLEU](https://github.com/cnap/gec-ranking/) metric ([Napoles et al., 2016](https://arxiv.org/pdf/1605.02592.pdf)).
 
+
+_**Restricted**_:
 | Model           | GLEU  |  Paper / Source | Code |
 | ------------- | :-----:| --- | :-----: |
-|_**Restricted**_     |           
 | SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) |  61.50 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA |
 | Transformer (Junczys-Dowmunt et al., 2018) | 59.9 | [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task](http://aclweb.org/anthology/N18-1055)| NA |
 | CNN Seq2Seq (Chollampatt and Ng, 2018)| 57.47 | [ A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) |
-|_**Unrestricted**_           |
+
+
+**Unrestricted**:
+
+| Model           | GLEU  |  Paper / Source | Code |
+| ------------- | :-----:| --- | :-----: |
 | CNN Seq2Seq + Fluency Boost and inference (Ge et al., 2018) |  62.37 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA |
 
 _**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets.

diff --git a/structured/README.md b/structured/README.md
@@ -0,0 +1,34 @@
+# Exporting NLP-progress into a structure format
+
+Parse and export the unstructured information from Markdown into a structured JSON format. 
+
+## Installation
+
+Requires Python 3.6+.
+
+Create a virtualenv and install requirements (you can also use conda):
+
+```shell
+virtualenv -p python3 venv
+source venv/bin/activate
+
+pip install -r requirements.txt
+```
+
+## Running
+
+From the NLP-progress root directly (where the LICENCE file is), run:
+
+```shell
+python structured/export.py <one or more directories or files>
+```
+
+For example, to export all the data in the `english/` directory:
+
+```shell
+python structured/export.py english
+```
+
+By default the output will be written into `structured.json`, but you can override this with the `--output` parameter. 
+
+