[update] several edits

okfn · rufuspollock · May 12, 2016 · May 11, 2016 · May 11, 2016 · May 11, 2016
commit 9924c474973e68c7f80e035ddf5907ee5315ecca
diff --git a/_config.yml b/_config.yml
@@ -1,6 +1,6 @@
 title: Data Patterns
-short_title: Data Patterns
-description: "A collection of tips, tricks and patterns for data work."
+short_title: Labs Handbook
+description: "A collection of guides, counselling and other advise to participate in OKFN Labs"
 baseurl: ""
 issues_url: http://github.com/okfn/datapatterns/issues
 url: "http://datapatterns.org/"
@@ -21,6 +21,9 @@ devs:
       name: Dan Fowler
       github: danfowler
 
+      name: Gustavo Silva
+      github: gsilvapt
+
 permalink: pretty
 openknowledgeribbon: true
 

diff --git a/pattern/appendix/glossary.md → core-guides/appendix/glossary.md b/pattern/appendix/glossary.md → core-guides/appendix/glossary.md
diff --git a/core-guides/core-data-curators.md b/core-guides/core-data-curators.md
@@ -0,0 +1,172 @@
+---
+title: Welcome, Data Curators
+---
+
+As mentioned in the introductory page, one key issue is that the problem with open data is not a matter of *creation*, rather *curation*. This means the role of a Data Curator is to take data from known (quality-proven and reliable) sources and 
+transform that into data packages.
+
+This guide will introduce you to key routines we usually do, before publishing any data package.
+
+# Getting data
+
+The first step is, naturally, finding data to work with. There are two common ways you can do this:
+
+1. Follow the [GitHub's issue tracker](https://github.com/datasets/registry/issue) and look for a package with 3-star priority, or
+
+2. Get your own data and start preparing it.
+
+Which one you should go to first is really up to you. The difference, however, is that some are more needed than others and 
+you could use that knowledge and skill to tackle on some of the project's priorities.
+
+Other than that, the workflow and requirements is pretty much the same.
+
+## Following urgencies
+
+The main advantage of working with a previously open issue is the fact you are introduced to some work that is already 
+developed. Chances are that someone has already took care of finding the best source, convert it to CSV or XLSX and then your 
+work is reduced significantly.
+
+## Your Own Taste
+
+You can also search for topics of your preference, or even suggest them in the issues page if they are not there already.
+
+
+# Preparing the Data
+
+Now, back to the workflow. This is the troublesome part. The job is to, first, find the best and most reliable source, convert 
+if to CSV or XLSX and then work on preparing the data package. If you do not know what `CSV` is, we advise you to read the 
+[Data Guides](data-guides/).
+
+## Convert to .CSV
+
+As mentioned before, the first step is to have a source `csv` file and you can do that, for instance, by click in the right 
+tab in the [World Bank of Data](http://data.worldbank.org/). See picture below:
+
+![image]({{ site.url }}/images/export.jpg)
+
+Chances are, many world data references have a similar option.
+
+## Directory structure
+
+In order to keep everything organized and as "universal" as possible, we have found a structure that pleases the most of us 
+that any contributor should also go by:
+
+```
+dir
+   data
+      id-name.csv
+   archive
+      source.csv
+   scripts
+      README.md
+      py.py
+      requirements.txt - optional
+   README.md
+   datapackage.json
+   Makefile
+```
+
+In summary, you should fit under `data` the final CSV that you will use for your data package and the source data into the 
+`archive` folder.
+If you need to perform any script to clean and wrangle any bit of the dataset, you have to post it under `scripts`, preferably 
+with the name `process.py`, but this is not a convention.
+The `dir/README.md` should contain information about the package, source and licenses (if it applies). On the other hand, 
+`scripts/README.md` should talk about the script and any particular information about it. 
+
+*Note for Python users:* Do not forget to create the `requirements.txt` if you use any special Python package. 
+
+As for the Makefile, which is the easiest part, the common Makefile structure is as follows:
+
+```
+version="0.1.0"
+
+DATADIR=data
+SCRIPTDIR=scripts
+
+all: data
+
+data:
+      python $(SCRIPTDIR)/process.py
+
+      clean:
+            rm -f $(DATADIR)/*
+
+            .PHONY: all data clean
+```
+
+## The ugly job of quality assurance
+
+At this stage, you are ready for loading the source `csv` file and check for any inconsistencies, blank spaces and other 
+things that are important to ensure the data package is machine readible. Here follows a list of things you must pay attention to:
+
+* The common structure of these packages is `COUNTRY,YEAR,VALUE`. This is not static though. The World Bank of Data usually 
+provides such structure that we generally use `COUNTRY,COUNTRY CODE, YEAR, VALUE`. 
+* We prefer using `.` rather than `,` to separate decimal values. We also want to avoid using certain symbols such as `%, &, #
+;, :, ` and a few others that can interfere with the data package.
+* If data is not available, you either make that cell as `0` or `NaN`.
+* The data package name should be `your-package.csv`. We rather use `-` (underscores) to refer to spaces.
+
+By now you can understand why we use programming skills here. The workflow at this stage is: 1) Download the source `CSV` file
+ which you can do directly in your programming environment (like in Python, R, ...); 2) Prepare a small and quick Python script
+ to search for these small inconsistencies and remove/change them so that there are no problems in the and, 3) Run the `Makefile` in the terminal (in Linux, you can do that by simply changing to the directory of the package - `sudo cd ~/path/to/package` - and then by running `make`. You should see no error messages.) to ensure the script is working flawlessly.
+
+### Unpivoting Tables
+
+This is usually the most difficult part - initially! If you read the [Data Guides](data-guides/), you know by now what pivot 
+tables are. Even though those are more human-friendly, if you check one carefully, you can see they are exactly the opposite 
+of clean and machine readable formats. 
+
+They usually have it in a tabular format, lots of columns - and most of the cells need some work.
+
+You can work these manually, in Excel or using Python, R or whatever suite you prefer. 
+
+In Python, we recommend using the `[pandas](http://pandas.pydata.org/)` package, as it eases off the work significantly.
+
+To "unpivot" a table, you can simply run the following snippet:
+
+```python
+import pandas as pd
+df = pd.read_csv('source/source.csv')
+df = pd.melt(df, id_vars=['Country], var_name="Year", value_name="Value")
+df = df.sort_value(['Country', 'Year'], ascending  = [True, True])
+df.to_csv('data/package-name.csv', sep=",", index=False)
+```
+
+This is the code we usually run to unpivot and reoder the entire frame. This is just an example and this is scalable. Under 
+`id_vars`, you can add as many variables as you want - as long they have matching data in the dataframe you have loaded.
+
+
+## The JSON format
+
+When you have your `CSV`file ready, you sure want to create the `datapackage.json` file. Now, there are two ways to do this:
+
+* Manually, implying you know `[JSON](http://www.json.org/)` and its structure (you should look at their website).
+* Using the [Data Package Manager](https://github.com/okfn-oe/datapackage-validator), which creates the file and the main fields automatically.
+
+In either case, we advise you to go through the document and make sure everything is correct. Some things to go by:
+
+* Value fields should have `number` type.
+* Year fields should have `date` type.
+* Always add a description if you had to change anything in that field.
+* Go to the bottom of the file (if you used the [Data Package Manager](https://github.com/okfn-oe/datapackage-validator) and 
+make sure you have the data package name, title and description correct.
+* You can use the GitHub's repository link as the homepage.
+* You can add a field called `maintainers` if you want to. `"homepage": [{"name": "", "email":""}]` is just an example.
+* For the first version, you can let it be `0.1.0.`. Future and updated versions should see this changed.
+* As for the license, we truly advise you to use [ODC-PDDL-1.0](http://opendatacommons.org/licenses/pddl/1-0/)
+
+
+# Preparing to announce your data package
+
+After making sure everything is in place, then you are ready to announce your package. You should go the `[datasets/registry issues page](https://github.com/datasets/registry/issues)` and, either look for the package you tried to resolve or, if you decided to go on your own, open a new issue entitled as your package.
+
+Stuff to include in your post:
+* Package Name (eg. GINI Index)
+* [Data Package Validator link](http://data.okfn.org/tools/validate) - just paste your GitHub repository here and copy the link afterwards
+* [Data Package Viewer link](http://data.okfn.org/tools/view) - same as with the validator
+* The link to your repository
+
+If you do not have experience working with GitHub, the following topic will cover the basics of working with GitHub and Git 
+to publish datasets online.
+
+
diff --git a/core-guides/core-datasets-roadmap.md b/core-guides/core-datasets-roadmap.md
diff --git a/core-guides/getting-started.md b/core-guides/getting-started.md
@@ -0,0 +1,38 @@
+---
+title: "Getting started"
+---
+
+As mentioned before, this is the place to start if you are interested in participating the Core Datasets project.
+
+The first and immediate thing you should do is to sign-up in the [forum](https://discuss.okfn.org/) and introduce yourself.
+The forums act as our central place of communications. However, we also discuss a lot through GitHub, since we keep track 
+of our project's management through the [issues tab](https://github.com/datasets/registry/issues).
+
+[This video](https://vimeo.com/133587259) will help you understand the core functionalities of the platform where we host 
+the forums. Worry not!, you will find most of them very intuitive and will learn along the way. In fact, feel free to ask for 
+help in there if you feel lost.
+
+In order to keep things organized in GitHub, we use the label system. They are very straightforward and easy to understand. 
+It is important that you respect those, that are usually attributed by Core Datasets Managers. This group of people keeps 
+track of the project's needs and tries to forward the project's direction accordingly.
+
+Following the project's direction
+---------------------------------
+
+The label system also helps users to know where to contribute immediately. 
+
+## How to read labels
+
+![image]({{ site.url }}/images/issues.jpg)
+
+The label tabs have been named in a very intuitively way, thinking about others that could come along and wanting to contribute.
+
+You can find difficulty levels - usually associated with how hard can it be to get the data from the source - or if someone 
+already dealt with downloading and preparing the data, only remaining to prepare the package. There are other indicators too, 
+like `Indicator`, `format`, among others. However, the most crucial one is actually the priority of that dataset.
+
+
+At this point, you should have learned how the community is structured and how you can contribute to the Core Datasets Project.
+
+The next couple of tutorials will be slightly more technical to explain how you can prepare a data package and what you should 
+pay attention to.
diff --git a/core-guides/images/create-repository-init-readme.jpg b/core-guides/images/create-repository-init-readme.jpg
diff --git a/core-guides/images/create-repository-name.jpg b/core-guides/images/create-repository-name.jpg
diff --git a/core-guides/images/export.jpg b/core-guides/images/export.jpg
diff --git a/core-guides/images/issues.jpg b/core-guides/images/issues.jpg
diff --git a/core-guides/images/remote-v-links.jpg b/core-guides/images/remote-v-links.jpg
diff --git a/core-guides/images/repo-create.jpg b/core-guides/images/repo-create.jpg
diff --git a/core-guides/index.md b/core-guides/index.md
@@ -0,0 +1,66 @@
+---
+title: Welcome to the Core Datasets guides!
+---
+
+If you wish to contribute to the Core Datasets project, this is the place to start. Reading through the 
+[data guides](data-guides/) is also key, specially if you have never worked with data before. These tutorials, courses and 
+instructions will get you to a really independent level, enough to start collaborating to some data packages.
+
+## What is Core Datasets? Why is it any different than the rest?
+
+[OKFN Labs](http://okfnlabs.org/) is the base project that is supporting many other initiatives. The idea there is to build the
+needed tools so that others can create data packages and make then available. Afterall, that is [OKFN's mission and ultimate 
+goal](https://okfn.org/about/)
+
+
+One of the main projects is the [Core Datasets](http://data.okfn.org/roadmap/core-datasets). The idea is to ensure we have 
+**core** datasets available for everyone. This means then, we apply [open standards](http://opendefinition.org/) and thus 
+we ensure everyone, or at least their computer, can read that dataset.
+
+In this project, we are creating standardized datasets, in a bulk as CSV with JSON. All this data is grouped in a package and 
+the maintainer must make his data package available - for instance, in [GitHub](https://github.com). This ensure anyone can 
+participate and update each others packages if needed.
+
+*This project is community-based*, as well as many other initiaves, thus we need your help. 
+And, as we will discuss further on, it is part of the [Frictioneless Data Project](http://data.okfn.org/). 
+
+### Why do we need data packages?
+
+Even though many organizations, public and private, like to say they agree with open principles, the truth and reality is 
+a bit different. In addition, if you need data from any project of yours, you will realize that each organization publishes 
+data in a non-standardized matter, and this makes it a very painful experience. 
+
+Core Datasets strikes back by providing datasets in the same format for all of them, following the same rules and standards. 
+This reduces the trouble of finding and getting data, easing the entire experience. Core Datasets also empowers **all** 
+individuals since we struggle to ensure every kind of software will read these data packages - thus we say these packages 
+must be, at least, *machine readable*. This means, whether you use Excel, Python, R or whatever software you prefer, we 
+guarantee you will be able to read your data source painlessly. 
+
+**Reminder:** We do not create data - It is a matter of curation and not creation. We all know many data sources, but as you 
+will see later in the guides, one of the tasks is actually to find the **most reliable** source.
+
+### Solution
+
+If you want to collaborate to Core Datasets, you will be creating core datasets. That means you will:
+
+* Create clean, raw datasets, easy to import (since we propose the usage of CSV and JSON formats);
+* You will be creating reliable and up-to-date datasets
+* You are opening knowledge
+* You are applying a standard structure of information
+
+Guides
+------
+
+The following list represents key aspects you have to comprehend before participating in this project. You can take these at 
+your own pace.
+
+If you have never worked with data before and you are apprehensive about your skill set, please dig in to the [Data Guides](data-guides/)
+
+* [Introduction](intro)
+* [Key Principles](key-principles)
+* [Who can contribute](who-can-contribute)
+* [Get Started](getting-started)
+* [Core Datasets Curators](core-data-curators)
+* [Working with GitHub](working-with-git)
+* [Core Datasets Roadmap](core-datasets-roadmap)
+* [Your first package](first-package)
diff --git a/pattern/intro.md → core-guides/intro.md b/pattern/intro.md → core-guides/intro.md
diff --git a/core-guides/key-principles.md b/core-guides/key-principles.md
@@ -0,0 +1,36 @@
+---
+title: "Key Principles"
+---
+
+Underlynig the core datasets projects, there are a couple of principles that are key to understanding how we do things 
+and why we do it.
+
+Narrow Focus
+----------------
+
+   Focus on Reference & Indicator data. We are not building something for all data or even most data.
+
+Small
+-------
+
+   Max 100.000-500.000 rows. Prefer several separate slices to one huge dataset.
+
+Tabular
+--------
+
+   All data should have Tabular form and serialized as CSV. 1 record to 1 row a unique id.
+
+Well Structured
+----------
+
+   No blank lines at the top of the table, no footnotes inline in rows at the bottom.
+
+Key Additional Information
+---------
+
+   Simple, but sufficient, basic metadata (what, where, from, what license) plus data structure info (e.g. fields and their types)
+
+Host it on GitHub
+----------
+
+   So that everyone can participate too!
diff --git a/pattern/ref/visualization.md → core-guides/ref/visualization.md b/pattern/ref/visualization.md → core-guides/ref/visualization.md
diff --git a/core-guides/who-can-contribute.md b/core-guides/who-can-contribute.md
@@ -0,0 +1,12 @@
+---
+title: "Who and how can I contribute"
+---
+
+Everyone can contribute. There are no special requirements. Even if you are not a so-called power-user, you can learn and 
+develop skills to independently work on new datapackages. And, more importantly, we try to cover everything you need to get 
+started!
+
+The recommendation is to read the Data Guides first, which will help you get used to some of the concepts and technologies we use.
+Next, you will be ready to get involved with the community and this guide will help you get on your feet.
+
+