Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[update] several edits #92

Merged
merged 2 commits into from
May 12, 2016
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
[update] several edits
gsilvapt committed May 11, 2016
commit 9924c474973e68c7f80e035ddf5907ee5315ecca
7 changes: 5 additions & 2 deletions _config.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
title: Data Patterns
short_title: Data Patterns
description: "A collection of tips, tricks and patterns for data work."
short_title: Labs Handbook
description: "A collection of guides, counselling and other advise to participate in OKFN Labs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never use "OKFN" anymore except in URLs. Should always be "Open Knowledge Labs" ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"counselling" is odd phrasing. I would just have:

"Guides and advice on participating in Open Knowledge Labs and its projects"

baseurl: ""
issues_url: http://github.com/okfn/datapatterns/issues
url: "http://datapatterns.org/"
@@ -21,6 +21,9 @@ devs:
name: Dan Fowler
github: danfowler

name: Gustavo Silva
github: gsilvapt

permalink: pretty
openknowledgeribbon: true

File renamed without changes.
172 changes: 172 additions & 0 deletions core-guides/core-data-curators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
---
title: Welcome, Data Curators
---

As mentioned in the introductory page, one key issue is that the problem with open data is not a matter of *creation*, rather *curation*. This means the role of a Data Curator is to take data from known (quality-proven and reliable) sources and
transform that into data packages.

This guide will introduce you to key routines we usually do, before publishing any data package.

# Getting data

The first step is, naturally, finding data to work with. There are two common ways you can do this:

1. Follow the [GitHub's issue tracker](https://github.com/datasets/registry/issue) and look for a package with 3-star priority, or

2. Get your own data and start preparing it.

Which one you should go to first is really up to you. The difference, however, is that some are more needed than others and
you could use that knowledge and skill to tackle on some of the project's priorities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line-breaking.

Recommend we do not do line breaks - a single paragraphy is a single line.

This is a very minor point and we can correct later - but good to know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't change the outlook but I also did that because it is easy to scroll through text in vim. I can fix it though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I have it working as normal in mu blog. Fixing soon.


Other than that, the workflow and requirements is pretty much the same.

## Following urgencies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd phrasing in english. Not quite sure what you mean by "urgencies"


The main advantage of working with a previously open issue is the fact you are introduced to some work that is already
developed. Chances are that someone has already took care of finding the best source, convert it to CSV or XLSX and then your
work is reduced significantly.

## Your Own Taste

You can also search for topics of your preference, or even suggest them in the issues page if they are not there already.


# Preparing the Data

Now, back to the workflow. This is the troublesome part. The job is to, first, find the best and most reliable source, convert
if to CSV or XLSX and then work on preparing the data package. If you do not know what `CSV` is, we advise you to read the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never want XLSX I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audience I though for these guides would be non-power users. To me, I can simply download data from the web, whether it is excel, csv or an API. But for a regular user, perhaps things are not that straightforward. I can remove it if you think it is better ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is we won't be converting to XLSX. Downloading as XLSX is fine but converting to it: no.

[Data Guides](data-guides/).

## Convert to .CSV
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generall we want this scripted.


As mentioned before, the first step is to have a source `csv` file and you can do that, for instance, by click in the right
tab in the [World Bank of Data](http://data.worldbank.org/). See picture below:

![image]({{ site.url }}/images/export.jpg)

Chances are, many world data references have a similar option.

## Directory structure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what extent are we duplicating general guide to creating a data package. Not a problem right now - let's get this in but worth keeping in mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, I mentioned this in some issues. I think I did not understand very clearly what are the purposes of both guides. Sorry for that!


In order to keep everything organized and as "universal" as possible, we have found a structure that pleases the most of us
that any contributor should also go by:

```
dir
data
id-name.csv
archive
source.csv
scripts
README.md
py.py
requirements.txt - optional
README.md
datapackage.json
Makefile
```

In summary, you should fit under `data` the final CSV that you will use for your data package and the source data into the
`archive` folder.
If you need to perform any script to clean and wrangle any bit of the dataset, you have to post it under `scripts`, preferably
with the name `process.py`, but this is not a convention.
The `dir/README.md` should contain information about the package, source and licenses (if it applies). On the other hand,
`scripts/README.md` should talk about the script and any particular information about it.

*Note for Python users:* Do not forget to create the `requirements.txt` if you use any special Python package.

As for the Makefile, which is the easiest part, the common Makefile structure is as follows:

```
version="0.1.0"
DATADIR=data
SCRIPTDIR=scripts
all: data
data:
python $(SCRIPTDIR)/process.py
clean:
rm -f $(DATADIR)/*
.PHONY: all data clean
```

## The ugly job of quality assurance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep these titles simpler and with less commentary ;-)

e.g. this could just be:

"Quality Assurance"


At this stage, you are ready for loading the source `csv` file and check for any inconsistencies, blank spaces and other
things that are important to ensure the data package is machine readible. Here follows a list of things you must pay attention to:

* The common structure of these packages is `COUNTRY,YEAR,VALUE`. This is not static though. The World Bank of Data usually
provides such structure that we generally use `COUNTRY,COUNTRY CODE, YEAR, VALUE`.
* We prefer using `.` rather than `,` to separate decimal values. We also want to avoid using certain symbols such as `%, &, #
;, :, ` and a few others that can interfere with the data package.
* If data is not available, you either make that cell as `0` or `NaN`.
* The data package name should be `your-package.csv`. We rather use `-` (underscores) to refer to spaces.

By now you can understand why we use programming skills here. The workflow at this stage is: 1) Download the source `CSV` file
which you can do directly in your programming environment (like in Python, R, ...); 2) Prepare a small and quick Python script
to search for these small inconsistencies and remove/change them so that there are no problems in the and, 3) Run the `Makefile` in the terminal (in Linux, you can do that by simply changing to the directory of the package - `sudo cd ~/path/to/package` - and then by running `make`. You should see no error messages.) to ensure the script is working flawlessly.

### Unpivoting Tables

This is usually the most difficult part - initially! If you read the [Data Guides](data-guides/), you know by now what pivot
tables are. Even though those are more human-friendly, if you check one carefully, you can see they are exactly the opposite
of clean and machine readable formats.

They usually have it in a tabular format, lots of columns - and most of the cells need some work.

You can work these manually, in Excel or using Python, R or whatever suite you prefer.

In Python, we recommend using the `[pandas](http://pandas.pydata.org/)` package, as it eases off the work significantly.

To "unpivot" a table, you can simply run the following snippet:

```python
import pandas as pd
df = pd.read_csv('source/source.csv')
df = pd.melt(df, id_vars=['Country], var_name="Year", value_name="Value")
df = df.sort_value(['Country', 'Year'], ascending = [True, True])
df.to_csv('data/package-name.csv', sep=",", index=False)
```

This is the code we usually run to unpivot and reoder the entire frame. This is just an example and this is scalable. Under
`id_vars`, you can add as many variables as you want - as long they have matching data in the dataframe you have loaded.


## The JSON format

When you have your `CSV`file ready, you sure want to create the `datapackage.json` file. Now, there are two ways to do this:

* Manually, implying you know `[JSON](http://www.json.org/)` and its structure (you should look at their website).
* Using the [Data Package Manager](https://github.com/okfn-oe/datapackage-validator), which creates the file and the main fields automatically.

In either case, we advise you to go through the document and make sure everything is correct. Some things to go by:

* Value fields should have `number` type.
* Year fields should have `date` type.
* Always add a description if you had to change anything in that field.
* Go to the bottom of the file (if you used the [Data Package Manager](https://github.com/okfn-oe/datapackage-validator) and
make sure you have the data package name, title and description correct.
* You can use the GitHub's repository link as the homepage.
* You can add a field called `maintainers` if you want to. `"homepage": [{"name": "", "email":""}]` is just an example.
* For the first version, you can let it be `0.1.0.`. Future and updated versions should see this changed.
* As for the license, we truly advise you to use [ODC-PDDL-1.0](http://opendatacommons.org/licenses/pddl/1-0/)


# Preparing to announce your data package

After making sure everything is in place, then you are ready to announce your package. You should go the `[datasets/registry issues page](https://github.com/datasets/registry/issues)` and, either look for the package you tried to resolve or, if you decided to go on your own, open a new issue entitled as your package.

Stuff to include in your post:
* Package Name (eg. GINI Index)
* [Data Package Validator link](http://data.okfn.org/tools/validate) - just paste your GitHub repository here and copy the link afterwards
* [Data Package Viewer link](http://data.okfn.org/tools/view) - same as with the validator
* The link to your repository

If you do not have experience working with GitHub, the following topic will cover the basics of working with GitHub and Git
to publish datasets online.


Empty file.
38 changes: 38 additions & 0 deletions core-guides/getting-started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: "Getting started"
---

As mentioned before, this is the place to start if you are interested in participating the Core Datasets project.

The first and immediate thing you should do is to sign-up in the [forum](https://discuss.okfn.org/) and introduce yourself.
The forums act as our central place of communications. However, we also discuss a lot through GitHub, since we keep track
of our project's management through the [issues tab](https://github.com/datasets/registry/issues).

[This video](https://vimeo.com/133587259) will help you understand the core functionalities of the platform where we host
the forums. Worry not!, you will find most of them very intuitive and will learn along the way. In fact, feel free to ask for
help in there if you feel lost.

In order to keep things organized in GitHub, we use the label system. They are very straightforward and easy to understand.
It is important that you respect those, that are usually attributed by Core Datasets Managers. This group of people keeps
track of the project's needs and tries to forward the project's direction accordingly.

Following the project's direction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For markdown we want to use ## method for doing headings rather than underling.

---------------------------------

The label system also helps users to know where to contribute immediately.

## How to read labels

![image]({{ site.url }}/images/issues.jpg)

The label tabs have been named in a very intuitively way, thinking about others that could come along and wanting to contribute.

You can find difficulty levels - usually associated with how hard can it be to get the data from the source - or if someone
already dealt with downloading and preparing the data, only remaining to prepare the package. There are other indicators too,
like `Indicator`, `format`, among others. However, the most crucial one is actually the priority of that dataset.


At this point, you should have learned how the community is structured and how you can contribute to the Core Datasets Project.

The next couple of tutorials will be slightly more technical to explain how you can prepare a data package and what you should
pay attention to.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added core-guides/images/create-repository-name.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added core-guides/images/export.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added core-guides/images/issues.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added core-guides/images/remote-v-links.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added core-guides/images/repo-create.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
66 changes: 66 additions & 0 deletions core-guides/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: Welcome to the Core Datasets guides!
---

If you wish to contribute to the Core Datasets project, this is the place to start. Reading through the
[data guides](data-guides/) is also key, specially if you have never worked with data before. These tutorials, courses and
instructions will get you to a really independent level, enough to start collaborating to some data packages.

## What is Core Datasets? Why is it any different than the rest?

[OKFN Labs](http://okfnlabs.org/) is the base project that is supporting many other initiatives. The idea there is to build the
needed tools so that others can create data packages and make then available. Afterall, that is [OKFN's mission and ultimate
goal](https://okfn.org/about/)


One of the main projects is the [Core Datasets](http://data.okfn.org/roadmap/core-datasets). The idea is to ensure we have
**core** datasets available for everyone. This means then, we apply [open standards](http://opendefinition.org/) and thus
we ensure everyone, or at least their computer, can read that dataset.

In this project, we are creating standardized datasets, in a bulk as CSV with JSON. All this data is grouped in a package and
the maintainer must make his data package available - for instance, in [GitHub](https://github.com). This ensure anyone can
participate and update each others packages if needed.

*This project is community-based*, as well as many other initiaves, thus we need your help.
And, as we will discuss further on, it is part of the [Frictioneless Data Project](http://data.okfn.org/).

### Why do we need data packages?

Even though many organizations, public and private, like to say they agree with open principles, the truth and reality is
a bit different. In addition, if you need data from any project of yours, you will realize that each organization publishes
data in a non-standardized matter, and this makes it a very painful experience.

Core Datasets strikes back by providing datasets in the same format for all of them, following the same rules and standards.
This reduces the trouble of finding and getting data, easing the entire experience. Core Datasets also empowers **all**
individuals since we struggle to ensure every kind of software will read these data packages - thus we say these packages
must be, at least, *machine readable*. This means, whether you use Excel, Python, R or whatever software you prefer, we
guarantee you will be able to read your data source painlessly.

**Reminder:** We do not create data - It is a matter of curation and not creation. We all know many data sources, but as you
will see later in the guides, one of the tasks is actually to find the **most reliable** source.

### Solution

If you want to collaborate to Core Datasets, you will be creating core datasets. That means you will:

* Create clean, raw datasets, easy to import (since we propose the usage of CSV and JSON formats);
* You will be creating reliable and up-to-date datasets
* You are opening knowledge
* You are applying a standard structure of information

Guides
------

The following list represents key aspects you have to comprehend before participating in this project. You can take these at
your own pace.

If you have never worked with data before and you are apprehensive about your skill set, please dig in to the [Data Guides](data-guides/)

* [Introduction](intro)
* [Key Principles](key-principles)
* [Who can contribute](who-can-contribute)
* [Get Started](getting-started)
* [Core Datasets Curators](core-data-curators)
* [Working with GitHub](working-with-git)
* [Core Datasets Roadmap](core-datasets-roadmap)
* [Your first package](first-package)
File renamed without changes.
36 changes: 36 additions & 0 deletions core-guides/key-principles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: "Key Principles"
---

Underlynig the core datasets projects, there are a couple of principles that are key to understanding how we do things
and why we do it.

Narrow Focus
----------------

Focus on Reference & Indicator data. We are not building something for all data or even most data.

Small
-------

Max 100.000-500.000 rows. Prefer several separate slices to one huge dataset.

Tabular
--------

All data should have Tabular form and serialized as CSV. 1 record to 1 row a unique id.

Well Structured
----------

No blank lines at the top of the table, no footnotes inline in rows at the bottom.

Key Additional Information
---------

Simple, but sufficient, basic metadata (what, where, from, what license) plus data structure info (e.g. fields and their types)

Host it on GitHub
----------

So that everyone can participate too!
File renamed without changes.
12 changes: 12 additions & 0 deletions core-guides/who-can-contribute.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: "Who and how can I contribute"
---

Everyone can contribute. There are no special requirements. Even if you are not a so-called power-user, you can learn and
develop skills to independently work on new datapackages. And, more importantly, we try to cover everything you need to get
started!

The recommendation is to read the Data Guides first, which will help you get used to some of the concepts and technologies we use.
Next, you will be ready to get involved with the community and this guide will help you get on your feet.


Loading