-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[update] several edits #92
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
--- | ||
title: Welcome, Data Curators | ||
--- | ||
|
||
As mentioned in the introductory page, one key issue is that the problem with open data is not a matter of *creation*, rather *curation*. This means the role of a Data Curator is to take data from known (quality-proven and reliable) sources and | ||
transform that into data packages. | ||
|
||
This guide will introduce you to key routines we usually do, before publishing any data package. | ||
|
||
# Getting data | ||
|
||
The first step is, naturally, finding data to work with. There are two common ways you can do this: | ||
|
||
1. Follow the [GitHub's issue tracker](https://github.com/datasets/registry/issue) and look for a package with 3-star priority, or | ||
|
||
2. Get your own data and start preparing it. | ||
|
||
Which one you should go to first is really up to you. The difference, however, is that some are more needed than others and | ||
you could use that knowledge and skill to tackle on some of the project's priorities. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. line-breaking. Recommend we do not do line breaks - a single paragraphy is a single line. This is a very minor point and we can correct later - but good to know. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it doesn't change the outlook but I also did that because it is easy to scroll through text in vim. I can fix it though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ahh, I have it working as normal in mu blog. Fixing soon. |
||
|
||
Other than that, the workflow and requirements is pretty much the same. | ||
|
||
## Following urgencies | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Odd phrasing in english. Not quite sure what you mean by "urgencies" |
||
|
||
The main advantage of working with a previously open issue is the fact you are introduced to some work that is already | ||
developed. Chances are that someone has already took care of finding the best source, convert it to CSV or XLSX and then your | ||
work is reduced significantly. | ||
|
||
## Your Own Taste | ||
|
||
You can also search for topics of your preference, or even suggest them in the issues page if they are not there already. | ||
|
||
|
||
# Preparing the Data | ||
|
||
Now, back to the workflow. This is the troublesome part. The job is to, first, find the best and most reliable source, convert | ||
if to CSV or XLSX and then work on preparing the data package. If you do not know what `CSV` is, we advise you to read the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We never want XLSX I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The audience I though for these guides would be non-power users. To me, I can simply download data from the web, whether it is excel, csv or an API. But for a regular user, perhaps things are not that straightforward. I can remove it if you think it is better ;) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point is we won't be converting to XLSX. Downloading as XLSX is fine but converting to it: no. |
||
[Data Guides](data-guides/). | ||
|
||
## Convert to .CSV | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generall we want this scripted. |
||
|
||
As mentioned before, the first step is to have a source `csv` file and you can do that, for instance, by click in the right | ||
tab in the [World Bank of Data](http://data.worldbank.org/). See picture below: | ||
|
||
 | ||
|
||
Chances are, many world data references have a similar option. | ||
|
||
## Directory structure | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To what extent are we duplicating general guide to creating a data package. Not a problem right now - let's get this in but worth keeping in mind. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeap, I mentioned this in some issues. I think I did not understand very clearly what are the purposes of both guides. Sorry for that! |
||
|
||
In order to keep everything organized and as "universal" as possible, we have found a structure that pleases the most of us | ||
that any contributor should also go by: | ||
|
||
``` | ||
dir | ||
data | ||
id-name.csv | ||
archive | ||
source.csv | ||
scripts | ||
README.md | ||
py.py | ||
requirements.txt - optional | ||
README.md | ||
datapackage.json | ||
Makefile | ||
``` | ||
|
||
In summary, you should fit under `data` the final CSV that you will use for your data package and the source data into the | ||
`archive` folder. | ||
If you need to perform any script to clean and wrangle any bit of the dataset, you have to post it under `scripts`, preferably | ||
with the name `process.py`, but this is not a convention. | ||
The `dir/README.md` should contain information about the package, source and licenses (if it applies). On the other hand, | ||
`scripts/README.md` should talk about the script and any particular information about it. | ||
|
||
*Note for Python users:* Do not forget to create the `requirements.txt` if you use any special Python package. | ||
|
||
As for the Makefile, which is the easiest part, the common Makefile structure is as follows: | ||
|
||
``` | ||
version="0.1.0" | ||
DATADIR=data | ||
SCRIPTDIR=scripts | ||
all: data | ||
data: | ||
python $(SCRIPTDIR)/process.py | ||
clean: | ||
rm -f $(DATADIR)/* | ||
.PHONY: all data clean | ||
``` | ||
|
||
## The ugly job of quality assurance | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's keep these titles simpler and with less commentary ;-) e.g. this could just be: "Quality Assurance" |
||
|
||
At this stage, you are ready for loading the source `csv` file and check for any inconsistencies, blank spaces and other | ||
things that are important to ensure the data package is machine readible. Here follows a list of things you must pay attention to: | ||
|
||
* The common structure of these packages is `COUNTRY,YEAR,VALUE`. This is not static though. The World Bank of Data usually | ||
provides such structure that we generally use `COUNTRY,COUNTRY CODE, YEAR, VALUE`. | ||
* We prefer using `.` rather than `,` to separate decimal values. We also want to avoid using certain symbols such as `%, &, # | ||
;, :, ` and a few others that can interfere with the data package. | ||
* If data is not available, you either make that cell as `0` or `NaN`. | ||
* The data package name should be `your-package.csv`. We rather use `-` (underscores) to refer to spaces. | ||
|
||
By now you can understand why we use programming skills here. The workflow at this stage is: 1) Download the source `CSV` file | ||
which you can do directly in your programming environment (like in Python, R, ...); 2) Prepare a small and quick Python script | ||
to search for these small inconsistencies and remove/change them so that there are no problems in the and, 3) Run the `Makefile` in the terminal (in Linux, you can do that by simply changing to the directory of the package - `sudo cd ~/path/to/package` - and then by running `make`. You should see no error messages.) to ensure the script is working flawlessly. | ||
|
||
### Unpivoting Tables | ||
|
||
This is usually the most difficult part - initially! If you read the [Data Guides](data-guides/), you know by now what pivot | ||
tables are. Even though those are more human-friendly, if you check one carefully, you can see they are exactly the opposite | ||
of clean and machine readable formats. | ||
|
||
They usually have it in a tabular format, lots of columns - and most of the cells need some work. | ||
|
||
You can work these manually, in Excel or using Python, R or whatever suite you prefer. | ||
|
||
In Python, we recommend using the `[pandas](http://pandas.pydata.org/)` package, as it eases off the work significantly. | ||
|
||
To "unpivot" a table, you can simply run the following snippet: | ||
|
||
```python | ||
import pandas as pd | ||
df = pd.read_csv('source/source.csv') | ||
df = pd.melt(df, id_vars=['Country], var_name="Year", value_name="Value") | ||
df = df.sort_value(['Country', 'Year'], ascending = [True, True]) | ||
df.to_csv('data/package-name.csv', sep=",", index=False) | ||
``` | ||
|
||
This is the code we usually run to unpivot and reoder the entire frame. This is just an example and this is scalable. Under | ||
`id_vars`, you can add as many variables as you want - as long they have matching data in the dataframe you have loaded. | ||
|
||
|
||
## The JSON format | ||
|
||
When you have your `CSV`file ready, you sure want to create the `datapackage.json` file. Now, there are two ways to do this: | ||
|
||
* Manually, implying you know `[JSON](http://www.json.org/)` and its structure (you should look at their website). | ||
* Using the [Data Package Manager](https://github.com/okfn-oe/datapackage-validator), which creates the file and the main fields automatically. | ||
|
||
In either case, we advise you to go through the document and make sure everything is correct. Some things to go by: | ||
|
||
* Value fields should have `number` type. | ||
* Year fields should have `date` type. | ||
* Always add a description if you had to change anything in that field. | ||
* Go to the bottom of the file (if you used the [Data Package Manager](https://github.com/okfn-oe/datapackage-validator) and | ||
make sure you have the data package name, title and description correct. | ||
* You can use the GitHub's repository link as the homepage. | ||
* You can add a field called `maintainers` if you want to. `"homepage": [{"name": "", "email":""}]` is just an example. | ||
* For the first version, you can let it be `0.1.0.`. Future and updated versions should see this changed. | ||
* As for the license, we truly advise you to use [ODC-PDDL-1.0](http://opendatacommons.org/licenses/pddl/1-0/) | ||
|
||
|
||
# Preparing to announce your data package | ||
|
||
After making sure everything is in place, then you are ready to announce your package. You should go the `[datasets/registry issues page](https://github.com/datasets/registry/issues)` and, either look for the package you tried to resolve or, if you decided to go on your own, open a new issue entitled as your package. | ||
|
||
Stuff to include in your post: | ||
* Package Name (eg. GINI Index) | ||
* [Data Package Validator link](http://data.okfn.org/tools/validate) - just paste your GitHub repository here and copy the link afterwards | ||
* [Data Package Viewer link](http://data.okfn.org/tools/view) - same as with the validator | ||
* The link to your repository | ||
|
||
If you do not have experience working with GitHub, the following topic will cover the basics of working with GitHub and Git | ||
to publish datasets online. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
--- | ||
title: "Getting started" | ||
--- | ||
|
||
As mentioned before, this is the place to start if you are interested in participating the Core Datasets project. | ||
|
||
The first and immediate thing you should do is to sign-up in the [forum](https://discuss.okfn.org/) and introduce yourself. | ||
The forums act as our central place of communications. However, we also discuss a lot through GitHub, since we keep track | ||
of our project's management through the [issues tab](https://github.com/datasets/registry/issues). | ||
|
||
[This video](https://vimeo.com/133587259) will help you understand the core functionalities of the platform where we host | ||
the forums. Worry not!, you will find most of them very intuitive and will learn along the way. In fact, feel free to ask for | ||
help in there if you feel lost. | ||
|
||
In order to keep things organized in GitHub, we use the label system. They are very straightforward and easy to understand. | ||
It is important that you respect those, that are usually attributed by Core Datasets Managers. This group of people keeps | ||
track of the project's needs and tries to forward the project's direction accordingly. | ||
|
||
Following the project's direction | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For markdown we want to use |
||
--------------------------------- | ||
|
||
The label system also helps users to know where to contribute immediately. | ||
|
||
## How to read labels | ||
|
||
 | ||
|
||
The label tabs have been named in a very intuitively way, thinking about others that could come along and wanting to contribute. | ||
|
||
You can find difficulty levels - usually associated with how hard can it be to get the data from the source - or if someone | ||
already dealt with downloading and preparing the data, only remaining to prepare the package. There are other indicators too, | ||
like `Indicator`, `format`, among others. However, the most crucial one is actually the priority of that dataset. | ||
|
||
|
||
At this point, you should have learned how the community is structured and how you can contribute to the Core Datasets Project. | ||
|
||
The next couple of tutorials will be slightly more technical to explain how you can prepare a data package and what you should | ||
pay attention to. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
--- | ||
title: Welcome to the Core Datasets guides! | ||
--- | ||
|
||
If you wish to contribute to the Core Datasets project, this is the place to start. Reading through the | ||
[data guides](data-guides/) is also key, specially if you have never worked with data before. These tutorials, courses and | ||
instructions will get you to a really independent level, enough to start collaborating to some data packages. | ||
|
||
## What is Core Datasets? Why is it any different than the rest? | ||
|
||
[OKFN Labs](http://okfnlabs.org/) is the base project that is supporting many other initiatives. The idea there is to build the | ||
needed tools so that others can create data packages and make then available. Afterall, that is [OKFN's mission and ultimate | ||
goal](https://okfn.org/about/) | ||
|
||
|
||
One of the main projects is the [Core Datasets](http://data.okfn.org/roadmap/core-datasets). The idea is to ensure we have | ||
**core** datasets available for everyone. This means then, we apply [open standards](http://opendefinition.org/) and thus | ||
we ensure everyone, or at least their computer, can read that dataset. | ||
|
||
In this project, we are creating standardized datasets, in a bulk as CSV with JSON. All this data is grouped in a package and | ||
the maintainer must make his data package available - for instance, in [GitHub](https://github.com). This ensure anyone can | ||
participate and update each others packages if needed. | ||
|
||
*This project is community-based*, as well as many other initiaves, thus we need your help. | ||
And, as we will discuss further on, it is part of the [Frictioneless Data Project](http://data.okfn.org/). | ||
|
||
### Why do we need data packages? | ||
|
||
Even though many organizations, public and private, like to say they agree with open principles, the truth and reality is | ||
a bit different. In addition, if you need data from any project of yours, you will realize that each organization publishes | ||
data in a non-standardized matter, and this makes it a very painful experience. | ||
|
||
Core Datasets strikes back by providing datasets in the same format for all of them, following the same rules and standards. | ||
This reduces the trouble of finding and getting data, easing the entire experience. Core Datasets also empowers **all** | ||
individuals since we struggle to ensure every kind of software will read these data packages - thus we say these packages | ||
must be, at least, *machine readable*. This means, whether you use Excel, Python, R or whatever software you prefer, we | ||
guarantee you will be able to read your data source painlessly. | ||
|
||
**Reminder:** We do not create data - It is a matter of curation and not creation. We all know many data sources, but as you | ||
will see later in the guides, one of the tasks is actually to find the **most reliable** source. | ||
|
||
### Solution | ||
|
||
If you want to collaborate to Core Datasets, you will be creating core datasets. That means you will: | ||
|
||
* Create clean, raw datasets, easy to import (since we propose the usage of CSV and JSON formats); | ||
* You will be creating reliable and up-to-date datasets | ||
* You are opening knowledge | ||
* You are applying a standard structure of information | ||
|
||
Guides | ||
------ | ||
|
||
The following list represents key aspects you have to comprehend before participating in this project. You can take these at | ||
your own pace. | ||
|
||
If you have never worked with data before and you are apprehensive about your skill set, please dig in to the [Data Guides](data-guides/) | ||
|
||
* [Introduction](intro) | ||
* [Key Principles](key-principles) | ||
* [Who can contribute](who-can-contribute) | ||
* [Get Started](getting-started) | ||
* [Core Datasets Curators](core-data-curators) | ||
* [Working with GitHub](working-with-git) | ||
* [Core Datasets Roadmap](core-datasets-roadmap) | ||
* [Your first package](first-package) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
--- | ||
title: "Key Principles" | ||
--- | ||
|
||
Underlynig the core datasets projects, there are a couple of principles that are key to understanding how we do things | ||
and why we do it. | ||
|
||
Narrow Focus | ||
---------------- | ||
|
||
Focus on Reference & Indicator data. We are not building something for all data or even most data. | ||
|
||
Small | ||
------- | ||
|
||
Max 100.000-500.000 rows. Prefer several separate slices to one huge dataset. | ||
|
||
Tabular | ||
-------- | ||
|
||
All data should have Tabular form and serialized as CSV. 1 record to 1 row a unique id. | ||
|
||
Well Structured | ||
---------- | ||
|
||
No blank lines at the top of the table, no footnotes inline in rows at the bottom. | ||
|
||
Key Additional Information | ||
--------- | ||
|
||
Simple, but sufficient, basic metadata (what, where, from, what license) plus data structure info (e.g. fields and their types) | ||
|
||
Host it on GitHub | ||
---------- | ||
|
||
So that everyone can participate too! |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
--- | ||
title: "Who and how can I contribute" | ||
--- | ||
|
||
Everyone can contribute. There are no special requirements. Even if you are not a so-called power-user, you can learn and | ||
develop skills to independently work on new datapackages. And, more importantly, we try to cover everything you need to get | ||
started! | ||
|
||
The recommendation is to read the Data Guides first, which will help you get used to some of the concepts and technologies we use. | ||
Next, you will be ready to get involved with the community and this guide will help you get on your feet. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never use "OKFN" anymore except in URLs. Should always be "Open Knowledge Labs" ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"counselling" is odd phrasing. I would just have:
"Guides and advice on participating in Open Knowledge Labs and its projects"