Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UP] Suggestion to fix issue 79 and 71 #83

Merged
merged 4 commits into from
May 4, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions _layouts/pattern.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
layout: default
type: pattern
---
70 changes: 70 additions & 0 deletions glossary/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: Glossary
---

## Open Data
Open data is data that can be used, reused and redistributed freely by anyone for any purpose. More details can be found at at [`opendefinition.org`](http://www.opendefinition.org/).

## Machine-readable
Formats that are machine readable are ones which are able to have their data extracted by computer programs easily. PDF documents are not machine readable. Computers can display the text nicely, but have great difficulty understanding the context that surrounds the text.

## BitTorrent
BitTorrent is a protocol for distributing the bandwith for transferring very large files between the computers which are participating in the transfer. Rather than downloading a file
from a specific source, BitTorrent allows peers to download from each other.

## CSV
Comma Separated Variables. A very simple, open format for tabular data which can be exported and imported by all spreadsheet applications and is easily manipulable with command line tools.

## curl
[curl](http://curl.haxx.se/) - a command line tool for transferring data to and from online systems over standard internet protocols including FTP and HTTP. Very powerful and great for working with `Web API` s from the command line.

## DAP
See `Data Access Protocol`.

## Data Access Protocol
A system that allows outsiders to be granted access to databases without overloading either system.

## Etherpad
A piece of software for collaborative real-time editing of text. See [http://etherpad.org/]().

## Attribution Licence
A licence that requires attributing the original source of the licensed material.

## Attribution License
See `Attribution Licence`.

## API
See `Application Programming Interface`.

## Application Programming Interface
A way computer programmes talk to one another. Can be understood in terms of how a programmer sends instructions between programmes.

## Web API
An `API` that is designed to work over the Internet.

## Share-alike License
See `Share-alike Licence`.

## Share-alike Licence
A licence that requires users of a work to provide the content under the same or similar conditions as the original.

## Public domain
No copyright exists over the work. Does not exist in all jurisdictions.

## Open standards
Generally understood as technical standards which are free from licencing restrictions. Can also be interpreted to mean standards which are developed in a vendor-neutral manner.

## Anonymization
See `Anonymisation`.

## Anonymisation
The process of treating data such that it cannot be used for the identification of individuals.

## IP rights
See `Intellectual property rights`.

## Intellectual property rights
Monopolies granted to individuals for intellectual creations.

## Tab-seperated values
Tab-seperated values (TSV) are a very common form of text file format for sharing tabular data. The format is extremely simple and highly `machine-readable`.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
32 changes: 16 additions & 16 deletions pattern/liberating-html-tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,31 +9,31 @@ In this section, we look at some quick tricks for liberating data from HTML tabl
Screenscraping HTML Tables Using Google Spreadsheets
----------------------------------------------------

The Google spreadsheet formula
The Google spreadsheet formula:

```
*=importHTML("","table",N)*
=importHTML("","table",N)
```
will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N'th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – [`List of largest United Kingdom settlements by population`](http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population) (found using a search on Wikipedia for uk city population):

![image](../images/wikipediaTable.jpg)
will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N'th table in the page (counting starts at 1) as the target table for data scraping.

Grab the URL, fire up a new Google spreadsheet, and start to enter the formula `*=importHTML*` into one of the cells:
So for example, have a look at the following Wikipedia page – [List of largest United Kingdom settlements by population](http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population) (found using a search on Wikipedia for UK city population):
![image](http://farm9.staticflickr.com/8303/7850933084_b188c02992_o_d.jpg)

![image](../images/gssImportFormula.jpg)
Grab the URL, fire up a new Google spreadsheet, and start to enter the formula `=importHTML` into one of the cells:
![image](http://farm9.staticflickr.com/8284/7850932578_b5db80ed9d_o_d.jpg)

Autocompletion works a treat, so finish off the expression and add in the URL and table number:

![image](../images/gssImportFormulaFull.jpg)

```
=ImportHtml("http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population","table",1)
```excel
=importHTML("http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population","table",2)
```

The table numbers are not always obvious - start with 1 and increment the table number until you get the correct one.
![image](http://farm9.staticflickr.com/8438/7850932674_ef1514b761_o_d.jpg)

As if by magic, a data table appears in the spreadsheet, pulled in directly from the Wikipedia page:
![image](http://farm9.staticflickr.com/8425/7850932816_b5598830e0_o_d.jpg)

![image](../images/gssImportedHTMLTable.jpg)
If the data in the HTML table is updated, the data in the spreadsheet will also be updated when you refresh or call the spreadsheet page.

<div class="alert alert-info">Any questions? Got stuck? <a class="btn btn-large btn-info" href="http://ask.schoolofdata.org">Ask School of Data!</a></div>

If the data in the HTML table is updated, the data in the spreadsheet will also be updated when you refresh or call the spreadsheet page.
File renamed without changes.
39 changes: 0 additions & 39 deletions pattern/patterns/liberating-html-tables.md

This file was deleted.

File renamed without changes.