-
Notifications
You must be signed in to change notification settings - Fork 1
/
instructor-notes.html
271 lines (269 loc) · 9.67 KB
/
instructor-notes.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
<!DOCTYPE html>
<html>
<head>
<title>Author Carpentry</title>
<link rel="stylesheet" href="css/site.css">
</head>
<body>
<header>
<a href="http://authorcarpentry.github.io"><img src="img/AClogo.jpg" alt="Author Carpentry logo"></a>
</header>
<nav>
<ul>
<li>
<a href=".">Lesson</a>
</li>
<li>
<a href="00-getting-started.html">Topic 1</a>
</li>
<li>
<a href="01-working-with-openrefine.html">Topic 2</a>
</li>
<li>
<a href="mailto:[email protected]">Contact Us</a>
</li>
</ul>
</nav>
<section>
<h1 id="learning-objectives">
Learning Objectives
</h1>
<ul>
<li>
Introduce concept of facets
</li>
<li>
Show split columns by defined separator
</li>
<li>
Show power of include / exclude, sort by name / count
</li>
<li>
Show the power of clustering algorithms to reveal data patterns, data snafus
</li>
<li>
If time, show call to an API, a web service (JSON example here from a locality georeferencing service)
</li>
<li>
If time, show how to parse the JSON returned from the service.
</li>
<li>
Show the power of undo / redo.
</li>
<li>
Can be used to introduce
</li>
<li>
scripting
</li>
<li>
regular expressions
</li>
<li>
APIs
</li>
</ul>
<hr />
<h1 id="lesson">
Lesson
</h1>
<h2 id="opening-refine">
Opening Refine
</h2>
<p>
If you are using a different browser, or OpenRefine does not automatically open for you when you click the .exe file, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
</p>
<h2 id="creating-a-project">
Creating a Project
</h2>
<p>
Start the program. (Double-click on the google-refine.exe file. Java services will start on your machine, and Refine will open in your Firefox browser).
</p>
<p>
Note the file types Open Refine handles: TSV, CSF, *SV, Excel (.xls .xlsx), JSON, XML, RDF as XML, Google Data documents. Support for other formats can be added with Google Refine extensions.
</p>
<p>
In this first step, we’ll browse our computer to the sample data file for this lesson. In this case, I’ve modified the Portal_rodents csv file. I added several columns: scientificName, locality, county, state, country and I generated several more columns in the lesson itself (JSON, decimalLatitude, decimalLongitude). Data in locality, county, country, JSON, decimalLatitude and decimalLongitude are contrived and are in no way related to the original dataset. When doing this demo, the columns: JSON, decimalLatitude, decimalLongitude can be deleted, and then recreated if time, with a call to a locality service, and subsequent parsing of the JSON data returned by the service.
</p>
<p>
<strong>Once Refine is open, you’ll be asked if you want to Create, Open, or Import a Project.</strong>
</p>
<ul>
<li>
Click Browse, find Portal_rodents_19772002_scinameUUIDs.csv
</li>
<li>
Click next to open Portal_rodents_19772002_scinameUUIDs.csv
</li>
<li>
Refine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click “update preview.”
</li>
<li>
If all looks well, click <em>Create Project.</em>
</li>
</ul>
<h2 id="faceting">
Faceting
</h2>
<p>
<em>Exploring data by applying multiple filters</em>
</p>
<p>
OpenRefine supports faceted browsing as a mechanism for
</p>
<ul>
<li>
seeing a big picture of your data, and
</li>
<li>
filtering down to just the subset of rows that you want to change in bulk.
</li>
</ul>
<p>
Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture on that column, and allows you to filter to some subset of rows for which their cells in that column satisfy some constraint. That’s a bit abstract, so let’s jump into some examples.
</p>
<p>
<a href="https://github.com/OpenRefine/OpenRefine/wiki/Faceting">More on faceting</a>
</p>
<ul>
<li>
Scroll over to the scientificName column
</li>
<li>
Click the down arrow and choose > Facet > Text facet
</li>
<li>
In the left margin, you’ll see a box containing every unique, distinct value in the scientificName column and Refine shows you how many times that value occurs in the column (a count), and allows you to sort (order) your facets by name or count.
</li>
<li>
Edit. Note that at any time, in any cell of the Facet box, or data cell in the Refine window, you have access to “edit” and can fix an error immediately. Refine will even ask you if you’d like to make that same correction to every value it finds like that one (or not).
</li>
</ul>
<h2 id="cluster">
Cluster
</h2>
<p>
One of the most magical bits of Refine, the moment you realize what you’ve been missing. Refin<br /> e has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.
</p>
<p>
<a href="https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth">More on clustering</a>
</p>
<p>
In OpenRefine, clustering refers to the operation of “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings “New York” and “new york” are very likely to refer to the same concept and just have capitalization differences. Likewise, “Gödel” and “Godel” probably refer to the same person.
</p>
<ul>
<li>
In this example, in the scientificName Text Facet we created in the step above, click the <em>Cluster</em> button.
</li>
<li>
In the resulting pop-up window, you can change the algorithm method, and keying function. Try different combinations to see the difference.
</li>
<li>
For example, with this dataset, the <em>nearest neighbor</em> method with the <em>PPM</em> keying function shows the power of clustering the best.
</li>
<li>
Intentional errors in these scientific names have been introduced to show how errors (typos) in any position can be found with this method. All errors can then be fixed by simply entering the correct value in the box on the right. Often, the algorithm has guessed correctly.
</li>
<li>
After corrections are made in this window, you can either Merge and Close the Cluster pop-up, or Merge and Re-cluster.
</li>
</ul>
<h2 id="split-leading---trailing-whitespace-undo---redo">
Split / Leading - Trailing Whitespace / Undo - Redo
</h2>
<p>
If data in a column needs to be split into multiple columns, and the strings in the cells are separated by a common separator (say a comma, or a space), you can use that separator to divide up the bits into their own columns.
</p>
<ul>
<li>
Go to the drop-down tab at the top of the column that you need to split into multiple columns
</li>
<li>
For example, go to the scientificName column > from drop-down choose Edit Column > Split into several columns
</li>
<li>
In the pop-up, for separator, remove the comma, put in a space
</li>
<li>
Remove the check in the box that says “remove column after splitting”
</li>
<li>
You’ll get two extra columns called, in this case: scientificName 1, scientificName 2
</li>
<li>
This will reveal an error in a few names that have spaces at the beginning (so-called leading white space).
</li>
<li>
These can be easily removed with another Refine feature in the column drop-down choices. See drop-down: Edit cells > Common transforms > Remove leading and trailing whitespace
</li>
<li>
To Undo create columns, look just above the scientificName cluster in the left side of the screen. Click where it says Undo / Redo. Click back one step (all steps, all changes are saved here). Just go back to the previous step and click. The extra columns will be gone.
</li>
</ul>
<h2 id="call-a-service-this-example-is-set-up-to-georeference-locality-data-but-could-use-any-service.">
Call a Service (this example is set up to georeference locality data, but could use any service).
</h2>
<ul>
<li>
For this demo, the instructor may find a web service appropriate to demonstrate.
</li>
</ul>
<h2 id="scripts">
Scripts
</h2>
<ul>
<li>
Refine saves every change, every edit you make to the dataset in a file you can save on your machine.
</li>
<li>
IF you had 20 files to clean, and they all had the same type of errors, and all files had the same columns, you could save the script, open a new file to clean, paste in the script and run it. Voila, clean data.
</li>
<li>
In the Undo / Redo section, click Extract, save the bits desired using the check boxes. Save the code in a .txt file. To run these steps on a new dataset, import the new dataset into Refine, open the Extract / Apply section, paste in the .txt file, click Apply.
</li>
</ul>
<h2 id="export">
Export
</h2>
<ul>
<li>
Save your work when you are done by exporting it in the desire format. Save your files with meaningful names, no spaces. Refine does not change your original dataset.
</li>
</ul>
<h4 id="time-estimate-for-this-demo.">
Time estimate for this demo.
</h4>
<ul>
<li>
Takes about 20 - 30 minutes to do a good demo.
</li>
<li>
If students are going to install and then try this tool out on the provided dataset or their own dataset, it will take longer.
</li>
<li>
Mac users with the newest operating system will have to allow this to run by “allowing everything” to run. They can change the setting back after the exercise.
</li>
<li>
Some students will run into issues with
<ul>
<li>
unzipping
</li>
<li>
finding the .exe file once the software has been unzipped
</li>
</ul>
</li>
</ul>
</section>
<footer>
<span>© Author Carpentry</span>
<span><a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img
alt="Creative Commons License" style="vertical-align: middle;"
src="https://i.creativecommons.org/l/by/4.0/80x15.png" /></a></span>
<span>This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution
4.0 International License</a></span>
</footer>
</body>
</html>