-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathassignment2-451.html
376 lines (300 loc) · 15.9 KB
/
assignment2-451.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Course homepage for CS 431/631 451/651 Data-Intensive Distributed Computing (Winter 2019) at the University of Waterloo">
<meta name="author" content="Adam Roegiest">
<title>Data-Intensive Distributed Computing</title>
<!-- Bootstrap core CSS -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<!-- Just for debugging purposes. Don't actually copy these 2 lines! -->
<!--[if lt IE 9]><script src="../../assets/js/ie8-responsive-file-warning.js"></script><![endif]-->
<script src="js/ie-emulation-modes-warning.js"></script>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
</head>
<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Overview</a></li>
<li><a href="organization.html">Organization</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li class="active"><a href="assignments.html">Assignments</a></li>
<li><a href="software.html">Software</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</nav>
<div class="container">
<div class="page-header">
<div style="float: right"><img width="250" src="images/waterloo_logo.png" alt="University of Waterloo logo"/></div>
<h1>Assignments <br/><small>Data-Intensive Distributed Computing (Winter 2019)</small></h1>
</div>
<div class="subnav">
<ul class="nav nav-pills">
<li><a href="assignment0-451.html">0</a></li>
<li><a href="assignment1-451.html">1</a></li>
<li><a href="assignment2-451.html">2</a></li>
<li><a href="assignment3-451.html">3</a></li>
<li><a href="assignment4-451.html">4</a></li>
<li><a href="assignment5-451.html">5</a></li>
<li><a href="assignment6-451.html">6</a></li>
<li><a href="assignment7-451.html">7</a></li>
<li><a href="project-451.html">Final Project</a></li>
</ul>
</div>
<h3>Assignment 2: Counting in Spark <small>due 2:30pm January 31</small></h3>
<p>In this assignment you will do two things:</p>
<ol>
<li>"Port" the MapReduce implementations of the bigram frequency count
program from <a href="http://bespin.io">Bespin</a> over to Spark (in
Scala).</li>
<li>"Port" the MapReduce implementations
of <a href="assignment1-451.html">assignment 1</a> over to Spark (in Scala).</li>
</ol>
<h4 style="padding-top: 10px">Bigram Relative Frequency</h4>
<p>Your starting points
are <code>ComputeBigramRelativeFrequencyPairs</code>
and <code>ComputeBigramRelativeFrequencyStripes</code> in
package <code>io.bespin.java.mapreduce.bigram</code> (in Java).
You are welcome to build on the <code>BigramCount</code> (Scala)
implementation <a href="https://github.com/lintool/bespin/blob/master/src/main/scala/io/bespin/scala/spark/bigram/BigramCount.scala">here</a>
for tokenization and "boilerplate" code like command-line argument
parsing. To be consistent in tokenization, you should use (i.e., import)
the <code>Tokenizer</code> trait
<a href="https://github.com/lintool/bespin/blob/master/src/main/scala/io/bespin/scala/util/Tokenizer.scala">here</a>.</p>
<p>Put your code in the
package <code>ca.uwaterloo.cs451.a2</code>. Since
you'll be writing Scala code, your source files should go
into <code>src/main/scala/ca/uwaterloo/cs451/a2/</code>.
The repository is designed so that Scala/Spark code will also
compile with the same Maven build command:</p>
<pre>
$ mvn clean package
</pre>
<p>Following the Java implementations, you will write both a "pairs"
and a "stripes" implementation in Spark. Note that although Spark has a
different API than MapReduce, the algorithmic concepts are still very
much applicable. Your pairs and stripes implementation should follow
the same logic as in the MapReduce implementations. In particular,
your program should only take one pass through the input data.</p>
<p>Make sure your implementation runs in the Linux student CS
environment on the Shakespeare collection and also on sample Wikipedia
file <code>/data/cs451/enwiki-20180901-sentences-0.1sample.txt</code>
on HDFS in the Datasci cluster. See
the <a href="software.html">software page</a> for more details on
accessing the Datasci.</p>
<p>You can verify the correctness of your algorithm by comparing the
output of the MapReduce implementation with your Spark
implementation.</p>
<p>Clarification on terminology: informally, we often refer to
"mappers" and "reducers" in the context of Spark. That's a shorthand
way of saying map-like transformations
(<code>map</code>, <code>flatMap</code>, <code>filter</code>, <code>mapPartitions</code>,
etc.) and reduce-like transformations
(e.g., <code>reduceByKey</code>, <code>groupByKey</code>, <code>aggregateByKey</code>,
etc.). Hopefully it's clear from lecture that while Spark represents a
generalization of MapReduce, the notions of per-record processing
(i.e., map-like transformation) and grouping/shuffling (i.e.,
reduce-like transformations) are shared across both frameworks.</p>
<p>We are going to run your code in the Linux student CS environment
as follows (we will make sure the collection is there):</p>
<pre>
$ spark-submit --class ca.uwaterloo.cs451.a2.ComputeBigramRelativeFrequencyPairs \
target/assignments-1.0.jar --input data/Shakespeare.txt \
--output cs451-lintool-a2-shakespeare-bigrams-pairs --reducers 5
$ spark-submit --class ca.uwaterloo.cs451.a2.ComputeBigramRelativeFrequencyStripes \
target/assignments-1.0.jar --input data/Shakespeare.txt \
--output cs451-lintool-a2-shakespeare-bigrams-stripes --reducers 5
</pre>
<p>We are going to run your code on the Datasci cluster as follows
(note the addition of the <code>--num-executors</code>
and <code>--executor-cores</code> options):</p>
<pre>
$ spark-submit --class ca.uwaterloo.cs451.a2.ComputeBigramRelativeFrequencyPairs \
--num-executors 2 --executor-cores 4 --executor-memory 24G target/assignments-1.0.jar \
--input /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
--output cs451-lintool-a2-wiki-bigram-pairs --reducers 8
$ spark-submit --class ca.uwaterloo.cs451.a2.ComputeBigramRelativeFrequencyStripes \
--num-executors 2 --executor-cores 4 --executor-memory 24G target/assignments-1.0.jar \
--input /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
--output cs451-lintool-a2-wiki-bigram-stripes --reducers 8
</pre>
<p><b>Important:</b> Make sure that your code accepts the command-line
parameters above!<p>
<p>When you run a Spark job (in distributed mode), you need to specify how much cluster
resource to request. The option <code>--num-executors</code> specifies
the number of executors, each with a certain number of cores specified
by <code>--executor-cores</code>. So, in the above commands, we
request a total of 8 workers (2 executors, 4 cores each).</p>
<p>The <code>--reducers</code> flag is the amount of parallelism that
you set in your program in the reduce stage. If the total number of
workers is larger than <code>--reducers</code>, some of the workers
will be sitting idle, since you've allocated more workers for the job
than the parallelism you've specified in your
program. If <code>--reducers</code> is larger than the number of
workers, on the other hand, then your reduce tasks will queue up at
the workers, i.e., a worker will be assigned more than one reduce
task. In the above example we set the two equal.</p>
<p>Note that the setting of these two parameters should not affect the
correctness of your program. The setting above is a reasonable middle
ground between having your jobs finish in a reasonable amount of time
and not monopolizing cluster resources.</p>
<p>A related but still orthogonal concept is partitions. Partitions
describes the physical division of records across workers during
execution. When reading from HDFS, the number of HDFS blocks
determines the number of partitions in your RDD. When you apply a
reduce-like transformation, you can optionally specify the number of
partitions (or Spark applies a default) — in this case, the
number of partitions is equal to the number of reducers.</p>
<h4 style="padding-top: 10px">PMI</h4>
<p>Your starting points for PMI computations in Spark should be your
solutions to assignment 1. Write two programs, <code>PairsPMI</code>
and <code>StripesPMI</code> that go in
package <code>ca.uwaterloo.cs451.a2</code>,
in <code>src/main/scala/ca/uwaterloo/cs451/a2/</code>.</p>
<p>There are obviously going to be differences in the MapReduce and
Spark implementations, but we want you to preserve the "spirit" of the
"pairs" vs. "stripes" approach in your respective
implementations. That is, the pairs implementation keeps track of each
co-occurring counts independently, while the stripes implementation
groups all co-occurring terms with respect to a term. If you have
questions, please ask.</p>
<p>We are going to run your code in the Linux student CS environment
as follows (we will make sure the collection is there):</p>
<pre>
$ spark-submit --class ca.uwaterloo.cs451.a2.PairsPMI \
target/assignments-1.0.jar --input data/Shakespeare.txt \
--output cs451-lintool-a2-shakespeare-pmi-pairs --reducers 5 --threshold 10
$ spark-submit --class ca.uwaterloo.cs451.a2.StripesPMI \
target/assignments-1.0.jar --input data/Shakespeare.txt \
--output cs451-lintool-a2-shakespeare-pmi-stripes --reducers 5 --threshold 10
</pre>
<p>We are going to run your code on the Datasci cluster as follows
(we'll use the same simple Wikipedia collection
at <code>/data/cs451/simplewiki-20180901-sentences.txt</code>
from assignment 1):</p>
<pre>
$ spark-submit --class ca.uwaterloo.cs451.a2.PairsPMI \
--num-executors 2 --executor-cores 4 --executor-memory 24G target/assignments-1.0.jar \
--input /data/cs451/simplewiki-20180901-sentences.txt \
--output cs451-lintool-a2-wiki-pmi-pairs --reducers 8 --threshold 10
$ spark-submit --class ca.uwaterloo.cs451.a2.StripesPMI \
--num-executors 2 --executor-cores 4 --executor-memory 24G target/assignments-1.0.jar \
--input /data/cs451/simplewiki-20180901-sentences.txt \
--output cs451-lintool-a2-wiki-pmi-stripes --reducers 8 --threshold 10
</pre>
<p>Hints:</p>
<ul>
<li>Use broadcast variables.</li>
<li>Spark accumulators may be helpful</li>
</ul>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>All implementations should be in
package <code>ca.uwaterloo.cs451.a2</code>; your
Scala code should be
in <code>src/main/scala/ca/uwaterloo/cs451/a2/</code>.
There are no questions to answer in this assignment unless there is
something you would like to communicate with us, and if so, put it
in <code>assignment2.md</code>.</p>
<p>When grading, we will pull your repo and build your code:<p>
<pre>
$ mvn clean package
</pre>
<p>And run using exactly the commands above. Make sure that your code
runs in the Linux Student CS environment (even if you do development
on your own machine), which is where we will be doing the
grading. "But it runs on my laptop!" will not be accepted as an excuse
if we can't get your code to run.</p>
<p>Specifically, we will clone your repo and use the below check
scripts:</p>
<ul>
<li><a href="assignments/check_assignment2_public_linux.py"><code>check_assignment2_public_linux.py</code></a>
in the Linux Student CS environment.</li>
<li><a href="assignments/check_assignment2_public_datasci.py"><code>check_assignment2_public_datasci.py</code></a> on the Datasci cluster.</li>
</ul>
<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and going through the steps above.</p>
<p>That's it!</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>This assignment is worth a total of 40 points, broken down as
follows:</p>
<ul>
<li>The bigram relative frequency implementations are worth 20
points total, 5 point for each of the following conditions: pairs on
Linux Student CS, pairs on the Datasci cluster, stripes on Linux
Student CS, stripes on the Datasci cluster.</li>
<li>The PMI implementations are worth 20 points total, 5 point for
each of the following conditions: pairs on Linux Student CS, pairs
on the Datasci cluster, stripes on Linux Student CS, stripes on
the Datasci cluster.</li>
</ul>
<p>There are no points explicitly for hidden test cases: the points
are folded into the distribution above.</p>
<h4 style="padding-top: 10px">Reference Running Times</h4>
<p>To help you gauge the efficiency of your solution, we are giving you the running times of our reference implementations.
Keep in mind that these are machine dependent and can vary depending on the server/cluster load.</p>
<table style="border: 1px solid black; width:50%;padding: 5px;">
<tr style="border: 1px solid black">
<th style="border: 1px solid black; text-align: center;">Class name</th>
<th style="border: 1px solid black; text-align: center;">Running time Linux</th>
<th style="border: 1px solid black; text-align: center;">Running time Datasci</th>
</tr>
<tr style="border: 1px solid black">
<td style="border: 1px solid black; padding: 2px;">ComputeBigramRelativeFrequencyPairs</td>
<td style="border: 1px solid black; padding: 2px">30 seconds</td>
<td style="border: 1px solid black; padding: 2px">6 minutes 30 seconds</td>
</tr>
<tr style="border: 1px solid black">
<td style="border: 1px solid black; padding: 2px">ComputeBigramRelativeFrequencyStripes</td>
<td style="border: 1px solid black; padding: 2px">25 seconds</td>
<td style="border: 1px solid black; padding: 2px">2 minutes 30 seconds</td>
</tr>
<tr style="border: 1px solid black">
<td style="border: 1px solid black; padding: 2px;">PairsPMI</td>
<td style="border: 1px solid black; padding: 2px">45 seconds</td>
<td style="border: 1px solid black; padding: 2px">10 minutes</td>
</tr>
<tr style="border: 1px solid black">
<td style="border: 1px solid black; padding: 2px">StripesPMI</td>
<td style="border: 1px solid black; padding: 2px">25 seconds</td>
<td style="border: 1px solid black; padding: 2px">3 minutes</td>
</tr>
</table>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
<div style="padding-bottom: 100px"></div>
</div><!-- /.container -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>