-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lecture 15 Big Data Spark.srt
4153 lines (3458 loc) · 119 KB
/
Lecture 15 Big Data Spark.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:06,899 --> 00:00:17,770
好的,今天我们要谈论火花
all right today today we're going to
talk about spark spark say essentially a
2
00:00:17,770 --> 00:00:24,400
您可以将它视为MapReduce的后续产品,这是
successor to MapReduce you can think of
it as a kind of evolutionary step in
3
00:00:24,400 --> 00:00:31,210
MapReduce以及我们正在研究它的一个原因是,它如今已广泛用于
MapReduce and one reason we're looking
at it is that it's widely used today for
4
00:00:31,210 --> 00:00:37,260
事实证明,数据中心计算非常流行并且非常有用
data center computations that's turned
out to be very popular and very useful
5
00:00:37,260 --> 00:00:41,589
它要做的一件有趣的事情是它
one interesting thing it does which will
pay attention to is that it it
6
00:00:41,589 --> 00:00:47,550
概括了MapReduce两个阶段的类型,将地图引入了
generalizes the kind of two stages of
MapReduce the map introduced into a
7
00:00:47,550 --> 00:00:57,579
完整的多步数据流图概念,这对
complete notion of multi-step data flow
graphs that and this is both helpful for
8
00:00:57,579 --> 00:01:02,139
程序员的灵活性,它更具表现力,并且还为系统
flexibility for the programmer it's more
expressive and it also gives the system
9
00:01:02,139 --> 00:01:07,530
在优化和优化方面,SPARC系统还有很多需要解决的问题
the SPARC system a lot more to chew on
when it comes to optimization and
10
00:01:07,530 --> 00:01:12,759
处理故障和处理故障,以及
dealing with faults dealing with
failures and also for the from the
11
00:01:12,759 --> 00:01:16,539
程序员的角度来看它支持迭代应用程序
programmers point of view it supports
iterative applications application said
12
00:01:16,539 --> 00:01:21,909
您知道有效地循环数据要比产生我们更好
you know loop over the data effectively
much better than that produced us you
13
00:01:21,909 --> 00:01:27,579
可以通过运行多个MapReduce应用程序将很多东西拼凑在一起
can cobble together a lot of stuff with
multiple MapReduce applications running
14
00:01:27,579 --> 00:01:36,149
一个接一个,但是在SPARC中一切都更加方便,所以我
one after another but it's all a lot
more convenient in and SPARC okay so I
15
00:01:36,149 --> 00:01:41,909
我想我将从一个示例应用程序开始,这是
think I'm just gonna start right off
with an example application this is the
16
00:01:41,909 --> 00:01:52,840
PageRank的代码,我将复制以下代码进行一些更改
code for PageRank and I'll just copy
this code with a few a few changes from
17
00:01:52,840 --> 00:01:56,789
中的一些示例源代码
some sample source code in the
18
00:01:57,520 --> 00:02:02,680
在火花源中,我想这实际上有点难以阅读
in the spark source I guess it's
actually a little bit hard to read let
19
00:02:02,680 --> 00:02:06,479
我只是给我第二条定律尝试使其变大
me just give me a second law try to make
it bigger
20
00:02:14,860 --> 00:02:20,120
好的,所以如果这太难读了,是否有它的副本
all right okay so if this is if this is
too hard to read is there's a copy of it
21
00:02:20,120 --> 00:02:26,570
在注释中,它是代码的扩展以及本文的第3至2节
in the notes and it's an expansion of
the code and section 3 to 2 in the paper
22
00:02:26,570 --> 00:02:33,500
页面排名,这是一种算法,Google使用了非常著名的算法
a page rank which is a algorithm that
Google uses pretty famous algorithm for
23
00:02:33,500 --> 00:02:42,380
计算不同网页搜索结果对PageRank的重要性
calculating how important different web
search results are what PageRank is
24
00:02:42,380 --> 00:02:46,700
试图做得好实际上PageRank有点广泛
trying to do
well actually PageRank is sort of widely
25
00:02:46,700 --> 00:02:51,350
用作无法正常工作的示例,
used as an example of something that
doesn't actually work that well and
26
00:02:51,350 --> 00:02:56,510
MapReduce,原因是PageRank涉及一堆
MapReduce and the reason is that
PageRank involves a bunch of sort of
27
00:02:56,510 --> 00:03:01,130
截然不同的步骤,更糟的是,PageRank涉及迭代,其中存在循环
distinct steps and worse PageRank
involves iteration there's a loop in it
28
00:03:01,130 --> 00:03:06,130
必须运行很多次,MapReduce没什么可说的
that's got to be run many times and
MapReduce just has nothing to say about
29
00:03:06,130 --> 00:03:15,860
关于迭代,此版本的PageRank的输入PageRank只是一个
about iteration the input the PageRank
for this version of PageRank is just a
30
00:03:15,860 --> 00:03:23,360
网络中每个链接一个巨大的行集合,然后每行有两个
giant collection of lines one per link
in the web and each line then has two
31
00:03:23,360 --> 00:03:28,550
URLs包含链接的页面的URL以及包含该链接的链接的URL
URLs the URL of the page containing a
link and the URL of the link that that
32
00:03:28,550 --> 00:03:33,920
页面指向,您是否知道目的是通过以下方式获取此文件
page points to and you know if the
intent is that you get this file from by
33
00:03:33,920 --> 00:03:38,390
爬网并查看所有内容,将所有链接收集在一起
crawling the web and looking at all the
all collecting together all the links in
34
00:03:38,390 --> 00:03:46,790
网络的输入绝对是巨大的,只是一种愚蠢的行为
the web's the input is absolutely
enormous and as just a sort of silly
35
00:03:46,790 --> 00:03:53,180
从我实际运行这段代码时为我们提供的小例子,我给出了一些
little example for us from when I
actually run this code I've given some
36
00:03:53,180 --> 00:03:56,959
这里输入示例,这就是impro真正看起来的样子
example input here and this is the way
the impro would really look it's just
37
00:03:56,959 --> 00:04:03,290
每行用两个URL排列,我使用的是u1,即页面的URL和u3
lines each line with two URLs and I'm
using u1 that's the URL of a page and u3
38
00:04:03,290 --> 00:04:09,489
例如作为该页面指向的链接的URL,只是为了方便
for example as the URL of a link that
that page points to just for convenience
39
00:04:09,489 --> 00:04:15,230
所以这个输入文件代表的网络图只有三页
and so the web graph that this input
file represents there's only three pages
40
00:04:15,230 --> 00:04:22,610
在其中一二三中,我可以解释这些链接,其中有一个链接
in it one two three I could just
interpret the links there's a link from
41
00:04:22,610 --> 00:04:27,419
一二三有一个从一回到自己的链接
one two three
there's a link from one back to itself
42
00:04:27,419 --> 00:04:32,710
有一个从两个到三个的网络链接,有一个从两个回到的网络链接。
there's a web link from two to three
there's a web link from two back to
43
00:04:32,710 --> 00:04:39,190
本身,并且有一个从三到一个的网络链接,就像一个非常简单的图表
itself and there's a web link from three
to one just like a very simple graph
44
00:04:39,190 --> 00:04:45,100
构造PageRank尝试做的事情是您知道重要性的估计
structure what PageRank is trying to do
it's you know estimating the importance
45
00:04:45,100 --> 00:04:50,620
每页的真正含义是它在评估重要性
of each page what that really means is
that it's estimating the importance
46
00:04:50,620 --> 00:04:56,979
根据其他重要页面是否具有指向给定页面的链接以及
based on whether other important pages
have links to a given page and what's
47
00:04:56,979 --> 00:05:01,150
真正在这里发生的是这种对估计概率的建模
really going on here is this kind of
modeling the estimated probability that
48
00:05:01,150 --> 00:05:08,199
单击链接的用户将在每个给定页面上结束,因此该用户
a user who clicks on links will end on
each given page so it has this user
49
00:05:08,199 --> 00:05:14,289
用户有85%的机会跟随来自
model in which the user has a 85 percent
chance of following a link from the
50
00:05:14,289 --> 00:05:19,150
用户当前页面,该页面来自用户当前的随机选择的链接
users current page following a randomly
selected link from the users current
51
00:05:19,150 --> 00:05:25,900
页面转到该链接所指向的位置,并有15%的机会简单地切换到某些链接
page to wherever that link leads and a
15% chance of simply switching to some
52
00:05:25,900 --> 00:05:29,080
其他页面,即使没有链接也不会像您知道的那样链接
other page even though there's not a
link to it as you would if you you know
53
00:05:29,080 --> 00:05:38,949
直接在浏览器中输入一个URL,想法是他喝了
entered a URL directly into the browser
and the idea is that the he drank
54
00:05:38,949 --> 00:05:45,400
算法类型重复运行此算法,它模拟用户查看
algorithm kind of runs this repeatedly
it sort of simulates the user looking at
55
00:05:45,400 --> 00:05:51,610
页面,然后点击链接,并添加了from页面的重要性
a page and then following a link and
kind of adds the from pages importance
56
00:05:51,610 --> 00:05:55,720
到目标网页的重要性,然后再运行一次,
to the target pages importance and then
sort of runs this again and it's going
57
00:05:55,720 --> 00:06:02,889
最终进入像SPARC上的页面排名之类的系统,它将在某种程度上运行
to end up in the system like page rank
on SPARC it's going to kind of run this
58
00:06:02,889 --> 00:06:09,030
模拟所有页面并行或有文字
simulation for all pages in parallel it
or literately
59
00:06:09,900 --> 00:06:14,680
的想法是,它将跟踪算法将保持
the and the idea is that it's going to
keep track the algorithms gonna keep
60
00:06:14,680 --> 00:06:19,560
跟踪每个页面或每个URL的页面排名,并对其进行更新
track of the page rank of every single
page or every single URL and update it
61
00:06:19,560 --> 00:06:24,610
因为它模拟了随机的用户点击,我的意思是最终
as it sort of simulates random user
clicks I mean that eventually that those
62
00:06:24,610 --> 00:06:31,529
排名现在将收敛于真正的最终价值
ranks will converge on kind of the true
final values now
63
00:06:31,529 --> 00:06:37,259
因为它是迭代的,尽管您可以在快速的MapReduce中编写代码,但这是一个
because it's iterative although you can
code this up in rapid MapReduce it's a
64
00:06:37,259 --> 00:06:45,439
痛苦的是,它不能只是一个MapReduce程序,而必须是多个
pain it can't be just a single MapReduce
program it has to be multiple you know
65
00:06:45,439 --> 00:06:51,359
多次调用MapReduce应用程序,其中每个调用
multiple calls to a MapReduce
application where each call sort of
66
00:06:51,359 --> 00:06:55,739
模拟迭代中的一个步骤,因此您可以在MapReduce中进行操作,但这是一个
simulates one step in the iteration so
you can do in a MapReduce but it's a
67
00:06:55,739 --> 00:07:00,479
痛苦,这也是一种斜率,因为MapReduce只考虑了一个
pain and it's also kind of slope because
MapReduce it's only thinking about one
68
00:07:00,479 --> 00:07:05,279
map和一个reduce,它总是从磁盘的GFS中读取其输入
map and one reduce and it's always
reading its input from the GFS from disk
69
00:07:05,279 --> 00:07:09,089
和GFS文件系统,并始终写入其输出,这就是
and the GFS filesystem and always
writing its output which would be this
70
00:07:09,089 --> 00:07:17,069
每页更新的等级每个阶段还写入每页更新的那些
sort of updated per page ranks every
stage also writes those updated per page
71
00:07:17,069 --> 00:07:23,009
也可以在GFS中对文件进行排名,因此如果您将其作为排序运行,则会有很多文件I / O
ranks to files in GFS also so there's a
lot of file i/o if you run this as sort
72
00:07:23,009 --> 00:07:31,279
一系列MapReduce应用程序的正确处理,所以我们这里有这个总和
of a sequence of MapReduce applications
all right so we have here this sum
73
00:07:31,279 --> 00:07:35,869
嗯,有一个PageRank代码,我实际上要冒出来
there's an a PageRank code that came
with um came a spark I'm actually gonna
74
00:07:35,869 --> 00:07:40,009
为你运行它,我要为你运行整个过程
run it for you I'm gonna run the whole
thing for you
75
00:07:40,009 --> 00:07:44,359
这段代码显示在我刚才显示的输入中,以查看最终
this code shown here on the input that
I've shown just to see what the final
76
00:07:44,359 --> 00:07:50,259
输出是,然后我将仔细检查,我们将逐步进行
output is and then I'll look through and
we're going to step by step and
77
00:07:52,679 --> 00:08:02,619
展示它如何执行,所以现在您应该在
show how it executes alright so here's
the you should see a screen share now at
78
00:08:02,619 --> 00:08:10,199
一个终端窗口,我向您显示输入文件,然后我得到了帮助
a terminal window and I'm showing you
the input file then I got a hand to this
79
00:08:10,199 --> 00:08:17,019
PageRank程序,现在是我的阅读方式,我知道您已经下载了
PageRank program and now here's how I
read it I've you know I've downloaded a
80
00:08:17,019 --> 00:08:23,529
将SPARC复制到我的笔记本电脑上,事实证明这很容易,如果是
copy of SPARC to my laptop it turns out
to be pretty easy and if it's a pre
81
00:08:23,529 --> 00:08:29,229
我可以运行它的编译版本,它只在Java虚拟机中运行
compiled version of it I can just run it
just runs in the Java Virtual Machine I
82
00:08:29,229 --> 00:08:33,789
可以非常轻松地运行它,因此它实际上是在下载SPARC并运行
can run it very easily so it's actually
doing downloading SPARC and running
83
00:08:33,789 --> 00:08:37,559
简单的东西原来很简单,所以我要运行
simple stuff turns out to be pretty
straightforward so I'm gonna run the
84
00:08:37,559 --> 00:08:43,418
我用输入显示的代码,我们会看到很多
85
00:08:43,419 --> 00:08:52,089
错误消息的流逝,但最终支持人员运行了程序并打印
of junk error messages go by but in the
end support runs the program and prints
86
00:08:52,089 --> 00:08:56,399
最终结果,我们得到了我拥有的三个页面的三个排名,
the final result and we get these three
ranks for the three pages I have and
87
00:08:56,399 --> 00:09:01,889
显然第一页的排名最高
apparently page one has the highest rank
88
00:09:02,819 --> 00:09:09,160
我不完全知道为什么,但这就是算法最终要做的
and I'm not completely sure why but
that's what the algorithm ends up doing
89
00:09:09,160 --> 00:09:13,439
所以您当然知道我们对算法本身并不是很感兴趣
so you know of course we're not really
that interested in the algorithm itself
90
00:09:13,439 --> 00:09:26,470
就像我们如何执行arc执行一样,所以我要动手
so much as how we execute arc execute
sit all right so I'm gonna hand to
91
00:09:26,470 --> 00:09:33,339
了解什么是编程模型并产生火花,因为它可能不完全是
understand what the programming model is
and spark because it's perhaps not quite
92
00:09:33,339 --> 00:09:40,779
看起来该如何将程序逐行交给SPARC
what it looks like I'm gonna hand the
program line by line to the SPARC
93
00:09:40,779 --> 00:09:49,240
解释器,因此您可以启动此火花外壳程序并为其输入代码
interpreter so you can just fire up this
spark shell thing and type code to it
94
00:09:49,240 --> 00:09:57,730
直接,所以我已经准备了一个版本的MapReduce程序,
directly so I've sort of prepared a
version of the MapReduce program that I
95
00:09:57,730 --> 00:10:05,800
可以一次在此处运行一行,因此第一行是该行所在的行
can run a line at a time here so the
first line is this line in which it
96
00:10:05,800 --> 00:10:11,019
读取或要求SPARC读取此输入文件,并且您知道输入
reads the or asking SPARC to read this
input file and it's you know the input
97
00:10:11,019 --> 00:10:15,990
我带三个页面显示的文件
file I showed with the three pages in it
98
00:10:16,110 --> 00:10:23,110
好的,所以这里要注意的是,当Sparky是文件时,
okay so one thing there notice here is
is that when Sparky's a file what is
99
00:10:23,110 --> 00:10:29,769
实际要做的是从GFS(如分布式文件系统)中读取文件,然后
actually doing is reading a file from a
GFS like distributed file system and
100
00:10:29,769 --> 00:10:36,579
碰巧是HDFS Hadoop文件系统,但是这个HDFS文件系统非常
happens to be HDFS the Hadoop file
system but this HDFS file system is very
101
00:10:36,579 --> 00:10:40,720
很像GFS,所以如果您有一个巨大的文件,就像拥有一个文件
much like GFS so if you have a huge file
as you would with got a file with all
102
00:10:40,720 --> 00:10:46,720
所有链接的URL和HDFS上的Web都将文件拆分
the URLs all the links and the web on it
on HDFS is gonna split that file up
103
00:10:46,720 --> 00:10:51,730
在很多您知道的东西中,它会将文件分片
among lots and lots you know bite by
chunks it's gonna shard the file over
104
00:10:51,730 --> 00:10:57,329
很多服务器,所以读取文件的真正含义是
lots and lots of servers and so what
reading the file really means is that
105
00:10:57,329 --> 00:11:02,740
spark将安排对许多许多应用进行计算
spark is gonna arrange to run a
computation on each of many many
106
00:11:02,740 --> 00:11:10,209
机器,每个机器读取输入文件的一个块或一个分区,然后
machines each of which reads one chunk
or one partition of the input file and
107
00:11:10,209 --> 00:11:16,209
实际上实际上是系统终止或HDFS终止将文件大分割
in fact actually the system ends up or
HDFS ends up splitting the file big
108
00:11:16,209 --> 00:11:19,319
文件通常会进入更多分区
files typically into many more
partitions
109
00:11:19,319 --> 00:11:23,860
然后有工作机,所以每台工作机都将结束
then there are worker machines and so
every worker machine is going to end up
110
00:11:23,860 --> 00:11:28,990
负责查看输入文件的多个分区
being responsible for looking at
multiple partitions of the input files
111
00:11:28,990 --> 00:11:37,670
这很像地图的工作方式mapreduce好吧,所以这是第一行
this is all a lot like the way map works
mapreduce okay so this is the first line
112
00:11:37,670 --> 00:11:44,180
在程序中,您可能想知道变量行实际上是什么,所以在
in the program and you may wonder what
the variable lines actually hold so in
113
00:11:44,180 --> 00:11:50,840
打印出线条的结果,但带有线条点-事实证明,即使
printed the result of lines but with the
lines points - it turns out that even
114
00:11:50,840 --> 00:11:55,880
尽管看起来我们已经输入了一行代码,要求系统读取
though it looks like we've typed a line
of code that's asking the system to read
115
00:11:55,880 --> 00:12:02,090
一个文件,实际上它没有读取文件,并且一段时间不会读取文件了,
a file in fact it hasn't read the file
and won't read the file for a while what
116
00:12:02,090 --> 00:12:07,030
我们实际上是在用此代码在这里构建该代码在做什么不是
we're really building here with this
code what this code is doing is not
117
00:12:07,030 --> 00:12:13,130
导致输入被处理,而不是这段代码是建立一个
causing the input to be processed
instead what this code does is builds a
118
00:12:13,130 --> 00:12:19,250
沿袭图为我们想要的计算建立了一个配方
lineage graph it builds a recipe for the
computation we want like a little kind
119
00:12:19,250 --> 00:12:23,450
您在本文的图三中看到的血统图
of lineage graph that you see in Figure
three in the paper so what this code is
120
00:12:23,450 --> 00:12:27,320
这样做只是建立谱系图,建立计算配方
doing it's just building the lineage
graph building the computation recipe
121
00:12:27,320 --> 00:12:32,960
而实际上只是开始计算时不进行计算
and not doing the computation when the
computations only gonna actually start
122
00:12:32,960 --> 00:12:39,080
一旦我们执行了论文所称的动作,即像
to happen once we execute what the paper
calls an action which is a function like
123
00:12:39,080 --> 00:12:44,390
以收集为例,最后告诉马克,哦,我现在实际上想要输出
collect for example to finally tell mark
oh look I actually want the output now
124
00:12:44,390 --> 00:12:50,360
请去实际执行谱系图,然后告诉我
please go and actually execute the
lineage graph and tell me what the
125
00:12:50,360 --> 00:12:53,840
结果是线条实际上是一块
result is
so what lines holds is actually a piece
126
00:12:53,840 --> 00:13:01,220
现在不了解谱系图的结果,以便了解计算
of the lineage graph not a result now in
order to understand what the computation
127
00:13:01,220 --> 00:13:07,730
将在我们最终运行它时执行,我们实际上可以在此时询问SPARC,我们可以
will do when we finally run it we could
actually ask SPARC at this point we can
128
00:13:07,730 --> 00:13:14,840
请口译员继续进行,并告诉我们您实际所知
ask the interpreter to please go ahead
and tell us what you know I actually
129
00:13:14,840 --> 00:13:19,780
到目前为止执行血统图,并告诉我们结果是什么
execute the lineage graph up to this
point and tell us what the results are
130
00:13:19,780 --> 00:13:24,350
所以,你通过调用一个动作来做到这一点,我将称之为collect,
so and you do that by calling an action
I'm going to call collect which so just
131
00:13:24,350 --> 00:13:31,070
打印到目前为止执行谱系图的所有结果以及我们正在执行的操作
prints out all the results of executing
the lineage graph so far and what we're
132
00:13:31,070 --> 00:13:34,790
希望看到这里是您知道到目前为止我们已经要求它做的所有事情
expecting to see here is you know all
we've asked it to do so far the lineage
133
00:13:34,790 --> 00:13:38,120
图只是说请阅读文件,所以我们希望看到最后一个
graph just says please read a file so
we're expecting to see that the final
134
00:13:38,120 --> 00:13:44,020
输出只是文件的内容,实际上这就是我们得到的以及所得到的
output is just the contents of the file
and indeed that's what we get and what
135
00:13:44,020 --> 00:13:48,890
这个谱系图是什么
what
this lineage graph this one
136
00:13:48,890 --> 00:13:57,350
转换谱系图的结果只是在
transformation lineage graph is results
in is just the sequence of lines one at
137
00:13:57,350 --> 00:14:03,620
一次,所以实际上是一组线,一组字符串,每个字符串包含
a time so it's really a set of lines a
set of strings each of which contains
138
00:14:03,620 --> 00:14:10,010
输入的一行不错,所以这是程序的第一行
one line of the input alright so that's
the first line of the program the second
139
00:14:10,010 --> 00:14:19,760
行本质上是收集符号的即时编译
line is is collect essentially just
just-in-time compilation of the symbolic
140
00:14:19,760 --> 00:14:25,280
执行链是的,是的,是的,所以收集了什么
execution chain yeah yeah yeah yeah
that's what's going on so what collect
141
00:14:25,280 --> 00:14:30,280
确实,如果您致电收集,实际上发生了很多事情
does is it actually huge amount of stuff
happens if you call collect
142
00:14:30,280 --> 00:14:37,100
它告诉SPARC采取谱系图并产生Java字节码
it tells SPARC to take the lineage graph
and produce java bytecodes
143
00:14:37,100 --> 00:14:40,940
描述了您所知道的所有各种转换
that describe all the various
transformations you know which in this
144
00:14:40,940 --> 00:14:45,170
情况不是很多,因为我们只是读取文件,所以SPARC很好,
case it's not very much since we're just
reading a file but so SPARC well when
145
00:14:45,170 --> 00:14:50,810
您致电收集SPARC,通过查看来确定所需数据的位置
you call collect SPARC well figure out
where the data is you want by looking
146
00:14:50,810 --> 00:14:57,380
HDFS,您会知道只需选择一组工人来处理不同的任务
HDFS it'll you know just pick a set of
workers to run to process the different
147
00:14:57,380 --> 00:15:01,580
输入数据的分区将编译沿袭图,我们达到
partitions of the input data it'll
compile the lineage graph and we reach
148
00:15:01,580 --> 00:15:05,660
沿袭图转换为Java字节码,然后发送字节码
transformation in the lineage graph into
java bytecodes it sends the byte codes
149
00:15:05,660 --> 00:15:10,850
向所有Spark选择的工作机和那些工作机
out to the all the worker machines that
spark chose and those worker machines
150
00:15:10,850 --> 00:15:18,050
执行字节码,然后字节码说哦,你知道的请阅读告诉
execute the byte codes and the byte
codes say oh you know please read tell
151
00:15:18,050 --> 00:15:24,770
每个工人在输入处读取分区,然后最后收集
each worker to read it's partition at
the input and then finally collect goes
152
00:15:24,770 --> 00:15:32,120
取出并从工作程序中获取所有结果数据,因此再次没有
out and fetches all the resulting data
back from the workers and so again none
153
00:15:32,120 --> 00:15:34,910
直到您真正想要采取行动之前,我们都会
of this happens until you actually
wanted an action and we sort of
154
00:15:34,910 --> 00:15:39,170
现在过早运行收集,您通常不会这样做,因为我只是
prematurely run collect now you wouldn't
ordinarily do that I just because I just
155
00:15:39,170 --> 00:15:43,460
想要查看输出是什么,以了解转换是什么
want to see what the the output is to
understand what the transformations are
156
00:15:43,460 --> 00:15:51,490
好的,如果您查看我正在显示的代码
doing okay
if you look at the code that I'm showing
157
00:15:51,490 --> 00:16:01,779
第二行是此地图调用,所以请假行指的是
the second line is this map call so the
leave so line sort of refers to the
158
00:16:01,779 --> 00:16:06,369
第一个转换的输出是对应于字符串的集合
output of the first transformation which
is the set of strings correspond to
159
00:16:06,369 --> 00:16:11,740
输入中的行我们将要调用map,我们已经要求系统调用
lines in the input we're gonna call map
we've asked the system call map on that
160
00:16:11,740 --> 00:16:16,660
映射的作用是在输入的每个元素上运行一个函数
and what map does is it runs a function
over each element of the input that is
161
00:16:16,660 --> 00:16:22,019
在这种情况下,或者输入的每一行,那个小功能就是S箭头
in this case or each line of the input
and that little function is the S arrow
162
00:16:22,019 --> 00:16:27,160
基本上描述了一个调用split函数的函数
whatever which basically describes a
function that calls the split function
163
00:16:27,160 --> 00:16:34,990
每行拆分仅获取一个字符串,并返回一个在
on each line split just takes a string
and returns a array of strings broken at
164
00:16:34,990 --> 00:16:39,730
有空间的地方以及该行的最后部分
the places where there are spaces and
the final part of this line that refers
165
00:16:39,730 --> 00:16:44,740
对于第0&1部分,对于输入的每一行,我们都希望在
to parts 0 & 1 says that for each line
of input we want to at the output of
166
00:16:44,740 --> 00:16:51,040
此转换是该行的第一个字符串,然后是第二个字符串
this transformation be the first string
on the line and then the second string
167
00:16:51,040 --> 00:16:54,189
线,所以我们只是做一些转换以将这些弦
of the line so we're just doing a little
transformation to turn these strings
168
00:16:54,189 --> 00:16:59,019
变成更容易处理的东西