Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

功能--->函数 #126

Merged
merged 1 commit into from
Aug 29, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ch10.md
Original file line number Diff line number Diff line change
Expand Up @@ -562,7 +562,7 @@ top5.each{|count, url| puts "#{count} #{url}" } # 5

​ 由于它们将工作流显式建模为数据从几个处理阶段穿过,所以这些系统被称为**数据流引擎(dataflow engines)**。像MapReduce一样,它们在一条线上通过反复调用用户定义的函数来一次处理一条记录,它们通过输入分区来并行化载荷,它们通过网络将一个函数的输出复制到另一个函数的输入。

​ 与MapReduce不同,这些功能不需要严格扮演交织的Map与Reduce的角色,而是可以以更灵活的方式进行组合。我们称这些函数为**算子(operators)**,数据流引擎提供了几种不同的选项来将一个算子的输出连接到另一个算子的输入:
​ 与MapReduce不同,这些函数不需要严格扮演交织的Map与Reduce的角色,而是可以以更灵活的方式进行组合。我们称这些函数为**算子(operators)**,数据流引擎提供了几种不同的选项来将一个算子的输出连接到另一个算子的输入:

- 一种选项是对记录按键重新分区并排序,就像在MapReduce的混洗阶段一样(请参阅“[分布式执行MapReduce](#分布式执行MapReduce)”)。这种功能可以用于实现排序合并连接和分组,就像在MapReduce中一样。
- 另一种可能是接受多个输入,并以相同的方式进行分区,但跳过排序。当记录的分区重要但顺序无关紧要时,这省去了分区散列连接的工作,因为构建散列表还是会把顺序随机打乱。
Expand Down