-
-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nixos/hadoop: package rewrite and module improvements #141143
Conversation
97fd78f
to
0486861
Compare
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: |
outputHashMode = "recursive"; | ||
outputHash = dependencies-sha256; | ||
}; | ||
with lib; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with lib; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding it at the top level is better than adding with lib
for both common
and hadoop_3_3
pname = "hadoop"; | ||
version = "3.3.1"; | ||
sha256 = "1b3v16ihysqaxw8za1r5jlnphy8dwhivdx2d0z64309w57ihlxxd"; | ||
untarDir = "${pname}-${version}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
untarDir = "${pname}-${version}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
untarDir is needed here, it's used by libPatches below
- Make hadoop 3 the default hadoop package - hadoop_3 -> 3.3.1 - hadoop_3_2 -> 3..2.2 - hadoop_2 -> 2.10.1
The existing tests for HDFS and YARN only check if the services come up and expose their web interfaces. The new combined hadoop test will also test whether the services and roles work together as intended. It spin up an HDFS+YARN cluster and submit a demo YARN application that uses the hadoop cluster for storageand yarn cluster for compute.
mkdir -p $out/share/doc/hadoop | ||
cp -dpR * $out/ | ||
mv $out/*.txt $out/share/doc/hadoop/ | ||
for n in $(find $out/lib/${untarDir}/bin -type f ! -name "*.*"); do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use the bash builtin glob *.*
?
(subshells are generally frowned upon for added execution time and not so good error propagation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's doing doing the opposite of *.*
here. We want to exclude files with a .
in the name. ! -name "*.*"
passed to find means files that don't match *.*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about extended globing then https://www.linuxjournal.com/content/bash-extended-globbing
!(*.*)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to set ( shopt -s extglob
) before the expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested and it returns directories as well.
You will need to add a check for directory in the for.
I feel the extglob is a little more readable, but it's mostly a style preference. If you feel strongly about keeping the find, just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it returns directories? -type f
ensures find only returns files. As for readability, maybe it's just a matter of familiarity, but I find find's syntax of "find in this path all files that don't match the name pattern *.*
" more readable.
[illustris@illustris-thinkpad:/dev/shm]$ perf stat bash -c 'for n in $(find ./bin -type f ! -name "*.*"); do echo $n; done'
./bin/mapred
./bin/yarn
./bin/test-container-executor
./bin/container-executor
./bin/hadoop
./bin/oom-listener
./bin/hdfs
Performance counter stats for 'bash -c for n in $(find ./bin -type f ! -name "*.*"); do echo $n; done':
5.25 msec task-clock:u # 0.899 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
447 page-faults:u # 0.085 M/sec
9,854,259 cycles:u # 1.878 GHz
15,911,728 instructions:u # 1.61 insn per cycle
3,607,606 branches:u # 687.434 M/sec
103,533 branch-misses:u # 2.87% of all branches
0.005837019 seconds time elapsed
0.004626000 seconds user
0.001255000 seconds sys
The glob on the other hand returns a single line containing paths of all files. This will then need to be split and tested to see if it's a file or a directory.
[illustris@illustris-thinkpad:/dev/shm]$ perf stat bash -c 'shopt -s extglob; for n in bin/!\(*.*\); do echo $n; done | tr " " "\n" | while read line; do [ -d $line ] || echo $line; done'
bin/container-executor
bin/hadoop
bin/hdfs
bin/mapred
bin/oom-listener
bin/test-container-executor
bin/yarn
Performance counter stats for 'bash -c shopt -s extglob; for n in bin/!\(*.*\); do echo $n; done | tr " " "\n" | while read line; do [ -d $line ] || echo $line; done':
5.33 msec task-clock:u # 1.019 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
720 page-faults:u # 0.135 M/sec
7,124,080 cycles:u # 1.337 GHz
10,574,787 instructions:u # 1.48 insn per cycle
2,309,161 branches:u # 433.261 M/sec
58,271 branch-misses:u # 2.52% of all branches
0.005228103 seconds time elapsed
0.003550000 seconds user
0.002430000 seconds sys
to me the extglob alternative seems a lot harder to read. If you have a better way of writing it, please let me know. In terms of performance, both are within margin of error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for not being clear. The extglob returns directories and those need a test in the for loop.
I don't feel too strongly about the extglob.
Interesting that the performance is better with the find!
I have one minor nit, but otherwise it's looking good. |
Thank you for your contribution! |
Motivation for this change
The current system of building hadoop creates a monolithic fixed-output "build-deps" derivation by running maven in a loop. This makes updating the packages and using custom builds of hadoop much more difficult. It also forces expensive full rebuilds for minor changes. Most of the difficulties in building the package from source are because of maven's unusual way of doing things, such as returning different checksums for the same files, or downloading dynamically linked binaries at build time. The new package can directly accept upstream builds from apache, or binaries from your own custom builds.
The HDFS and YARN modules in their present state require too much manual configuration to spin up a cluster. The changes in this PR adds many sane defaults that make it possible to start a cluster with very little manual configuration. See
nixos/tests/hadoop/hadoop.nix
for an example.The existing tests for HDFS and YARN are simply checking whether the namenode, datanode, resourcemanager and nodemanager services start up and expose their web UIs. This is not enough to check if the services are able to communicate, store data and run workloads. The newly added test will test the following:
Things done
Package:
hadoop
package to the latest 3.x releasehadoop2
pointing to the latest hadoop 2.x releaseModule:
HADOOP_HOME
from service config as it is now correctly set by the packagealways
HADOOP_CONF_DIR
Tests
Todo
Future work
In its current state, the module doesn't make it easy to spin up an HA HDFS cluster with QJM. Usually this would require a series of manual steps to initialize the cluster. In subsequent PRs I'll try to make a 1-click deployment of a production-ready HA hadoop cluster possible.
While building from source is very inconvenient with nix's currently limited support for maven, it would be nice to provide the option to build hadoop from source eventually.
sandbox = true
set innix.conf
? (See Nix manual)nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
./result/bin/
)