-
Notifications
You must be signed in to change notification settings - Fork 0
Test harness for Hadoop TeraSort
License
wjguerrero/terasort
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
TeraSort Benchmark This repository provides an execution wrapper for the Hadoop TeraSort benchmark. TeraSort is a widely used benchmark for Hadoop distributions. As the name implies, it is a sorting benchmark. The TeraSort package includes 3 map/reduce applications: TeraGen (data generator), TeraSort (map/reduce sorting), and TeraValidate (sort validation). For more information see: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html RUNTIME PARAMETERS Use of run.sh requires the Hadoop cluster to be provisioned and accessible to the user via the 'hadoop' command. The following runtime parameters and environment metadata may be specified (using run.sh arguments): * collectd_rrd If set, collectd rrd stats will be captured from --collectd_rrd_dir. To do so, when testing starts, existing directories in --collectd_rrd_dir will be renamed to .bak, and upon test completion any directories not ending in .bak will be zipped and saved along with other test artifacts (as collectd-rrd.zip). User MUST have sudo privileges to use this option * collectd_rrd_dir Location where collectd rrd files are stored - default is /var/lib/collectd/rrd * hadoop_conf_dir Hadoop conf directory (see 'settings_*' output parameters below). Default is [hadoop_home]/etc/hadoop * hadoop_examples_jar Path to JAR file containing TeraSort programs. Default is [hadoop_home]/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar * hadoop_heapsize Optional explicit heap size in megabytes for the hadoop daemon. Value is set to the HADOOP_HEAPSIZE environment variable * hadoop_home The base directory where hadoop is installed. Default is /usr/local/hadoop. If [hadoop_home]/bin is not already included in PATH, it will be added * meta_compute_service The name of the compute service this test pertains to. May also be specified using the environment variable bm_compute_service * meta_compute_service_id The id of the compute service this test pertains to. Added to saved results. May also be specified using the environment variable bm_compute_service_id * meta_cpu CPU descriptor - if not specified, it will be set using the 'model name' attribute in /proc/cpuinfo * meta_instance_id The compute service instance type this test pertains to (e.g. c3.xlarge). May also be specified using the environment variable bm_instance_id * meta_map_reduce_version The map reduce version in place on the Hadoop cluster. Default is 2. Options are 1 or 2 * meta_memory Memory descriptor - if not specified, the system memory size will be used * meta_hdfs_nodes Number of data nodes in the HDFS cluster. If not specified, an attempt to determine this value will be tried using 'hdfs dfsadmin -report'. However, only the 'hdfs' user has access to this information using that command and if not accessible, the value of this parameter defaults to 1 * meta_os Operating system descriptor - if not specified, it will be taken from the first line of /etc/issue * meta_provider The name of the cloud provider this test pertains to. May also be specified using the environment variable bm_provider * meta_provider_id The id of the cloud provider this test pertains to. May also be specified using the environment variable bm_provider_id * meta_region The compute service region this test pertains to. May also be specified using the environment variable bm_region * meta_resource_id An optional benchmark resource identifiers. May also be specified using the environment variable bm_resource_id * meta_run_id An optional benchmark run identifiers. May also be specified using the environment variable bm_run_id * meta_storage_config Storage configuration descriptor. May also be specified using the environment variable bm_storage_config * meta_storage_volumes Number of storage volumes attached to each HDFS node * meta_storage_volume_size Size of each storage volume in GB * meta_test_id Identifier for the test. May also be specified using the environment variable bm_test_id * no_purge If set, 'teragen_dir', 'terasort_dir' and 'teravalidate_dir' will not be deleted upon test completion * output The output directory to use for writing test data (logs and artifacts). If not specified, the current working directory will be used * tera_args Optional -D arguments for teragen, terasort and teravalidate - multiple OK e.g. "dfs.block.size=134217728". Any of the *_args parameters may contain the following tokens: {cpus} : number of CPU cores on the host {nodes}: number of nodes in the hdfs cluster (per meta_hdfs_nodes) {rows}: number of rows (teragen_rows) {gb}: size of rows in GB Additionally, values may be an expression: e.g. mapred.map.tasks={cpus}*{nodes} or a ternary expression - reduce jobs are 64GB but at least equal to the number of cluster nodes: e.g. mapreduce.job.reduces={nodes}>({gb}/64)?{nodes}:({gb}/64) * teragen_args Optional -D arguments for teragen - multiple OK e.g. "dfs.block.size=134217728" * teragen_balance when set, the hdfs cluster will be balanced to this threshold following execution of teragen (sudo -u hdfs hdfs balancer -threshold [teragen_balance]) User must have sudo privilege. The balance operation can be very slow - increase the hdfs-site.xml property dfs.datanode.balance.bandwidthPerSec from the default 8 Mb/s (1048576) something higher to increase balancing speed. Value should be between 3 and 50 * teragen_dir hdfs directory where teragen input should be written to - if it already exists, it will be deleted prior to testing. Default is 'terasort-input' * teragen_rows Number of 100 byte rows to generate - default is 10 billion == 1 TB. This parameter will be validated against the available HDFS capacity prior to test execution * terasort_args Optional -D arguments for terasort - multiple OK e.g. "mapred.map.tasks=4" * terasort_dir hdfs directory where terasort output should be written to - if it already exists, it will be deleted prior to testing. Default is 'terasort-output' * teravalidate_args Optional -D arguments for teravalidate - multiple OK * teravalidate_dir hdfs directory where teravalidate output should be written to - if it already exists, it will be deleted prior to testing. Default is 'terasort-validate' * verbose Show verbose output DEPENDENCIES This benchmark has the following dependencies: hadoop Hadoop command - cluster should be provisioned and enabled for the current user invoking run.sh hdfs Used to interact with hdfs java Java JDK php Test automation scripts (/usr/bin/php) zip Used to compress test artifacts TEST ARTIFACTS This benchmark generates the following artifacts: collectd-rrd.zip collectd RRD files (see --collectd_rrd) terasort-conf.zip Compressed XML configuration files for teragen, terasort and teravalidate (teragen.xml, terasort.xml and teravalidate.xml) terasort-logs.zip Compressed stdout, stderr and log files for teragen, terasort and teravalidate (teragen.out, teragen.log, terasort.out, terasort.log, teravalidate.out, teravalidate.log) SAVE SCHEMA The following columns are included in CSV files/tables generated by save.sh. Indexed MySQL/PostgreSQL columns are identified by *. Columns without descriptions are documented as runtime parameters above. Data types and indexing used is documented in save/schema/terasort.json. Columns can be removed using the save.sh --remove parameter benchmark_version: [benchmark version] collectd_rrd: [URL to zip file containing collectd rrd files] compression: [true if compression used (i.e. mapreduce.map.output.compress)] compression_codec: [intermediate compression codec class name - if used (i.e. mapreduce.map.output.compress.codec)] hadoop_heapsize hadoop_version: [Hadoop version] iteration: [iteration number (used with incremental result directories)] java_version: [version of java present] meta_compute_service meta_compute_service_id* meta_cpu: [CPU model info] meta_cpu_cache: [CPU cache] meta_cpu_cores: [# of CPU cores] meta_cpu_speed: [CPU clock speed (MHz)] meta_instance_id* meta_hdfs_nodes meta_hostname: [system under test (SUT) hostname] meta_memory meta_memory_gb: [memory in gigabytes] meta_memory_mb: [memory in megabyets] meta_os_info: [operating system name and version] meta_provider meta_provider_id* meta_region* meta_resource_id meta_run_id meta_storage_config* meta_storage_volumes meta_storage_volume_size meta_test_id* purge_dirs settings_core: [JSON string representing settings in core-site.xml] settings_hdfs: [JSON string representing settings in hdfs-site.xml] settings_mapred: [JSON string representing settings in mapred-site.xml] settings_yarn: [JSON string representing settings in yarn-site.xml] tera_args: [JSON] teragen_args: [JSON] teragen_balance teragen_blocksize: [TeraGen block size setting in MB] teragen_cmd: [TeraGen runtime command - including teragen_args] teragen_dir teragen_failed: [1 if teragen failed] teragen_gbs: [GB/s writeen during teragen] teragen_map_tasks: [# of TeraGen map task] teragen_reduce_tasks: [# of TeraGen map task] teragen_replication: [replication factor for teragen] teragen_rows teragen_time: [Total run duration of TeraGen in seconds] terasort_args: [JSON] terasort_blocksize: [TeraSort block size setting in MB] terasort_cmd: [TeraSort runtime command - including terasort_args] terasort_conf: [URL to terasort-conf.zip (if --store option used)] terasort_dir terasort_failed: [1 if terasort failed] terasort_gbs: [GB/s processed during terasort] terasort_logs: [URL to terasort-logs.zip (if --store option used)] terasort_map_tasks: [# of TeraSort map task] terasort_map_time: [duration of the terasort map phase] terasort_reduce_p1: [seconds in reduce shuffle phase (0-33%)] terasort_reduce_p2: [seconds in reduce sort phase (34-66%)] terasort_reduce_p3: [seconds in reduce phase (67-100%)] terasort_reduce_tasks: [# of TeraSort map task] terasort_replication: [replication factor for terasort] terasort_time: [Total run duration of TeraSort in seconds] teravalidate_args: [JSON] teravalidate_blocksize: [TeraValidate block size setting in MB] teravalidate_cmd: [TeraValidate runtime command - including teravalidate_args] teravalidate_dir teravalidate_failed: [1 if teravalidate failed] terasort_replication: [replication factor for teravalidate] teravalidate_time: [Total run duration of TeraValidate in seconds] test_started*: [when the test started] test_stopped: [when the test ended] USAGE # run 1 test iteration with some metadata ./run.sh --meta_compute_service_id aws:ec2 --meta_instance_id c3.xlarge --meta_region us-east-1 --meta_test_id aws-1214 # run with explicit hadoop_examples_jar ./run.sh --hadoop_examples_jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar # run 3 test iterations using a specific output directory for i in {1..3}; do mkdir -p ~/terasort-testing/$i; ./run.sh --output ~/terasort-testing/$i; done # save.sh saves results to CSV, MySQL, PostgreSQL, BigQuery or via HTTP # callback. It can also save artifacts (HTML, JSON and text results) to S3, # Azure Blob Storage or Google Cloud Storage # save results to CSV files ./save.sh # save results from 3 iterations text example above ./save.sh ~/terasort-testing # save results to a PostgreSQL database ./save --db postgresql --db_user dbuser --db_pswd dbpass --db_host db.mydomain.com --db_name benchmarks # save results to BigQuery and artifacts to S3 ./save --db bigquery --db_name benchmark_dataset --store s3 --store_key THISIH5TPISAEZIJFAKE --store_secret thisNoat1VCITCGggisOaJl3pxKmGu2HMKxxfake --store_container benchmarks1234
About
Test harness for Hadoop TeraSort
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published