-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JOB benchmark dataset [1/N] (imdb dataset) #12497
Conversation
Thanks @doupache for paving the way! Got few nit suggestion.
|
Thanks @austin362667 for the suggestions. IMDB is more suitable than JOB as it's specific and avoids confusion. Job can be used in many different contexts. Adding 'progress' to the title is also a good idea 👍 |
Thanks @doupache -- I started the CI jobs, and I will try and test this out manually locally over the next few days |
Thanks @austin362667 and @alamb. I have updated the PR and learned some Cargo tips from @austin362667. #1
cd benchmarks && cargo build
#2
cargo run --bin imdb -- convert --input ./data/imdb/ --output ./data/imdb/ --format parquet i also test all 21 parquet like follwoing. schema is from the original dataset. # create table
CREATE EXTERNAL TABLE name (
id INTEGER NOT NULL PRIMARY KEY,
name STRING NOT NULL,
imdb_index STRING,
imdb_id INTEGER,
gender STRING,
name_pcode_cf STRING,
name_pcode_nf STRING,
surname_pcode STRING,
md5sum STRING
)
STORED AS PARQUET
LOCATION '../benchmarks/data/imdb/name.parquet';
# read
SELECT * FROM name LIMIT 5; |
Sure! I think maybe it's because historical reasons that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @doupache and @austin362667 -- I tried this out and it worked great locally!
I ran it locally and it made a bunch of parquet files 👍
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/benchmarks/data/imdb$ ls -l
total 12672680
-rw-r--r--@ 1 andrewlamb staff 70M May 8 2014 aka_name.csv
-rw-r--r--@ 1 andrewlamb staff 32M Sep 23 13:37 aka_name.parquet
-rw-r--r--@ 1 andrewlamb staff 37M May 8 2014 aka_title.csv
-rw-r--r--@ 1 andrewlamb staff 15M Sep 23 13:37 aka_title.parquet
-rw-r--r--@ 1 andrewlamb staff 1.3G May 8 2014 cast_info.csv
-rw-r--r--@ 1 andrewlamb staff 351M Sep 23 13:37 cast_info.parquet
-rw-r--r--@ 1 andrewlamb staff 206M May 8 2014 char_name.csv
-rw-r--r--@ 1 andrewlamb staff 105M Sep 23 13:37 char_name.parquet
-rw-r--r--@ 1 andrewlamb staff 45B May 8 2014 comp_cast_type.csv
-rw-r--r--@ 1 andrewlamb staff 517B Sep 23 13:37 comp_cast_type.parquet
-rw-r--r--@ 1 andrewlamb staff 17M May 8 2014 company_name.csv
-rw-r--r--@ 1 andrewlamb staff 8.5M Sep 23 13:37 company_name.parquet
-rw-r--r--@ 1 andrewlamb staff 92B May 8 2014 company_type.csv
-rw-r--r--@ 1 andrewlamb staff 650B Sep 23 13:37 company_type.parquet
-rw-r--r--@ 1 andrewlamb staff 2.3M May 8 2014 complete_cast.csv
-rw-r--r--@ 1 andrewlamb staff 1.1M Sep 23 13:37 complete_cast.parquet
-rw-r--r--@ 1 andrewlamb staff 1.2G Sep 23 13:32 imdb.tgz
-rw-r--r--@ 1 andrewlamb staff 1.9K May 8 2014 info_type.csv
-rw-r--r--@ 1 andrewlamb staff 1.9K Sep 23 13:37 info_type.parquet
-rw-r--r--@ 1 andrewlamb staff 3.6M May 8 2014 keyword.csv
-rw-r--r--@ 1 andrewlamb staff 2.0M Sep 23 13:37 keyword.parquet
-rw-r--r--@ 1 andrewlamb staff 85B May 8 2014 kind_type.csv
-rw-r--r--@ 1 andrewlamb staff 605B Sep 23 13:37 kind_type.parquet
-rw-r--r--@ 1 andrewlamb staff 261B May 8 2014 link_type.csv
-rw-r--r--@ 1 andrewlamb staff 767B Sep 23 13:37 link_type.parquet
-rw-r--r--@ 1 andrewlamb staff 89M May 8 2014 movie_companies.csv
-rw-r--r--@ 1 andrewlamb staff 25M Sep 23 13:37 movie_companies.parquet
-rw-r--r--@ 1 andrewlamb staff 919M May 8 2014 movie_info.csv
-rw-r--r--@ 1 andrewlamb staff 293M Sep 23 13:37 movie_info.parquet
-rw-r--r--@ 1 andrewlamb staff 34M May 8 2014 movie_info_idx.csv
-rw-r--r--@ 1 andrewlamb staff 11M Sep 23 13:37 movie_info_idx.parquet
-rw-r--r--@ 1 andrewlamb staff 89M May 8 2014 movie_keyword.csv
-rw-r--r--@ 1 andrewlamb staff 27M Sep 23 13:37 movie_keyword.parquet
-rw-r--r--@ 1 andrewlamb staff 641K May 8 2014 movie_link.csv
-rw-r--r--@ 1 andrewlamb staff 274K Sep 23 13:37 movie_link.parquet
-rw-r--r--@ 1 andrewlamb staff 306M May 8 2014 name.csv
-rw-r--r--@ 1 andrewlamb staff 135M Sep 23 13:37 name.parquet
-rw-r--r--@ 1 andrewlamb staff 381M May 8 2014 person_info.csv
-rw-r--r--@ 1 andrewlamb staff 143M Sep 23 13:37 person_info.parquet
-rw-r--r--@ 1 andrewlamb staff 160B May 8 2014 role_type.csv
-rw-r--r--@ 1 andrewlamb staff 646B Sep 23 13:37 role_type.parquet
-rw-r--r--@ 1 andrewlamb staff 4.2K Nov 28 2014 schematext.sql
-rw-r--r--@ 1 andrewlamb staff 194M May 8 2014 title.csv
-rw-r--r--@ 1 andrewlamb staff 88M Sep 23 13:37 title.parquet
The only thing I think we should do is add imdb
to the list of benchmarks in the bench.sh help text, but we can do that as a follow on PR
**********
* Benchmarks
**********
all(default): Data/Run/Compare for all benchmarks
tpch: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join
tpch_mem: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory
tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join
tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
parquet: Benchmark of parquet reader's filtering speed
sort: Benchmark of sorting speed
clickbench_1: ClickBench queries against a single parquet file
clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet
clickbench_extended: ClickBench "inspired" queries against a single parquet (DataFusion specific)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @doupache for the contribution and @alamb for the review~
I'll 1. add imdb
help text. and fix what @andygrove told in PR [2/N] #1252,
2. non-neg id use UInt32
3 use single context.
Let's keep improving things in the next PR. Thanks @austin362667 |
* imdb dataset * cargo fmt * we should also extrac the tar after download * we should not skip last col
Which issue does this PR close?
Partial Closes #12311
cd benchmarks/ ./bench.sh data imdb
All IMDB tables are now generated in
benchmarks/data/imdb/*.parquet
Rationale for this change
Add imdb dataset for the JOB benchmarking
What changes are included in this PR?
Download the the dataset and convert it to parquet files.
Are these changes tested?
Just like running to generate tpch dataset:
./bench.sh data tpch
run:
./bench.sh data imdb
Are there any user-facing changes?
no