Skip to content

Latest commit

 

History

History
238 lines (189 loc) · 10.1 KB

command_list.md

File metadata and controls

238 lines (189 loc) · 10.1 KB

Command line tool

Preparation

CLI is a Command Line Interface tool for using Yosegi. following tools are provided.

  • bin/setup.sh # for gathering Yosegi related jars
  • bin/yosegi.sh # create yosegi data, and show data

yosegi.sh needs some jars, so please create jar files before using.

$ mvn package

For preparation, get Yosegi jars and store then to proper directories.

$ bin/setup.sh # get Yosegi jars from Maven repository (bin/setup.sh -h for help)

yosegi.sh

help

Output usage.

  • Example
$ ./bin/yosegi.sh help
setup  setup yosegi lib dir.
create create file.
cat read yosegi file.
schema view yosegi file schema.
fstats view yosegi file stats.
cstats view column stats.
stest run storage perfomance test.
help view help. 

create

Create an Yosegi file.

args Required detail
-f,--format true Input data format. Now only json is supported.
-h , --help false Output usage.
-i,--input true Input file path. "-" is standard input.
-o,--output true Output file path. "-" is standard output
-s,--schema false If need a schema with input data format please enter it.

Example of execution

$ bin/yosegi.sh create -i src/example/src/main/resources/sample_json.txt -f json -o /tmp/sample.yosegi

cat

Commands for reading and outputting Yosegi files.

args Required detail
-e,--expand false Use expand function.
-f,--format true Output data format. Supports only output with json now
-h,--help false Output usage.
-i,--input true Input file path. Input file path. "-" is standard input.
-o,--output true Output file path. "-" is standard output
-p,--projection_pushdown <projection_pushdown> false Use projection pushdown. Format:"[ [ "column1" , "[column1-child]" , "column1-child-child" ] [ "column2" , ... ] ... ]"
-s,--schema false If need a schema with output data format please enter it.
-x,--flatten false Use flatten function.

Example of execution

$ bin/yosegi.sh cat -i /tmp/sample.yosegi -o '-' # show whole data
{"summary":{"total_price":550,"total_weight":412},"number":5,"price":110,"name":"apple","class":"fruits"}
{"summary":{"total_price":800,"total_weight":600},"number":10,"price":80,"name":"orange","class":"fruits"}

$ bin/yosegi.sh cat -i /tmp/sample.yosegi -o '-' -p '[["name"]]' # show part of data
{"name":"apple"}
{"name":"orange"}

schema

This command outputs the schema of the Yosegi file.

args Required detail
-f,--format true Output data format. Supports only output with hive now.
-h,--help false Output usage.
-i,--input true Input file path. Input file path. "-" is standard input.
-o,--output true Output file path. "-" is standard output

Example of execution

$ bin/yosegi.sh schema -i /tmp/sample.yosegi -o '-' -f hive
struct<summary:struct<total_price:int,total_weight:int>,number:int,price:int,name:string,class:string>

fstats

Output statistical information of the file.

args Required detail
-h,--help false Output usage.
-i,--input true Input file path. Input file path. "-" is standard input.
-o,--output true Output file path. "-" is standard output

Example of execution

$ bin/yosegi.sh fstats -i /tmp/sample.yosegi  -o "-"
Line count=2 , Average record size=113.500000 , Average record real size=139.500000 , Average record per field=7.000000 , Field count=14 , Raw data size=227 , Real data size=279 , Logical data size=75 , cardinality=-1 , stats report count=1 , Average field size=16.214286 , Average field real size=19.928571 , Compress late=1.229075 , Average row count per stats report count=14.000000 , Average cardinality=-1.000000

cstats

Output statistical information of the column.

args Required detail
-h,--help false Output usage.
-i,--input true Input file path. Input file path. "-" is standard input.
-o,--output true Output file path. "-" is standard output

Example of execution

$  bin/yosegi.sh cstats -i /tmp/sample.yosegi  -o "-"
/ROOT/summary<SPREAD> : Field count=2 , Raw data size=0 , Real data size=0 , Logical data size=0 , cardinality=-1 , stats report count=1 , Average field size=0.000000 , Average field real size=0.000000 , Compress late=NaN , Average row count per stats report count=2.000000 , Average cardinality=-1.000000
/ROOT/summary/total_price<INTEGER> : Field count=2 , Raw data size=48 , Real data size=48 , Logical data size=8 , cardinality=-1 , stats report count=1 , Average field size=24.000000 , Average field real size=24.000000 , Compress late=1.000000 , Average row count per stats report count=2.000000 , Average cardinality=-1.000000
/ROOT/summary/total_weight<INTEGER> : Field count=2 , Raw data size=48 , Real data size=48 , Logical data size=8 , cardinality=-1 , stats report count=1 , Average field size=24.000000 , Average field real size=24.000000 , Compress late=1.000000 , Average row count per stats report count=2.000000 , Average cardinality=-1.000000
/ROOT/number<INTEGER> : Field count=2 , Raw data size=48 , Real data size=48 , Logical data size=8 , cardinality=-1 , stats report count=1 , Average field size=24.000000 , Average field real size=24.000000 , Compress late=1.000000 , Average row count per stats report count=2.000000 , Average cardinality=-1.000000
/ROOT/price<INTEGER> : Field count=2 , Raw data size=48 , Real data size=48 , Logical data size=8 , cardinality=-1 , stats report count=1 , Average field size=24.000000 , Average field real size=24.000000 , Compress late=1.000000 , Average row count per stats report count=2.000000 , Average cardinality=-1.000000
/ROOT/name<STRING> : Field count=2 , Raw data size=25 , Real data size=77 , Logical data size=19 , cardinality=3 , stats report count=1 , Average field size=12.500000 , Average field real size=38.500000 , Compress late=3.080000 , Average row count per stats report count=2.000000 , Average cardinality=3.000000
/ROOT/class<STRING> : Field count=2 , Raw data size=10 , Real data size=10 , Logical data size=24 , cardinality=1 , stats report count=1 , Average field size=5.000000 , Average field real size=5.000000 , Compress late=1.000000 , Average row count per stats report count=2.000000 , Average cardinality=1.000000

stest

Measure reading and writing performance.

args Required detail
-h,--help false Output usage.
-i,--input true Input file path. Input file path. "-" is standard input.
-n,--maxSpreadCount <spread_count> false Max spread count.
-p,--projection_pushdown <projection_pushdown> false Use projection pushdown.
-o,--output true Output file path. "-" is standard output

The output format is as follows.

<Column Path> , <Name of ColumnBinaryMaker> , <Name of ICompressor> , <Row count> , <Raw data size> , <Real data size> , <Logical data size> , <Write CPU time(msec)> , <Read CPU time(msec)>

Example of execution

$ bin/yosegi.sh stest -i /tmp/sample.yosegi  -o "-"
/summary/total_price,jp.co.yahoo.yosegi.binary.maker.OptimizeDumpLongColumnBinaryMaker,jp.co.yahoo.dataplatform.yosegi.compressor.DefaultCompressor,2,22,22,4,1.089,1.548

merge

Merging Yosegi files.

args Required detail
-h,--help false Output usage.
-i,--input true Input file path. Input file path. "-" is standard input.
-o,--output true Output file path. "-" is standard output
-p,--projection_pushdown <projection_pushdown> false Use projection pushdown.
-x,--flatten false Use flatten function.

Example of execution

$ bin/yosegi.sh merge -i "/tmp/sample.yosegi,/tmp/sample.yosegi"  -o "/tmp/merge_sample.yosegi"

to_arrow

Commands for creating Apache Arrow files from Yosegi files.

args Required detail
-e,--expand false Use expand function.
-h,--help false Output usage.
-i,--input true Input file path. Input file path. "-" is standard input.
-o,--output true Output file path. "-" is standard output
-p,--projection_pushdown <projection_pushdown> false Use projection pushdown.
-x,--flatten false Use flatten function.

Example of execution

$ ./bin/yosegi.sh to_arrow -i /tmp/sample.yosegi -o "/tmp/sample.arrow"

Examples of using data in [pyarrow(https://arrow.apache.org/docs/python/)]

  • sample.py
import pyarrow as pa

reader = pa.RecordBatchFileReader( pa.OSFile( "/tmp/sample.arrow" ) )

for i in range( reader.num_record_batches ):
  rb = reader.get_record_batch(i)
  print( rb.num_rows )
  df = rb.to_pandas()
  print( df["name"].value_counts() )

Runnning command

$ python sample.py
2
apple     1
orange    1
Name: name, dtype: int64

from_arrow

Commands for creating Apache Arrow files from Yosegi files.

args Required detail
-e,--expand false Use expand function.
-h,--help false Output usage.
-i,--input true Input file path. Input file path. "-" is standard input.
-o,--output true Output file path. "-" is standard output

Example of execution

$ ./bin/yosegi.sh  from_arrow -i /tmp/sample.arrow -o /tmp/sample.yosegi