-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
83 lines (53 loc) · 2.22 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
sampl
like "head" or "tail" but picks 10 random lines
sampl is a command-line tool for randomly picking a number of lines from
large data files. By default, it picks 10 lines like the Unix commands
"head" and "tail".
sampl is useful in order to get an idea of what's in a data file without
worrying whether the beginning of the file is representative of the rest.
sampl is fast because it doesn't scan the entire file(s) but simply picks
lines found at random positions.
Installation
============
Requires a standard installation of OCaml.
$ make
$ make install # Installation directory defaults to $HOME/bin.
PREFIX and BINDIR are supported, so if you want to install sampl in /usr/local,
just do:
$ sudo make PREFIX=/usr/local install
Uninstallation:
$ make uninstall
Usage
=====
"sampl" is typically used as replacement for "head" on large data files
whose first records are often not representative of typical lines.
$ ./sampl --help
Usage: ./sampl [OPTIONS] FILE1 [FILE2 ...]
sampl picks 10 random lines from the input files without scanning
the entire file.
Options:
-i
Keep outputting random lines indefinitely instead of stopping
after 10 lines. See also -n.
-n NUMBER
Specify sample size, i.e. the number of lines to pick.
The default is 10 lines. See also -i.
-s NUMBER
Specify seed of the random number generator.
By default, the seed is initialized randomly from system resources.
-version
Print sampl version and exit.
-help Display this list of options
--help Display this list of options
Algorithm for picking a random line
===================================
1. A random byte position is chosen in the input files,
just as if the files were concatenated as one big file.
2. The position is moved forward to the nearest beginning of line.
3. The line is read and printed to stdout.
This algorithm gives more chances to lines that follow long lines.
Therefore, lines are picked really "randomly" only if the length of a
line is statistically independent from any property of the line
that follows.
Also, the same line can be picked several times over the execution of the
program.