In gathering our requirements, the most common requirement by far was the ability to get all logs that contain some given string. So there are a number of tools available to do just that. These tools are generally installed in the path of an access node capable of full interaction with the Hadoop cluster.
logsearch - search for log lines which contain a literal string.
logmultisearch - search for lines which contain any or all of a list of literal strings.
loggrep - search for lines which match a regular expression.
logcat - returns all data for a given time range.
Each of these tools will return a formatted, sorted, set of logs in plain text format.
All of these tools use a common set of options, described here.
Note: All of these options support glob style matching. This includes '?' to match one character, '*' to match any set of characters, and more. More details are available in [the documentation.](http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path\))
DC
is the data centre top level code that allows you to delinate different geographical sources.
SERVICE
is the name of the service to be searched. For example web
or web/pod1
or whatever.
COMPONENT
is a the name of the component of the service that you're interested in. Some examples are app*
, {webhead*,app*}
, and so on.
START
and END
are the start (inclusive) and end (exclusive) of the time range you want logs for. This can either be specified as a number representing milliseconds since 1970, or as a date string. The date strings are parsed by the date
command, so there is a lot of flexibility there. Some examples of valid times are 1339684481000
, 2012-02-28 12:00:00
, 1 hour ago
, 1pm last tuesday
, 1pm last tuesday EST
. Don't forget to put multi-word dates in quotes so they're not treated as a list of arguments.
OUTPUT_DIR
is where, in HDFS, the output of the command will go. Relative paths are relative to your home directory. You can also redirect to standard out by using a -
as the target.
-dateFormat
specifies what format dates should be written in. The options are RFC822
, RFC3164
, RFC5424
(the default), or any valid format string for FastDateFormat. Here are some examples of what the formats look like:
RFC822 2012-06-06T12:34:56.789+0000
RFC3164 Jun 06 12:34:56 [The RFC calls for space padded 'day', but our apps use zero padded.]
RFC5424 2012-06-06T12:34:56.789+00:00
All of these tools write their results to a number of files in the directory in HDFS (unless you specify standard out). In order to see the data in your console, you can use the command
hdfs dfs -cat OUTPUT_DIR/part-*
If you'd like to copy the output to your local disk, use
hdfs dfs -get OUTPUT_DIR/part-* LOCAL_DIR
logsearch is, by far, the most optimized and efficient of the tools. This tool will do a simple, straight string match and return lines that contain the requested string.
Example:
logsearch ERROR dc1 web/pod1 '*' '1 hour ago' 'now' demo
logsearch WARN dc1 web/pod1 '*' '16 hour ago' 'now' demo2
logsearch -i AUTH dc1 web/pod1 applog '2013-02-27 00:00:00' '2013-02-27 01:00:00' test-search
Usage: logsearch [OPTIONS] STRING DC SERVICE COMPONENT START END OUTPUT_DIR
Options:
-v Verbose output.
-i Make search case insensitive.
-r Force remote sort.
-l Force local sort.
-dateFormat=FORMAT Valid formats are RFC822, RFC3164 (zero padded day),
RFC5424 (default), or any valid format string for FastDateFormat.
-fieldSeparator=X The separator to use to separate fields in intermediate
files. Defaults to 'INFORMATION SEPARATOR ONE' (U+001F).
STRING
is the string to search for in the line. Lines which contain this string anywhere (excluding the timestamp at the start of the line) will be returned.
logmultisearch is a moderately optimal tool, designed to provide basic AND and OR search functionality. It is not as efficient as logsearch.
Example:
logmultisearch -v 'ERROR|WARN' dc1 web/pod1 applog '30 minutes ago' 'now' -
Usage: logmultisearch [OPTIONS] (STRINGS_DIR|STRINGS_FILE|STRING)
DC SERVICE COMPONENT START END OUTPUT_DIR
Note:
If OUTPUT_DIR is '-', then results are written to stdout.
Options:
-v Verbose output.
-i Make search case insensitive.
-r Force remote sort.
-l Force local sort.
-a Enable AND searching.
-dateFormat=FORMAT Valid formats are RFC822, RFC3164 (zero padded day),
RFC5424 (default), or any valid format string for FastDateFormat.
-fieldSeparator=X The separator to use to separate fields in intermediate
files. Defaults to 'INFORMATION SEPARATOR ONE' (U+001F).
logmultisearch takes the strings to search for in one of three different formats. If the given option is the name of a directory on the local filesystem, then all files in that directory are assumed to contain search strings, one per line. If the option is the name of a file on the local filesystem, then that file is assumed to contain all of the search strings, one per line. Otherwise, that argument is assumed to be the only search string. Any line matching one or more of the given search strings will be returned.
loggrep is the most functional tool, supporting full regular expressions, but is by far the slowest of the three search tools.
Example:
loggrep -v 'URL*' dc1 web/pod1 applog '30 minutes ago' 'now' -
Usage: loggrep [OPTIONS] REGEX DC SERVICE COMPONENT START END OUTPUT_DIR
Note:
If OUTPUT_DIR is '-', then results are written to stdout.
Options:
-v Verbose output.
-i Make search case insensitive.
-r Force remote sort.
-l Force local sort.
-dateFormat=FORMAT Valid formats are RFC822, RFC3164 (zero padded day),
RFC5424 (default), or any valid format string for FastDateFormat.
-fieldSeparator=X The separator to use to separate fields in intermediate
files. Defaults to 'INFORMATION SEPARATOR ONE' (U+001F).
REGEX
is a Java style regular expression. Any log line that matches the regular expression will be returned. Note that the timestamp is not included in the line when checking for a match. For more information on Java regular expressions, see the JavaDoc for Pattern.
The logcat tool provides a straight dump of all content within a given time range.
Example:
logcat -v dc1 web applogs '30 minutes ago' 'now' -
logcat -v dc1 web applogs '30 minutes ago' 'now' - | grep ERROR | wc -l
logcat -v dc1 web applogs '2013-01-16 12:00:00' '2013-01-17 12:00:00' -
Usage: logcat [OPTIONS] DC SERVICE COMPONENT START END OUTPUT_DIR
Note:
If OUTPUT_DIR is '-', then results are written to stdout.
Options:
-v Verbose output.
-r Force remote sort.
-l Force local sort.
-dateFormat=FORMAT Valid formats are RFC822, RFC3164 (zero padded day),
RFC5424 (default), or any valid format string for FastDateFormat.
-fieldSeparator=X The separator to use to separate fields in intermediate
files. Defaults to 'INFORMATION SEPARATOR ONE' (U+001F).
Sometimes you want more than just a dump of the logs. So let's look at how the tools work.
Each of the provided tools does the same thing
- Call a mapreduce job, which writes logs in an intermediate format to a temp directory.
- Call a Pig job to format the timestamps, sort the log lines and output to the final directory.
If you want to do something other than format the dates and sort the output, then you can follow this same pattern and substitute your own Pig script.
The tools listed above are all Perl scripts, and you can use them as a start for your own tools.