Adhoc Log Summaries in Unix
The Unix command-line interface (CLI) is known for its simple commands that achieve powerful results in combination.
Below, I explore how we can summarize logs using common Unix commands.
While performance can vary by machine and tool version, I have found these tools to generally work well with logs hundreds of gigabytes in size. When building a query, I recommend working with a smaller sample of your logs first (head -n 1000
is your friend).
The Setup
For the following examples, I will simulate logs using this Zsh command:
(for req in A_100 B_500 C_100 D_200 B_10 C_8 C_8 D_200 D_200
echo "$(date -Ins); GET ${req:s/_/; duration=}")
Outputting:
2022-12-19T23:44:33,357606000-08:00; GET A; duration=100
2022-12-19T23:44:33,364918000-08:00; GET B; duration=500
2022-12-19T23:44:33,372200000-08:00; GET C; duration=100
2022-12-19T23:44:33,379211000-08:00; GET D; duration=200
2022-12-19T23:44:33,386160000-08:00; GET B; duration=10
2022-12-19T23:44:33,393175000-08:00; GET C; duration=8
2022-12-19T23:44:33,399998000-08:00; GET C; duration=8
2022-12-19T23:44:33,406752000-08:00; GET D; duration=200
2022-12-19T23:44:33,413488000-08:00; GET D; duration=200
The Basics: grep
, sort
, cut
and uniq
A good first approach to count requests is to use the following commmands:
grep
/egrep
to filter logs based on a regular expressioncut
to extract structured log partssort
to sort output for use by uniq and to sort the final results by frequencyuniq
to group uniq results
For well-structured logs, cut
can extract relevant log parts. We then pipe the output through sort
and uniq -c
to calculate the frequency, followed by sort -nr
to sort results in descending order:
(for req in A_100 B_500 C_100 D_200 B_10 C_8 C_8 D_200 D_200
echo "$(date -Ins); GET ${req:s/_/; duration=}") \
| cut -d ';' -f 2 \
| sort \
| uniq -c \
| sort -nr
3 GET D
3 GET C
2 GET B
1 GET A
If logs are less structured or their format is unknown, grep -o can extract relevant information using a regular expression:
(for req in A_100 B_500 C_100 D_200 B_10 C_8 C_8 D_200 D_200
echo "$(date -Ins); GET ${req:s/_/; duration=}") \
| grep -Eo "GET [^;]+" \
| sort \
| uniq -c \
| sort -nr
This produces the same result:
3 GET D
3 GET C
2 GET B
1 GET A
Stream editing using awk
awk
, while slightly more complex, allows for more advanced log analysis. Here, we compute the average request duration per request type
(for req in A_100 B_500 C_100 D_200 B_10 C_8 C_8 D_200 D_200
echo "$(gdate -Ins); GET ${req:s/_/; duration=}") \
| awk -F '[;=]' '
/GET .*/ {
duration[$2]+= $4
count[$2]++
}
END {
for (req in count)
print (duration[req] / count[req]), req
}' \
| sort -nr
255 GET B
200 GET D
100 GET A
38.6667 GET C
Stream editing using perl
Although Perl is not part of the POSIX standard, it provides more advanced regular expression handling than awk
and also functions as a stream editor with the correct flags:
(for req in A_100 B_500 C_100 D_200 B_10 C_8 C_8 D_200 D_200
echo "$(date -Ins); GET ${req:s/_/; duration=}") \
| perl -lane '
BEGIN { my ($durations, $counts) }
/(GET [^ ]+); duration=(\d+)/ and do {
$durations{$1} += $2;
$counts{$1} += 1;
};
END {
while (my ($req,$count)=each %counts) {
$avg=$durations{$req}/$count;
print "$avg $req";
}
}' \
| sort -nr
255 GET B
200 GET D
100 GET A
38.6666666666667 GET C