[nycbug-talk] Any web stat program that collects data on time to serve

Edward Capriolo edlinuxguru at gmail.com
Tue Aug 2 13:23:14 EDT 2011


On Tue, Aug 2, 2011 at 12:52 PM, Jesse Callaway <bonsaime at gmail.com> wrote:

>
>
> On Tue, Aug 2, 2011 at 10:14 AM, Edward Capriolo <edlinuxguru at gmail.com>wrote:
>
>>
>>
>> On Mon, Aug 1, 2011 at 5:16 PM, Jesse Callaway <bonsaime at gmail.com>wrote:
>>
>>>
>>>
>>> On Mon, Aug 1, 2011 at 5:12 PM, Chris Snyder <chsnyder at gmail.com> wrote:
>>>
>>>> On Mon, Aug 1, 2011 at 5:08 PM, Chris Snyder <chsnyder at gmail.com>
>>>> wrote:
>>>> >
>>>> > As you've discovered, Apache doesn't log the request separate from the
>>>> > response, so a log analyzer is no help here.
>>>>
>>>> But wait -- this isn't strictly true. Apache can be made to log the
>>>> time taken to serve the request, in microseconds. It just doesn't do
>>>> so in the standard log format.
>>>>
>>>> http://httpd.apache.org/docs/2.2/mod/mod_log_config.html#formats
>>>>
>>>> But getting awstats or another log analyzer to pay attention is another
>>>> story.
>>>> _______________________________________________
>>>> talk mailing list
>>>> talk at lists.nycbug.org
>>>> http://lists.nycbug.org/mailman/listinfo/talk
>>>>
>>>
>>>
>>> correctamundo...
>>>
>>> Gotta go with sec (simple event correlator) or collectd for the easiest
>>> way. Otherwise you're writing your own filter program for the apache logs...
>>> which i mean it's kinda cool that you can just add a pipe character to the
>>> logfile name, like in perl. But...
>>>
>>>
>>>
>>> --
>>> -jesse
>>>
>>
>> To be clear I am using:
>>      LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\"
>> \"%{User-agent}i\" %T %D" with_time
>>         CustomLog     /opt/awstats-7.0/wwwroot/cgi-bin/gui-access-perf.log
>> with_time
>>
>> %D is the time in microseconds.
>>
>>
>> http://httpd.apache.org/docs/2.2/mod/mod_log_config.html#formats
>>
>> I know I can script something and make my own report, but I really do not
>> want to. If find when you write these things yourself you end up taking care
>> of them indefinitely. I was hoping to find some tool that would break down
>> %D by page. hits/average(time to serve),max(time_to_serve),95th
>> percentile(time_to_serve).
>>
>>
> Could you commit to the apache snmp module? Then you might possibly be able
> to pawn off maintenance at some point. Er... nah, that wouldn't really work
> per-page. Just thinking out loud.
>
> --
> -jesse
>

I am not trying to show off but I like closing up threads. I buckled and
just wrote it myself :(. I used a mix of shell, hadoop, and hive. This is
the gist of it:

sh produce_tts_stats.sh
awk '{print $7 "\t" $NF }' gui-access-perf.log > gui-access-perf_1
hadoop dfs -rm
/user/hive/warehouse/edward.db/time_to_serve/gui-access-perf_1
hadoop dfs -copyFromLocal gui-access-perf_1
/user/hive/warehouse/edward.db/time_to_serve

hive -e "create table time_to_serve fields terminated by '\t'" #<---one time
step

hive -e "
use edward;
set mapred.map.tasks=1;
set hive.cli.print.header=true;
select url,count(1) as count, max(tts) as tts_max ,min(tts) as tts_min
,avg(tts) as tts_avg from time_to_serve group by url order by tts_avg limit
4000000;
" > outfile

[edward at etl02 ~]$ head outfile
url count tts_max tts_min tts_avg
/ 21429 39520 37 72.10341126510804
/robots.txt 1 74 74 74.0
/w00tw00t.at.ISC.SANS.DFind:) 1 77 77 77.0

It is a couple more steps with cron and was not really enough data to
justify distributed computing.  Hive was a nice fit though because it
handled all the group stuff I did not want to code up by hand.

Edward
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nycbug.org/pipermail/talk/attachments/20110802/7d306a01/attachment.html>


More information about the talk mailing list