Your Daily Source for Apache News and Information  
Breaking News Preferences Contribute Triggers Link Us Search About
Apache Today [Your Apache News Source] To internet.com

Free trial with SiteScope. $10 Amazon Certificate.

Apache HTTPD Links
The Apache FAQ
Apache-Perl Integration Project
The Java Apache Project
Apache XML Project
Apache Module Registry
Apache-Related Projects
Apache Project
The Jakarta Project
PHP Server Side Scripting
The Apache Software Foundation
ApacheCon
The Linux Channel at internet.com
Linux Programming
Linux Start
Linux Apps
Just Linux
All Linux Devices
BSD Today
Linux Today
Apache Today
BSD Central
Linux Central
Linux Planet
PHPBuilder
Linuxnewbie.org
Enterprise Linux Today
SITE DESCRIPTIONS
Apache Guide: Logging, Part 4 -- Log-File Analysis
Sep 18, 2000, 15 :11 UTC (14 Talkback[s]) (16514 reads) (Other stories by Rich Bowen)

By

In the first sections of this series, I've talked about what goes into the standard log files, and how you can change the contents of those files.

This week, we're looking at how to get meaningful information back out of those log files.

The Challenge

The problem is that although there is an enormous amount of information in the log files, it's not much good to the people that pay your salary. They want to know how many people visited your site, what they looked at, how long they stayed, and where they found out about your site. All of that information is (or might be) in your log files.

They also want to know the names, addresses, and shoe sizes of those people, and, hopefully, their credit card numbers. That information is not in there, and you need to know how to explain to your employer that not only is it not in there, but the only way to get this information is to explicitly ask your visitors for this information, and be willing to be told 'no.'

What Your Log Files Can Tell You

There is a lot of information available to put in your log files, including the following:

Address of the remote machine
This is almost the same as "who is visiting my web site," but not quite. More specifically, it tells you where that visitor is from. This will be something like buglet.rcbowen.com or proxy01.aol.com.

Time of visit
When did this person come to my web site? This can tell you something about your visitors. If most of your visits come between the hours of 9 a.m. and 4 p.m., then you're probably getting visits from people at work. If it's mostly 7 p.m. through midnight, people are looking at your site from home.

Single records, of course, give you very little useful information, but across several thousand 'hits', you can start to gather useful statistics.

Resource requested.
What parts of your site are most popular? Those are the parts that you should expand. Which parts of the site are completely neglected? Perhaps those parts of the site are just really hard to get to. Or, perhaps they are genuinely uninteresting, in which case you should spice them up a little. Of course, some parts of your site, such as your legal statements, are boring and there's nothing you can do about it, but they need to stay on the site for the two or three people that want to see them.

What's broken?
And, of course, your logs tell you when things are not working as they should be. Do you have broken links? Do other sites have links to your site that are not correct? Are some of your CGI programs malfunctioning? Is a robot overwhelming your site with thousands of requests per second? (Yes, this has happened to me. In fact, it's the reason that I did not get this article in on time last week!)

What your log files don't tell you

HTTP is a stateless, anonymous protocol. This is by design, and is not, at least in my opinion, a shortcoming of the protocol. If you want to know more about your visitors, you have to be polite, and actually ask them. And be prepared to not get reliable answers. This is amazingly frustrating for marketing types. They want to know the average income, number of kids, and hair color, of their target demographic. Or something like that. And they don't like to be told that that information is not available in the log files. However, it is quite beyond your control to get this information out of the log files. Explain to them that HTTP is anonymous.

And even what the log files do tell you is occasionally suspect. For example, I have numerous entries in my log files indicating that a machine called cache-mtc-am05.proxy.aol.com visited my web site today. I can tell that this is a machine that is on the AOL network. But because of the way that AOL works, this might be one person visiting my site many times, or it might be many people visiting my site one time each. AOL does something called proxying, and you can see from the machine address that it is a proxy server. A proxy server is one that one or more people sit behind. They type an address into their browser. It makes that request to the proxy server. The proxy server gets the page (generating the log file entry on my web site). It then passes that page back to the requesting machine. This means that I never see the request from the originating machine, but only the request from the proxy.

Another implication of this is that if, 10 minutes later, someone else sitting behind that same proxy requests the same page, they don't generate a log file entry at all. They type in the address, and that request goes to the proxy server. The proxy sees the request and thinks "I already have that document in memory. There's no point asking the web site for it again." And so instead of asking my web site for the page, it gives the copy that it already has to the client. So, not only is the address field suspect, but the number of request is also suspect.

So, Um, What Good are These Logs?

It might sound like the data that you receive is so suspect as to be useless. This is in fact not the case. It should just be taken with a grain of salt. The number of hits that your site receives is almost certainly not really the number of visitors that came to your site. But it's a good indication. And it still gives you some useful information. Just don't rely on it for exact numbers.

How Do I Get Useful Statistics?

So, to the real meat of all of this. How do you actually generate statistics from your Web-server logs?

There are two main approaches that you can take here. You can either do it yourself, or you can get one of the existing applications that is available to do it for you.

Unless you have custom log files that don't look anything like the Common log format, you should probably get one of the available apps out there. There are some excellent commercial products, and some really good free ones, so you just need to decide what features you are looking for.

So, without further ado, here's some of the great apps out there that can help you with this task.

Analog
The Analog web site (http://www.statslab.cam.ac.uk/~sret1/analog/) claims that about 29 percent of all web sites that use any log analysis tool at all use Analog. They claim that this makes it the most popular log analysis tool in the world. This fascinated me in particular, because until last week, I had never heard of it. I suppose that this is because I was happy with what I was using, and had never really looked for anything else.

The example report, which you can see on the Analog web site, seemed very thorough, and to contain all of the stats that I might want. In addition to the pages and pages of detailed statistics, there was a very useful executive summary, which will probably be the only part that your boss will really care about.

WebTrends
Another log analysis tool that I have been introduced to in the past few months is WebTrends. WebTrends provides astoundingly detailed reports on your log files, giving you all sorts of information that you did not know you could get out of these files. And there are lots of pretty graphs generated in the report.

WebTrends has, in my opinion, two counts against it.

The first is that it is really expensive. You can look up the actual price on their web site. (http://www.webtrends.com/default.htm)

The other is that it is painfully slow. A 50MB log file from one site for which I am responsible (one month's traffic) took about 3 hours to grind through to generate the report. Admittedly, it's doing a heck of a lot of stuff. But, for the sake of comparison, the same log file took about 10 minutes using WWWStat. Some of this is just the difference between Perl's ability to grind through text files and C's ability. But 3 hours seemed a little excessive.

WWWStat
Now that I've mentioned it, WWWStat is the package that I've been using for about 6 years now. It's fast, full-featured, and it's free. What more could you want. You can get it at http://www.ics.uci.edu/pub/websoft/wwwstat/ and there is a companion package (linked from that same page) that generates pretty graphs.

It is very easy to automate WWWStat so that it generates your log statistics every night at midnight, and then generates monthly reports at the end of each month.

It may not be as full-featured as WebTrends, but it has given me all the stats that I've ever needed.

Wusage
Another fine product from Boutell.com, Wusage is now in version 7. I've used it on and off through the years, and have always been impressed by not only the quality of the software but also the amazing responsiveness of the technical support staff.

You can get Wusage at http://www.boutell.com/wusage/

Or, You Can Do it Yourself

If you want to do your own log parsing and reporting, the best tool for the task is going to be Perl. In fact, Perl's name (Practical Extraction and Report Language) is a tribute to its ability to extract useful information from logs and generate reports. (In reality, the name ``Perl'' came before the expansion of it, but I suppose that does not detract from my point.)

The Apache::ParseLog module, available from your favorite CPAN mirror, makes parsing log files simple, and so takes all the work out of generating useful reports from those logs.

For detailed information about how to use this module, install it and read the documentation. Once you have installed the module, you can get at the documentation by typing perldoc Apache::ParseLog.

Trolling through the source code for WWWStat is another good way to learn about Perl log file parsing.

And that's about it

Not much more to say here. I'm sure that I've missed out someone's favorite log parsing tool, and that's to be expected. There are hundreds of them on the market. It's really a question of how much you want to pay, and what sort of reports you need.

Thanks for Listening

Thanks for reading. Let me know if there are any subjects that you'd like to see articles on in the future. You can contact me at

--Rich

Related Stories:
E-Commerce Solutions: Template-Driven Pages, Part 2(Sep 13, 2000)
Apache Guide: Logging, Part 3 -- Custom Logs(Sep 05, 2000)
Apache Guide: Logging, Part II -- Error Logs(Aug 28, 2000)
Apache Guide: Logging with Apache--Understanding Your access_log(Aug 21, 2000)
Apache Guide: Apache Authentication, Part 3(Aug 07, 2000)
Apache Guide: Apache Authentication, Part 1(Jul 24, 2000)

  Current Newswire:
Daemon News: Jakarta-Tomcat on FreeBSD 4.4

Moto, a compilable server-side scripting language

SECURITY: Flaws Found in PHP Leave Web Servers Open to Attack

Everything Solaris: Apache: Handling Traffic

LinuxEasyInstaller 2.0 final release

Apache 2.0.32 beta is available

Everything Solaris: Apache: The Basics

Apache Jakarta James Mailserver v2.0a2 Released

PostgreSQL v7.2 Final Release

Daemon News: Multiple webservers behind one IP address

 Talkback(s) Name  Date
  What about webalizer?
I have used Webalizer in the past, it seems much more user friendly than most log analysers. The reports are very concise and tell you everything you need to know - checkout http://freshmeat.net/projects/webalizer/homepage/ for more information.

Setup is a doddle.

It's also GPL copyrighted. 'just thought it deserved a mention.   
  Sep 18, 2000, 18:38:06
  Webalizer
If you are unwilling to pay dollars, you can head over to webalizer and get that. It is available on pretty much any platform - being based on gd. It is highly configurable. It is very speedy (runs nicely on my P90). I advice you to get 2.0 prerelease, though. GIF support is no longer supported due to the silly patent issues.   
  Sep 18, 2000, 19:29:46
  Log file analysis and virtual hosting
Any tips on log file analysis in a virtual server environment?
What's the best way of going through dozens of different log files?   
  Sep 19, 2000, 13:11:19
  A little $$$ gets you nice ones

For a few bucks (less than $300) check out:

+ Summary (http://www.summary.net)

+ Urchin (http://www.urchin.com)

Sincerely,

-- Niraj
  
  Sep 19, 2000, 13:33:30
   Re: Log file analysis and virtual hosting
I've used webalizer (http://www.mrunix.net/webalizer/) on virtual servers with great success. It is pretty easy to install and configure, is free, and is simple to automate with a little Perl or shell script.   
  Sep 25, 2000, 14:20:02
  another tool
How about WebAlizer? I have been using it for some time. It has nice graphing and reports. I like it somewhat more than Analog.   
  Sep 25, 2000, 15:53:25
  Analog
I downloaded analog yesterday and found it incredibly easy to use.

The simple way is:
analog logfile.log
and it will produce a summary in html form.

It is customisable to the extent that analog basically becomes an API.
You can do something like 20 different reports / graphs, over time intervals (5 min - 1 year).

You can include / exclude files, dirs based on simple regexp.

It is very fast, taking between 2 min (p100 64 Mb RAM), and 10 sec (dual p500 1GB Ram) for 100Mb of log files.

If can include any format log files, and you can specify custom formats.

The web site is available at
http://www.analog.cx/


Your output can be html, or simple text files for importing into a spreadsheet.

Anyway - it only took me a few hours to produce output in the *exact* format that i wanted, and found a few new formats that I didn't know i needed.

Scott   
  Sep 27, 2000, 23:02:44
  webalizer rules!
i struggled and struggled to figure out how to use analog. what
pain in the butt for a mere mortal. then i stumbled onto
webalizer. i had it up and running in no time flat on my linux box
for virtual hosts.

the reports in webalizer are just as detailed as in analog plus they
are better looking and easier to read..

i got v2 prerelease so i could do the reverse dns and it has run w/o
a hitch. don't waste your time with anything else. PLUS webalizer is a
a floor wax AND dessert topping!

  
  Sep 28, 2000, 03:20:06
   Re: Log file analysis and virtual hosting
Give weblog a go.

http://awsd.com/scripts/weblog/

Set up config.pl and then make copies of it for each site.

Then when you are ready for your stats run, call each copy of config.pl in turn.   
  Oct 1, 2000, 00:41:16
   Re: Log file analysis and virtual hosting
> Any tips on log file analysis in a virtual server environment?

Once you've used the obvious "HostnameLookups Off" (which you probably already
do - it speeds up your Web serving immensely) and logging to separate log files
for each virtual host (much safer), you do end up with a real problem -
dozens (in my case, something like 900) sets of logs to analyse PLUS you have
to do after-the-fact DNS lookups because you've turned off lookups when
originally logging.

Hence, anything that doesn't have its own DNS cache file to cache lookups
(instead relying on direct gethostbyname() calls, often to the same sets of
hosts, night after night) is a worthless Web log analyser IMHO. And, no,
I don't use logresolve that comes with Apache for that very reason - it doesn't
cache DNS lookups - arrgh. Makes it hopeless really.

Yes, you *could* analyse your Web logs without doing any DNS lookups, but you
always find people want country/host info and whinge if you don't have it
(though "country" is dubious, what with many UK sites, for example, using .com
addresses).

I picked Analog myself (http://www.analog.cx/) and over recent years have
crafted a shell script I use which does the following "unusual" things (so
unusual, that I suspect not many other people run Analog like this):

1. I mirror our Web logs from disparate live Web servers onto our intranet
(e.g. /net/logs/ /access_log.date.gz or whatever).

2. I run my master shell script from an NFS server - yes, the one that hosts
/net/logs.

3. I nominate 6 of my fastest workstation boxes (it can be more) on the
intranet to run the master script in "client mode" (i.e. with params that
tell it that it's going to do an analysis of ). Yes, all workstations
mount /net/logs.

4. Using simple file locking, I then go down my list of domains to process
and remsh from the NFS server to the free workstations (backgrounded of
course), palming off each domain until all are busy.

5. I then wait for one to finish (lock file is removed) and remsh the next
domain to that freed up workstation - and then sit around in a checking loop
looking for the next freed up workstation.

To speed DNS lookups, I copy the master DNS cache file (from the end of last
night's run) onto the system disk of each workstation and point analog to
use that file for DNS cache lookups.

When all the runs are complete, I sort/merge the 6 workstation DNS files
(remember, analog updates them as well as reads from them)
and keep the 500,000 most recent lookups, which are copied back to the
master DNS cache file, ready for the next day's run. Yes, I've tried sharing
the master DNS cache file across NFS, but very slow to do so.

At the end of the day, I can do the 900 sites with DNS lookups in about 7 hours
(sounds slow, but some DNS lookups can "stall" for minutes remember),
which means the workstations are usually free before 9am, ready for our
intranet users to use them.

P.S. I hate this "slashcode" backslash bug when you use double quotes like
in this sentence. I did *not* type the backslashes !
  
  Oct 10, 2000, 21:39:12
  Access_log
I need to know when the file regenerates itself. I can tell by my server that it starts over at some point. Some of the files I have looked at start at 0000 GMT and other start at 2000 GMT I am not sure what causes the change. Is this a configureable option in the httpd.conf or the regenerate time bilt in to the daemon.

Thanks
MIke   
  Oct 13, 2000, 03:14:53
  FunnelWeb another good one
Another analyzer that is worth taking a look at is FunnelWeb at www.activeconcepts.com. A bit cheaper than WebTrends, and creates MUCH better looking graphs (important for the PHB [Dilbert reference :-)]) and more output options (including PDF). It comes on multiple platforms (including Linux) and the Windows version had bug fix releases once every few weeks.   
  Nov 10, 2000, 15:49:30
  NetTracker - Another Option
I would like to throw NetTracker out there as a possible analysis tool. I've had experience with WebTrends, Analog and Accrue and I've found NetTracker to be one of the easiest to setup and administer and one of the best in the quality and quantity of reports and information it delivers. NetTracker is available at http://www.sane.com   
  Jan 5, 2001, 01:25:21
  analog --- excellent !
I download the
----- "Analog 5.1"
today, installed it, and used it ... only in 30mins.

It is really easy to use. I like it very much.

Thanks to all.
  
  Feb 1, 2002, 05:56:03
Enter your comments below.
Your Name: Your Email Address:


Subject: CC: [will also send this talkback to an E-Mail address]
Comments:

See our talkback-policy for or guidelines on talkback content.

About Triggers Media Kit Security Triggers Login


All times are recorded in UTC.
Linux is a trademark of Linus Torvalds.
Powered by Linux 2.4, Apache 1.3, and PHP 4
Copyright 2002 INT Media Group, Incorporated All Rights Reserved.
Legal Notices,  Licensing, Reprints, & Permissions,  Privacy Policy.
http://www.internet.com/