[stats-dev] log loader

Jim C. Nasby decibel at distributed.net
Fri Apr 22 12:51:07 EDT 2005

On Fri, Apr 22, 2005 at 03:03:04AM -0500, Jeff Lawson wrote:
> > A few words about the pre-processor; If we use a C 
> > pre-processor, it's obviously faster, but at a cost of 
> > portability. 
> The performance difference might not be that bad if you ensure that the
> limiting rate is the I/O of the database insertions, and not your other
> pre-processing activities.  For example, by using threading to continue to
> parse log lines while you're waiting for the database to do the bulk insert
> statement that is executing in another thread.  Of course threading in Perl
> is a rather rarely used feature and some consider it to still be a little
> experimental.

Nerf was actually suggesting that we just insert via perl, and not do a
bulk copy. IMO we should use a bulk copy, as it will be much faster.

I was also thinking about the possibility of threading, as well as
loading multiple logs at once when we're 'catching up'. The stats-proc
hourly process currently does this:

find logs (6 seconds)
scp (6 seconds)
bunzip (2 seconds)
filter (1-2 seconds)
copy (3 seconds)
integrate.sql (15 seconds)

The 'integrate' step for the log database should be faster than it is on
stats. Find logs should be changed so that it actually reads the
directory into an array; that way if we're catching up we only need to
read it once. The SCP could be sped up by fetching multiple files at
once as well (about 2 seconds of that is connection setup).

I guess based on this maybe it doesn't make sense to screw around with
Jim C. Nasby, Database Consultant           decibel at distributed.net
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

More information about the stats-dev mailing list