[stats-dev] log loader

Jim C. Nasby decibel at distributed.net
Fri Apr 22 13:13:43 EDT 2005


On Thu, Apr 21, 2005 at 07:13:02PM -0500, Chris Hodson wrote:
> Recent activity has spurred me to think about this project again.  For
> those of you who just joined us (or have short memories), the idea is to
> have a database that mimics the raw logs that the master has.  There are
> many reasons this would be useful.
> 
> I'm going to recap some of the issues that have been decided and some
> that (in my mind) haven't.  Feel free to jump in with opinions about
> anything at any time.
> 
> Decided:
> 	* Single table for all entries

Actually, I was thinking the other day that this would be a good place
to use table inheritance. My idea is that we'd have a main table that
has all the info common to each project (ie, participant_id, client
version, result timestamp, etc), and either each project or each
grouping of projects (ie: OGR) would be a table that inherits from the
common table and adds it's own fields as needed. This also has the
benefit of giving us per-project partitioning (the common table wouldn't
actually have any data in it). If we do per-project tables, we can also
eliminate storing project_id, which at 4 bytes per row starts to add up.

> 	* Email, and client version will be normalized
> 	* All records (even hackers, worms, etc) will be included
> 	* Program to do the loading will be written in perl
> 
> Open issues:
> 	* How much pre-processing?  Sanity only?
Only enough pre-processing to throw out bad logs. We do need to
pre-process because of emails with commas, and other changes in log
formats. It shouldn't be hard to just modify the existing logmod.cpp for
this. Basically it just needs to output all fields instead of just some,
and not strip time info out of the logs.

> 	* Should the pre-processor be written in C or part of the perl program?
C

> 	* Is there any daily processing to be done?
For now I think daily is fine, but we should design with hourly in mind.

> 	* Where will the lookup info be stored?  e.g Will it use the same email -> id lookup table that the rest of stats uses?

No. It needs to be different because this database won't take retires
into account, and it shouldn't throw out bad email addresses.

> 	* Will this be independent of any other stats work or will it fit in?

Eventually stats could feed off this, and probably should.

> A few words about the pre-processor; If we use a C pre-processor, it's
> obviously faster, but at a cost of portability.  This portability loss
> is both in having to compile the C program (not a huge deal), but also
> in the database loading.  If we go with the perl solution, it would make
> sense to use DBI and load the data while it's already split and in
> memory.  The advantage is that there are DBD module that can talk to
> make RDBMSs (including ODBC) which would ease the adoption by anyone
> outside of dnet who wanted to start doing their own logging on a
> perproxy.  Just my $.02.

We don't want to use perl to put data in the database. A copy will be
much faster.

> I welcome any comments or questions.
> 
> -Nerf
> _______________________________________________
> stats-dev mailing list
> stats-dev at lists.distributed.net
> http://lists.distributed.net/mailman/listinfo/stats-dev
> 

-- 
Jim C. Nasby, Database Consultant           decibel at distributed.net
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


More information about the stats-dev mailing list