[stats-dev] Attempting to speed movedata.sql for ogr-verification

Jim C. Nasby jim at nasby.net
Sat May 10 11:36:45 EDT 2003

In a nutshell, this is what I've been (less clearly) suggesting. I
believe what you've outlined is exactly what we need to be doing to
avoid that 4 hour query every night.

I think the only part of the process that might still be painful is
marking stubs as incomplete due to a new retire. Currently, I think that
will require hitting OGR_results to see what work was done by the email
being retired. But, the good news is that we only average about 5
retires a day, so as long as we can do this in such a way as to avoid a
table-scan, we'll be fine.

I think the first step is probably to get ogr-verify to only grab the
retires for that day (make sure you don't use now(); use the day you
just loaded).

On Wed, May 07, 2003 at 12:10:07PM -0700, Benjamin Gavin wrote:
> OK,
>   I've been thinking on this for a bit longer, and after running some
> tests against an unloaded blower, these queries are insignificant compared
> to the total runtime of the process.  What we actually need to speed up is
> the verification/tracking/summary steps.  SO, if you'll excuse my rampant
> musings, I'm going to propose a slightly different approach to the
> process.  The goals of this approach are the following:
> 1.  Track progress on an incremental basis to avoid rebuilding data tables
> where the data on a daily basis is very (90%+) similar.
> 2.  Reliably handle participant retires
> 3.  Reliably track "spammers", and allow their work to be effectively
> removed from the verification process. (not really addressed in this
> email)
> To accomplish the above goals, I believe that we need to do the following:
> 1.  We must accurately track retires, for the initial load (seeding the
> verification database) we can pull this data from the stats database.  For
> subsequent loads, we'll need to gather the list of "new retires" for that
> day.
> 2.  It is imperative that we minimize access to the ogr_summary table,
> preferably by incrementally updating the contents (and of the ogr_complete
> table, if possible).
> 3.  We must enforce that the results of pass1 and pass2 are in complete
> agreement.  If they do not agree, we must continue until we find at least
> two responses that agree (and at least one of those must come from a 8014+
> client).
> 4.  We must enforce that pass1 and pass2 are completed by two distinct
> "effective stats_id" participants (i.e. including retires).
> Some of the above we are already doing, others we are doing but at great
> cost (no incremental updates).  In order to solve this I propose a stepped
> approach:
> 1.  Load the days logdata
> 2.  Load the days retires
> 3.  Update all current stats_ids with the latest retire data.  This
> requires that we track both the originating participant id and the
> "effective stats_id".
> 4.  For all stubs where the stats_ids match for pass1 && pass2 determine
> whether another valid stats_id can satisfy that pass, or if it needs to be
> left incomplete.
> 5.  If the stub was marked as incomplete, update the ogr_completion table
> as necessary.
> 6.  Process the incoming logdata, updating ogr_summary pass1/2 with the
> new incoming data as proper, updating ogr_complete as necessary.  This
> would also be the place to verify that the results of pass1 && pass2
> agree.
> I'll leave it at that for now, thoughts/comments?
> Ben (TheJet)

Jim C. Nasby (aka Decibel!)                    jim at nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
To unsubscribe, send 'unsubscribe stats-dev' to majordomo at lists.distributed.net

More information about the stats-dev mailing list