[stats-dev] Attempting to speed movedata.sql for ogr-verification
Jim C. Nasby
jim at nasby.net
Sat May 10 11:36:45 EDT 2003
In a nutshell, this is what I've been (less clearly) suggesting. I
believe what you've outlined is exactly what we need to be doing to
avoid that 4 hour query every night.
I think the only part of the process that might still be painful is
marking stubs as incomplete due to a new retire. Currently, I think that
will require hitting OGR_results to see what work was done by the email
being retired. But, the good news is that we only average about 5
retires a day, so as long as we can do this in such a way as to avoid a
table-scan, we'll be fine.
I think the first step is probably to get ogr-verify to only grab the
retires for that day (make sure you don't use now(); use the day you
just loaded).
On Wed, May 07, 2003 at 12:10:07PM -0700, Benjamin Gavin wrote:
> OK,
> I've been thinking on this for a bit longer, and after running some
> tests against an unloaded blower, these queries are insignificant compared
> to the total runtime of the process. What we actually need to speed up is
> the verification/tracking/summary steps. SO, if you'll excuse my rampant
> musings, I'm going to propose a slightly different approach to the
> process. The goals of this approach are the following:
>
> 1. Track progress on an incremental basis to avoid rebuilding data tables
> where the data on a daily basis is very (90%+) similar.
>
> 2. Reliably handle participant retires
>
> 3. Reliably track "spammers", and allow their work to be effectively
> removed from the verification process. (not really addressed in this
> email)
>
> To accomplish the above goals, I believe that we need to do the following:
>
> 1. We must accurately track retires, for the initial load (seeding the
> verification database) we can pull this data from the stats database. For
> subsequent loads, we'll need to gather the list of "new retires" for that
> day.
>
> 2. It is imperative that we minimize access to the ogr_summary table,
> preferably by incrementally updating the contents (and of the ogr_complete
> table, if possible).
>
> 3. We must enforce that the results of pass1 and pass2 are in complete
> agreement. If they do not agree, we must continue until we find at least
> two responses that agree (and at least one of those must come from a 8014+
> client).
>
> 4. We must enforce that pass1 and pass2 are completed by two distinct
> "effective stats_id" participants (i.e. including retires).
>
> Some of the above we are already doing, others we are doing but at great
> cost (no incremental updates). In order to solve this I propose a stepped
> approach:
>
> 1. Load the days logdata
>
> 2. Load the days retires
>
> 3. Update all current stats_ids with the latest retire data. This
> requires that we track both the originating participant id and the
> "effective stats_id".
>
> 4. For all stubs where the stats_ids match for pass1 && pass2 determine
> whether another valid stats_id can satisfy that pass, or if it needs to be
> left incomplete.
>
> 5. If the stub was marked as incomplete, update the ogr_completion table
> as necessary.
>
> 6. Process the incoming logdata, updating ogr_summary pass1/2 with the
> new incoming data as proper, updating ogr_complete as necessary. This
> would also be the place to verify that the results of pass1 && pass2
> agree.
>
> I'll leave it at that for now, thoughts/comments?
>
> Ben (TheJet)
>
>
--
Jim C. Nasby (aka Decibel!) jim at nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
--
To unsubscribe, send 'unsubscribe stats-dev' to majordomo at lists.distributed.net
More information about the stats-dev
mailing list