[stats-dev] Attempting to speed movedata.sql for ogr-verification

Benjamin Gavin virtual_olympus at yahoo.com
Wed May 7 13:10:07 EDT 2003


OK,
  I've been thinking on this for a bit longer, and after running some
tests against an unloaded blower, these queries are insignificant compared
to the total runtime of the process.  What we actually need to speed up is
the verification/tracking/summary steps.  SO, if you'll excuse my rampant
musings, I'm going to propose a slightly different approach to the
process.  The goals of this approach are the following:

1.  Track progress on an incremental basis to avoid rebuilding data tables
where the data on a daily basis is very (90%+) similar.

2.  Reliably handle participant retires

3.  Reliably track "spammers", and allow their work to be effectively
removed from the verification process. (not really addressed in this
email)

To accomplish the above goals, I believe that we need to do the following:

1.  We must accurately track retires, for the initial load (seeding the
verification database) we can pull this data from the stats database.  For
subsequent loads, we'll need to gather the list of "new retires" for that
day.

2.  It is imperative that we minimize access to the ogr_summary table,
preferably by incrementally updating the contents (and of the ogr_complete
table, if possible).

3.  We must enforce that the results of pass1 and pass2 are in complete
agreement.  If they do not agree, we must continue until we find at least
two responses that agree (and at least one of those must come from a 8014+
client).

4.  We must enforce that pass1 and pass2 are completed by two distinct
"effective stats_id" participants (i.e. including retires).

Some of the above we are already doing, others we are doing but at great
cost (no incremental updates).  In order to solve this I propose a stepped
approach:

1.  Load the days logdata

2.  Load the days retires

3.  Update all current stats_ids with the latest retire data.  This
requires that we track both the originating participant id and the
"effective stats_id".

4.  For all stubs where the stats_ids match for pass1 && pass2 determine
whether another valid stats_id can satisfy that pass, or if it needs to be
left incomplete.

5.  If the stub was marked as incomplete, update the ogr_completion table
as necessary.

6.  Process the incoming logdata, updating ogr_summary pass1/2 with the
new incoming data as proper, updating ogr_complete as necessary.  This
would also be the place to verify that the results of pass1 && pass2
agree.

I'll leave it at that for now, thoughts/comments?

Ben (TheJet)


--- Chris Hodson <nerf at slacker.com> wrote:
> How about doing this processing during the hourly runs?  It would make
> the
> hourly runs a bit longer, but would speed up the daily run.  We could
> also
> do the normalization during this step.  Comments?
> 
> -----Original Message-----
> From: owner-stats-dev at lists.distributed.net
> [mailto:owner-stats-dev at lists.distributed.net] On Behalf Of Benjamin
> Gavin
> Sent: Friday, May 02, 2003 10:38 AM
> To: stats-dev at distributed.net
> Subject: [stats-dev] Attempting to speed movedata.sql for
> ogr-verification
> 
> 
> OK,
>   So I was looking through the queries that are part of the
> ogr-verification
> process and trying to determine if we could speed any of them up a bit. 
> As
> an example, I took the query which attempts to insert non-matching stubs
> into the primary ogr_stubs table (i.e. those which came in as returns,
> but
> don't currently exist in the ogr_stubs table).  I don't have any real
> good
> performance numbers between the two, but initial tests are showing that
> the
> new query is slightly faster than the old one:
> 
> Old Query:
> SELECT DISTINCT L.stub_marks, 0
> FROM logdata L
> WHERE NOT EXISTS (SELECT * FROM ogr_stubs WHERE
> 			L.stub_marks = ogr_stubs.stub_marks);
> 
> Explain Output:
>  Unique  (cost=3055.91..3059.66 rows=50 width=26)
>    ->Sort  (cost=3055.91..3057.16 rows=500 width=26)
>        Sort Key: stub_marks, 0
>          ->Seq Scan on logdata l  (cost=0.00..3033.50 rows=500 width=26)
>             Filter: (NOT (subplan))
>             SubPlan
>               ->Index Scan using stubs_marks on ogr_stubs
>                 (cost=0.00..3.01 rows=1 width=26)
>                   Index Cond: ($0 = stub_marks)
> 
> New Query:
> SELECT DISTINCT L.stub_marks, 0
> FROM logdata L LEFT OUTER JOIN ogr_stubs O ON L.stub_marks =
> O.stub_marks
> WHERE O.stub_marks IS NULL;
> 
> Explain Output:
>  Unique  (cost=3095.83..3103.33 rows=100 width=46)
>    ->Sort  (cost=3095.83..3098.33 rows=1000 width=46)
>       Sort Key: l.stub_marks, 0
>         ->Nested Loop  (cost=0.00..3046.00 rows=1000 width=46)
>            Filter: ("inner".stub_marks IS NULL)
>              ->Seq Scan on logdata l  (cost=0.00..20.00 rows=1000
> width=26)
>              ->Index Scan using stubs_marks on ogr_stubs o
>                (cost=0.00..3.01 rows=1 width=20)
>                   Index Cond: ("outer".stub_marks = o.stub_marks)
> 
> 
> You'll notice that the cost estimate for the second query is higher, but
> in
> my initial (admittedly minor) testing, the second tended to return as
> much
> as 20% faster.  I'm fairly certain that the database load was rather
> high at
> the time, and the numbers varied as much as 100% between runs, so we'll
> need
> to do a bit more testing to see if we can eek out a bit more performance
> from these.
> 
> Ben Gavin
> thejet at distributed.net
> 
> 
> =====
> /******************************/
> Benjamin Gavin
> Virtual.Olympus Software
> Virtual_Olympus at yahoo.com
> http://www.virtual-olympus.com/ /******************************/
> 
> 
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com
> 
> --
> To unsubscribe, send 'unsubscribe stats-dev' to
> majordomo at lists.distributed.net
> 
> 
> 
> --
> To unsubscribe, send 'unsubscribe stats-dev' to
> majordomo at lists.distributed.net
> 


=====
/******************************/
Benjamin Gavin
virtual.olympus software
ben at virtual-olympus.com
http://www.virtual-olympus.com/
/******************************/

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

--
To unsubscribe, send 'unsubscribe stats-dev' to majordomo at lists.distributed.net



More information about the stats-dev mailing list