[stats-dev] Incremental ogr-verify

stats-dev at lists.distributed.net stats-dev at lists.distributed.net
Sun Jun 15 16:16:27 EDT 2003


I was going to send this to just nerf for implimentation, but I figured
I might as well let everyone have a look at it.

Note: I'm lazy, so I've used SELECT ... INTO a lot. Only thing is, when
you do this in pgsql, I think it creates the table WITH OIDs. This means
we have an extra int column we don't need. It would be useful to know if
there was a speed difference between SELECT ... INTO and doing seperate
CREATE TABLE and INSERT INTO statements.

Here's what needs to happen:

1) Grab only new data from stats_participant:

I just added a created column to make this easier.

Change get_idlookup.sh to use this SQL:

SELECT email, id, ... AS stats_id
    FROM stats_participant
    WHERE retire_to = STATS_DATE
        OR (created >= STATS_DATE AND created < dateadd(dd, 1, STATS_DATE)

STATS_DATE is the date of the stats-day that we're processing. That
query will return all records where someone retired on that day, as well
as all new participants that were registered on that day.

2) Store the list of new retires

Two ways this can be done:
- add retire_date column to OGR_idlookup
- use a table to store the retires that happened for STATS_DATE

There's normally very few retires that happen each day, so the second
one makes more sense.

changes to create_id_lookup 1.15:

TRUNCATE TABLE retire_today;
(Table will need to be created as id, retire_to, but don't do
create/drop in the script)
INSERT INTO retire_today(id, retire_to)
    SELECT id, retire_to
        FROM import_id
        WHERE retire_to >= 1 AND retire_to != id
        -- I think we eventually want retire_to to store id when someone
        -- isn't retired
;

Don't screw with OGR_idlookup; remove the truncate and drop indexes.

INSERT INTO OGR_idlookup(email, id, stats_id)
    SELECT email, id, id
        FROM import_id i
        WHERE NOT EXISTS (SELECT 1 FROM OGR_lookup l WHERE l.id = i.id)
        -- Might be better in the future to export two lists; new
        -- participants and new retires. This protects us in the
        -- meantime
;

-- Update for retires
UPDATE OGR_idlookup
    SET stats_id = retire_to
    FROM import_id i
    WHERE OGR_idlookup.id = i.id
        AND i.retire_to >= 1
        AND i.retire_to != i.id
;

3) Update existing data in OGR_summary. I think this can happen either
before or after movedata.sql, but it makes sense to do it before.

Build a temporary table of all work by all newly retired
participants. You'll have to be careful about indexes, etc.; we
absolutely do not want to tablescan OGR_results.

SELECT DISTINCT stub_id, nodecount, id
    INTO TEMP retired_stub_id
    FROM OGR_results o, retire_today r
    WHERE o.id = r.id
;

Need to experiment to see what indexes we want on retired_stub_id.
Probably want one on stub_id, nodecount.

Build a list of new information.

SELECT stub_id, nodecount
        , (SELECT count(*)
                FROM retired_stub_id rsi
                WHERE rsi.stub_id = r.stub_id
                    AND rsi.nodecount = r.nodecount
            ) AS ids
        , max(p.version) AS max_version
    INTO TEMP retired_new_info
    FROM ogr_results r, platform p
        , (SELECT DISTINCT stub_id, nodecount FROM retired_stub_id) rs
    WHERE r.stub_id = rs.stub_id
        AND r.nodecount = rs.nodecount
        AND r.id NOT IN (SELECT id
                                FROM retired_stup_id rsi
                                WHERE rsi.stub_id = r.stub_id
                                    AND rsi.nodecount = r.nodecount
                            )
        AND p.platform_id = r.platform_id
;

UPDATE OGR_results
    SET participants = participants - n.ids
        , max_client = n.max_version
    FROM retired_new_info n
    WHERE r.stub_id = n.stub_id
;

4) Update OGR_summary with new work

Basically, build a temp table like what summarize does now, and use that
info to update OGR_summary:

SELECT stub_id, nodecount, count(DISTINCT l.stats_id) AS ids, max(p.version) AS version
    INTO TEMP day_summary
    FROM day_results dr, OGR_lookup l, platform p
    WHERE l.id = dr.id
        AND p.platform_id = dr.platform_id
    GROUP BY stub_id, nodecount
;

You'll need to load the max function from stats-sql/functions.sql.
UPDATE OGR_summary
    SET participants = participants + ids
        , max_client = max(max_client, version)
    FROM day_summary ds
    WHERE OGR_summary.stub_id = ds.stub_id
        AND OGR_summary.nodecount = ds.nodecount
;

INSERT INTO OGR_summary(stub_id, nodecount, participants, max_client)
    SELECT stub_id, nodecount, ids, version
    FROM day_summary ds
    WHERE NOT EXISTS (SELECT 1
                            FROM OGR_summary s
                            WHERE s.stub_id = ds.stub_id
                                AND s.nodecount = ds.nodecount
                        )
;
-- 
Jim C. Nasby (aka Decibel!)                    jim at nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


More information about the stats-dev mailing list