[stats-dev] Incremental ogr-verify
stats-dev at lists.distributed.net
stats-dev at lists.distributed.net
Sun Jun 15 16:16:27 EDT 2003
I was going to send this to just nerf for implimentation, but I figured
I might as well let everyone have a look at it.
Note: I'm lazy, so I've used SELECT ... INTO a lot. Only thing is, when
you do this in pgsql, I think it creates the table WITH OIDs. This means
we have an extra int column we don't need. It would be useful to know if
there was a speed difference between SELECT ... INTO and doing seperate
CREATE TABLE and INSERT INTO statements.
Here's what needs to happen:
1) Grab only new data from stats_participant:
I just added a created column to make this easier.
Change get_idlookup.sh to use this SQL:
SELECT email, id, ... AS stats_id
FROM stats_participant
WHERE retire_to = STATS_DATE
OR (created >= STATS_DATE AND created < dateadd(dd, 1, STATS_DATE)
STATS_DATE is the date of the stats-day that we're processing. That
query will return all records where someone retired on that day, as well
as all new participants that were registered on that day.
2) Store the list of new retires
Two ways this can be done:
- add retire_date column to OGR_idlookup
- use a table to store the retires that happened for STATS_DATE
There's normally very few retires that happen each day, so the second
one makes more sense.
changes to create_id_lookup 1.15:
TRUNCATE TABLE retire_today;
(Table will need to be created as id, retire_to, but don't do
create/drop in the script)
INSERT INTO retire_today(id, retire_to)
SELECT id, retire_to
FROM import_id
WHERE retire_to >= 1 AND retire_to != id
-- I think we eventually want retire_to to store id when someone
-- isn't retired
;
Don't screw with OGR_idlookup; remove the truncate and drop indexes.
INSERT INTO OGR_idlookup(email, id, stats_id)
SELECT email, id, id
FROM import_id i
WHERE NOT EXISTS (SELECT 1 FROM OGR_lookup l WHERE l.id = i.id)
-- Might be better in the future to export two lists; new
-- participants and new retires. This protects us in the
-- meantime
;
-- Update for retires
UPDATE OGR_idlookup
SET stats_id = retire_to
FROM import_id i
WHERE OGR_idlookup.id = i.id
AND i.retire_to >= 1
AND i.retire_to != i.id
;
3) Update existing data in OGR_summary. I think this can happen either
before or after movedata.sql, but it makes sense to do it before.
Build a temporary table of all work by all newly retired
participants. You'll have to be careful about indexes, etc.; we
absolutely do not want to tablescan OGR_results.
SELECT DISTINCT stub_id, nodecount, id
INTO TEMP retired_stub_id
FROM OGR_results o, retire_today r
WHERE o.id = r.id
;
Need to experiment to see what indexes we want on retired_stub_id.
Probably want one on stub_id, nodecount.
Build a list of new information.
SELECT stub_id, nodecount
, (SELECT count(*)
FROM retired_stub_id rsi
WHERE rsi.stub_id = r.stub_id
AND rsi.nodecount = r.nodecount
) AS ids
, max(p.version) AS max_version
INTO TEMP retired_new_info
FROM ogr_results r, platform p
, (SELECT DISTINCT stub_id, nodecount FROM retired_stub_id) rs
WHERE r.stub_id = rs.stub_id
AND r.nodecount = rs.nodecount
AND r.id NOT IN (SELECT id
FROM retired_stup_id rsi
WHERE rsi.stub_id = r.stub_id
AND rsi.nodecount = r.nodecount
)
AND p.platform_id = r.platform_id
;
UPDATE OGR_results
SET participants = participants - n.ids
, max_client = n.max_version
FROM retired_new_info n
WHERE r.stub_id = n.stub_id
;
4) Update OGR_summary with new work
Basically, build a temp table like what summarize does now, and use that
info to update OGR_summary:
SELECT stub_id, nodecount, count(DISTINCT l.stats_id) AS ids, max(p.version) AS version
INTO TEMP day_summary
FROM day_results dr, OGR_lookup l, platform p
WHERE l.id = dr.id
AND p.platform_id = dr.platform_id
GROUP BY stub_id, nodecount
;
You'll need to load the max function from stats-sql/functions.sql.
UPDATE OGR_summary
SET participants = participants + ids
, max_client = max(max_client, version)
FROM day_summary ds
WHERE OGR_summary.stub_id = ds.stub_id
AND OGR_summary.nodecount = ds.nodecount
;
INSERT INTO OGR_summary(stub_id, nodecount, participants, max_client)
SELECT stub_id, nodecount, ids, version
FROM day_summary ds
WHERE NOT EXISTS (SELECT 1
FROM OGR_summary s
WHERE s.stub_id = ds.stub_id
AND s.nodecount = ds.nodecount
)
;
--
Jim C. Nasby (aka Decibel!) jim at nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
More information about the stats-dev
mailing list