[RC5] Stats suggestion

bwilson at fers.com bwilson at fers.com
Fri Jan 5 18:13:26 EST 2001


Current storage usage (averages)

33.2 bytes per participant per day per project they contributed to during
that day.  Today, this is about 96.5MB.
131.6 bytes per person per project without regard for days.  Today this is
about 10.5MB.
320 bytes per person regardless of project or days.  Today this is about
100.7MB.
20 bytes per person per team they have belonged to.  Today this is 6.2MB.

These numbers are based only on OGR, but should still apply when RC5-64 is
migrated into the new structure.

Adding fields to hold exponential-average would add about 10 bytes per
participant per project for each exponential average we wish to track.
This is about a 7% increase (131 -> 141 bytes per person per project),
which is reasonable.  10.5MB becomes 11.3MB.

Adding a field to store the chosen half-life would add about 4 bytes per
participant, increasing from 100.7MB to 101.9MB.  Not sure if we want to
enable this feature, but I'll keep calculating as if we are.

Updating one or more exponential average fields would require one pass at
the Email_Rank table that would include all participants, whether they had
contributed that day or not.  I don't see a good place we could include
that in an existing query, but it seems worth it in the big scope of
things.  I don't see a reason I couldn't calculate multiple exponential
averages in a single pass, even if we allow a participant-chosen
half-life.  Maybe I could squeeze it in with the ranking code - which
would be nice, because then it would be simpler if I also want to rank by
it.

If we did allow a participant-selected half-life, I think we'd probably
build in a mechanism to recalc from the dawn of time whenever the
half-life changes.  Recalculating the average for a single participant is
no big deal with this model, even though it has to review all the history.
 It's about the same as the phistory pages.

Just to settle this once and for all, it's highly likely that I would go
back to the dawn of time and calculate the decaying averages correctly
from the beginning.  This is the kind of thing where I write a script that
will calculate from day X to day X+1 and run it a certain number of times
in a row each day (depending on how long it takes and what else is going
on).  When it catches up to today, then we'll show the fields on the stats
and keep them up to date from then on.

In the event I do not, the most likely result is to use their overall rate
as the starting point.  The intent would be to get it as close as is
practical to the real thing from day 1.

Remember that part of distributed.net's goals in providing stats at all is
to encourage consistent participation which is not unhealthy to the
network.  "Bursty" contributions like megaflushes should serve little
purpose, though they will not be actively penalized.  I'll be looking for
half-life terms that would respond fairly quickly (a week or two) to new
computers being added, but which will not "flutter" too much with the
natural variances such as weekends (when many participants contribute less
or none).

I could even conceive of a model where someday we reward people for the
stability of their contribution (variance from their own average?) as much
as for the quantity of work done.  How does *that* strike you?  }:8)
__
Bruce Wilson, Manager, FERS Business Services
bwilson at fers.com, 312.245.1750, http://www.fers.com/
PGP KeyID: 5430B995, http://www.lasthome.net/~bwilson/
"A good programmer is someone who looks
both ways before crossing a one-way street."




Ben Clifford <benc at hawaga.org.uk>
Sent by: owner-rc5 at lists.distributed.net
2001-01-05 15:37
Please respond to rc5


        To:     rc5 at lists.distributed.net
        cc:
        Subject:        Re: [RC5] Stats suggestion



> Now, what would be *really* nice is if people could choose their own
> halflife (or halflives) for their own records (maybe two counters of
that)
> and then a global ranked counter with a fixed halflife (perhaps a 15-day
or
> 30-day). Might not be too hard, seeing as how that only requires an
extra
> number or two in the user data for each own-record halflife.

The value could be stored in decay-constant form - this would save
computation during the stats run making it only a little more compute
intensive than with fixed constant.

Can someone in the know comment on how much data (in bytes) is stored per
participant and per participant-day at the moment?



--
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest





--
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest



More information about the rc5 mailing list