[stats-dev] PHP output caching and "dead" projects

Benjamin Gavin virtual_olympus at yahoo.com
Fri Nov 21 12:44:43 EST 2003

  I've been looking at ways to improve the performance of the stats
system.  I've identified a couple problem areas which are currently
effecting the speed of the stats system:

1.  Page content is generated dynamically, at every request:  This is
perhaps the largest problem.  Our stats database updates only once every
night, and performing potentially intensive database queries to service
request after request is not helping either the database server or the

2.  "Dead" projects continue to take as many (or more) resources than
currently active projects:  From the database side of the fence, inactive
projects cost us a large chunk, especially if those older projects stay
"popular" for a long period of time.  This is largely due to database
cache thrashing and the like, but also due to the sheer volume of data
that needs to be kept around long term.  If we can't get rid of the data
completely, we can certainly try to limit the number of times it is

So, in the interest of finding a solution to this, and since nobody seems
to agree with me that just eliminating the data from the database
completely is a valid option...  I have arrived at a scheme for caching
the output of the various system pages which could be utilized firstly for
the "dead" projects (to alleviate #2), and potentially for the "live"
projects as well (to alleviate #1).

The schema that I have arrived at looks as follows:

1.  Maintain a list of "dead" projects, or include a field in the database
for "closed" status. (this may exist already, but the documentation on the
DB schema is sparse)

2.  If the requested project is in the list of "dead" projects (or for all
projects long term), then check to see if the cache file already exists
for the current page request.  If so, serve it up from the cache,
otherwise regenerate the page and place the result in the cache.

In my preliminary testing (on my local box), which is certainly less beefy
than blower, qualitative response times seem to have improved by 40-50%
(sometimes 100-200% for team/participant list pages).

The caching structure is as follows:

Directory: /cache/[project_id]/[page name]

File: SHA1(normalized query string).html

The nice thing about using the normalized query string is that it
automatically handles things like password protected team member pages. 
Unless the person knows the correct team password, they will not be able
to retrieve the cached page with the team member information.  The cache
directories could be placed in a location which is not accessible through
the web root as well to avoid people "lucky guessing" the filenames.

That leaves two remaining pieces:

1.  An Exception List: Those pages which should never be cached, e.g.
participant editing, team joining, etc

2.  Stats Proc Changes: If we implement caching for all projects, then we
would need to add a final step to the stats proc routines which clears out
the web caches when the stats run is complete.

Just FYI, adding caching was about 20 lines of code in project.inc and 10
in footer.inc.  A better implementation would be to split the caching
logic into it's own include and link it to the files which could
reasonably be cached.  It would also be good to include the notion of a
"page error" which would cause the page not to be cached due to a database
error, improper authentication, etc.

So... thoughts, comments, etc?

Ben [TheJet]

Do you Yahoo!?
Free Pop-Up Blocker - Get it now

More information about the stats-dev mailing list