[stats-dev] PHP output caching and "dead" projects

Chris Jones fiddles at distributed.net
Fri Nov 21 18:11:29 EST 2003

On 21-Nov-2003, Benjamin Gavin wrote:
> OK,
>   I've been looking at ways to improve the performance of the stats
> system.  I've identified a couple problem areas which are currently
> effecting the speed of the stats system:
> 1.  Page content is generated dynamically, at every request:  This is
> perhaps the largest problem.  Our stats database updates only once every
> night, and performing potentially intensive database queries to service
> request after request is not helping either the database server or the
> users.
> 2.  "Dead" projects continue to take as many (or more) resources than
> currently active projects:  From the database side of the fence, inactive
> projects cost us a large chunk, especially if those older projects stay
> "popular" for a long period of time.  This is largely due to database
> cache thrashing and the like, but also due to the sheer volume of data
> that needs to be kept around long term.  If we can't get rid of the data
> completely, we can certainly try to limit the number of times it is
> queried.

Simply have the "dead" projects purely in cached form. I can't see that the data 
would be being used overly much anymore, especially CSC.
> So, in the interest of finding a solution to this, and since nobody seems
> to agree with me that just eliminating the data from the database
> completely is a valid option...  I have arrived at a scheme for caching
> the output of the various system pages which could be utilized firstly for
> the "dead" projects (to alleviate #2), and potentially for the "live"
> projects as well (to alleviate #1).
> The schema that I have arrived at looks as follows:
> 1.  Maintain a list of "dead" projects, or include a field in the database
> for "closed" status. (this may exist already, but the documentation on the
> DB schema is sparse)
> 2.  If the requested project is in the list of "dead" projects (or for all
> projects long term), then check to see if the cache file already exists
> for the current page request.  If so, serve it up from the cache,
> otherwise regenerate the page and place the result in the cache.
> In my preliminary testing (on my local box), which is certainly less beefy
> than blower, qualitative response times seem to have improved by 40-50%
> (sometimes 100-200% for team/participant list pages).
> The caching structure is as follows:
> Directory: /cache/[project_id]/[page name]
> File: SHA1(normalized query string).html
> The nice thing about using the normalized query string is that it
> automatically handles things like password protected team member pages. 
> Unless the person knows the correct team password, they will not be able
> to retrieve the cached page with the team member information.  The cache
> directories could be placed in a location which is not accessible through
> the web root as well to avoid people "lucky guessing" the filenames.

> That leaves two remaining pieces:
> 1.  An Exception List: Those pages which should never be cached, e.g.
> participant editing, team joining, etc
> 2.  Stats Proc Changes: If we implement caching for all projects, then we
> would need to add a final step to the stats proc routines which clears out
> the web caches when the stats run is complete.
> Just FYI, adding caching was about 20 lines of code in project.inc and 10
> in footer.inc.  A better implementation would be to split the caching
> logic into it's own include and link it to the files which could
> reasonably be cached.  It would also be good to include the notion of a
> "page error" which would cause the page not to be cached due to a database
> error, improper authentication, etc.
> So... thoughts, comments, etc?
> Ben [TheJet]

Chris Jones
fiddles at distributed.net

More information about the stats-dev mailing list