Drupal scalability

The company I work for currently runs numerous Drupal websites, which are now beginning to experience performance issues. The hardware in the production environment "seems" adequate, but nobody really knows what the problem is or how to address it. The camel's straw, as far as acceptable performance, seems to be the last website we put into production, which shares this same server infrastructure. Also, the typical response time of a Drupal page, measured in hundreds of milliseconds, instead of tens, has been a long standing issue that everyone seems to pretend isn't an issue.

Current environment

In our current production environment we have one database server, with a replicated backup on standby, and two load balanced application servers. The applications that run on this platform require shared file storage. The database server is also acting as a NFS server for this purpose, using NFS4. The secondary database server periodically synchronizes the NFS files and acts as a standby NFS server (although with some potential for data loss). This affords some amount of resiliency to a single node failure, whether application or database server. The network is 100Mbps Fast Ethernet.

Some testing of NFS4 before deployment showed that large throughput was attainable, up to 94% of network bandwidth if memory serves correctly. Other than this, no load testing has been done in the production environment and little workload data is available, so only generalizations can be made about performance bottlenecks and solutions. Our workload is composed of numerous individual applications (websites), which each require an identical infrastructure. As I see it, four categories of options exists:

  • Scale up - buy faster hardware resources.
  • Scale out - buy more hardware resources.
  • Reconfigure what we have to better utilize resources.
  • Accept the limitations, duplicate the environment, and split up the workload.

Analysis

Since we have load balanced app servers, application scale out can easily be accomplished by commissioning more app servers. No architecture changes would be required for this. The CPU workload of the app servers is relatively small and the disk workload is nearly zero, as all data is transferred over the network. This suggests that the network and/or database/NFS server are the bottlenecks. On the database server, CPU workload is fairly low, as are disk reads which suggests we have utilized caching effectively at the database and NFS layer. Disk writes and network are the remaining factors.

Memory doesn't seem to be an issue in any case. Database and application servers have 16GB and 12GB respectively. Each has 75% being utilized for disk cache.

Options

I see numerous options, listed below roughly in order of increasing complexity.

  1. Upgrade database/NFS interconnect to Gigabit Ethernet. The performance benefit is unknown, but I speculate 2-3 times throughput of database and NFS traffic. I've been unable to find any data on latency differences, which I expect contribute greatly to app server to database performance. A cheap $70 gigabit switch would allow us to test before purchasing a proper managed switch.

  2. Review the raid configuration of the database servers and reconfigure. I suspect raid5 is in use and recommend changing to raid10. No discussion. "For 15 years a lot of the world's best database experts have been arguing back and forth with people (vendors and others) about the pros and cons of RAID-3, -4 and -5.". There will likely be some immediate performance benefit with the current workload, but this change should allow scaling up to a much larger workload with no other changes. As for implementing, this is obviously a big invasive change that requires rebuilding the server. If it is acceptable to run without a backup database (the primary), downtime should only be a few minutes to switch to the secondary database server. If not, we'll need to commission a third database server as the backup during this time. If hardware doesn't support raid10, configure as many mirror pairs as disks allow and stripe using Linux software raid or LVM.

  3. Consider alternatives for shared file storage. Either additional hardware to host NFS separately from the database, or alternatives to NFS. Even a separate set of disks in the database server, dedicated to NFS may serve as sufficient separation. Any of these options removes load from the database disks. Alternatives may be a simple as a dedicated NFS server appliance, to as complex as a cluster of application servers connected to a high speed SAN using a clustered filesystem.

  4. Investigate usage of Squid as a reverse proxy, to be run on each app server. The Boosting Apache Performance by using Reverse Proxies article from Linux Gazette has a nice diagram of how a reverse proxy is utilized. Each app server has spare disk and memory that could be utilized for caching. This would offload much of the mundane workload and static data transfer from Apache and shared file storage. Depending on the amount of caching, the NFS load on the database server may be significantly reduced. This may also allow continued use of the simple NFS protocol.

    As a side effect, the response time of Drupal pages will likely decrease dramatically (as long as cache control headers are set appropriately). Most of the response time for a full page, as a user sees it, is loading all of the various images, stylesheets and scripts, not the HTML itself. Shaving even 50% off the load time of each of these resources really adds up, and Squid should be able to do better than 50%. See Squid on top of Apache on same machine from Groups.Drupal.org.

  5. Investigate other methods of caching in the application, such as usage of memcached for Drupal.

  6. Scale out of database that is transparent to the application, using something like Continuent's Tungsten.

  7. Modify the application to work with asynchronous replicated databases.

That's all for now...

UPDATE: Read Part 2 of Drupal scalability.