Drupal scalability, Part 3

We were finally able to upgrade the database server disks last September before the frenzy of PCI compliance consumed everyone's spare cycles. The old configuration was a three disk RAID 5 array, new configuration is a six disk RAID 10 array using 15k RPM disks. I finally got to write a TPS report:

OLD: 90-100 TPS
NEW: 400-415 TPS sustained, ~800 TPS peak
===

So we're at about 400 writes per second, up from 90. To make fair comparisons I stuck with the same simple tests with ApacheBench. Keeping concurrency low enough to get reasonable reponse times we now get 8 requests per second compared to 5. Yeah, that's a whopping 8 home pages we can serve per second. Not that great, but response time improved a little. It took a couple of days to shuffle data around and rebuild the RAID array. When I got the system back up I got to see what the new array could do while the database replicated everything that happened the past couple of days. It was nice to see 400 writes per second, but the end results were discouraging. I only did a little more informal testing and if I remember correctly it looked like the network may be the next bottleneck. We weren't getting a new switch anytime soon, so I gave up. Then came PCI Data Security Standard (PCI DSS).

Faster network

In the process of complying with PCI requirements the network topology was completely changed. These servers don't have a separate management network so the old topology meant it was really difficult to manage because everything was using NAT, hidden in a separate subnet behind a F5 Big-IP load balancer. This made things like mail, central syslogging, and backups a challenge. So instead the new topology has the web servers and load balancer all in the same DMZ subnet, with the load balancer configured to "front bounce" requests. The new DMZ subnet is bigger than the old one, so we end up with a new switch; a Gigabit Ethernet switch!

Results

We actually hadn't intended to get onto a different switch, but when I realized it had happened I fired up ApacheBench to do some more tests. Now instead of ~8 requests per second we get ~20 requests per second. Not only that, response time went way down. I'm talking about 400ms, down from 1800ms, with a concurrency level of 6. The old setup took almost 4 seconds to serve a page with 20 concurrent requests. Now it's only about 1 second. When looking at the threshold of what humans will tolerate, we may be able to get away with 4-6 seconds if we push it, which is what you get at 100 concurrent requests. The old configuration would have been crippled and unusably slow with that load.

Concurrency level old response time new response time
1 1200ms 200ms
4 1500ms 300ms
6 1800ms 400ms
20 3900ms 1100ms
40 - 2300ms
100 - 6200ms

Next steps

After a recent discussion about perceived performance, I realize that we need some hard numbers. Next steps, add some good monitoring so we can characterize normal load, both on the system and database level, and response time from the web servers for various sites. Then what can we do?

  • How much is serving NFS from the same box impacting the database?
  • How is performance affected by writing replication logs to the same filesystem as the database files?
  • Consider a purpose built database server and perhaps utilize a Solid State Disk (SSD) or two.
  • Consider using a different platform for the database and/or NFS. Solaris ZFS looks promising, especially when combined with SSDs. Numbers in the neighborhood of 3000 IOPS have been quoted, and it would take a lot of fast 15k disks to match that. Like 36.
  • Consider using Varnish as a reverse proxy. Why not Squid? Varnish was designed as a reverse proxy.
  • Switch to Pressflow, a modified Drupal distribution (API-compatible with the same major Drupal version).