Measuring the performance improvement of Mercurial (NFS vs local disk)

  1. The Mercurial developers were concerned about race conditions and concurrent write/reads causing service inconsistency between hosts. This became evident when stale file handles started appearing in our apache logs.
  2. An extension we wrote (pushlog) was also being served off of NFS. This is a problem not because we have multiple hosts writing at once, but because the file is kept in memory for the lifetime of the hgweb-serving WSGI process, and we’ve experienced that sometimes requests to the pushlog can be served old information.
  3. During times of peak activity there was non-trivial IOWait which caused clone times to increase.
  4. Netapp licenses aren’t cheap. 😉

This took a lot of effort and coordination with the release engineering team to ensure that downtime was kept minimal and there were no feature or performance regressions along the way.

A large part of the transition was the necessity to rewrite the Puppet module that we use to deploy Mercurial. The module is now available on GitHub for people to comment on and use. Of course, pull requests are appreciated.

Now, on to some stats!

Before:

  • Sample size: 1063259
  • Mean: 16.9414 seconds per Hg operation (?cmd=getbundle)
  • Minimum: 0.000112
  • Maximum: 1353.56
  • 95th percentile: 19.551 seconds per Hg operation

After:

  • Sample size: 536678
  • Mean: 15.7102 seconds per Hg opeation (?cmd=getbundle)
  • Minimum: 0.000123
  • Maximum: 1236.69
  • 95th percentile: 19.5716 seconds per Hg operation

From that data we can’t see much significant performance improvement or degradation. I’m hoping that in the future we’ll be able to measure end-to-end client clone time to see better performance. With the local disks we’ll be able to stream uncompressed copies at line speed, which should result in 180 second clone times.