Unfortunately not all WiFi connections work perfectly all the time. They’re fraught with unexpected problems including dropping out entirely, abruptly killing connections, and running into connection limits.
Thankfully with a little knowledge it is possible to regain productivity that would otherwise be lost to a flaky internet connection. These techniques are applicable to coffee shops, hotels, and other places with semi-public WiFi.
We have quite a bit of infrastructure around this including Tinderbox Pushlog (TBPL) and more. This post deals with the infrastructure and problem we face while trying to scale the ‘try’ repository.
A few statistics:
- The try repository currently has 17943 heads. These heads are never removed.
- The try repository is about 3.6 GB in size.
- Due to Mercurial’s on-wire HTTP protocol, this number of heads causes HTTP cloning to fail
- There are roughly 81000 HTTP requests to try per day
- To fix problems (mentioned below), the try repository is deleted and re-cloned from mozilla-central every few months
There are a number of problems associated with such a repository. One particularly nasty one has been present through several years of Mercurial development, and has been tricky in that it is seemingly unreproducible. The scenario is something like:
- User ‘hg push’es some changes to a new head onto try
- The push process takes a long time (sometimes between 10 minutes and hours)
- A developer could issue an interrupt signal (ctrl+C) which causes the client to gracefully hang up and exit (his typically has no effect on the server
- Subsequent pushes will hang with something similar to ‘remote: waiting for lock on repository /repo/hg/mozilla/try/ held by ‘hgssh1.dmz.scl3.mozilla.com:23974’
- When this happens a hg process is running on the server has the following characteristics:
- A ‘hg serve’ process runs single-threaded using 100% CPU
- strace-ing and ltrace-ing reveal that the process is not making any system calls or external library calls
- perf reveals that the process is spending all of its time inside some ambiguous python function
- pdb yields that the process is spending all of its time in a function that (along some point in the stack trace) is going through ancestor calculations
- The process will eventually exit cleanly
- As operators there is nothing we can do that to alleviate the situation once the repository gets in this state. We simply inform developers and monitor the situation.
There have been several ideas on ways to alleviate the problem:
- Periodically reset ‘try’. This is considered bad because 1) it loses history, and 2) it is disruptive to developers, who might have to re-submit try jobs again
- Reset try on the SSH servers, but keep old try repositories on the HTTP servers. This has the potential to create unforeseen problems of growing these repositories even further on the HTTP servers. If reset (staggered from SSH server resets) this will remove unforeseen problem potential, but still lose history.
- Creating bundle files out of pushes to ‘try’, then hosting these in an accessible location (S3, http webroot, etc). I will detail this method in a future blog post.
As of now though, try will periodically need to be reset as a countermeasure to the hangs mentioned in this post. Getting a reproducible test case might allow us to track down a bug or inefficiency in Mercurial to fix this problem after all. If you’d like to help us with this, please ping fubar or me (bkero) on irc.mozilla.org.