When Everything Falls Apart: Stories of Version Control System Scaling
This slide deck is using the shower system - roll over the previews to see the notes and click any slide to go into presentation mode. Then use keys to navigate.
Some stories showing the trials and triumphs of Version Control administration
Ben, Kero, Linux.Conf.Au 2015, Auckland New Zealand, 2015-01-16
This is a simple cover slide with an image in the middle
System Administrator, VCS
Image by Chris Heilmann
Introduce yourself, who you are and why you are the person to give this talk
Scaling version control systems
Primarily detailing Mercurial
Getting kicked off of Github
Image by @jonrohan
Introduce yourself, who you are and why you are the person to give this talk
Repositories: 3,445 (1,223 unique)
Commits: 32,123,211 (865,594 unique)
2TB+ transfer per day
Happy Meals clones served daily Biggest consumer: ourselves
Tested platforms: > 12
We primarily use Mercurial for Firefox development, although FirefoxOS uses quite a bit of Git on Github. We have a couple thousand repositories and a couple million commits, although interestingly only about a million of those are unique. We have over a thousand fresh clones every day, although many more in that are partial updates. we test on almost any platform you can think of too.
We use other version control systems too. Git is interesting, and the first story will be about that. The others we host, but we don't particularly use them extensively and haven't had to scale them up. Interestingly the RCS repository is the Mail aliases file that postfix uses, which is the same way back from when it came from Netscape.
We have two SSH masters and ten mirrors that serve HTTP traffic. Unpictured is a load balancer cluster that strips the SSL from HTTP traffic and send sit to us. Also unpictured is the NFS server that holds the mount for data.
Okay, you must be thinking: when is he going to get on with it?
Know what you're
hosting 1st story Here's the first story, it has to deal with git.
I was minding my own business, manning the BOFH control center, as one does, when a bug came in.
This was the early dates of git.mozilla.org. I was in IT, and IT was manually creating git repositories for people since Release Engineering wanted veto power over anything that they thought might influence their availability. It was a bit suspicious that Github kicked him off. I wondered why? Could it have to do with the 1.7 GB of space taken up by the repository? It's not *THAT* big. Then again, I've never had a 1.7GB repository on Github, so maybe they're cracking down on people who make too much space.
So I log into github, look at the repository. Yup, it's been disabled due to "excessive use of resources". Story checks out. I don't think he can delete it either, so it's like a cone of shame that that particular developer has to wear for the rest of his career. Let's let someone else deal with it.
Bug comment calling me out
Oh crap, he called me out. I guess *I* have to do it.
And it's *IMPORTANT*. Now, this was before (and quite possibly how) I learned the difference between something being important, and something being urgent. So I guess I'd better to this right away.
I'm thinking "Okay, this host is mostly idle. 24 cores, 60 GB RAM, and a load average of 0.5. The repository is only 1.7 GB and it only belongs to one user. The load from it shouldn't be that bad. I'll just create it and give it to him.
Bug update with credentials
So I make it for him. I give him the URI he can push to, and tell him the URL where he can access the web portion. Mark the bug RESOLVED/FIXED, dust my hands off, and be done with the whole thing, right? That's what I thought.
All was not well. On the early morning of the 22nd (the developer was in Europe) we began to get some alerts about availability. Looking at the load graphs, we can see that indeed there were some problems. Nagios was reporting that the host was using critical levels of swap memory. This would happen for a while until *SOMETHING* completed, then the host would go back to it's normal barely-loaded self.
So I SSH in to the server and take a look at the repository. Whoa! This doesn't seem to just be 1.7 GB anymore.
I look in the commit log to see a bunch of 'automerge' commits. It looks like what he's doing is automerging changes from Mercurial and applying them to this git mirror of Mercurial. That means whenever a commit lands in any of the Mozilla-Central repositories, it will be mirrored here.
Commit log (highlighting dates)
And look at the timestamps on these. 30 seconds apart? 1 minute apart? Now some of you are starting to get the picture of what's going on here.
...and here's what happened to the server. I don't have another Htop interface to show you, so I'm just going to show you this rough analogue of what happened to the server. When I logged in, all the cores were peggged and the host was running into swap. This led to an unuseable shell. Somehow sshd was still getting runs in the process queue, but interactive shells would hang. I needed to run 'ssh $HOST $command' to be able to execute any commands. So I turned the repository read-only and waited for the load to die down. After that I started hunting for fixes.
git-config man page line count
Being a Gentoo ricer in a past life I'm used to reading thick man pages about tuning flags. Still, this one was particularly dense. I knew it was either packing operations or garbage collection that were causing it, but it was difficult to say which. Additionally this was happening on a live system, so any mistune here could increase load or memory usage, potentially hardlocking the box. No bueno.
pack.windowMemory git-config man page
Awesomely, git repositories can have their own tuning options in the repo/.git/config file in addition to the global /etc/gitconfig. Digging through the man page and checking online, I found a few good options. Setting pack.windowMemory would limit the amount of memory used, which was really valuable for us since it meant we didn't OOM our box with several concurrent operations on this repository. By the way, the default windowMemory limit is 0, which means unlimited.
gc.auto git-config man page
Additionally, gc.auto is a tuning option that tells the repository how many loose objects to allow before starting to pack them. Interestingly the pack can occur during a seemingly innocuous operation, such as 'git status'. This turned out to be a huge win for us, since we could have a little more control over when gc operations occurred. We ended up tuning this to 1,000 objects. Using these two options, we rsynced the repository to a secondary host, then ran a manual pack operation on it.
The repository had shrunk. From 208GB down to 28GB. The pack operations had an effect. It's important to note that although you can put the whole repo into one single pack file, that has other performance problems as well. Fortunately for us we didn't encounter them after this.
After these changes the load on the server fell. That's certainly not the last of the issues, but that was a big one.
Phew. After we made the changes we watched the host very carefully to look at how it was performing. Tuning these two knobs seemed to have done the trick. There were no more load spikes, and no loss of availability from the host at all. Gentoo ricers: 1, stubborn git repos: 0
Okay, this next one is also a story of melting servers, but it takes a different form. You'll see what I mean in a bit. I'm going to start off by explaining a little bit about the history of managed software and CI at Mozilla!
So the year was 2003 and Github, nor pull requests will be invented for another 4 years. But developers still exist, and they need somewhere to put their code. Thus, we had version control systems.
Greg Borenstein, Github
Mozilla, just after learning about this new hotness called Continuous Integration, decided that it was the only sane way to coordinate all of the developers. The way it worked was this: Developers write code, generate patches. These all got bundled into Mercurial changesets, which were then bundled into changegroups.
By the way this thing is called a Build Indicator, and people build all sorts of things like this. This one belongs to Github, you can check it out if you grab the link from my slides.
Wikipedia: Build Indicators page
By the way, this is slightly absurd. There's a whole wikipedia page devoted to the build light indicators that folks build for their CI systems.
So the developers write code, then they commit it to their checked out copy of our source code. Then push it to a repository called 'try' in a new head (like a Git branch). We called it try because developers would push their code there to try to build it before putting it into the main repo (mozilla-central) and potentially breaking everybody else's checkouts.
So the code's in the try repository. So far so good. Now it gets checked out by build hosts on a ton of different platforms. Windows, Linux, OS9. Whatever you can think of.
So their code gets checked in, it builds fine. They to go a site later and look at the build status. It's all good.
Documentation excerpt from
Now here's one of the tricky parts. One of the original ideas for Mercurial is that history would be immutable. This sounds good in principle, and has a lot of advantages. Scalability is not one of of those advantages. To their credit, they did add the concept of phrases in Mercurial 2.1. This lets you have private or "non-publishing" phases for heads, which ARE mutable. Unfortunately for us we were kind of behind, and were stuck on 2.0.2 for about a year. When the developers push these changegroups in, they create Mercurial heads. As you might expect, the authors didn't consider how to clean these up. That's IT's problem. So you end up with something like this...
Which results in this. Think of a head like a branch. Technically they're different, but for the purpose of this presentation they're analogous. Let's see a show of hands. Who here thinks this is an expected use case for your version control system? Okay, now who thinks this is likely to break things? Yup, we're going to get to that. But first, let's see how these numbers came to be.
30 push 4 times per week
70 push 2 times per week
(30 * 4) + (70 * 2) = 260 pushes per week
= 1040 pushes/month
= 12480 pushes/year
Let's say we have 100 Firefox developers. They're all writing code and pushing it to this repository a few times per week. 30 of those 100 push to it 4 times per week, and 70 of them push twice a week. That's 260 pushes. Which works out to about a thousand per month, then 12 and a half thousand per year.
Scene from the Lemmings game for PC
At about 10,000 heads things start to fall over. Well, sometimes. It doesn't happen with every operation. Since nearly all of these heads are idle, they don't encounter some problems but still encounter others. When the web side and SSH side were hosted on the same box, we had availability issues on both sides. Devs couldn't push code, and our systems couldn't check it out either. Since we moved the HTTP side to separate hosts, those are fine but developers still have issues pushing to try. So much so that we made a wiki page explaining what to do.
MDN Developer documentation
Here's that page. I've blurred out a little bit so that it won't steal my thunder and make you bored for the rest of the story. You'll see that it says if you're experiencing excessive wait, file a bug asking IT to reset it. Developers sometimes file these, sometimes poke us on IRC. When that happens...
Image credit to
We are not impressed. Happening only once a year, it wasn't really worth the engineering effort to solve the solution. This makes developers grumpy too because then they have to submit their changes to try again to be run. When I say 'reset try', what I mean is that we delete the try repository and make a fresh clone from Mozilla-central. This is what I refer to as the nuke-and-pave approach. It's akin to a very blunt hammer. This makes everybody grumpy. IT because they need to dredge out documentation on how to do this, developers because they'll need to push again, and Release Engineering because they need to go remove all the old build jobs that will try to check out code that isn't there anymore.
Growth of Try heads over time
Image credit to
We don't have 100 active code contributors to Firefox anymore, we have much more. From 2012 to 2015 our codebase doubled in size. Which means we get to the 10,000 head point in much shorter time periods than a year. We needed a real solution to this problem instead of the nuke-and-pave style 'try reset' approach I mentioned earlier. Compounding this, there was a dictate that we weren't to be deleting any more try heads, so long-term it was no longer even on the table as an option.
Happens on pushes when heads > 10,000
45+ minutes to return, sometimes never
Process: 'hg serve'
1 core pegged
No strace output
No ltrace output
Killing it yields no traceback
If killed, happens on (most) subsequent runs
So let's look at the symptoms. This only happens on the try repository, and only when it has a lot of heads. It takes a long time to return, if at all. When it's running it is pegging a core, but it's not making any system calls, nor any library calls. If killed it yields no traceback. Finally, if we kill it, it will most likely happen again during the next run. What is it?
Image by @jonrohan
This bug seemed to deny all scrutiny. Clearly if we were to get to the bottom of it we'd need to try harder. So try harder we did.
Custom GDB script
The only way left to debug this was to use GDB. We'd have to attach to the running process and demand it dump it's stack trace.
This gdb script is fairly straightforward. It assumes that gdb is already running and attached to a running python process. The first line prints the system stacktrace, the second line prints the python stacktrace, the third detaches from the running process, and the fourth kills the gdb session. This works, and gives the output...
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-64.el6_5.2)
#0 0x0000003cb3c8373c in set_contains (so=0x1b19050, key=268353) at Objects/setobject.c:1867
#1 0x0000003cb3cd4130 in cmp_outcome (f=
, throwflag= ) at Python/ceval.c:4241
#2 file '.../mercurial/ancestor.py', in '__iter__'
#11 file '.../mercurial/branchmap.py', in 'update'
#15 file '.../mercurial/branchmap.py', in 'updatecache'
#19 file '.../mercurial/localrepo.py', in 'branchmap'
#22 file '.../mercurial/localrepo.py', in 'branchtip'
#25 file '.../mercurial/hgweb/webutil.py', in 'nodeinbranch'
#28 file '.../mercurial/hgweb/webcommands.py', in 'changelist'
So here's a highlight of the output that you get if you take that gdb script and run it against one of those 'hg serve' processes stuck in a loop. Ignore the first two lines, that's the system section. Yup, as expected it's running Python. If you look down a little more you'll see the python traceback. What it's doing is performing a cache update of the branchmap. To do that it's iterating over every ancestor. As you might expect, on a repository with 10,000 heads this takes a bit of time.
Why is it updating the cache?
Mercurial Bug 4255 Here's the bug about fixing it upstream. Submitted in May, and as of now it has the 'CONFIRMED' status. We're looking to help them fix it upstream, but it might require some considerable changes to the code for this to happen.
File bug upstream
GeneralDelta compression format
Find ways to change caching behavior
Plan new, more scalable system
What we we doing now to mitigate the problem? We've filed bugs upstream, we switched to a better compression format, and we know what behavior triggers these hangs. That's a huge step in the right direction from what we were doing before. Long-term, we think that we'll need to create a new system to address the scaling problems that we've experienced. What I just talked about is a big part of that.
NASA, Apollo 17 The result is that we've duct-taped our lunar buggy. *POINT TO DUCT TAPE*. But it's not 1972 anymore, and although the system still has a lot of life we can coax out of it, we want to refactor to a system we think is going to scale better and create better workflows for our developers.
Need to replace this old system
More web-scalable (needz MongoDB)
Closer to a pull-request model
Leverages Mercurial bundles
Stores bundles in scalable object store
Ideally should require minimal retooling from other groups
So we're designing a new system to replace Try. We need something that meets the needs of our developers in a way that the old system doesn't. It should deal with many developers committing in short amounts of time, we should never have to destroy history, we shouldn't have to deal with giant cache rebuilds anymore, etc. This in conjunction with MozReview (our pull-request style system) should allow us to get out of the developer's way and give them the self-service resources they need to thrive.
Know what you're hosting
Don't put all your eggs in one basket
Don't assume your approach is going to work forever
You don't live in a vacuum
So in conclusion: know what you're hosting. Ask questions until you're certain you know exactly what you're getting yourself into. Don't host everything together if you can help it. Although it's very easy to just have "the git box", if someone comes along and gives you an unwelcome surprise it can have serious consequences, such as affecting production uses of that service. Don't assume that the way you do things now is always going to work. Just because a piece of software is convenient to set up doesn't mean it's going to scale. Lastly, don't pretend that you live in a vacuum. There are a lot of people out there that have very similar issues to what you have. Be active on IRC, read and ask questions on mailing lists. We wouldn't have been able to get as far as we have with all of this without the continued help and support from the community. Sure, we ran into some issues before anybody else did, but having an active and open community is invaluable for diagnosing and resolving issues.
http://planet.mozilla.org/releng/ If you want to find out more, here are some resources to do that. You can check out the Release Engineering section over at Planet Mozilla (or check out the whole Planet, there are a lot of interesting topics that go on there). You can also check out the blog of my teammate, gps. He writes a lot about developer productivity and is an expert on Mercurial internals. You can also read my blog, up at bke.ro. It has some more personal articles on there, although from time to time I do cover version-control related topics.
http://gregoryszorc.com/blog/ If you want to find out more, here are some resources to do that. You can check out the Release Engineering section over at Planet Mozilla (or check out the whole Planet, there are a lot of interesting topics that go on there). You can also check out the blog of my teammate, gps. He writes a lot about developer productivity and is an expert on Mercurial internals. You can also read my blog, up at bke.ro. It has some more personal articles on there, although from time to time I do cover version-control related topics.
http://bke.ro/ If you want to find out more, here are some resources to do that. You can check out the Release Engineering section over at Planet Mozilla (or check out the whole Planet, there are a lot of interesting topics that go on there). You can also check out the blog of my teammate, gps. He writes a lot about developer productivity and is an expert on Mercurial internals. You can also read my blog, up at bke.ro. It has some more personal articles on there, although from time to time I do cover version-control related topics.