When Everything Falls Apart: Stories of Version Control System Scaling

Scaling Version Control

High-five at the Mozilla Festival!

A bit about me

Ben Kero
  • Ben Kero
  • System Administrator, VCS
  • Release Engineering
Image by Chris Heilmann

What's this about?

What's this about?

Heisencat
  • Scaling version control systems
  • Primarily detailing Mercurial
  • Headaches, Heisenbugs
  • Getting kicked off of Github
Image by @jonrohan

A bit of background

Some statistics

Mercurial
  • Primarily Mercurial
  • Repositories: 3,445 (1,223 unique)
  • Commits: 32,123,211 (865,594 unique)
  • 2TB+ transfer per day
  • 1,000 Happy Meals clones served daily
  • Biggest consumer: ourselves
  • Tested platforms: > 12

We also use…

  • Git
  • Subversion
  • CVS
  • Bazaar
  • RCS

Infrastructure (Hg)

Shell Shell



HTTP daemon HTTP daemon HTTP daemon HTTP daemon HTTP daemon

HTTP daemon HTTP daemon HTTP daemon HTTP daemon HTTP daemon

Get to the stories

Know what you're
hosting

1st story

BOFH control center

BOFH Control Centre

Original bug request

Initial bug requesting repo creation

Github repo disabled pic

command-line showing a 208 GB repository

Bug comment calling me out

Bug author requests I handle it

and it's important...

Bug author says it's important

Mostly idle host

Htop showing a mostly idle host

Bug update with credentials

Bug detailing how to access the new repo

Load spike graph

Graphite load graph of the server after new repo added

SSH in, du'd repo, 208GB

command-line showing a 208 GB repository

Commit log

git commit log of the repository

Commit log (highlighting dates)

git commit log of the repository (highlighting frequency of commits)

Brick animated GIF

What I had basically done was throw a brick into a washing machine

git-config man page line count

git-config man page line-count (2601)

pack.windowMemory git-config man page

git-config man page pack.windowMemory section

gc.auto git-config man page

git-config man page gc.auto section

And we waited…

18 hours later…

28GB size after GC

After GC, it was much smaller! 28GB

Load drop after GC

The load flattened out afterwards

Phew

What else ya got?

Statue of Yoda Image by seanness

2003

Stop-light CI system

A stop-light CI system indicator Image by Greg Borenstein, Github

Wikipedia: Build Indicators page

Slightly absurd, page wiki page about build light indicators

Try explanation

Statue of Yoda Image by seanness

Build Farm

Our CI build farm Image by Matthew Murray

S'all good, man

Saul Goodman ('It's all good, man') Image by Irmin Wehmeier

Immutable changes

Mercurial documentation Documentation excerpt from Mercurial Wiki

Number of heads

Mercurial documentation

Head Math

  • 100 developers
  • 30 push 4 times per week
  • 70 push 2 times per week
  • (30 * 4) + (70 * 2) = 260 pushes per week
  • = 1040 pushes/month
  • = 12480 pushes/year

Try limit lemmings

Mercurial documentation Scene from the Lemmings game for PC

Wiki page try section

Wiki page MDN Developer documentation

Don't poke the bear

Wiki page Image credit to LadyOfHats

Growth of Try heads over time

Gaining more developers Image credit to OpenHUB

Symptoms

  • Happens on pushes when heads > 10,000
  • 45+ minutes to return, sometimes never
  • Process: 'hg serve'
  • 1 core pegged
  • No strace output
  • No ltrace output
  • Killing it yields no traceback
  • If killed, happens on (most) subsequent runs

Heisencat

Heisencat, the bug defied scrutiny Image by @jonrohan

Next debugging steps

  • python26-debuginfo
  • GDB
  • Custom GDB script

GDB command and output

      bt
      py-bt
      detach
      quit
      

GDB Output

      GNU gdb (GDB) Red Hat Enterprise Linux (7.2-64.el6_5.2)
      ...
      #0  0x0000003cb3c8373c in set_contains (so=0x1b19050, key=268353) at Objects/setobject.c:1867
      #1  0x0000003cb3cd4130 in cmp_outcome (f=, throwflag=) at Python/ceval.c:4241
#2 file '.../mercurial/ancestor.py', in '__iter__'
#11 file '.../mercurial/branchmap.py', in 'update'
#15 file '.../mercurial/branchmap.py', in 'updatecache'
#19 file '.../mercurial/localrepo.py', in 'branchmap'
#22 file '.../mercurial/localrepo.py', in 'branchtip'
#25 file '.../mercurial/hgweb/webutil.py', in 'nodeinbranch'
#28 file '.../mercurial/hgweb/webcommands.py', in 'changelist'
      

Unhidden wiki

Wiki doc excerpt about cache invalidation Mozilla Wiki: TryServer

Why is it updating the cache?

Upstream bug snapshot Mercurial Bug 4255

So what now?

  • File bug upstream
  • GeneralDelta compression format
  • Find ways to change caching behavior
  • Plan new, more scalable system

The Result

Upstream bug snapshot NASA, Apollo 17

The New Hotness

  • Need to replace this old system
  • More web-scalable (needz MongoDB)
  • Closer to a pull-request model
  • Multi-homing
  • Leverages Mercurial bundles
  • Stores bundles in scalable object store
  • Ideally should require minimal retooling from other groups

In Review

  • Know what you're hosting
  • Don't put all your eggs in one basket
  • Don't assume your approach is going to work forever
  • You don't live in a vacuum

Further Reading

http://planet.mozilla.org/releng/

Further Reading

http://gregoryszorc.com/blog/

Further Reading

http://bke.ro/

Thanks

Red panda (Firefox) Photo by Yortw