A Note on Last Night's Downtime

Riley D.
Roll20 Team
Last night we had a very ill-timed and unfortunate bout of downtime. It started around 9:45 PM CST and lasted until around 11:45 PM CST. During that time a good chunk of users were unable to access their campaigns.  First off, I just want to say that we take downtime really seriously. We know that you plan your games well in advance and that often even 10 or 15 minutes of downtime can ruin a game experience. Our goal is to have 100% uptime. Obviously we're a relatively new service that's growing quickly so there will be bumps in the road, but we want those bumps to be the exception for Roll20, not the rule. Along those lines, you can view the current status of all of our servers as well as the historic data going back 3 months at our status site: http://status.roll20.net . Regarding last night's downtime in particular, the core issue stemmed from the number of users that were accessing the site, paired with the size of all the campaigns that were being played. This caused our real-time service provider, Firebase, to go down. Once that happened, they immediately responded and began the process of bringing the server back up. However, a new issue arose which they had not encountered before, preventing them from being able to return the server to a working state. At that time they made the decision to move Roll20 to a new server on their network, a process which took about 30 minutes to complete. Once that was finished, data was migrated, and the server started to respond again. So, all told service was completely degraded from around 9:45 PM - 10:45 PM, then from 10:45 - 11:45 service was partially degraded. Around Midnight CST service was fully restored for all users. I've been working with Firebase since last night and into this morning to determine the root cause of the issue, and to develop a plan of action so we can keep this from happening again. We've identified a few things that we can do to fix the problem, and they're confident that we shouldn't experience issues today even if we have similar levels of activity. We'll continue to work with Firebase to make sure that things are running smoothly today and over the coming week. Know that your ability to enjoy a game on Roll20 is our top priority, and we sincerely apologize for any games last night that were interrupted or canceled due to these issues.  Thanks, Riley (on behalf of the whole Roll20 Team)
Ha...  I got a power surge so my internet went down for about ten minutes.  When I get back on we got to play for another 5 minutes till this hits -3-'  That's some bad timing on my part. Glad you fixed it though. 
John M.
KS Backer
As annoying as it was to have my game interrupted, I really appreciate the prompt action taken from the Dev Team and the folks at Firebase. Thanks for all your hard work.
Big thanks for the quick responce ^.^
I appreciate the detailed account of what happened, it's understandable and I'm glad that Roll20 is growing. As it turned out, we were just discussing wrapping up for the night anyway, so we only lost about 5 minutes of in-game activity.
Ahh the downtime really threw a wrench into my scheduled campaign last night but it's just a game and these things are to expected given the growth of Roll20 and it's community. Thanks Riley for taking the time to personally respond to my post from last night and for keeping in touch with the user base about the issue. I really can't ask for more than that. Many in your position would have just kept silent, or maybe just posted something generic to calm their users.  I am sure it is not easy to run something as large a Roll20 is growing, and overall you guys have done a great job so far! I will keep happily handing over my few bucks every month if things continue in this fashion for the future. This platform has allowed me to play for hours with those around the world and quench my D&D tooth. Thank you, Bryant
this all happened just before our normal game time, we started 90 minutes late, but things turned out ok
Thanks, Team.  Glad you guys were able to respond to the Twitter comments left by your roll20 users to assure us how you guys were committed to fixing the problem.  You all are great at what you do and know that I, personally, am happy that you are my tabletop provider.
It's great that you take the time to let us know what's going on. I ran a game a few hours ago and we didn't encounter any problems, so it seems you've got it all back up and running.
Seems we've been hit again :(
Not again =(
It seems Firebase is "restarting servers" without notifying Roll20.
Any other information on that?  Seems interesting.. though rather unfortunate for the folks at Roll20.  They'll end up with the flak from the less-informed players.
Just follow @FirebaseStatus on Twitter. Any service outage you see on Roll20 is because of Firebase.
Riley D.
Roll20 Team
It should be coming back up right now. If you're trying to get in, just try to load in once -- mashing reload only makes it take longer :-)
Riley D.: If you're listening, I highly recommend building an RTS in-house so you guys can control your own downtime. It would probably be cheaper too.
Riley D.
Roll20 Team
Devin T. said: Riley D: If you're listening I highly recommend building an RTS in-house so you guys can control your own downtime. It would probably be cheaper too. If we thought we could pull it off any better than they are, we'd be doing it. Roll20 is basically bringing Firebase down at this point, not the other way around :P
Riley D.: If you guys change your mind I'd be willing to do a free hour of consulting. I'm an Erlang developer, so with a brief description of your requirements I could let you guys know how you could build it in Erlang. devin@devintorr.es
Riley D.
Roll20 Team
Followup:  http://app.roll20.net/forum/post/67423/followup-on-downtime-tonight-2-slash-17/#post-67423 Thanks for the offer, Devin. We may contact you if for no other reason than to pick your brain :-)