Community Forums: Downtime on 12/4: Post-Mortem | Roll20: Online virtual tabletop

1449502130

Edited 1449508297

Roll20 Team

On Friday afternoon/evening, we experienced some unexpected downtime. This downtime affected all 30 of our real-time shards for approximately 5-10 minutes, and then two shards in particular were affected for an extended period (off and on over the course of about 2 hours).  First off, we want to say that downtime is always something we strive to avoid. We know that games on Roll20 are often scheduled weeks in advance, and getting a group together can be hard enough without piling another difficulty on top of it in the form of your tools not working. We take any downtime very seriously. We think it's important to let you know when it happens, why it happened, and what we're doing to prevent it from happening in the future. The cause of this downtime was our real-time service provider, Firebase. They are the ones who provide the infrastructure that allows us to have 10,000 players online playing tabletop games together at the same time. Most of the time, this service works well (99.95% uptime is the goal, and nearly every month that goal is met). Sometimes, there are minor hiccups. Unfortunately on Friday night there was a major outage for those two shards, which affected approximately  7% of all games. In addition to this outage affecting these games, the entire site was slowed down by requests piling up and being unable to reach Firebase via their REST API. This is what we use to do things like create a new game, add a player to game, or copy things between games using the Transmogrifier. This led to some 504 (timeout) errors, as well as in-game problems such as the image library search responding slowly or not at all. By 6:30 PM, approximately 5 minutes after the downtime started, our technical team had been alerted and was responding. By 6:40 PM, we had made some changes on our end to alleviate the strain of the two shards being down so that the 504 Errors and the image library became responsive again. In addition, 28 out of our 30 shards were now back online and operating normally. We continued to work with Firebase throughout the rest of the evening to get the remaining two shards online. By 8:10 PM, service had been restored to all shards. Later on in the evening, the two shards experienced a slow-down from approximately 9:30PM - 10:00 PM; however the changes we had made previously prevented this slowdown from affecting the rest of the site. We did our best to communicate these issues on our Twitter feed, @roll20app, which is always the best source of information about downtime and site-wide issues. You can also always check on how we're doing on our status page, at <a href="http://status.roll20.net" rel="nofollow">http://status.roll20.net</a> . Here are a few things that we're doing now to help keep this particular issue from happening again: We're working with Firebase to get a better idea in advance of when there will be small amounts of downtime. The initial 5 minutes or so of downtime across all shards was due to a planned database restart on their end which we were not made aware of in advance. Our goal is to always know about planned downtime in advance and communicate it to you so you can plan accordingly. We're re-tooling pieces of our the Image Library search to hopefully be more responsive and place less strain on the site as a whole, so that not only will queries return more quickly, but so that if there are technical issues the image search can remain online. We're investigating other ways that we can more quickly and clearly communicate to the whole community when there are issues, and what's currently being done to handle them. On a personal note, I just happened to be out of town on vacation when all of this happened, so I'd like to thank the members of the Roll20 team (in particular, Steve and Stephanie) who were able to be around to start fixing things and to communicate to the community what was happening until I could get back to my computer to help. As we continue to expand the Roll20 team in the future, this type of coverage will only get better for us, which is a great thing for the entire community. Finally, I'd like to again apologize to anyone whose games were disrupted in any way; our goal is to bring people together to enjoy tabletop gaming, and we're always striving to find ways to make sure we can meet that goal with the highest standard possible.

1449504166

Cookie Preferences

Downtime on 12/4: Post-Mortem