Last night we had a very ill-timed and unfortunate bout of downtime. It started around 9:45 PM CST and lasted until around 11:45 PM CST. During that time a good chunk of users were unable to access their campaigns.
First off, I just want to say that we take downtime really seriously. We know that you plan your games well in advance and that often even 10 or 15 minutes of downtime can ruin a game experience. Our goal is to have 100% uptime. Obviously we're a relatively new service that's growing quickly so there will be bumps in the road, but we want those bumps to be the exception for Roll20, not the rule. Along those lines, you can view the current status of all of our servers as well as the historic data going back 3 months at our status site: http://status.roll20.net .
Regarding last night's downtime in particular, the core issue stemmed from the number of users that were accessing the site, paired with the size of all the campaigns that were being played. This caused our real-time service provider, Firebase, to go down. Once that happened, they immediately responded and began the process of bringing the server back up. However, a new issue arose which they had not encountered before, preventing them from being able to return the server to a working state. At that time they made the decision to move Roll20 to a new server on their network, a process which took about 30 minutes to complete. Once that was finished, data was migrated, and the server started to respond again. So, all told service was completely degraded from around 9:45 PM - 10:45 PM, then from 10:45 - 11:45 service was partially degraded. Around Midnight CST service was fully restored for all users.
I've been working with Firebase since last night and into this morning to determine the root cause of the issue, and to develop a plan of action so we can keep this from happening again. We've identified a few things that we can do to fix the problem, and they're confident that we shouldn't experience issues today even if we have similar levels of activity.
We'll continue to work with Firebase to make sure that things are running smoothly today and over the coming week. Know that your ability to enjoy a game on Roll20 is our top priority, and we sincerely apologize for any games last night that were interrupted or canceled due to these issues.
Thanks,
Riley (on behalf of the whole Roll20 Team)