Roll20 uses cookies to improve your experience on our site. Cookies enable you to enjoy certain features, social sharing functionality, and tailor message and display ads to your interests on our site and others. They also help us understand how our site is being used. By continuing to use our site, you consent to our use of cookies. Update your cookie preferences .
×

A Note from the CTO, Mike Todd: Stability, Accountability, and Our Path Forward

1769807428

Edited 1769808152
Mike T.
Roll20 Team
Hi everyone, I’m Mike Todd, formerly the CTO of DriveThruRPG and now, as of this past December, the CTO of Roll20. As a long-time TTRPG player and an engineer, I know that when you sit down for a session, the tech should stay out of the way. You're there to play a game, tell a story, and have fun with friends. Not to troubleshoot a VTT. Lately, we haven’t been meeting that standard. Recently, we’ve had a few incidents that have caused instability for some of you. I want to be open with you all about what’s happening behind the screen and how we’re fixing it. The Perfect Storm The experience has been less than ideal recently, and we know that the frustration has landed squarely on you. Some of the issues we’ve seen were triggered by instability in external services like Cloudflare (the service that serves images in the VTT) and Firebase (one of our primary database services), but the truth is that we should have been better prepared to deal with those realities. Relying on third-party infrastructure does not absolve us of our responsibility to you. In fact, it raises that bar.  Infrastructure & Stability : To put it bluntly, Cloudflare has been less stable than we need it to be, evidenced by the global outage in November that impacted almost all of the Internet. We’ve seen continued issues with their service even after that, and we are evaluating options to switch to a different, more stable provider for this part of our infrastructure. We are also actively researching alternatives to Firebase to further harden our architecture. The January Rush : I think we can agree that growth is great for our hobby, but that added strain puts every tech "bottleneck" under a magnifying glass. This month, those bottlenecks were put to the test because this is the busiest January we’ve had in years. Owning Our Issues Yes, there were some external issues, but I have to say we’ve had some misses that were entirely on us. One example is that we released the new D&D sheet in a buggy state. Last January we spent over a month in a laser-focused "bug-squishing" mode, which fixed over 500 bugs and made the sheet a lot more stable. Our team has worked hard to make this a better experience for everyone, and that hard work has paid off. But while the new D&D sheet is in a much better place, there are still some smaller bugs remaining, as well as one BBEG: intermittent issues when multiple people have the same sheet open at once. This is a complex concurrency challenge, and it is the top priority for our back-end engineers right now. Much more recently (this very week), we identified a wide-ranging issue, which has been the team’s primary focus this week. If I can lapse into tech speak for a moment, we noticed a memory usage creep on our web servers (Kubernetes pods, for the geeks out there) that was causing some of those instances to go into swap. This created a frustrating experience for some users that was often intermittent: You might have had a laggy session while your friend in the same game felt nothing, or one page load might have timed out while the next was nearly instantaneous. It was a "luck of the draw" issue based on which of Roll20’s server instances you hit.  My Infrastructure Philosophy Whenever something in our infrastructure breaks, I have a standard a three-phase response: Fix it : Put out the immediate fire. Instrument it : Set up monitoring so we know before it happens again. Automate it : Build self-healing measures so the system corrects itself without human intervention. The Road Ahead At times internal bugs and external outages happen concurrently, making them a nightmare to disentangle. But we have to admit that, regardless of the source of the problem, the result is the same: your game night was interrupted, and ultimately that’s our responsibility . If Cloudflare or other services are unreliable, then it’s on us to find a way to make them work or move to another service that is more reliable. In addition, we need to ensure all aspects of our systems can detect and alleviate those problems when they arise, so that your experience is not degraded. Now that we have identified and addressed the primary cause of that memory usage creep, we are seeing immediate results: reports of “server 500” errors (a specific type of error), image loading failures, and spontaneous logouts have dropped significantly. We also have many reports of people saying things are working now, that weren’t working a few days ago. But we aren't stopping there. In addition to keeping a close eye on things over this weekend to make sure your games run smoothly, here are our action items for the coming weeks to ensure this stability sticks: Hardening Infrastructure : We are working directly with Cloudflare engineers as they investigate the recent instability on their end. And we are investigating the possibility of moving that infrastructure back to AWS (Amazon Web Services). Active Monitoring & Auto-Healing : We are in the process of adding layers of additional monitoring and "auto-healing" protocols. Our goal is for the system to detect and fix issues before you notice something is wrong. “WebGL Context Lost” Investigation : This is an error some people were experiencing in the VTT which we believe is resolved by the Kubernetes fixes, but we are still keeping alert in case more reports come in. Firebase Alternatives : We are actively researching alternatives to Firebase. I know we've fallen short, and we are committed to doing better and being transparent with you as we navigate these challenges. If you’ve been affected by these issues, then I apologize to you and hope you can give us some time to make this right. We owe it to you. Thanks for being part of this community, and for sticking with us as we work through these problems and continue striving to be a better partner for your games. Sincerely, Mike Todd CTO
While I appreciate the desire for stability, I would suggest starting internally. The sudden release of the page design and new menu structure was annoying and off-putting. I relied on the "new messages" icon at the top of my game page, and I was late seeing new messages because it had been moved. I eventually found it in a location I think the least intuitive for any long-time user: at the bottom of a new pop-out menu panel, isolated from the menu links in a mostly-empty panel. 
5 minutes ago I ended my subscription - not because of the overall tech-problems ore some minor details but because as a Pathfinder-Player I don't feel taken seriously on this platform. This whole Demiplane-Business is - as we say in German - "a shot in the oven". Most players I communicated with don't want more two platforms that don't work together but one. My different questions I had - be it here in the forums, on Discord or per E-Mail - were never adequately answered. So I think you shouldn't just think about the technical problems but about the structural and communication ones, too. For me and my group this whole Demiplane desaster led to changing to a more PF-friendly system. If you want a DnD only system just say so ...
Your new update made roll20 unaccusable for operaGX Yall managed to break your service to become unusable for an entire BROWSER please fix 
1769906116

Edited 1769906305
Gauss
Forum Champion
Plaz said: Your new update made roll20 unaccusable for operaGX Yall managed to break your service to become unusable for an entire BROWSER please fix  Hi Plaz,  Which update are you referencing? I suggest posting your issue in the Bug Reports forum so that the issue you are having can be discussed. With that said, Roll20 only has two supported browsers, Chrome and Firefox. What is worse, OperaGX is one of the worst browsers to use with Roll20. It regularly has issues with many websites, not just Roll20. It is unlikely OperaGX will ever be a supported browser because of this.
Thorsten P. said: 5 minutes ago I ended my subscription - not because of the overall tech-problems ore some minor details but because as a Pathfinder-Player I don't feel taken seriously on this platform. This whole Demiplane-Business is - as we say in German - "a shot in the oven". Most players I communicated with don't want more two platforms that don't work together but one. My different questions I had - be it here in the forums, on Discord or per E-Mail - were never adequately answered. So I think you shouldn't just think about the technical problems but about the structural and communication ones, too. For me and my group this whole Demiplane desaster led to changing to a more PF-friendly system. If you want a DnD only system just say so ... Amen. I do get the whole pivot to D&D thing, but it doesn't excuse the mishandling of Demiplane. Additionally, I suspect that Demiplane "integration" is harming the D&D now that new D&D products are handled through Demiplane. I also suspect that as Roll20 tries to straddle operations between Roll20 and Demiplane, this diverts resources from site maintenance and improvement.
Thank you for  your open words Mike, much appreciated. As an UC engineer myself I fully understand what you are fighting with. It is not just money am spening on roll20 but also many hours of my lifetime and I want them being worth it. And not by being disappointed on how bad something is preventing my game to be a great time together with my players. Or when joining  scheduled game session, just wondering what might happen this time. Nevertheless it is not just tech stuff going awry. How things got communicated in the past 'til today also needs improvements.