Updated on: February 6, 2024
In a recent interview with SafetyDetectives, Brian Rue, co-founder and CEO at Rollbar, discussed the origins and distinctive features of Rollbar, a leading error monitoring tool. Rue highlighted the importance of a streamlined workflow for developers, allowing rapid deployment and quick issue resolution in production. Rollbar sets itself apart by prioritizing real-time error detection, deduplication, and prioritization. Rue also explored the evolving role of machine learning and AI in error monitoring, envisioning a future where AI not only detects but also fixes errors. Addressing the challenges of increasing software complexity, he shared insights on strategies for handling multiple services and large code bases. Rue concluded by dispelling common misconceptions about error monitoring, emphasizing the necessity of a detect-triage-fix workflow for effective issue resolution.
Can you introduce yourself and talk about what inspired the creation of Rollbar?
I’m Brian Rue, co-founder and CEO at Rollbar. I’ve been drawn to coding since I was a kid. I started with QBasic when I was little and then moved to the LAMP stack in my teen years.
I started building applications to scale in 2008 with a team I was on at Lolapps. We grew from nothing to 2 million users within a few months. The platform was constantly changing, and we were adopting new features and changes.
That meant that the faster we could get things from idea to code to production, the faster we’d grow. We realized that we could go faster by taking away most of the preproduction testing. It’s usually faster to fix things after they come up in production than to try to prevent them from breaking in the first place.
So the way that we would release code became, get it working locally in development, and then ship to production. This meant log into production, tail logs, press deploy, and watch for any new errors. If there are any, then fix them and deploy again. We would just repeat these steps until it’s all clear.
This worked great as a small team. It enabled us to ship extremely quickly, going from an idea to production in a couple of hours. It was just a great experience as a small team. As we grew from 4 or 5 devs to around 40, we built some tooling that brought that workflow into our bug tracker, allowing us to scale it up to the larger team.
So, my first goal with Rollbar was to take this type of workflow where you’re not afraid to deploy because you know that you can quickly fix whatever comes up, and make that possible for every developer.
What makes Rollbar stand out in the field of error monitoring tools?
Rollbar, first and foremost, is optimized for developers to find and fix fast.
The old way of dealing with problems in software production is that you have to find it, and you have to fix it. Your errors probably exist somewhere in Jira; they’ve been filed by support, reported by customers, or exist in your logs. But you have to find them, and then you fix them.
What Rollbar solves is we find them for you. We’re going to detect, prioritize, and manage all the things that are wrong with your code as it’s running in production.
To make that possible, you’ve got to do three things:
- Detect those issues at runtime. We do that with lightweight SDKs, so we get the meaningful errors directly from the code, along with way more context than you get from logs.
- Deduplicate those errors: In any large scale system, you can get thousands or even millions of errors. No one wants to deal with that many tasks; it’s just impossible. But the truth is, you don’t have to because there aren’t a million different errors. There’s maybe tens or hundreds of unique problems in your code. So the main function of any error monitoring system is to accurately turn that big list into a small actionable list, which can be prioritized, managed, and assigned.
- Work in real time: We need the fix to be as close to instant as possible so that the time between when errors are affecting customers and when you as a developer see them is a second or two versus several minutes. It allows you to not only have a better process, but you also want it because as you’re interactively dealing with the application yourself, you want to be able to say, “I clicked here and I saw the error happen there, so I know that that is the same thing.” Having that kind of couple seconds latency versus like tens of seconds or minutes latency is critical so you can respond to things quickly.
From the beginning, Rollbar prioritizes making the platform very fast, and today our average from the time that an error occurs in production to the time that we’ve deduplicated it, indexed it, and sent you a notification in Slack is about three seconds.
What is the role of machine learning and AI in the future of error monitoring?
From early 2019 through 2021, we implemented machine learning into the current generation of error duplication. This involved a high-volume, low-latency deployment of machine learning. Instead of fixing errors, it focused on accurately categorizing millions or billions of errors into a smaller, more manageable list.
Continuing to improve this process is one area of research. However, the goal is to improve the accuracy and efficiency of categorizing errors. The origin of the error monitoring category is going from “you find, you fix” to “found for you, but you still fix it.” The next generation is “found for you and fixed for you.”
There’s still a ways to go before this is reliably production-usable, but it demos really nicely. I have a demo script that says, point GPT-4 at a Rollbar item and say, “Hey, describe the issue and propose a fix for the issue and send me a pull request and improve the pull request.”
I think that’s the future. There’s going to be more code written as more code is written by developers with help from AI, and as there’s more code written directly by AI.
That means there are more errors to solve, right? More code, more errors. So, I think that we’ll see both developers using AI to detect and fix these issues, but we’ll also see AI producing the fixes and playing more of a role in what the developer would be doing manually.
How do you see the role of error monitoring evolving with the increasing complexity of software applications?
We see a couple of different ways of coping with complexity.
If you have the complexity where there are more services, you might say, “I’ve got this big monolith, and I want to break it into multiple services.”
Now you have less complexity per service, but they have the combined complexity of many services. In that world, the challenges are that you might need to trace errors across multiple services to really understand the full picture.
You might have a case where there’s one service that breaks and causes eight other services to have issues at the same time. Zooming up to the level of the company, I don’t want all eight teams plus the central team to be debugging the same issue. I want one central team to fix it, and the other eight teams to hear, “FYI, the team that owns the database is fixing it.” You don’t need everyone to panic at the same time, and that’s the complexity you get when you have multiple services.
When you just have one service, now you have a whole bunch of different developers and a whole bunch of different teams all working within the same massive code base. In that situation, you need to be able to ask, who owns this code? There’s probably code from multiple teams all in the stack trace. So which ones of those are most relevant to what the error is, who should own the error, and who should manage it.
How do you stay ahead of the curve in such a rapidly evolving tech landscape?
My first thought is that it isn’t really evolving that fast, with a few notable exceptions.
There are plenty of new things, new frameworks, new languages, and new libraries, but a lot of them take time. For example, Python 3 was launched in 2008, and Python 2 didn’t have end-of-life until 2020. That’s 12 years, which is a pretty long time.
When looking at the front-end, there’s this constant churn of frameworks and constantly new frameworks being created. I see it almost more as a pendulum shift between, let’s shift between rendering on the server side, then rendering on the client side, and then let’s move back to rendering on the server side.
I think there are some really interesting things that have kind of come out, but they’re not that fast-moving, right? I think Rust is amazing, but it’s been quite a bit of time between when Rust started and where it is now. I think it’s still fairly early in its adoption.
My one big exception is LLMs. When ChatGPT first kind of became visible, it was like, wow, this has really taken a big leap forward. There was around a 6-month period when I was constantly in the state of saying – how fast is this going to go, is it going to replace everything instantly?
For me, the answer was I need to understand it at a technical level. I asked questions like:
- What does this thing actually do?
- How do I use it?
- What does it make possible?
- How does it throw out old assumptions?
I needed to find answers to these questions before I can decide how we can stay ahead of the curve and implement it in our systems.
What are the common misconceptions about error monitoring that you encounter?
The main one is people think they already have it covered. They’ll say, I already have errors in my logs, or I already have a list of errors that shows up in an APM solution like Datadog or New Relic. Therefore, that must mean I already have error monitoring.
The question is, how do you discover issues? Do you find out about issues because your production alerts about your overall error rate have gone off, or are you finding about them because your customers are telling you things are broken? If that’s what’s happening, then you are already behind the eight ball.
When error monitoring is working, it should be a very localized alert. Just this one thing’s broken, just this one time. It should be in your canary release or just the feature being turned on.
Each time you get an alert, it should be leading to an action. I’m either going to say this thing doesn’t need to be solved right now or it needs to be solved later by another person, so I’m going to turn off that feature in production right now.
I think the main misconception is that people don’t really realize that the detect-triage-fix workflow can exist. They’re saying, errors are my logs, and therefore, I must have error monitoring. It’s night and day between what your workflow looks like when you have to go look through logs to find errors versus when they’re found for you.