Fixing Severe Product Performance Issues

Nearly every fast-growing product that’s pushing technical boundaries will face crippling instability and user-perceived performance issues. Here’s how to systematically isolate and resolve them.

Feb 21, 2024

In a perfect world, software companies would avoid critical product performance issues by using automated testing. However, few companies have that time, money, or cultural discipline until they’ve raised tens of millions in venture capital or have at least millions in annual revenue. I’d love to see more teams set up great automation frameworks early on, but the reality is that this work trades off directly against finding product-market fit. It’s hard to know what tests to write or how hard to work on them when you’re not sure which features will be in the product tomorrow. As a result, it’s somewhat inevitable that startups building novel solutions will experience not one, but many, moments like I’m describing. Here are the hallmarks:

Customers are complaining, loudly, that the product doesn’t “work” the way it did in the demonstrations—or they aren’t talking to you anymore.
User adoption is stalling even though you know you’re solving an important problem with a strong solution and are signing up new customers.
Engineering knows there’s significant tech debt and feels like no one is paying attention.
Product (and QA if it exists at this stage) is overwhelmed in attempting to clarify all the conditions where issues are occurring.

Solving these challenges is never easy, but after experiencing such moments in nearly every startup I’ve worked at or advised, it’s clear that there are some learnable patterns for making the badness go away while other approaches can cause it to linger for months or even years.

First, set user-experience latency goals

Even the best engineers I’ve ever met won’t want to do this at first, but trust me, this is where you want to start or you’ll burn two weeks just to return to this moment. Engineering will say that logging and dashboards are already in place, but frequently those metrics are not tracking the slowness, or latency, in the system as experienced by real people. It’s great that requests to the API return in 237 milliseconds, but if the user can’t load important content for 37 seconds—even if the problem is based on that API call—the issue won’t get solved, and engineering and product will talk past each other.

To be clear, great debugging requires both styles of measurements, but the first step is to set user-experience-based performance goals and then work backward. These can sound like:

90% of searches with five or fewer keywords return in one second or less
99% of analytics dashboards with three or fewer parameters load in two seconds or less
95% of all page loads take 1.5 seconds or less

Hopefully you’re noticing two commonalities. The first is that we never say 100%, because there’s too much entropy in complex systems to guarantee perfect performance, or at least the time needed to get there is generally not worth the tradeoffs required. Second, we’re being precise about the inputs. We’re defining conditions that we know are common, based on our usage analytics, such as total number of keywords, parameters in a given dashboard, or styles of queries etc., so that we can define the experience as the user is likely to encounter it. We’re also setting targets for how fast those actions should be or better. We’re always happy to have better, but the surge to fix the issues can come to a rest when we hit these targets. Without such concrete targets, performance debugging can be endless, and it quickly hits diminishing returns for the business and for engineering morale.

In case you’re struggling to establish benchmarks, I’ve found that in consumer products, pretty much all interactions need to be less than one or two seconds, and in emerging-technology use cases: <1 second feels snappy, <3 seconds feels solid, <6 seconds feels acceptable but not great, and <15 seconds is about the upper limit for the user not becoming totally disengaged. If you know you’ll be above these thresholds no matter how hard the team works, find a user experience solution (for example, “Email me when my dashboard is ready”).

Additionally, isolate the conditions in your experience thresholds:

Which environment(s) does this need to work in?
On which data sets and at what scale must this work?
With what level of user scale? How many of the actions described will occur per user per unit of time?
What user or customer-specific parameters are expected to be invoked in these circumstances?

Repeat this for as many core parts of the product experience as are necessary to declare victory on these critical performance issues.

Create a dashboard for the chosen metric(s)

Every member of the team must have the ability to see the progress against the goal. If you have sufficient usage of your product, the dashboard itself will have enough data flowing through it to let you know how close you are to success and which approaches are or are not working. If not, you will need to build some load simulation tests at least for your core experience goals.

This dashboard step is critical so that it’s not one person out in the field saying things are slow and everyone else back home logging into a different environment with, for example, low-scale data or less resource contention, and therefore you’re not speaking the same language. When the ground truth is not aligned, the urgency and level of hard work necessary for success don’t materialize.

Generate a set of hypotheses

Once the experience goals are agreed upon, engineering needs to brainstorm a set of hypotheses for why the performance is the way it is. These need to be documented for everyone to see. This aligns engineering debugging effort and accounts for the fact that our first hypothesis is rarely correct.

It also protects against a mentality I’ve seen even among the best, where, after the current hypothesis is checked and disproven, the debugging stops due to a lack of continued creativity as to the source of the issue. When the current hypothesis is clearly articulated for all to see, teams can run one idea fully to ground, rule it out, and then move on to the next one or even parallelize debugging when it’s a team effort.

During this phase, it also makes sense to identify that next tier of performance logging that will assist in the debugging. Based on the hypotheses, what would the team need to be able to measure to form a complete stack of knowledge, end to end, of how the problem manifests? This is where system- and API-level transparency play a massive and critical role.

The only reason to delay this step until #2 is because it takes too long to instrument every portion of the backend systems (if that’s already done, kudos to you and your team). Now your team can trigger the conditions users have experienced that led to the slowness and the engineering team can analyze and inspect the hotspots. Inevitably, this leads to more hypotheses, which should also get documented and assigned so the team can make systematic progress.

Assign your best problem solver, not your best engineer

Frequently when we encounter challenges like these, we look to whomever built the piece of the system we think is the problem to identify the issues. While this occasionally pays off, the reality is that not all engineers are great at finding and tracing systemic, multi-cause issues to their sources. When you have people with both sets of talents, double their equity and increase their comp, but in all other circumstances, ask yourself: Who on the team has the best instincts for how the pieces of the system come together and doesn’t stop until they find root causes?

Frequently this is someone who doesn’t take anyone else’s ground truth for granted and views themself as something of a jack-of-all-trades but is also adept at wading through lots of code and understanding how parts make up the whole. This debugging captain will need help from team members who are more expert on certain aspects of the code base as they proceed, but it’s crucial that the engineer driving the investigation has the correct personality traits.

Set a daily sync on the issue

If it’s costing the business dearly, then it’s worth it to take 10 or 15 minutes each day, outside of other meetings, to check in together on the debugging progress. This is an engineering-led meeting, but it likely has participants from across product, engineering, support, customer success, and any other tactically relevant stakeholders. It’s not for the CEO, the customer, or anyone else who isn’t part of solving or communicating directly about the solution.

These are not rigid Agile standups. They are a time to review progress against the debugging, make requests of teammates, check alignment against the list of hypotheses being tested, or even brainstorm new options to check.

Don’t give in to “but there’s no repro”

If the metrics from step one are clear that there’s a serious problem in terms of magnitude (for example, painfully slow) or impact (slow for a critical client in the middle of expansion negotiations), then it’s entirely unhelpful to give in to the mantra of “I can’t do anything until this issue has a set of clear issue reproduction steps.”

It’s lazy for product, customer success, or QA to fail to provide repro steps when the conditions are well-understood, but it’s lazy for engineering to dig heads into the sand when there is clear proof of an existential issue merely because the teams closest to users can’t explain the exact conditions that cause the problem.

Let me be clear—what I just said is a wildly unpopular view that will lead to product-engineering tension, but it’s tension the team must work through together, not ignore in situations of great crisis. To unlock better thinking at these impasses, I find it can be helpful to zoom out with engineering leaders or debugging captains: “What in our architecture could be contributing to this?” or “What are all the steps required to complete this action under the hood?” “Which ones are most susceptible to variation?”

Having this discussion while simultaneously providing all possible signal on user behavior can often lead to the missing epiphany. When all is lost, someone needs to figure out every single part of the architecture that is implicated from the logs and start reading code and examining hotspots under load until new hypotheses can be formulated.

Dig systematically until the root cause is uncovered

Many issues can be fixed without getting to root causes, but this is shortsighted. As a matter of discipline, insist that the debugging captain write down the root cause on the same document where the active hypotheses are tracked. If the team doesn’t know the root cause, someone must keep digging.

It’s a cliché, but the deeper you can go, the higher-leverage the fix can be. You won’t necessarily fix the root cause, but you need to understand how the issue will manifest again if you don’t, and be willing to accept that those conditions are sufficiently rare.

For example, I once worked on a product where the number of security groups an enterprise account could create impacted search times. Most clients had one or a few dozen security groups. The marquee client experiencing the crippling instability had almost 100,000, and the next most was about 1,000. In this context, it did not make sense to re-architect the system to accommodate 100,000 security groups, but we did need to come up with a fix further from the root cause that would work for that specific client’s circumstances.

Resist the temptation to refactor it all

An additional benefit to having a debugging captain who is not tied to code they wrote is that once the root cause is known, the team can have a thoughtful discussion about exactly how much needs to be fixed to deal with the problem and protect the product in the future.

Unfortunately, the same trait that leads great engineers to want to constantly improve the code they see is also the trait that causes root-cause analyses to go too far. We will never have perfect software systems at this phase of debugging in a company’s journey—or if we do, it will almost certainly come at the cost of not building product features that have value to our end customers.

The symptom of this fail mode is when a small fix will do but the team undertakes a massive refactor of a critical portion of the code base without a serious discussion with engineering and product leadership about whether it’s worth it. Sometimes it is!

However, I find that frequently what happens instead is that someone starts the rewrite, does not actually fix the bug right away, and then weeks or months go by and the issue persists, causing massive damage to both the business and the trust between product and engineering personnel. Meanwhile, the refactor breaks more critical pieces of the system because the company’s architecture and test automation weren’t evolved sufficiently at this phase to support such deep changes without starting a chain reaction of bigger problems.

In short, be deliberate about how much needs to be fixed. My rule of thumb for venture-backed companies is to make sure the fix will carry the team through about one or two orders of magnitude increase in usage/customers/complexity, as this generally means that the issue is unblocked, the team can blitz-scale, and the issue can be “fixed” again when engineering and product talent is more prolific after the next raise.

Product should be involved, a little

These tradeoff decisions are part of why product can’t be entirely hands-off. A lot of popular product practitioners will tell you that this post doesn’t even belong in a product-focused newsletter because fixing these issues is theoretically engineering’s job.

I agree that most of product’s time should not be spent on this, or else many other important aspects of the product job will suffer; however, product serves a few key roles here:

Identifying the problem and establishing what the user-experience performance thresholds need to be
Communicating with customers or with customer success about the progress in resolving the challenges
Weighing in on critical tradeoffs about how long to spend on this style of debugging, with regard to what risks are acceptable and therefore how deep of a solution is needed
Being available to answer key questions about the benefits and pitfalls of different proposed solutions and how those will impact customers

I have rarely sought to spend significant time on performance debugging issues as a product leader or PM, and I don’t encourage product managers to wade into this zone of work if it’s not necessary. However, it’s hard to miss that the organizational and root-cause-thinking traits that lead to product success can be a true asset to engineering in these stressful all-hands-on-deck situations.

With this in mind, my recommendation is for product to be proactive and willing to dedicate at least a little time per day to supporting engineering in these crisis situations, as it helps everyone solve faster and get back to the value-generation work.

Hi! If you enjoyed these insights, please subscribe, and if you are interested in product coaching or fractional product support for your venture, please visit our website at First Principles, where we help the most ambitious founders make a difference.

Emergent - Product Newsletter and Podcast

Discussion about this post