Educated Guess

How to not run out of runway - the seed stage growth model

2020-08-18T10:03:49+00:00

There’s lots of startup advice floating around. I’m going to give you a very practical one that’s often missed — how to plan your early growth. The seed round is usually devoted to finding your product-market fit, meaning you start with no or little product and should end with a demonstrably retentive and growing population of users. It’s this growth stretch that I’ll tell you how to plan — and once you do that, you’ll have a clear timeline for your seed:

The three steps of a classic seed round: Search for PMF, Early growth, and fundraising.

While there are differences, Series A is broadly raised when you have early traction, evaluated through growth rates and ARR or potentially-monetizable userbase.

What’s a good userbase? One that could or is generating about $1M ARR.
Why $1M ARR? Because your series A will be in the vicinity of $10M valuation, and x10 is generally the multiple used at this stage. There are many variations, but rarely order-of-magnitude ones. (2024 note: This post was written pre-pandemic)
What’s a good growth rate? Paul Graham says 5%-7% w/w and David Sacks says 15%-20% per month (which is slightly slower). You can easily find 100 other benchmarks for the numbers above on Twitter and VC blogs/podcasts -— but unless you’re going into investment, there’s no use spending so much time on that, just use these numbers as a guide and then do your best.

Taken together, all these numbers mean this is how your seed growth plan should look like: You raised enough capital for X months, say 18. Just before that money runs out you better start raising your Series A with traction at around $1M ARR. Maybe you’d like a 3-month buffer to raise and in case you miss your growth goals. With a growth rate of 5% w/w, how long does it take to make $1M ARR? Should you start at month 12 after your Seed? Or month 3? How long does that leave you to reach PMF? What if you missed the start of growth by a few months, how fast do you need to grow now to catch up? How many users does that mean you need to bring every week?

Build an excel file tracking these exponents. Fill in your target end date (3 months before you run out of cash, so you need to raise your next round). Put in a starting amount of users you will get from your first blast/launch (perhaps users in your waiting list or a major first push — like 100 or so). Then, set your goal number of users, and what your predicted weekly churn rate is. Figure out when will growth start (i.e when do you launch). What happens if you give yourself 4 months to grow? Is that a realistic growth rate or will you need to hit rarely seen >10% w/w growth? How about 10 months? Do you have 10+3 months left your runway or should you start growth against weekly goals today?

You can create a table that gives you the exact week-by-week goals for growth that you can just print and share with your team. You can put in how much did you actually realize every week and are you ahead or behind goal. Focusing just on this, as Paul Graham says, will guide you to success or to a better idea for your startup.

Focusing on hitting a growth rate reduces the otherwise bewilderingly multifarious problem of starting a startup to a single problem.

— P. Graham

This model is the basic one, and you can go fancy and add more options — splitting out by organic and paid growth, by channel, or adding in varying revenue; You can easily model some low and capped virality (e.g, every person might lead to up to 3 more people with 15% probability).

What I learned about fundraising during a pandemic

2020-05-26T08:41:07+00:00

Unlike most posts, I hope this one isn’t going to be useful to anyone within a few weeks - hopefully, we can all go back to normal. But for those of you out there raising right now, here’s a few bits from how the seed round for Cord went.

Ego cover

When the COVID-19 mess hit London, I told my Co-founder Jackson - we either raise now or we have to wait this thing out. We weren’t planning on raising immediately: We wanted to spend a few months building some functionality. But waiting an unknown period of time seemed like too big a risk. “At the very least if this fails,“ I said, “we could never tell if it was because we suck at fundraising or whether it was the adverse conditions. We’ll have an ‘ego insurance policy’.”

We ended up being very fortunate and the fundraising went better than expected, with the main downside being that we had to disappoint a lot of exceptionally talented and close supporters that we couldn’t fit into this round. But this isn’t about self-congratulation - the actual work is building the company and product. So I’d rather focus on what I learned instead:

The one advantage you have right now is that there are no face-to-face meetings, so there is no commute/wait in the lobby/order coffee overhead. I fit 7-8 pitches a day while spending 7 hours with our 2 year old so that my wife could work. Normally you can barely fit 3-4 pitches on a full day of work when you have to schlep between VC offices and angel meetings in coffee shops. That means you can make things move incredibly fast and create the momentum that absolutely seems to matter to raising effectively.
Pitch worldwide! Since you will do all the meetings including the final investment committee (IC) online anyway, you don’t need to plan your round around geographies - leverage that to meet as many VCs as you can. We met funds on both US coasts, throughout Europe, in London and Israel - just over 20 in two weeks, on top of angels from all of these geographies.
On how to get these meetings - this applies in better times as well: We started out with a list of about 300 relevant funds and angels that I collected from various sources. We then figured out who could intro us to whom using LinkedIn, and reached out to everyone we could for warm intros at once.
VCs will often set a 30m meeting on the calendar but have an hour to spend with you if you’re interesting: Don’t schedule the next VC meeting right after the last one. Also leave time for the occasional loo break!
That focus on squeezing as many meetings as possible into as short a time as possible means no time for changing your deck, though. Leave a day a week or some time on the weekend to aggregate feedback and make changes.
Technology is the only conduit for passing info to your investors right now. So be on your A game in terms of tooling. Invest in paid Zoom, don’t use Google hangouts: Any browser-based video conferencing app means that when you share your screen to show your deck or demos it’ll cover your video tab and you won’t be able to see your audience’s faces and read their expressions. (This was actually impossible even in Zoom in the IC meetings with bigger firms that had more than a few people on the call…). Jackson got a lot of great responses about his neat home office setup. And of course - quiet room, be on time, good WiFi, have a very easy fallback to a phone call at the ready.
No one really needed to hear ‘why this is relevant for Coronavirus’. If you need that in your pitch I don’t think that’s a great sign. We don’t know how long it’ll last, and VCs play a very long game. Most wouldn’t want to feel like the pandemic was the reason your startup moved from ‘pass’ to ‘invest’ - so just don’t give them that reason in the first place.
Use visuals to engage - you can’t see if they’re actually with you or on email while on the call. So share your screen and use that: Draw on it, show stuff, and keep referring to the stuff you’re showing. Your talking head can get quite boring, unlike in real life, and you need to ensure their attention doesn’t wander.
This is one thing I’m hoping doesn’t change after the pandemic - there’s definitely a more effective process for raising.

Check these 4 things before you share your experiment’s results with your team

2020-03-02T11:13:09+00:00

I’ve made my share of analysis mistakes, some costing weeks of team members working on the wrong thing, or worse - missing a great opportunity. But through that I’ve collected a mental checklist of what are the common pitfalls in analysis, whether of experiments (A/B tests) or observational data (just looking at logs, like # of users or conversion rates).

If you test for these, or just caveat what you didn’t test for you can be more prepared for executive questions and make sure the actions people take based on your analysis aren’t tainted by an incorrect conclusion. You can use Radical to leave notes next to the dashboard, notebook or chart about what you checked for, or involve your team in commenting, offering alternative hypotheses, etc.

The checklist:

1. Heterogeneous treatment effects

That mouthful basically means you see an effect that is an average of different effects for different populations of users. For example, maybe the new version of your website is much better at converting men than women: The conversion rate for males is 23% higher than it used to be, while it is slightly worse (-2%) for females. But if your website’s audience is made of 80% women, you will think the new version was a good idea: The average conversion rate is up 3%. In reality, most of your users like it slightly less, and it is only better for a small segment of users.

Few users very happy + most users slightly sad might look like a good result

To avoid this, there is often no way but to try segmenting your users by whatever data you have (e.g demographics, geography, etc) and looking for major differences in effect size between groups. However, if you overdo this, you need to be careful of multiple comparisons (essentially, discovering non-existing effects by random). Yeah, I know - 🤷 - no free lunch in Statistics! In some cases though, you can test for this more directly by plotting the distribution of effects: What % of users has what effect on their habits. For instance, when measuring a repeat activity like “Average number of days the user visits your service in a month”, you can check whether an experiment shifted everyone up by 3% (Orange line) or whether there was an average effect of +3% but a wide distribution of actual effects, with some users hurt by the change (Purple). Then, try and figure out what differentiates these users.

Orange experiment had the same effect for everyone; Purple experiment had a range of effects

2. Mix shifts

A related issue I check for is when plotting data over time. One of the worse errors is an incorrect trend: If the data is “directionally” correct but wrong by a constant (so, you think something’s up when it’s up, just not as up as you think), that’s usually okay for most business decisions. But when you think something is down when it’s up, you might turn off a successful activity or lean into a harmful one! Mix shifts occur when there is both a heterogenous effect (so two or more groups have different outcomes, e.g retention rates) and over time one group is growing as a proportion of the total. This creates a false trend over time coming in from the shift in the populations rather than the shift in behaviour. In the example above, if your website started churning its women users (who’ve had a worse experience after the change you made), you’d think the overall conversion rate is improving — when really it’s just the remaining men who are converting better, but not an overall improvement in the success of the business.

A change in populations from F to M might make the same experiment look like it’s improving

3. Checking exposure

A very subtle topic but a reality of online experimentation is that sometimes we separate out when we log a person was randomized into an experimental group, and when their actual experience changes. Here’s an example: On my e-commerce website, I might have an experiment where half the users who log-in will receive a 10% discount upon checkout, and half won’t. Naively, I may divide the user population into the two experimental groups, A and B, the minute they entered the website, and log this in our database. Then, I run an analysis about which groups, A or B, had a higher checkout rate, and compute the power of the experiment - essentially making sure I had enough users go through the experiment to draw a statistically significant conclusion about its effect. Right?

Well, almost. The thing is, I logged which experimental group users were on when they first visit the website. But a large portion of them (say, 90%), don’t even end up buying anything. So they actually don’t see the coupon — they have the exact same experience as the control group that didn’t get a coupon at all. As a result, of course, their buying behaviour doesn’t change. But the math behind computing the power of the experiment is misleading here: We’re diluting the effect of the experiment with a lot of people who didn’t buy regardless of the coupon, so we end up thinking the experiment was ineffective, or not get to our desired level of statistical significance, and ditch it. This is called ‘overexposure’ - you’re registering people into the experiment who didn’t actually get the treatment.

Over-exposure and under-exposure both make experiments look not stat. sig. when they are

In the rarer, converse case we might be exposing the user before logging he was on an experimental group - this could happen in complex setups where e.g we only logged the exposure when they actually checked out, not when they first saw the coupon in their cart - and as a result we underexpose and also dilute the effect of our experiment (or worse, can make the wrong conclusion if the exposure is correlated with some behaviour).

The bottom line is to always make sure you are clear on where did the user first encounter any possible effect of the experiment, and logging their inclusion in an experimental group (or control) only then.

4. Multiple comparison

Statistical power is a complicated topic, and it’s well known we can incorrectly discover an effect (as Randal Monroe of XKCD so well put in an experiment testing for the link between different coloured jelly beans and acne). But we can also miss an effect that’s there because our experiment is underpowered (So for example, it’s extremely hard to discover something you did improved conversion rates by 0.1%, because that requires tens of thousands of users). And while a 0.1% improvement is not in general critical for the success of the business and we shouldn’t focus on it, 0.1% degradations happen all the time, and are hard to catch. Here’s an example. Imagine we’re in a growth team doing various experiments on the experience in our app, and many of these happen to have great outcomes on our daily active users.

But, we’re not silly - we also check for effects on long term retention. If an experiment had a statistically significant negative effect on retention, we would reject it. However, long term retention is a noisy metric with less users and less power to measure it on, and the effects are often subtler. They often pan out slightly negative or slightly positive on average, but with confidence intervals that mean they are statistically insignifcant: We can’t be certain they’re negative. What happens though, is that many of them might. We might be unknowingly accruing small, negative effects to retention. Hence, ‘death by a thousand (statistically insignificant) cuts’ (coined by a fellow Facebook analytics genius).

In each experiment separately, the effect to performance is slightly negative but not stat. sig. Taken together though, this is clearly harmful

The main way to protect against this is with a long term holdout. A long-term holdout is a set of users who don’t receive any new experiments even as these are rolled out to the rest of the user population. You can then measure the effect of all the other experiments taken together on the long-term holdout and see whether the other users’ experience diverged significantly from the holdout — essentially the holdout is a control for the effect of many experiments together.

‘Additive’ and ‘Substractive’ measures, or how to goal for stuff that shouldn’t happen

2019-10-23T18:01:17+00:00

In a data-driven organisation, most of the stuff we measure is about adding up events. Revenue is made of sales events, active user counts are made up of adding (unique) visitation events, and marketing results are about adding up how many people saw everything we did, or clicked on it [1]. Usually, these events are so frequent the measurement becomes ‘smooth’ - it’s a ‘line’ we can forecast to predict next year’s goal. We can talk about the 0.4% increase caused by a change we made in the product or operation.

But this isn’t the only way we make a company succeed. All companies also deal with preventing negative events. Legal, PR teams, security, site reliability and devops teams explicitly deal with risk reduction as part of their responsibilities. Even teams that aren’t used to thinking in risk reduction terms are also responsible for not screwing things up - engineering, client support teams and operational teams that need to prevent downtime, churn and business mishaps.

How do you measure “risk reduced”? How do you value work that’s meant to prevent bad things from happening? Is this even something that we should goal on?

Yes, we should absolutely goal on everything that’s crucial to the success of the business. Goals guide prioritization and add clarity. Teams whose main prerogative in a given half is to reduce risk should goal on it. But there’s a method for doing it. When setting goals on these ‘subtractive’ metrics (meaning, we want to reduce bad stuff to 0, rather than increasing, e.g, revenue to $100M), we need to recognize a few things:

Sparsity inhibits feedback: Most of these are hard because they involve rare events. If your goal was to reduce, for instance, the number of meetings taken at a large company, you’d have no problems measuring it and targeting a reduction of 10% this half, because with thousands of meetings a day you can track your progress every day, identify which meetings involve the most people or are most frequent and justify tackling these to meet the goal. But it’s hard to effectively goal on preventing something that never happened or only happened once from happening again - there’s no ongoing feedback to guide you if you’re on the right track.

No causation: They are also hard because they are unexpected. Additive metrics are usually the ones where a proactive action we take (e.g run an ad campaign) creates the outcome we measure. What I term ‘subtractive’ metrics are usually reactive (prevent churn, prevent lawsuits, …). We don’t take the action, we just build the forts - it’s even harder to track progress when, once the fort is up, the enemies never attack because they see the fort! Not knowing the counterfactual, forts are always an unjustified expense [2].

No place for the meter: Additive goals are usually defined in terms of an important outcome that results from activity ‘funneling’ into some place we set up measurement. For example, measuring revenue is looking at the finance department’s spreadsheet, where all the business activity ended up. Number of users are webserver hits triggering some code flows ending in a DB. There’s a path where actions trigger a measurement. An instructive exception is perhaps sales in an enterprise business - all sales people need to be disciplined (and incentivized) to manually key in which clients are in what step of the funnel to enable tracking. (Even that’s being more and more automated by CRM software). Subtractive measures, however, are usually defined broadly: We want to prevent any incident, no matter who or what its cause is, so they are hard to measure automatically because they can happen “anywhere”. The solution to measuring, and setting goals, on ‘rare, unexpected, bad outcomes’ then, is tackling these three challenges.

First, we need to make what we measure a more common aspect of the problem we want to solve. Instead of tracking just bad events when they occur, we track near misses. These are events which could’ve been disastrous but were caught somewhere along the way. They may have been caught by a system or process we already put in place to prevent them - demonstrating that it’s working. Or, they were caught before causing damage by chance, and could inform us of a potential hole. These near-misses are more frequent than full on crashes, so they give us a denser time series, with more information about whether preventative measures are worth the investment, and which ones to set up.

Next, we should measure proxies of the disaster we’re trying to prevent, which history and intuition tell us are correlated with it. Essentially, we’re looking for the build up of water before the dam collapses. Facebook sends you less notifications if you stop opening a few. This signals you’re not interested and the app should throttle notifications - before you turn them off altogether, a practically irreversible action. Most B2Bs track engagement with the product as an early indicator of churn, and good devops teams ‘canary’ a new release to risk very few servers and users before rolling it out to 100%. Answering both (1) and (2), we also broaden the definition as much as reasonable to encompass smaller mishaps as much as the large ones. A spate of unsophisticated phishing e-mails may indicate we’re now a target for an imminent big cybersecurity attack.

Finally, addressing point (3) has unfortunately no other solution I know of than accountability. In great organisations, everyone has the discipline to help measure these incidents (especially with the harder to notice ‘near misses’). This only works when people are incentivised to do so and when major figures in the company are deeply invested in driving this process. That’s why we should never set goals on having some (low) X number of incidents - that would directly go against the incentives of measuring as many as you can. Instead, we focus on reducing the Time-To-Detect (TTD), Time-To-Mitigate (TTM) and % of incident followup actions executed, believing these together with a strong DERP (Detection, Escalation, Remediation, Prevention) process [3] take care of improving the incident rate itself. The number of incidents (as well as the false alarm rate for detection systems) would both still be measured, of course, but they would not be a goal [4].

To measure, goal and motivate work on preventative tooling in a data-driven, prioritized and cost-effective way you don’t need much more than a spreadsheet to (yes, manually) track the incidents and near-misses or proxy events that happened, their detection and resolution times - but you do need to start with a process and a commitment to follow it through.

[1] This was well put as “all of data science is just counting and normalizing - the hard part is figuring out what the denominator should be”.

[2] Black Swan by Nassim Taleb opens with a great hypothetical about an unsung heroic FAA agent that after campaigning for years manages to get a regulation enacted requiring all cockpit cabins be locked with a keypad, starting Sep 10th, 2001…

[3] In Facebook, the incident review process was exactly the same across dozens of offices and thousands of teams, orchestrated, audited, updated and motivated by the global VP of Engineering, Jay Parikh.

[4] They’re impossible to goal on anyway, given the business creates new processes, products and systems so rapidly and these grow in their own varying rates.

Accuracy of small-sample interpolated empirical distribution functions: a simulation

2019-09-11T10:24:41+00:00

Accuracy of small-sample interpolated empirical distribution functions: a simulation When you have a few data points, and no model of the distribution they are drawn from, you can still use what you have to estimate how that distribution behaves. For example, you may wonder what’s a 95th percentile of that distribution is.

There are several approaches for going from the data you have to a distribution that (hopefully) approximates the true distribution they were drawn from well:

Using some parametric model (i.e assuming they are drawn from some known distribution like Normal, Gamma, mixtures of these, …) and a method of estimating the model parameters from the data (MLE, etc.) Using some non-parametric estimation procedure [1]. The method I analyze here is the one that linearly interpolates between your data points, which is the default when you use numpy.percentile() in python or quantile(type=4) in R [2] How does it look like? Say you only have two samples: You intercepted two spies and their ages were 19 and 33. You’d like to know what the probability a spy is 28 years old is, or you want to know what the 95th percentile of spy ages are, but you only have these two spies. The classical empirical CDF from these two would assume:

(a) There is 0% chance of a spy being under 19 years old.

(b) There is 0% chance of a spy being over 33 years old.

(c) The distribution has two point masses at 19 and 33, so all percentiles between 0 and 50 are 19 and all percentiles between 50-100 are 33, and there’s essentially zero probability of having a spy aged between 20 and 30 for instance.

This sounds like an unnatural assumption, but it’s odd to use an empirical CDF with two points in practise. Empirical CDFs are really useful in proofs where there are $latex n$ samples and $latex n \rightarrow \infty$. It helps show that they converge to the true distribution, they cleanly manage discrete/discontinuities in distributions, and they converge fast, and in many cases are easy to show how the rate of convergence behaves (asymptotics).

But in scenarios where you actually want to use a few data points to generate some other information about the likely distribution, a simple yet practical assumption is using a linearly interpolated CDF. This would assume:

(a) There is 0% chance of a spy being under 19 years old.

(b) There is 0% chance a spy is over 33 years old.

(c) All percentiles between 0% and 100% map to ages distributed uniformly between 19 and 33 - i.e the 50th percentile is a spy being 26 years old, and the interquartile range (25th-75th percentiles) are that a spy is between 22.5 and 29.5 years old.

If you had 4 points, you essentially assume the first is the 0th percentile of the distribution, the second is the 33rd percentile, the third is the 66th percentile and the last is the 100th (maximum value); The rest of the percentiles are interpolated in a piecewise-linear manner between each two points. For example, this chart shows a random, uniform draw of 4 points from $latex [0,1]$, the resulting interpolated CDF (in red), the original CDF of the Uniform[0,1] distribution (blue), and the resulting error in estimating each of the percentiles.

My random draw of four points from $latex [0,1]$ was [0.98980661, 0.02112006, 0.5617029 , 0.60289351]. As it happened to have two points quite close to the ends of the range $latex [0,1]$ (.989 and .021), it approximates most of the range pretty well, with the maximal error being around percentile 30: Because the empirical distribution assumes the 33th percentile is .561 while in truth it is .33, the maximal error of is around $latex .561-.33=.23$.

You now understand what small-sample, linearly interpolated empirical CDFs are! Let’s assess their accuracy, using simulations.

Accuracy of small-sample interpolated ECDFs This leads us to estimating the error. In another draw I would’ve had $latex \frac{1}{2^3} = \frac{1}{8}$ chance of having all 4 points randomly generated from the distribution be under 0.5 or have all 4 of them be over 0.5. In these cases, the maximal error would be more than 0.5 (eg, if all four are under 0.5, my 100th percentile would be less than 0.5 whereas the real 100th percentile of Uniform[0,1] is 1). What should we use as the “error” function summarising how far an interpolated ECDF using a draw of $latex N$ samples is from some distribution? And how does that error distribute (so, how likely we are when drawing $latex N$ samples to get an error of X? How does that change with $latex N$, or with the (unknown) distribution we’ve drawn from?). We’re going to answer all of these questions next.

There are as always many options for a choice of summarizing error but two natural ones arise that I focus on:

(Approximate) $latex L_{\infty}$: This is asking what the maximal error is: What’s the percentile of the original distribution that our distribution is furthest from. The problem with calculating this in a simulation is that many distributions are unbounded - i.e there’s some chance they’ll generate arbitrarily high (or low negative) numbers. However, our interpolated ECDF always has a clear minimum and maximum value: The lowest and highest sample - these are its 0th and 100th percentile, where for an unbounded distribution one of these or both will be infinity (or minus infinity), so the $latex L_{\infty}$ as I define it would be infinite. So instead, I just measure, for bins of 1% width, what is the maximal distance between the value of the original CDF for that percentile, and our simulated interpolated ones - essentially the ‘maximum’ value of the yellow line in the chart above. This is similar to the Kolmogorov-Smirnov statistic for ECDFs, but is easier for me to calculate for this interpolated CDF in practice. (approximate) $latex L_{1}$ : This averages the distances between the CDF and the approximate CDF. It captures more information about the range of errors, at the loss of info about how great the error is maximally.

Two draws from the Standard Normal distribution gave me [~-2.5, ~0.5]. The $latex L_{\infty}$ here is the 99th percentile, where the difference is almost 2. The $latex L_{1}$ is ~1, because on average the error is close to 1.

Simulations The theory of Empirical CDFs promises some pretty quick convergence in many cases for the original (not interpolated) method. But how does it behave in practise in low-sample scenarios? Below are charts with the mean $latex L_{\infty}$ (blue lines), mean L_1 (red lines) and standard deviation of the $latex L_{\infty}$ (grey line) for simulations drawing 2-30 samples from the Normal, Exponential(1), Poisson(1), Beta(2,2), Gamma(2,2) and Uniform(0,1) distributions. The notebook that does this is here, so you can easily generate ones for your own distribution.

I also recommend for practical cases, if you had 30 samples, to just randomly choose e.g 5 of these every time and see how well they approximate the CDF you approximate with 30 points, to give you a sense of how the scale of the error looks (and perhaps fitting that line try and project from 30 onwards). But regardless some rules of thumb emerge - the maximal error in estimating percentiles 1-99 drops from an average of 2-3 times the variance of the distribution with 2 samples to ~1 times the variance or just under that for 30 samples.

Footnotes [1] For instance, the Empirical CDF is the standard way to treat this mathematically, and Kernel Density Estimation is a better, but more complicated method that also requires some tuning.

Understand, Identify, Execute

2019-08-27T20:02:39+00:00

The three words in the title are on posters all across Facebook’s campus. They’re a framework for thinking in stages about product development, and around choosing KPIs.

Understand, Identify, Execute poster

The toy example used to illustrate the different steps was imagining you had a prehistoric family to take care of. Understanding means choosing the right high-level objectives. In the prehistoric family, ‘survive’ is the most important one, and ‘get food’ is a key driver for that. A failure in understanding is if you had instead thought ‘having fun’ is the most important bit and drove your family to extinction by successfully chasing that strategy.

Identify is about finding the right levers that push us the furthest forward in the chosen direction. Maybe you’ve identified to get food you can fish in the small pond that’s right across your family’s cave. But maybe you’re figuratively (and literally) fishing in the wrong pond. Maybe a bit more search and thinking will lead you to see there’s a grove just around the pond with plenty of wildlife you can hunt more easily, and supplying more food on every catch. A failure in identification is about not attacking the strategy with the most effective tactic.

Failed identification might cause us to work really hard and achieve little, or worse - use the wrong KPI making us think we progressed a lot, but just creating a Goodhart’s effect where we’re optimizing into a bad outcome - like spamming clients to the point of ignoring e-mails. It’s hard work!

Execution is obvious - it’s whether you effectively actuate the desired action. Failure in execution means the team fumbled or took too long.

This framework helped us check we went through all stages in developing a product, which we would do iteratively. It’s legit to take weeks off of some of the team’s work (sometimes the entire team for a new product) to spend on ‘understand work’, that doesn’t ship anything impactful but just maps out the area, the needs of target audiences and the options to address them. Facebook re-‘understood’ Newsfeed several times (leading to changes in the core metric from time spent, to weighted user feedback, a measure of actively engaging with content (e.g shares, likes and comments), to ‘meaningful connections’ [1], and each time, identified the levers to move these metrics the most (show videos→time spent; show clickbaity content→WUF; show specific types of friend and group content→meaningful connections), which overall grew the company. Sometimes it may have led to over-optimization in one area [2], but it’s overall better than not having moved, or than having moved at random or immeasurably (i.e., at the very least we hit the targets we set ourselves, increased Facebook’s overall growth, retention and revenue by monitoring these while driven by each of these focuses).

[2] Some critics see this, often simplistically, as the root of some of Facebook’s issues or as more than a tradeoff that has to be made when you rank items. I don’t subscribe to that view, nor is it the purpose of this blog to discuss Facebook’s merits and faults - but rather to stay at the level of a very useful product development methodology

The false dichotomy of Quant-Qual analysis and the real tradeoff of data-driven product development

2019-07-17T14:52:40+00:00

Some people intuitively assume (perhaps inspired by the ‘rational-humanist’ dichotomy) that there’s a spectrum with taking data-driven decisions on one hand, and ‘qualitative’/intuitive/user-feedback-driven insights on the other. That’s incorrect. Data-driven product development doesn’t trade off qualitative insights: The best teams I’ve seen (e.g, at Facebook) are integrating both. Qualitative, case-by-case feedback from user interviews provides the observations around which we build hypotheses. We then test these with the data. Using only data will leave us blind to most phenomena, and using just qualitative insights will drive us randomly without a sense of magnitude, hence priority, of the opportunity. Many of my best optimizations in growth came from observing users use the product - then measuring the scale of a problem they encounter. First round just posted a great discussion of the value of qualitative research.

There is one key trade-off to being data driven, however. That trade-off is speed. It both directly costs resources to invest in measuring the state and the effect of almost everything you do, and it slows you down indirectly because data-driven decision making takes time: You need to gather evidence from queries (DB, surveys, experiments, etc.), and then process it with decision-makers - that’s a lot of work. This is why we also want to be wise and measured about what to measure.

“Not everything you can count counts, and not everything that counts is countable”.

Some concrete examples:

A) ‘Fail fast’ in experiments: stop experiments once they show either:

i. There’s a better outcome - even before we have narrow confidence intervals around how much better it is [1].

ii. The confidence intervals are narrow enough to show us the magnitude of the effect is not worthwhile pursuing. If an experiment happens to show we can move time spent between minus one second to adding 3 seconds per user per week, who cares?

B) Don’t try and measure the most metrics, or sometimes the most correct version of a metric, or sometimes even the outcome of every decision. Arbitrary precision requires arbitrary time: Avoid analysis paralysis and focus on the few decisions that matter. Jeff Bezos says it well here: There are only a few decisions that matter for which you want to invest your time in making correctly.

[1] This is the opposite of the practice researchers do for a scientific journal publication (except under the minimization of harm doctrine in some human/animal-subject studies)

Why your experiment’s impact is probably greater than you think

2019-07-17T14:52:40+00:00

When we experiment, we try and address hypotheses we believe are blockers (or enablers) for conversion success. Eg if your signup form for a service converts 65%, one can imagine many blockers for the 35% of users that don’t convert:

They don’t understand the offer;
They don’t trust you;
It’s too long to fill out;
It requests details they don’t feel comfortable providing;
They get stuck unable to answer one of the questions (there’s an answer you hadn’t considered);
It’s too slow;
It’s buggy;
They don’t feel it’s valuable for them to complete this stage
…

When we experiment, we try and address one such blocker and remove it. Let’s imagine we believe (or used qualitative research to find out) that some people find the form too long. You experiment with removing a few fields - but find it only grows conversions rate by 0.5%. Does that mean the solution of that problem was only an issue for 0.5% of your users? No.

Multiple blockers will cause first few fixes to register as less impact, with increasing impact as we work more on the same blocker, until it plateaus. What happens is that most people are not affected by just one blocker to completing a process/funnel. They might both feel it’s too long and not trust the process, or not trust the process and not understand what we’re asking. As a result, when you start fixing the ‘trust’ issues, if you didn’t touch anything else yet, you can only affect the people who only have the trust issue. Even if you “fixed” trust completely, there are still blockers for the rest of the population. As you remove the next blocker, say “understanding”, you now reap the rewards of both fixing the trust issue beforehand, and of the understanding blocker. The diagram illustrates the dynamic at play here, with hypotheses are called “Trust”, “Time” (I don’t have time to fill this out), “Price” and “Understand” (I don’t understand why I need it).

You can see many dynamics are possible, including something you think you’ve solved (stopped being a bottleneck) becoming once more a bottleneck after other things are solved. Users/clients also change preference, or the audience of users changes and we are exposed to people with a different mix of preferences, resulting in shifting the “blockers”. All of this is to say that it’s important to revisit hypotheses you think you’ve solved before, and it’s important to acknowledge that you won’t see all the fruits of every effort immediately, but sometimes a successful ‘clearing’ of the path to user success is actually reliant on a lot of previous blockers that were removed in a way that was impossible to measure. Essentially - every experiment’s impact is tempered by all the other issues so that you see only a part of its effect.

Organizing research into seven output types

2017-11-25T18:24:15+00:00

This post is about a range of options for “deliverables” - types of output - from applied researchers in the industry. I found it helpful to be very explicit with people you hire and when prospecting projects about what the output of your research would be, what the different types trade off, and which of these you expect from researchers on your team (or the business units you work with expect from you).

A key point for me is that in the context of a business need for impact, these outputs are found on different spots along two axes: Friction to impact, and Leverage.

By ‘friction to impact’, I mean how much work is needed to actually have business impact from providing this deliverable. A report needs to be read, understood and a decision or action taken by its audience to be impactful; In the drawer it helps no one. A dashboard needs to be regularly used, while a fully automated, integrated system (e.g, some recommendation algorithm) already delivers its impact once integrated into the product.

Leverage, on the other hand, describes some sort of scope that the research solves. A very generic method can be applied to hundreds of cases; An SDK, used across many products (at least with the same programming language) - but a single analysis is limited to one decision at a set point in time (though that decision may turn out to be quite large!)

While not comprehensive, these deliverable types cover much of the output from a wide variety of substantive research topics and disciplines:

A) Descriptive analysis

This type is most similar to a published paper: It summarizes an exploration into a topic. It rests on driving a decision as its mode of delivering impact. Because often other people will make the actual decision and the changes it supports, the most important thing about a descriptive analysis is communicating it well. The expectation, should you end your project with a descriptive analysis, is that you are able to convince other people to take action with it. It doesn’t matter how sophisticated the method (in fact, a great deal of making the decision easier to take is that the method is not sophisticated where unnecessary, or decision-makers will justly suspect a type of method-dependence). It doesn’t matter how long and thorough it is (again, the summarized delivery must be incredibly short). A lot of what matters is the delivery.

Examples are an analysis highlighting a new way to measure the business unit that is more aligned with the actual goal it’s trying to achieve, or highlighting an opportunity that was missed in the data, or describing a new process, or deciding between multiple strategic options, etc.

B) A Method

This is also often encapsulated in a paper, but this output aims to generalize an approach to many possibly datasets or topics. A new method might have enormous potential impact. Imagine you found a more efficient sorting algorithm - but until patented or utilized on a substantive topic of interest, the method remains un-impactful for the business, however impactful it might be on the progress of science. The responsibility is with the researcher to ensure that many people repeatedly adopt the new method. This may be easier to bring about than the change that a descriptive analysis is trying to make, if the method is a widely applicable, practical one. This requires a lot of insight into what others are missing and an interest in persuading others to use the new method in their research.

C) A Dataset

A new dataset may be the result of a research, either through collection of new data or some preliminary analysis and transformation. It needs to be used in another analysis, dashboard or other output to become impactful.

D) A Dashboard

The automation of a descriptive analysis is a dashboard. It is only impactful if people use the dashboard, and so the dashboard needs to be useful: It needs to be easy to understand and use, quick to load, comprehensive in the insight it provides, and be an important part of the process that people undertake on an ongoing basis. The dashboard can be a new or more refined measurement, a monitor helping a new process happen efficiently, providing ongoing insight, etc. You trade the ‘one-off’ nature of a descriptive analysis for continuous ongoing impact, at the cost of effort: Creating the analysis in more robust code that is able to apply it automatically from new data as a report (possibly an interactive one).

E) A Framework

An API, code library/package or generalized executable which isn’t integrated in a product. Like a Method (B) embedded in code, this is again a step closer to leveraging the impact of your research but doesn’t, unless used within other systems, actually generate impact. However, the ease of integration and the general applicability of an API solving a set of problems create enormous potential for impact with highly reduced friction compared to only a method: With a framework, you don’t need to read a complex paper, and then implement it: You just call a function in an SDK or a web API to get the application of the research within your product.

F) A Proof-of-concept

The proof of concept is a step on the way to a standalone system (G), that is nevertheless not providing the impact by itself, but meant to demonstrate that it is possible to build a standalone automated system with the desired impact. In some cases, the scale of impact as a POC is tiny, and a scalable automated system is needed to grow it. In either case, a POC is similar to building a method in the sense that if no standalone automated system is actually built around the POC’s code, no business impact is achieved. It’s hard work to convince other researchers to use your method in their research, and it’s hard to convince business stakeholders to take the action you propose in your descriptive analysis, or use a dashboard you built. But my experience is it’s probably even harder to cobble an engineering team to build your POC into a product/system, without very strong buy-in and explicit understanding in advance that the scope of your work is going to be a POC. There are many labs, teams and companies organizing around researchers building POCs and engineers building them into systems [1]. In my experience, this was the root of several organizational problems and I strongly and expressly avoid POCs as an accepted form of output for my own research teams: All other forms are legit, except I don’t find POCs are ever useful except as a milestone the researcher then goes on to turn into one of the other forms – usually a framework or a standalone/integrated automated system.

G) A Standalone, automated system

This research output is a piece of code, either standalone or (preferably) fully integrated with a key system/product that utilizes some research product to deliver impact. This is often the most impactful form of research, because it both increases the frequency and duration of the impact, and reduces the barriers to adoption to virtually zero. In the classic examples - e.g a classifier making decisions by itself, or a system automatically exploring a space of experiments, or an algorithm deciding the optimal ad bid - it essentially automates and streamlines the decision-making that humans might do with dashboards or by reviewing analyses. It does away with a lot of the dependencies to achieving impact relied upon in all of the previous forms of research: The need to communicate, convince people, or rely on their ongoing use of your research, and instead takes on the responsibility to deliver the impact itself.

In organizing these outputs, I think two axes are interesting: One is the potential scope of impact, or “reach” of the output. The other is the barriers/friction to creating the impact, and how dependent it is on others or the researcher’s future action. These are somewhat but not entirely inverse. The actual impact also, of course, widely varies with the substantive topic at hand, and of course depends on the research being correct, relevant, successful, etc. So for instance, it is perfectly sensible to choose to work on, and drive more impact from a “low-reach”, one-off descriptive analysis of how to use your company’s spare cash than a complicated method that makes some specific computations more efficient by 0.1% in rare cases, even though in my theory I place ‘methods’ as some of the highest-potential-scope output. But when several of these are applicable, and in the same topic, this may explain why I prefer a system with its “0 friction to impact” and decent reach, over a Proof-of-Concept with only marginally more potential applications and much more friction.

A great applied (industry) research organization constantly drives to further the fulcrum its projects create - driving towards bigger scope, generalized outputs, while simultaneously successfully impacting the business by driving the friction down or taking accountability of and ensuring the impact is realized, often with followup work by the same researcher.

Another aspect that I’ve ignored so far, is the amount of effort required. I avoid ordering these outputs by effort, however - I don’t see how to argue about it coherently nor is it the same for different people with their different preferences and skills. The only general case I see is that a POC must take less effort than a full system, by definition. For the rest, the effort required is on different, incomparable scales.

The key point I try to make is a conscious choice of output needs to be made - and very clearly communicated - as a researcher or manager, and it needs to be explicitly understood by your colleagues and business partners. Significant difficulties that researchers, their managers and their clients experience, in situations when they are all new to the relationship, arise when the researcher expected to build a POC when the business partner wanted a standalone automated system, or the researcher wanted to work on methods published as papers when their managers wanted the team to build frameworks. So I hope this post gave you a language to explore these.

–

[1] In some cases in the industry, there’s a separation and violation of the original meanings of roles: Engineers were sophisticated implementors figuring out solutions to practical problems - like landing vehicles on the moon or devising steam engines. They were doing applied research, and they definitely didn’t either abstain from the learning and experimentation of research nor the drive to build a fully working solution. I find in some cases in tech we’ve created two classes of people that want to do either one (exploration) or the other (building proven things to spec) but not both, and the result is a problem of division of labor (that I might discuss in another post).

The measured life of 1944

2012-07-12T10:08:36+00:00

My grandfather, Ernest Friedlander, was the quintessential Yekke [1] engineer at a time when their culture played a major part in Israel. He was trained in Germany, like so many of the engineers and architects of that time. In fact, the story goes that when a talk was given at the assembly of engineers in Tel Aviv the audience would not even notice when the speaker would mention “… That method, devised in Eintausendneunhundertzwanzigsieben, is used in…” – interposing the German numbers in the Hebrew text.

My grandfather passed away long before I was born, but only recently, when shuffling through his old papers did my father encounter this – and it immediately brought back vivid memories of his childhood.

The graph, done by hand, is a bit hard to make out with the faded labels and shorthand German. Nevertheless, a single clue almost entirely solves the puzzle: At the very top left corner, encapsulated by a box, it reads ‘8.7.44, 3600g’. That date happens to be my father’s date of birth. And 3.6Kg (8 lbs) was his weight at birth. The amazing part is when you then realize what Ernest had done. He weighed my father daily since he was brought home, a week after he was born. That is the red line. He had also weighed, three times a day, the feeding of my father – both by Brust (breastfed) and then with a supplement, marked red. This constitutes the upper half of each of the pages in the chart. The black line, then, is the net weight, without the food. Finally, he had calculated and drew a regression (trend) line to track the increase in weight. The four pages here are just a small sample my father had framed, recording the first 11 Woche (weeks). There are quite a few more, going up until he stopped breastfeeding. A remarkably tedious and meticulous job, carried on with no break for long months – this is exactly the defining characteristic of the Yekke. But also, it is a reminder how the advanced data culture that we live in today is in fact nothing new. Life was carefully measured even before this guy came around.

Lessons about the history of visualization and the way an engineer expresses his care and devotion aside, for me this discovery was touching because it hints that my fascination with statistics and data – quite an anomaly in my family – might in fact be some sort of trait carried onward by my grandfather’s genes on to me, a part of him that I carry on in my every day life.

[1] a culture of Jewish German nationals who emigrated to Israel mostly in the 30’s, escaping the rapidly Nazifying Germany. Having been mostly secular and deeply entwined in the German middle-class of the time, they are stereotyped as being pedantic, punctual and so pragmatic and rational as to seem unaffectionate.