Subject: A stitch in time saves nine (or nine hundred) // Taguchi loss to society model

From: Moore, Danny

To: All Staff

Date: 5 September 2022 at 10:08 am

All,

Last week brought a sharp reminder of the old English proverb: “a stitch in time saves nine.”

Sloppy work (being kind) on the physical network setup for a NYC based platform customer lead to an escalating storm which evolved from “how many engineers does it take to configure an internet circuit” to “how many CXOs does it take to manage the fall out.”

We need to be super tuned in to these types of incidents as a firm.

The phenomena where small and easy to avoid issues at the coal face sink whole companies is well documented in business literature, so neither new nor peculiar to Options. Research by Japanese manufacturing systems’ guru and business statistical Genichi Taguchi in the 1970s demonstrated that with complex systems the “stitch in time” proverb grossly understated the ripple out effect from root causes, or root sloppiness. He proposed an exponential “loss to society” model that suggested that the long term cost to a manufacturing firm goes exponential when problems get out into the wild.

The difference: essentially with a stitch in nine, there is only one ripped garment, so the downside is fixed, one pair of trousers in tatters. With manufacturing, say a fault in a car’s brake system where the brakes fail, or an engine that spontaneously bursts Into a fireball, can lead to unbounded downside risk for both the car company, its customers, and society at large. For our clients a jittery internet circuit during an investor pitch can be the economic equivalent of the the fireball, they look sloppy and loose investors. We need to understand that nuance of our sector.

Our own little drama last week illustrates how quickly impact can escalate beyond the proverb’s 9x. Getting it right would have taken and extra hour (?) and more importantly a bit more diligence and attention to detail, caring is generally free. Once the issues ripple out it started to suck in suck in days of support time, management get sucked into escalation meetings, troubleshooting, and eventually we needed to bring in an external contractor in a grand gesture to audit and rebuild the setup.

On the client management side we’re buying everyone in the client office a catered breakfast on Tuesday, and Jake is taking the CEO to dinner later next week. The dinner will cost us more time and expense than it would have taken to get the build right in the first place.

The cost of setting up the client’s internet has likely spiralled well past 100x what it should have been, and made everyone involved properly miserable heading into the holiday weekend. Worse than that we look “dumb” as a firm. We’ve set up what 500? office internet circuits in the last decade. Struggling with something so basic raises an eyebrow around our whole story.

Sloppiness at the detail level leads to team misery at the team level. That feeling of misery (continuous “fire fighting”) is the manifestation of spiralling cost.

In my mind understanding the loss to society model is the key to winning long term competitive battles in tech sectors. The concept is central to their business philosophy if you read up on Ford in the mass production era or Bezos and Musk today.

There are four powerful levers management teams can pull leveraging these phenomena to come out ahead:

Right first time culture, attention to detail, standards, build automation and setup cross-checks;
Investment in supporting infrastructure;
Troubleshooting and crisis management;
Customer face time and lots of it, at the right levels.

There is huge long term leverage from attention to detail at the coal face and right first time. Powerful leverage comes from the definition of standards, automated builds to standard, and automated cross-checks after the build. Begin by finding the engineers who give that extra 10% on the attention to detail front. As Elon Musk would say, “every manual step is a bug”… and Deming’s asserts that “every failure highlights an opportunity to improve the process.”

Supporting infrastructure is huge in engineering. In our world setting up the monitoring (Nagios) and validating that the logs are clean proactively identifies all sorts of problems. Same for Splunk cross-checks or desired state management via Puppet. Proactive start of day checks including log reviews also fall under supporting infrastructure.

Customer face time is self expiatory. If the office internet is flakey or cars are spontaneously bursting into fireballs the customers will tell you. This feedback loop is critical to keeping management teams fully tuned in to what’s happening at the coal face. Companies tend to face a lot of entropy where senior management spend more and more time looking inwards, having meetings, giving investor presentations, playing golf, and loose touch with the customer’s reality. Ever watch “under cover boss?”

The net effect is that the by now disenfranchised support engineer or account manager who is getting it in the neck directly is very aware of client issues, whereas the executive management team and senior engineers who need the real world feedback are living in their wee bubbles and have no idea what’s going on in the client’s world. Systematically getting the management team back in front of clients was a key H2 goal from the last board meeting. We’re running at a huge customer face time deficit after the pandemic.

My personal view is that “trouble shooting” and “crisis management” skills are possibly the most undervalued skills in sectors like ours. The curiosity to have a poke around the log files and run a few tests to make sure everything is working as it should, quickly validate the config against the standard, be alert to quirks in the context. It all nets to making the difference between identifying the root issue in 30 minutes, vs 3 months, and done right cuts the aggregate team effort to manage the incident by 99%. As we used to say when we were in the software business: “don’t manage the problem, fix it”..

Genius trouble shooters begin by checking the log files or running an end-to-end test, and end by looking for other examples of the same root cause across their universe. A surprising number of highly rated engineers and executives do neither.

To finish, it’s worth highlighting why I’ve take the time to draft this MEMO. As a team we need to be super tuned into the downside of letting standards slip and on the other hand the huge competitive advantage that can be gained from managing these issues better than our competitors.

My view is that [relative to competitors] we are much stronger across all four pillars outlined above. My view is also that weaknesses in the first three areas will lead to Pico’s demise in the medium term and presented us with the opportunity to buy Fixnetix. Being ahead is no reason to get complacent. Critically we need to play our own game and focus on running our business in away that guarantees long term success through this decade and next. Feeling good relative to sloppy competitors gives false comfort and is dangerous to long term aspirations.

It’s worth highlighting that this example was the first time in the last few years a client CEO has contacted me on LinkedIn to highlight service issues. I sit on other boards where I get contacted by frustrated customers every few weeks, actually twice this last weekend in one case. By that measure we’re running a tight ship, but need to put the work in to keep it so.

Hope this helps.

Cheers,

Danny

—

P.S: happy to pop on a call with any business or engineering nerds out there who want to dig into this in more detail.