A book for aspiring and seasoned individuals
on maximizing their potential as a Chief Technology Officer

Think Like a CTO
Available From Amazon/Manning

Unlocking value from your company data

Every company is sitting on data, going back many years which is easy to dismiss as dated and useless. What if though, after dusting off the layers, we could utilize this legacy experience to help navigate the future?

In this piece, I am going to go through how to assess the type and quality of data you have, and determine if it may be used, through machine learning (ML) models to inform of new opportunities or operating metrics, using a fictional company.

In my Think Like a CTO book, I go through in detail all the different types of data a company can produce and the metrics to use to classify it. Let us quickly revisit the 5 main lenses to which to view data.


#1 Volume / Age

This is an easy one – how much of it do you have and how far back does it go? The format does not matter at this point. For example, what is the oldest backup of a given system? Does your accounting software, tracking every sale/invoice, go back 5, 10, 20 or more years? CRM software? Customer usage data? Even data sitting in Excel – it does not have to be pretty, just that you have it.

#2 Velocity

This is the rate to which new data is produced within the organization, and how quickly that data is processed for something meaningful. This could be the number of sales transactions in a day, visitors to a website, tickets opened, anything that points to the operation of the business.

#3 Variety

What are the different formats that have evolved over the period? Is the data in physical files (XML, CSV, XLS, PDF etc.) or more structured in a database (tables/rows/columns) that can be exported out?

From this, we are determining how structured this data is, which is an indication of the level of effort required to consume the data. We may have PDF or images of old invoices, which would require some sort of OCR to extract the data out, compared to the ease of a CSV file of the same data.

Incidentally, this includes physical data too – the box of printed invoices/purchase-orders. This can be a huge, and often overlooked, valuable treasure trove of data. Years ago, it wasn’t practical, or needed to digitize this data, so it sat there in a literal box on a shelf, waiting to age out. It does have value.

#4 Veracity

It’s all good to have this data, but how reliable is it? Was the data accurate at the time, or does it capture a moment in time of what should have been, instead of what was (CRM systems suffer from this, especially when it comes from tracking sales).

More common, is when software was logging the wrong data, due to some bug (say incorrectly logging the wrong IP address believing it to the user, but was really the server). Knowing these periods will help determine how reliable we can assume certain data is.

#5 Ownership

While not strictly part of the classic four V’s of data classification, one additional thing we look at as part of due-diligence, is to confirm who owns the actual data. Just because you have the data, doesn’t mean it is yours to do with as you see fit.

You have may bought data, for a specific use (demographic for example) that was only to be used once, but you still have a copy of it. You have client data as part of their engagement. Such data, can be of limited use, but still inform outcomes.


All this data, in aggregate, tells the history of your organization. Like a tree, each ring representing a different moment in time, each data format, is a different part of the journey. This data may hold some secrets to the future.

The data is documentary evidence of known outcomes. Based on all the inputs that drives the business, there is a given outcome. At a high-level, there is the obvious revenue and costs of the business, but even then, what really moved the needle to impact the result? That can be sometimes hard to attribute.

In machine learning parlance, what we have is a whole set of supervised data – that is, a varied collection of data points, that all add up to a known verified outcome. It may be hard to spot, but this is what ML models are good at doing.

This data is hugely valuable, as only you have it, and this data can’t be bought – it has to be experienced. Let us look at an example and see what insights could potentially be unlocked.


Alan’s Roofers LLC” is a fictional company that has been repairing and installing residential roofs for the last 30 years. The company has grown from a 1 man band, to a large crew servicing a typical tier-2 town in the USA, with over 100 trucks.

The bookkeeping in the initial years was manual, and as volume increased they moved to a PC running accounting software to track job orders and then the invoicing. Eventually, it moved online, integrated with not only the bank, but the HR system, and suppliers.

A typical job offer, including the ones from the early days, had a surprising amount of data. It had the address of the customer, the date of the job, type of job (installation vs repair) and the materials used. It also tied to an invoice, billing the customer, for the amount. As the years went on, more detail was added, including the roofer(s) that serviced the job.

Alan, like most business owners, ran the business from a few metrics and gut instinct. He knew that the average roof for a given type could be from 15 to 30 years, so there was going to be a wait for a renewal job. He also didn’t have much insight into when repairs would come in, though his instinct told him that after given weather events, there would be an uptick of requests.

Given all the data he is sitting on, could he do a lot better? Of course.

First thing we could do is to augment some of this data with external public domain data. With the address of the job, we can determine the type, price and age of the house and also the weather at the time. From here on in, we have the basis of a what is called supervised data that can be used to train a ML model.

This model could help us predict, after a given weather event, which types of homes are likely to need repairs. This could be a factor of age, complexity of roof, time of last repair/renewal. Using this data, after the event, Alan could then orchestrate a highly concentrated marketing campaign against his existing customer base that likely needing assistance.

Another area, is the time from renewal to the first repair. Irrespective of the recommendations from the manufacturer of the material, this is the real data for the given local climate. From this, Alan can not only predict upcoming work (and market accordingly) but also plan his workforce accordingly.

Renewals for example, he can learn a lot, who is likely to need to need financing versus those that can pay it outright, based on the age and neighborhood of the house. He can then price the job accordingly, maybe offering a discount to those likely to pay it off in full.

This is just the tip of the iceberg – knowing who worked on which job, he can also determine if there anyone that is below/above standard, as every time they are on a job, the time-to-repair is impacted (either good or bad). There could be other insights, through an unsupervised model, to see what clustering ML can determine.


Machine Learning will be able to create relationships that may not first be obvious and from that, help inform future events. It can also confirm the “gut instinct” which at the end of the day, is the human version of ML – “I don’t know why specifically this will happen, but experience tells me this is the likely outcome“. The reason gut-instinct is a thing, is because of the legacy data to which it is based on.

It is common sense – you are not going to trust the person with only a couple of jobs under their belt compared to the one who has 20 years of jobs equally. Both have experience to offer, but only one has more data to draw upon.

Machine Learning is the exact same, the more experience you can throw at it, the more confidence we can have on its output.

With all the buzz around AI (ChatGPT for example – who kindly produced all the images for this article) it can be hard to figure where a business, particularly a traditional bricks’n’mortar company can benefit.

The irony is, it is these very types of organizations, with their years of validated data, that can benefit the most from machine learning models.


by Alan Williamson

Chief Technology Officer
Java Champion | Author | Speaker

Website Powered by WordPress.com.