Society is changing more rapidly than at any time in the past, as a consequence of the data revolution and its attendant technologies. Data are being captured at an unprecedented rate, through automatic measurement (e.g. traffic flow monitoring, intensive care units, the internet of things), accumulation of data sets as a side effect of other activities (e.g. supermarket purchases, travelcards, tax payments), and deliberate data capture (e.g. astronomical databases with billions of data points, genomic and other medical and biological data). As a consequence, massive data sets are being collected describing all aspects of society and indeed of nature. On top of all this sit advances in computer power permitting the development and application of highly sophisticated tools, models, and algorithms for extracting information from the large data sets. All of this enables understanding to be enhanced, insights to be gleaned, and optimal predictions and decisions to be made. Often these are in real time.
At least, that’s the promise of modern data technology.
But all of that promise is based on a critical assumption: that the data you have properly reflects the phenomenon you wish to study and the system about which you want to draw conclusions and make predictions. If you are wrong – if the data are distorted in some way – then the conclusions you draw, the decisions you make, and the actions you take can also be wrong. In some cases, the actions and decisions can be wrong with dramatic consequences including the loss of fortunes and even of lives.
The critical assumption is a brave one. After all, one can never record everything; it is inevitable that some things are left unrecorded. While I can measure your height, weight, income, IQ, gender, fashion preferences, and so on, I cannot measure everything about you. Indeed in general we cannot record all that is relevant, even if we can determine what is relevant and what is not. In sum what all this means is that it is very likely that some data which have a bearing on our understanding and decisions are missing. This leads to the key question: do those missing data mean that our data analysis is misleading?
Here are a couple of examples, one well known and one less well known.
The Challenger Space Shuttle blew up soon after launch on 28 January 1986, killing all seven crew members. A teleconference the night before had decided to go ahead with the launch despite concerns about the forecast low temperature. But the data the teleconference had examined, showing the relationship between launch temperature and possible distress to the seals around the booster rocket segments, was only partial. It was missing critical information about launches which had suffered no seal distress. Including that information gives a completely different picture of the relationship. Including all of the data would almost certainly have led to a different action.
Rich Caruana and his colleagues described a machine-learning system for predicting the probability that patients who developed pneumonia would die from the illness. It was usually pretty accurate, except that it predicted that patients who had a history of asthma had a lower risk of dying of pneumonia. What the machine didn’t know was that patients with a history of asthma were regarded as having such a high risk that they were sent to the intensive care unit, where they received particularly careful treatment which reduced their chance of dying from pneumonia. Lack of awareness of the missing information could have had fatal consequences.
This problem of missing data arises in familiar ways in financial applications.
For example, it seems obvious that we should choose those funds or those fund managers which have performed best in the past. But focusing attention on just those could be highly misleading. After all, those that have performed exceptionally well will have done so for two (possible) reasons. One is that they might indeed have exceptional ability; they really are good and make sound decisions. The other is that they just got lucky. It is likely that the best performers became so because of a combination of these two causes. Indeed, even if the managers all had the same ability, some would appear to do well purely by chance: someone has to come out first. And that tells us what the problem is: the chance contribution to the performance is equally likely to vanish at the next trading period as it is to repeat. This well-understood phenomenon, regression to the mean, means we should be cautious about projecting past performance of a selected group of funds into the future. Just focusing on the best means relevant data about the distribution of performance, and how much individual performance fluctuates over time, is missing.
As time passes, so financial technology changes. At a personal banking level, recent decades have seen the shift from cheques and physical bank branches, to telephone banking, to internet-based transactions. At the trading level, tick sizes have changed, as has the ability to model the processes underlying higher frequencies of transactions in real time. Furthermore, economic conditions and the competitive environment changes. All of this means that past data sets may no longer be relevant. It means we might have only limited data on which to model current conditions.
Fraud is another obvious financial area where data are dark. On the one hand, a fraudster will try to conceal information (not least his or her true identity) from the victim. On the other, the true extent of fraud is likely to be unknown – since, presumably, poor fraudsters tend to get caught more readily than good ones, with the very best getting away with it for years, or altogether. Bernie Madoff established Madoff Investment Securities in 1960, but wasn’t arrested until 2008, when he was already 71.
My forthcoming book Dark Data: Why What You Don’t Know Matters (to be published by Princeton University Press on 28 January 2020), looks at these and many more situations in which absent data have led to mistaken models. It gives a great many examples, ranging from medical catastrophes to some of the world’s biggest financial disasters. The book categorises the problems of incomplete data into fifteen types, showing how each of them can arise. But then it goes further, showing how to spot the dangers, and then how to sidestep them so that effective decisions can still be made.
But it does not stop even there. It then goes on to show how, with some understanding, advantage can be taken of ignorance of data to enable superior decisions: how one can deliberately hide data to enhance decisions.
If this sounds far-fetched, a straightforward and well-known example is the double-blind randomised clinical trial from medical research. Here patients are randomly allocated to different groups, and the identity of the group and what treatment each patient is receiving is concealed from the treating physicians. The data are dark. Only later, once the experiment has been completed, is the randomisation code broken and the treatments revealed. This process enables accurate and reliable estimates of causal relationships to be established. By deliberately darkening the data fresh light can be shed.
The starting point of all this is for you to ask yourself whether you are guilty of analysing the data you have, not the data you need. Whether your conclusions are wrong because you are missing crucial information. And whether your actions are inappropriate because they are taken on the basis of only partial, and mistaken, understanding.
Feel free to reach out directly.
Secure your seat to listen to the keynote “What you don’t know counts” by Professor David Hand on the 1st of November at The Quant Conference.