Society is changing more rapidly than at any time in the past, as a consequence of the data revolution and its attendant technologies. Data are being captured at an unprecedented rate, through automatic measurement (e.g. traffic flow monitoring, intensive care units, the internet of things), accumulation of data sets as a side effect of other activities (e.g. supermarket purchases, travelcards, tax payments), and deliberate data capture (e.g. astronomical databases with billions of data points, genomic and other medical and biological data). As a consequence, massive data sets are being collected describing all aspects of society and indeed of nature. On top of all this sit advances in computer power permitting the development and application of highly sophisticated tools, models, and algorithms for extracting information from the large data sets. All of this enables understanding to be enhanced, insights to be gleaned, and optimal predictions and decisions to be made. Often these are in real time.
At least, that’s the promise of modern data technology.
But all of that promise is based on a critical assumption: that the data you have properly reflects the phenomenon you wish to study and the system about which you want to draw conclusions and make predictions. If you are wrong – if the data are distorted in some way – then the conclusions you draw, the decisions you make, and the actions you take can also be wrong. In some cases, the actions and decisions can be wrong with dramatic consequences including the loss of fortunes and even of lives.
The critical assumption is a brave one. After all, one can never record everything; it is inevitable that some things are left unrecorded. While I can measure your height, weight, income, IQ, gender, fashion preferences, and so on, I cannot measure everything about you. Indeed in general we cannot record all that is relevant, even if we can determine what is relevant and what is not. In sum what all this means is that it is very likely that some data which have a bearing on our understanding and decisions are missing. This leads to the key question: do those missing data mean that our data analysis is misleading?
Here are a couple of examples, one well known and one less well known.
The Challenger Space Shuttle blew up soon after launch on 28 January 1986, killing all seven crew members. A teleconference the night before had decided to go ahead with the launch despite concerns about the forecast low temperature. But the data the teleconference had examined, showing the relationship between launch temperature and possible distress to the seals around the booster rocket segments, was only partial. It was missing critical information about launches which had suffered no seal distress. Including that information gives a completely different picture of the relationship. Including all of the data would almost certainly have led to a different action.
Rich Caruana and his colleagues described a machine-learning system for predicting the probability that patients who developed pneumonia would die from the illness. It was usually pretty accurate, except that it predicted that patients who had a history of asthma had a lower risk of dying of pneumonia. What the machine didn’t know was that patients with a history of asthma were regarded as having such a high risk that they were sent to the intensive care unit, where they received particularly careful treatment which reduced their chance of dying from pneumonia. Lack of awareness of the missing information could have had fatal consequences.
This problem of missing data arises in familiar ways in financial applications.
For example, it seems obvious that we should choose those funds or those fund managers which have performed best in the past. But focusing attention on just those could be highly misleading. After all, those that have performed exceptionally well will have done so for two (possible) reasons. One is that they might indeed have exceptional ability; they really are good and make sound decisions. The other is that they just got lucky. It is likely that the best performers became so because of a combination of these two causes. Indeed, even if the managers all had the same ability, some would appear to do well purely by chance: someone has to come out first. And that tells us what the problem is: the chance contribution to the performance is equally likely to vanish at the next trading period as it is to repeat. This well-understood phenomenon, regression to the mean, means we should be cautious about projecting past performance of a selected group of funds into the future. Just focusing on the best means relevant data about the distribution of performance, and how much individual performance fluctuates over time, is missing.
As time passes, so financial technology changes. At a personal banking level, recent decades have seen the shift from cheques and physical bank branches, to telephone banking, to internet-based transactions. At the trading level, tick sizes have changed, as has the ability to model the processes underlying higher frequencies of transactions in real time. Furthermore, economic conditions and the competitive environment changes. All of this means that past data sets may no longer be relevant. It means we might have only limited data on which to model current conditions.