Correlation not Causation
Tyler Vigen is somewhat famous as a result of his Spurious Correlations Web site (and new book). He has identified numerous data sets that are highly correlated but clearly unrelated. Several examples of these are swimming pool deaths and Nicolas Cage films, the divorce rate in Maine and margarine consumption, and arcade revenue and computer science doctorates. OK, the last one kind of makes sense. The critical missing piece, of course, is causation. While a these examples can be amusing, it forces us to realize that correlation does not mean causation. We need to ensure we are wary of this when analyzing data. The results may not make sense.
Machine learning and artificial intelligence algorithms work by identifying patterns in data. The first step in a predictive analytics effort is to provide a sample set of known good data (meaning the event we are trying to predict did not occur). Once the data is analyzed, the algorithm goes to work to identify correlations in the data. Once the system is active, it will provide an alarm or an alert if an anomaly is detected. An operator will then have to determine is this a false positive (normal operation condition) or an actual event (further action required). An event that is considered normal operation is added to the good data set.
A common application for a predictive machine learning algorithm is with a turbine generator at a power plant. Because of the experience with rotating equipment, the process variables that should be included in the sample set are well known and understood. In many instances there is an expected correlation, as well. Where this becomes a challenge is when the data is not as well understood (or you are looking to further enhance your current system). In a coal-fired power plant, the height of the coal pile may be correlated with boiler efficiency, but clearly are not related.
One of the examples I commonly use is that of a peanut butter manufacturer. They were trying to better understand what variables lead to certain quality of peanut butter. It was decided to mine all process data and look for patterns, rather than look at a predefined set. As it turns out, this was a smart move as the best predictors were not what was expected. Of course, they still had to decide if this made sense (it did in this case).
Because this kind of experience is more common than not it is best to start with a large data set. The data scientist will be adepts at identifying the patterns and what correlation exists. As the effort evolves, people will need to determine what should remain and what should be ignored. It is recommended that suspect correlation merit a discussion. It could be meaningful, or it could be a relationship similar to math doctorates and the amount of uranium stored at your plant.