Work In Progress
Our Elements guide is still in progress, and therefore lacks full visual and technical assets. We hope to release them by summer of 2020. Thanks for reading Lingua Franca!
A correlation is a quick visual indicator of the relationship between a model’s decision and the data it was trained on. The meaning of ‘relationship’ varies by domain, but in general users want to know which input fields contributed most to the output. Because of the fuzziness in interpreting relationships, as well as the inherent limitations of causally linking information, correlations should be used sparingly for explanatory purposes.
Correlation analyses are promising tools in the domain where one can assume largely linear relationships, or where the correlations being shown do not carry much risk to the user (i.e. misleading correlations do not have a large impact). Some versions of correlation analysis (e.g. Pearson) tend to overstate the importance of outliers and generally carry less explanatory capabilities in situations where the user is detecting a small-probability event.
However, all correlation analysis is by nature limited in explaining complex relationships by attempting to simplify the messiness of the real world. In many cases, the correlation analysis can be ‘gamed’ or deceived simply by the way that data is collected and described. For example, say an event has two causes—weather and altitude—but your data collects weather with temperature, condition, chance of rain, etc. Your correlation analysis will ‘split’ the importance of weather among those properties, so altitude may show as the most important factor to most model predictions. If weather were defined as a single field, then the opposite effect may be observed.
Traditional statistical tools prior to the emergence of neural networks possess the capability for factor analysis to determine correlations (e.g. information gain for decision trees). Modern attempts at correlation analysis for complex neural networks attempt include LIME, an algorithm that perturbs the input data in order to determine which output variables are most sensitive to small changes in it. As mentioned above, these tools all carry both theoretical and practical caveats.
- Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing by Justin Matejka & George Fitzmaurice