Simpson's Paradox
2000/01/01
Since 2000, I am aware of Simpson's paradox. Although well known to
statisticians and machine learning researchers, it is practically ignored
in all business intelligence systems which are based on aggregated numbers (Kennzahlensysteme) - and practically all existing systems fall into this category.
Simpson's paradox: when data on different aggregation levels yields apparently contrary interpretations. Consider following results from a hypothetical medical experiment as one common example:
Sex | Treatment | #Success | #Failure | %Success |
Male | 1 | 60 | 20 | 75% |
Male | 2 | 100 | 50 | 67% |
Female | 1 | 40 | 80 | 33% |
Female | 2 | 10 | 30 | 25% |
It seems clear that Treatment 1 is slightly better since its sucess rate (%Success) is higher for both men and women.
However if we combine results of both sexes into one table, it looks like this:
Treatment | #Success | #Failure | %Success |
1 | 100 | 100 | 50% |
2 | 110 | 80 | 58% |
Suddenly it seems as if Treatment 2 would be slightly better, contrary to our previous findings. That is Simpson's Paradox. A decision based on the last table would be wrong because in this case the summation reduces significant data
content.
With comprehensive real-life data from an active OLAP-System (aggregation
hierarchy for all attributes, a data snapshot at highest detail
and a description of how the system is used for strategic decisions), I
would be able to test whether the system is susceptible to Simpson's paradox,
the extent of the susceptibility and give examples from the decision
processes where this leads to counterintuitive results. This way the BI system can be made robust against wrong decisions and you contribute to a better understanding of this paradox.