What Data Can You Trust?
In our post Belief or Facts, we encourage discussion based on facts. But can we trust the data that we see? Is it accurate? Is it complete? Even when there is no intent to distort or deceive, data can be misleading due to the complexities of gathering and interpreting it. Analysis by Deloitte found that more than two-thirds of survey respondents stated that the third-party data about them was only 0 to 50 percent correct. The COVID-19 pandemic has also exposed challenges collecting and analyzing data. Are increases in cases due to improved testing availability? Are deaths classified based on positive tests or symptoms? Is measuring excess deaths a better measure of the pandemic’s impact?
One of the data sources we use, sentiment analysis, is a valuable tool, but it requires a rigorous methodology to produce meaningful results:
- Sources must be selected to ensure balanced results. Twitter generates a huge amount of data, but 80 percent of tweets come from 10 percent of users, i.e. about 2 percent of the population, potentially drowning out other opinions. To counter these effects, we curate a broad range of sources beyond the obvious Social Media to ensure that we maximize the number of individual posters sampled, and we normalize results to balance data volumes.
- The commonly used sentiment analysis algorithms perform reasonably well at figuring out whether a piece of text is generally positive or negative. However, they fail to accurately link the sentiment expressed to a specific topic. If you have a bunch of comments that are guaranteed to be about exactly the same thing, that’s not so bad, but in the real world, people often touch on multiple topics in the same sentence. Unless sentiment analysis can assign sentiment to the correct topic, it produces a meaningless data soup. The proprietary technology we use can accurately analyze tens of thousands of inter-related topics.
- Even if the sentiment analysis of text is accurate, the results must be interpreted with care. The Harvard Business Review found that online reviews tend to over-represent extreme views. We counter this problem by sampling widely. For example, posts on a political candidate’s Facebook page attract passionate supporters and visceral haters, but related discussion elsewhere about news items and policies can provide more nuanced views. Similarly, we don’t rely on absolute values for opinions: trends are more reliable indicators, and we further validate data by comparing with other surveys.
Data analysis in a complex world will never be perfect, and there are people deliberately trying to deceive. The more we question the data behind the facts we are fed, the closer to the truth we will get.