Bad Data

Empower planners and individual travellers in the developing world to make smarter (i.e., safer, greener, cheaper) mobility choices to reduce alarming historical trends in injuries, environmental impacts and economic costs.

Visit the project website
Principle Investigator
Prof Scott Ferson

Link to University Profile

Machine learning tools work well when data is abundant. Many statistical methods were invented for the situation in which sample size is limiting. But not all uncertainty has to do with small sample sizes. Poor or variable precision, missing values, non-numerical information, dubious provenance, and contamination by outliers, errors, and lies are just a few of the causes that give us bad data.

Some basic questions about bad data seem not to have clear answers:

  1. When investing in empirical effort, should we get more or better data?
  2. Is it always smart to combine good data with bad data?
  3. What cam we do if it clear that our data are not collected randomly?
  4. What can be done with ludicrously small samples like n=2 or even n=1?
  5. If data aren't missing "at random", we can still draw any conclusions?
  6. Is it prudent to ignore, as statisticians so often do, the reported precision statements associated with measurements?