Exploratory data analysis
From Wikipedia, the free encyclopedia
Exploratory data analysis (EDA) is an approach to analyzing data for the purpose of formulating hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses[1]. It was so named by John Tukey to contrast with Confirmatory Data Analysis, the term used for the set of ideas about hypothesis testing, p-values, confidence intervals etc. which formed the key tools in the arsenal of practicing statisticians at the time.
Contents |
[edit] EDA development
Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.
The objectives of EDA are to:
- Suggest hypotheses about the causes of observed phenomena
- Assess assumptions on which statistical inference will be based
- Support the selection of appropriate statistical tools and techniques
- Provide a basis for further data collection through surveys or experiments
Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.[2]
[edit] Techniques
There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.[3]
The principal graphical techniques used in EDA are:
The principal quantitative techniques are:
Graphical and quantitative techniques are:
[edit] History
Many EDA ideas can be traced back to earlier authors, for example:
- Francis Galton emphasized order statistics and quantiles.
- Arthur Bowley used precursors of the stemplot and five-number summary (Bowley actually used a "seven-figure summary", including the extremes, deciles and quartiles, along with the median - see his Elementary Manual of Statistics (3rd edn., 1920), p.62 - he defines "the maximum and minimum, median, quartiles and two deciles" as the "seven positions").
- Andrew Ehrenberg articulated a philosophy of data reduction (see his book of the same name).
The Open University course Statistics in Society (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced statistical inference via coin-tossing and the median test.
[edit] Software
- CMU-DAP (Carnegie-Mellon University Data Analysis Package, FORTRAN source for EDA tools with English-style command syntax, 1977).
- Data Desk, an EDA package from Data Description of Ithaca, New York.
- "ViSta: The Visual Statistics System" by F.W Young A free statistical system featuring very dynamic, highly interactive visualizations for EDA.
- Fathom (for high-school and intro college courses).
- ioGAS, an EDA package with GIS software links, used in the mining and exploration industries from ioAnalytics.
- JMP, an EDA package from SAS Institute.
- KNIME Konstanz Information Miner - Open-Source data exploration platform based on Eclipse knime.org
- LiveGraph (free real-time data series plotter).
- Mondrian The statistical data visualization tool with a link to R
- SOCR provides a large number of free Internet-accessible tools for EDA.
- TinkerPlots (for upper elementary and middle school students).
[edit] See also
- Anscombe's quartet, on importance of exploration
- Predictive analytics
- Structured data analysis (statistics)
[edit] Bibliography
- Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 0-471-09776-4.
- Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 0-471-09777-2.
- Tukey, John Wilder (1977). Exploratory Data Analysis. Addison-Wesley. ISBN 0-201-07616-0.
- Velleman, P F & Hoaglin, D C (1981) Applications, Basics and Computing of Exploratory Data Analysis ISBN 0-87150-409-X
[edit] References
- ^ "And roughly the only mechanism for suggesting questions is exploratory. And once they’re suggested, the only appropriate question would be how strongly supported are they and particularly how strongly supported are they by new data. And that’s confirmatory.", A conversation with John W. Tukey and Elizabeth Tukey, Luisa T. Fernholz and Stephan Morgenthaler, Statistical Science Volume 15, Number 1 (2000), 79-94.
- ^ Konold, C. (1999). Statistics goes to school. Contemporary Psychology, 44(1), 81-82.
- ^ "Exploratory data analysis is an attitude, a flexibility, and a reliance on display, NOT a bundle of techniques, and should be so taught.", John W. Tukey, We need both exploratory and confirmatory, The American Statistician, 34(1), (Feb., 1980), pp. 23-25.
- Leinhardt, G., Leinhardt, S., Exploratory Data Analysis: New Tools for the Analysis of Empirical Data, Review of Research in Education, Vol. 8, 1980 (1980), pp. 85-157.
- Theus, M., Urbanek, S. (2008), Interactive Graphics for Data Analysis: Principles and Examples, CRC Press, Boca Raton, FL, ISBN 978-1-58488-594-8
[edit] External links
[edit] Software
- DataDesk (free-to-try commercial EDA software for Mac and Windows)
- Experimental Data Analyst Mathematica application package for EDA
- FactoMineR (free exploratory multivariate data analysis software linked to R)
- GGobi (free interactive multivariate visualization software linked to R)
- KNIME Konstanz Information Miner - open-source data exploration platform
- MANET (free Mac-only interactive EDA software)
- Miner3D (EDA and visualization software)
- Mondrian (free interactive software for EDA)
- Orange (free component-based software for interactive EDA and machine learning)
- The Unscrambler (free-to-try commercial MVA software for Windows)
- Visalix (free interactive web application for EDA)
- ViSta (free interactive software based on Xlisp-Stat for EDA)
- Visulab (free interactive software for high dimensional non-spatial / non-temporal data with interactive EDA and visualization)
- VisuMap (EDA software for high dimensional non-linear data)
- XLisp-Stat (free software and Lisp based EDA development framework for Mac, PC and X Window)
[edit] Notes
- [1] (Very clear set of notes on EDA from Andrew Zieffler)