Data analysis
From Wikipedia, the free encyclopedia
Data analysis is a process of gathering, modeling, and transforming data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.
Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis, and confirmatory data analysis. EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of statistical or structural models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis.
Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The term data analysis is sometimes used as a synonym for data modeling, which is unrelated to the subject of this article.
Contents |
[edit] Nuclear and particle physics
In nuclear and particle physics the data usually originate from the experimental apparatus via a data acquisition system. It is then processed, in a step usually called data reduction, to apply calibrations and to extract physically significant information. Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc. The resulting data n-tuples are then scrutinized by the high physicists, using specialized software tools like ROOT or PAW, comparing the results of the experiment with theory.
The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation software like Geant4, predict the response of the detector to a given theoretical event, producing simulated events which are then compared to experimental data.
See also: Computational physics.
[edit] Social sciences
Qualitative data analysis (QDA) or qualitative research is the non-quantitative analysis of data from non-numerical sources, for example words, photographs, observations, etc..
[edit] Phases in data analysis
The statistical analysis of data is a process with several phases, each with its own goal.
[edit] Data cleaning
During data cleaning erroneous entries are inspected and corrected where possible. In some cases, it is easy to substitute suspect data with the correct values. However, when it is unclear what caused the erroneous data or what should be used to replace it, it is important that no subjective decisions are made to ensure the quality of the data. Furthermore, it is important not to throw information away at any stage in the data cleaning phase. When altering variables the original values should be kept in a duplicate dataset or under a different variable name so that information is always cumulatively retrievable.[1]
[edit] Initial data analysis
The initial data analysis uses descriptive statistics to answer the following four questions[1]:
- What is the quality of the data?
- What is the quality of the measurements?
- Did the implementation of the study fulfill the intentions of the research design?
- What are the characteristics of the data sample?
Each step of the initial data analysis is described below.
[edit] The quality of the data
The quality of the data can be assessed in several ways. First of all the distribution of the variables before data cleaning is compared to the distribution of the variables after data cleaning to see whether data cleaning has had unwanted effects on the data. Second, the missing observations in the data are analyzed to see whether they are missing at random and whether some form of imputation (statistics) is needed. Third, extreme observations in the data are analyzed to see if they seem to disturb the distribution. If that is the case, robust techniques can be applied.
[edit] The quality of the measurements
When the quality of the measurement instruments used is not the main focus of the research, the quality of the measurement instruments can be checked during initial data analysis. One way to assess the quality of a measurement instrument is to perform an analysis of homogeneity (internal consistency). A homogeneity index like Cronbach's α gives an indication of the reliability of a measurement instrument.
[edit] The implementation of the design
In many cases, a check to see whether the randomization procedure has worked will be the starting point for analyzing the implementation of the design. This can be done by checking whether variables are equally distributed across groups. Other ways of checking the implementation of the design are manipulation checking and the analysis of nonresponse and dropout.
[edit] Characteristics of the data sample
In this step, the findings of the initial data analysis are documented and possible corrective actions are taken. For instance, when the distribution of a variable is not normal, the data may need to be transformed or categorized. Furthermore, a decision should be made on how to handle missing data and outliers. If the randomization procedure seems to be defective, propensity scores can be calculated and included in the main analyses as a
[edit] See also
Wikiversity has learning materials about Data analysis |
[edit] References
- ^ a b Adèr, H. J.; Mellenbergh, G. J.; Hand, D. J. (2008), "Chapter 14: Phases and initial steps in data analysis", Advising on Reseach Methods: A Consultant's Companion, Huizen, the Netherlands: Johannes van Kessel publishing, p. 336, ISBN 978-90-79418-02-2
This article needs additional citations for verification. Please help improve this article by adding reliable references (ideally, using inline citations). Unsourced material may be challenged and removed. (December 2008) |
[edit] Further reading
- ASTM International (2002). Manual on Presentation of Data and Control Chart Analysis, MNL 7A, ISBN 0803120931
- Godfrey, A. B., Juran's Quality Handbook, 1999, ISBN 007034003
- Lewis-Beck, Michael S., Data Analysis: an Introduction, Sage Publications Inc, 1995, ISBN 0803957726
- NIST/SEMATEK (2008) Handbook of Statistical Methods,
- Pyzdek, T, Quality Engineering Handbook, 2003, ISBN 0824746147
- Richard Veryard (1984). Pragmatic data analysis. Oxford : Blackwell Scientific Publications. ISBN 0632013117''''Bold text'''''''