Data Science Introduction
Notes from the John Hopkins Coursera Data Science Specialisation
Types of Data Science Questions
- Descriptive
- Describe a set of data
- Commonly applied to Census data (example describing the population).
- Its not trying to infer anything
- Exploratory
- Find relationships you don’t know about
- Exploratory analysis alone should not be used for generalising / predicting
- Correlation does not imply causation
- Inferential
- Use a relatively small sample of data to say something about a bigger problem
- Estimation of both the quality and uncertainty around your estimate
- Predictive
- To use the data on some object to predict values for another object
- If X predicts Y it does not mean that X causes Y
- More data and simple models tend to work well
- Prediction is hard - especially when it involves the future
- Causal
- To find out what happens to one variable when you make another variable change
- Generally done through randomised studies
- Generally an average effect
- Mechanistic
- Using the exact changes in variables that leads to changes in other variables for individual objects
- Incredibly hard to infer
What is Data
“Data are values of qualitative or quantitative variables, belonging to a set of items”
- Set of items is sometimes called the population.
- Data is the second most important thing, the question being the most important.
- With that said, the data may limit the question later.
Experimental Design
- Know about the analysis plan
- Plan for data and code sharing
- Example data sharing plan
- Formulate your question in advance
- Confounding
- Pay attention to other variables which could be causing correlation. Observed correlation does not mean that the variables are always related (correlation does not imply causation)