Raw data is a group of values that you can extract from a source. It becomes useful when it is processed to find different patterns in the data that was extracted.
These patterns, also referred to as information, help you to interpret the data, make predictions, and identify unexpected changes in the future.
This information is then processed into knowledge. Knowledge is a large, organized collection of persistent and extensive information and experience that can be used to describe and predict phenomena in the real world.
Data analysis is the process by which you convert data into information and, thereafter, knowledge. Data analytics is when data analysis is combined with making predictions.
There are several data analysis techniques available to make sense of data. One of them is statistics, which uses mathematical techniques on datasets.
Statistics is the science of collecting and analyzing a large amount of data to identify the characteristics of the data and its subsets. For example, you may want to study the medical history of a country to identify the most common causes of illness-related fatality. You can also dive deeper into some subgroups, such as people from different geographic areas, to identify whether there are specific patterns for people from each area.
Statistics is performed on datasets. Different data inside datasets have different characteristics and require different methods of processing. Some types of data, such as name and label, may be qualitative, which means it provides descriptive information
. Others, such as counts and amounts
, are quantitative, which means you can perform numerical operations
, such as addition or multiplication,
on these values.
Statistics can be further divided into two subcategories: descriptive statistics and inferential statistics.
Descriptive statistics are used to describe a collection of data. For example, the average age of people in a country is a descriptive statistics indicator that describes an aspect of the country's residents.
Descriptive statistics on a single variable in a dataset are called univariate
analysis, while descriptive statistics that look at two or more variables at the same time are called multivariate
analysis. In particular, statistics that look at two variables are called bivariate
analysis.
The average age of a country is an example of univariate analysis, while an analysis examining the interaction between GDP per capita, healthcare spending per capita, and age is multivariate analysis.
In contrast, inferential statistics allows datasets to be collected as a sample or a small portion of measurements from a larger group, called a population. Inferential statistics are used to infer the properties of a population-based on the properties of a sample.
For example, a survey of 10,000 people is a sample of the entire population of a country with 100 million people. Instead of collecting the age of every person in the country, you survey 10,000 people in the country and use their average age as the average age of the country.
basic mathematical techniques of univariate and bivariate analyses and how to use them to describe and understand a given dataset. You will be introduced to the following methods in this order:
Univariate Analysis Techniques
- Data Frequency Distribution
- Quantiles
- Central Tendency
- Dispersion
Bivariate Analysis Techniques
- Scatterplots
- Linear Trend Analysis and Pearson Correlation Coefficient
- Interpreting and Analyzing the Correlation Coefficient
- Time Series Data
Start with a simple question: what is data? Data is the recorded description or measurements of something in the real world. For example, a list of heights is data; that is, height is a measure of the distance between a person's head and their feet. The data is used to describe a unit of observation. In the case of these heights, a person is a unit of observation.
As you can imagine, there is a lot of data you can gather to describe a person—including their age, weight, and smoking preferences.
One or more of these measurements used to describe a specific unit of observation is called a data point, and each measurement in a data point is called a variable (often referred to as a feature). When you have several data points together, you have a dataset.
For example, you may have Person A, who is a 45-year-old smoker, and Person B, who is a 24-year-old non-smoker. Here, age is a variable. The age of Person A is one measurement and the age of Person B is another. 45 and 24 are the values of measurement. A compilation of data points with measurements such as ages, weights, and smoking trends of various people is called a dataset.
Data can be broken down into three main categories: structured, semi-structured, and unstructured.
Structured data has an atomic definition for all the variables, such as the data type, value range, and meaning for values. In many cases, even the order of variables is clearly defined and strictly enforced.
For example, the record of a student in a school registration card contains an identification number, name, and date of birth, each with a clear meaning and stored in order.
Unstructured data, on the other hand, does not have a definition as clear as structured data, and thus is harder to extract and parse. It may be some binary blob that comes from electronic devices, such asvideo and audio files. It may also be a collection of natural input tokens (words, emojis), such as social network posts and human speech.
Semi-structured data usually does not have a pre-defined format and meaning, but each of its measurement values is tagged with the definition of that measurement. For example, all houses have an address. But some may have a basement, or a garage, or both. It is also possible that owners may add upgrades that cannot be expected at the time when this house's information is recorded. All components in this data have clear definitions, but it is difficult to come up with a pre-defined list for all the possible variables, especially for the variables that may come up in the future. Thus, this house data is semi-structured.