One of the central concepts of data science is gaining insights from data. Statistics is an excellent tool for unlocking such insights in data. Statistics is a form of math, and it involves formulas, but it doesn’t have to be that scary even if you’ve never encountered it before.
Machine learning came from statistics. The algorithms and models used in machine learning all come from what’s called statistical learning. Knowing some basic statistics is extremely helpful whether you are deep into machine learning algorithms or just staying up-to-date on the latest machine learning research.
We always have to fight for the right and useful datasets to solve the business problems presented to our team. It is important to take right dataset for data analytics or that matter machine learning projects. Building a shared understanding of key terms and phrases such as data set is an important aspect establishing a data-driven culture. In this chapter we will talk about the importance of finding right dataset and how to get them for you to practice various concepts. Our intention is when you move to the real world projects, you take the learning from these dataset and will help you to appreciate the importance of data.
DATASET
Value To understand what a dataset is, we must first discuss the components of a dataset. A single row of data is called an instance. Datasets are a collection of instances that all share a common attribute. To solve a business problem, you will be using more than one datasets, each used to fulfill various roles in the system. Data can come in many forms, but data analytics and machine learning models rely on four primary data types: numerical, categorical, time series, and text data.
Numerical data (or Quantitative data)
Numerical data, or quantitative data, is any form of measurable data such as your height, weight, or the cost of your phone bill. You can determine if a set of data is numerical by attempting to average out the numbers or sort them in ascending or descending order. Exact or whole numbers (ie. 26 students in a class) are considered discrete numbers, while those which fall into a given range (ie. 3.6 percent interest rate) are considered continuous numbers. Numerical data is not tied to any specific point in time, they are simply raw numbers. Arithematic operations can be performed on these dataset. Thus, its meaningful to talk about:
2 * height
Cost_price + profit
Total_Cost / total_items
Numerical data of the number of male students and female students in a class may be taken, then added together to get the total number of students in the class.
Numerical data is categorized into discrete (represents countable items) and continuous data where continuous data (data measurement) are further grouped into interval and ratio data. These data types are significantly used for statistical analysis or research purposes.
Discrete Data
Discrete Data
represents countable items and the values can be grouped into a list, where the
list may either be finite or infinite. Example:
A more practical example of discrete data will be counting the cups of water
required to empty a bucket and counting the cups of water required to empty an
ocean—the former is finite countable while the latter is infinite countable.
Data can be
represented using various charts. Below is an example where Quarterly average
collection of GST from 5 cities is captured in a stacked bar chart:
Continuous Data:
This is a
type of numerical data which represents measurements—their values are described
as intervals on a real number line, rather than take counting numbers. For
example, the Cumulative Grade Point Average (CGPA) in a 5 point grading system
defines first-class students as those whose CGPA falls under 4.50 – 5.00,
second class upper as 3.50 – 4.49, second class lower as 2.50 – 3.49, third
class as 1.5 – 2.49, pass as 1.00 – 1.49 and fail as 0.00 – 0.99… A student may
score a point 4.495, 2.125, 3.5 or any possible number from 0 to 5. In this
case, the continuous data is regarded as being uncountably finite.
Continuous
data may be subdivided into two types, namely; Interval & Ratio Data.
Interval Data
This is a
data type measured along a scale, in which each point is placed at an equal
distance from one another. Interval data takes numerical values that can only
take the addition and subtraction operations.
For example,
the temperature of a body measured in degrees Celsius or degrees Fahrenheit is
regarded as interval data. This temperature does not have a zero point. Temperature
interval between 10-degree centigrade, and 20-degree, 70-degree centigrade, and
80-degree are the same. Marks of the examination, height, time, etc. can be a
good example of an interval scale.
Trend analysis is a popular interval data analysis technique, used to
draw trends and insights by capturing data over a certain period of time. In
other words, a trend analysis on interval data is conducted by capturing data
using an interval scale survey in multiple iterations, using the same question.
Trends over several years can be evaluated by calculating the trend percentage
as the current year divided by the base year. In the example below, let’s see
the 5 year percentage trend in Revenue (in USD) for Infosys.
Ratio Data
Ratio data is
a continuous data type similar to interval data, but has a zero point. In other
words, ratio data is an interval data with zero point. For ratio data, the
temperature may not only be measured in degrees Celsius and degrees Fahrenheit,
but also in Kelvin. The
presence of zero-point accommodates the measurement of 0 Kelvin. Suppose, we
are considering the body temperature in °C and °F scale. We found that two
persons’ body temperature is 10°C and 20°C or 10°F and 20°F respectively. We
can not say that the 2nd person’s body temperature is 2 times higher than the
1st one. Because 0°C and 0°F are not the true zero that means it doesn’t mean
the absence of temperature.
Cross-tabulation, in
statistics, is a method to understand the relationship between multiple
variables. The contingency table, also known as a crosstab, is used to
establish a correlation between multiple ratio data variables in a tabular
format. Cross-tab is a popular choice for statistical data analysis. Since it
is a reporting/ analyzing tool, it can be used with any data level: ordinal or
nominal. It treats all data as nominal data (nominal data is not measured. It
is categorized). For example, you can analyze the relation between two
categorical variables like age and purchase of electronic gadgets.
There are two questions asked
here:
·
What is your age?
·
What electronic gadget are you likely to buy in the next six
months?
General Characteristics/Features
of Numerical Data
Quantitativeness:
Numerical data is sometimes called quantitative data due to its quantitative
nature. Unlike categorical data which takes quantitative values with
qualitative characteristics, numerical data exhibits quantitative features. .
Arithmetic
Operation: One can perform arithmetic operations like addition and
subtraction on numerical data. True to its quantitative character, almost all
statistical analysis is applicable when analysing numerical data.
Estimation
& Enumeration: Numerical data can both be estimated an enumerated. In a
case whereby the numerical data is precise, it may be enumerated. However, if
it is not precise, the data is estimated. When computing the CGPA of a student,
for instance, a 4.495623 CGPA is rounded up to 4.50.
Interval Difference: The
difference between each interval on a numerical data scale is equal. For
example, the difference between 5 minutes and 10 minutes on a wall clock is the
same as the difference between 10 and 15 minutes.
Data Visualisation: Numerical data may be visualised in
different ways depending on the type of data being investigated. Some of the
data visualisation techniques adopted by numerical data include; scatter plot,
dot plot, stacked dot plot, histograms, etc.
Examples of Numerical Data Visualization: Scatter Plot
A Scatter (XY) Plot has points that show the relationship between two
sets of data. Scatter plots have the ability to show trends, clusters,
patterns, and relationships in a cloud of data points—especially a very large
one. Note: It's important to remember that correlation does not always equal
causation, and other unnoticed variables could be influencing the data in a
chart.
In the above
example, each dot shows one student’s marks scored versus the number of
hours/week studied.
Disadvantages of Numerical Data
·
Preset
answers that do not reflect how people feel about a subject.
·
“Standard”
questions from researchers may lead to structural bias.
·
Results
are limited.