One of the central concepts of data science is gaining insights from data. Statistics is an excellent tool for unlocking such insights in data. Statistics is a form of math, and it involves formulas, but it doesn’t have to be that scary even if you’ve never encountered it before.

Machine learning came from statistics. The algorithms and models used in machine learning all come from what’s called statistical learning. Knowing some basic statistics is extremely helpful whether you are deep into machine learning algorithms or just staying up-to-date on the latest machine learning research.

We always have to fight for the right and useful datasets to solve the business problems presented to our team. It is important to take right dataset for data analytics or that matter machine learning projects. Building a shared understanding of key terms and phrases such as data set is an important aspect establishing a data-driven culture. In this chapter we will talk about the importance of finding right dataset and how to get them for you to practice various concepts. Our intention is when you move to the real world projects, you take the learning from these dataset and will help you to appreciate the importance of data.

DATASET

Value To understand what a dataset is, we must first discuss the components of a dataset. A single row of data is called an instance. Datasets are a collection of instances that all share a common attribute. To solve a business problem, you will be using more than one datasets, each used to fulfill various roles in the system. Data can come in many forms, but data analytics and machine learning models rely on four primary data types: numerical, categorical, time series, and text data.

**Numerical data (or Quantitative data)**

Numerical data, or quantitative data, is any form of measurable data such as your height, weight, or the cost of your phone bill. You can determine if a set of data is numerical by attempting to average out the numbers or sort them in ascending or descending order. Exact or whole numbers (ie. 26 students in a class) are considered discrete numbers, while those which fall into a given range (ie. 3.6 percent interest rate) are considered continuous numbers. Numerical data is not tied to any specific point in time, they are simply raw numbers. Arithematic operations can be performed on these dataset. Thus, its meaningful to talk about:

2 * height

Cost_price + profit

Total_Cost / total_items

Numerical data of the number of male students and female students in a class may be taken, then added together to get the total number of students in the class.

Numerical data is categorized into discrete (represents countable items) and continuous data where continuous data (data measurement) are further grouped into interval and ratio data. These data types are significantly used for statistical analysis or research purposes.

### Discrete Data

Discrete Data

represents countable items and the values can be grouped into a list, where the

list may either be finite or infinite. Example:

A more practical example of discrete data will be counting the cups of water

required to empty a bucket and counting the cups of water required to empty an

ocean—the former is finite countable while the latter is infinite countable.

Data can be

represented using various charts. Below is an example where Quarterly average

collection of GST from 5 cities is captured in a stacked bar chart:

**Continuous Data:**

This is a

type of numerical data which represents measurements—their values are described

as intervals on a real number line, rather than take counting numbers. For

example, the Cumulative Grade Point Average (CGPA) in a 5 point grading system

defines first-class students as those whose CGPA falls under 4.50 – 5.00,

second class upper as 3.50 – 4.49, second class lower as 2.50 – 3.49, third

class as 1.5 – 2.49, pass as 1.00 – 1.49 and fail as 0.00 – 0.99… A student may

score a point 4.495, 2.125, 3.5 or any possible number from 0 to 5. In this

case, the continuous data is regarded as being uncountably finite.

Continuous

data may be subdivided into two types, namely; Interval & Ratio Data.

*Interval Data*

This is a

data type measured along a scale, in which each point is placed at an equal

distance from one another. Interval data takes numerical values that can only

take the addition and subtraction operations.

For example,

the temperature of a body measured in degrees Celsius or degrees Fahrenheit is

regarded as interval data. This temperature does not have a zero point. Temperature

interval between 10-degree centigrade, and 20-degree, 70-degree centigrade, and

80-degree are the same. Marks of the examination, height, time, etc. can be a

good example of an interval scale.

Trend analysis is a popular interval data analysis technique, used to

draw trends and insights by capturing data over a certain period of time. In

other words, a trend analysis on interval data is conducted by capturing data

using an interval scale survey in multiple iterations, using the same question.

Trends over several years can be evaluated by calculating the trend percentage

as the current year divided by the base year. In the example below, let’s see

the 5 year percentage trend in Revenue (in USD) for Infosys.

*Ratio Data*

Ratio data is

a continuous data type similar to interval data, but has a zero point. In other

words, ratio data is an interval data with zero point. For ratio data, the

temperature may not only be measured in degrees Celsius and degrees Fahrenheit,

but also in Kelvin. The

presence of zero-point accommodates the measurement of 0 Kelvin. Suppose, we

are considering the body temperature in °C and °F scale. We found that two

persons’ body temperature is 10°C and 20°C or 10°F and 20°F respectively. We

can not say that the 2nd person’s body temperature is 2 times higher than the

1st one. Because 0°C and 0°F are not the true zero that means it doesn’t mean

the absence of temperature.

Cross-tabulation, in

statistics, is a method to understand the relationship between multiple

variables. The contingency table, also known as a crosstab, is used to

establish a correlation between multiple ratio data variables in a tabular

format. Cross-tab is a popular choice for statistical data analysis. Since it

is a reporting/ analyzing tool, it can be used with any data level: ordinal or

nominal. It treats all data as nominal data (nominal data is not measured. It

is categorized). For example, you can analyze the relation between two

categorical variables like age and purchase of electronic gadgets.

There are two questions asked

here:

·

What is your age?

·

What electronic gadget are you likely to buy in the next six

months?

**General Characteristics/Featuresof Numerical Data**

*Quantitativeness*:

Numerical data is sometimes called quantitative data due to its quantitative

nature. Unlike categorical data which takes quantitative values with

qualitative characteristics, numerical data exhibits quantitative features. .

*ArithmeticOperation*: One can perform arithmetic operations like addition and

subtraction on numerical data. True to its quantitative character, almost all

statistical analysis is applicable when analysing numerical data.

*Estimation& Enumeration:* Numerical data can both be estimated an enumerated. In a

case whereby the numerical data is precise, it may be enumerated. However, if

it is not precise, the data is estimated. When computing the CGPA of a student,

for instance, a 4.495623 CGPA is rounded up to 4.50.

*Interval Difference*: The

difference between each interval on a numerical data scale is equal. For

example, the difference between 5 minutes and 10 minutes on a wall clock is the

same as the difference between 10 and 15 minutes.

*Data Visualisation*: Numerical data may be visualised in

different ways depending on the type of data being investigated. Some of the

data visualisation techniques adopted by numerical data include; scatter plot,

dot plot, stacked dot plot, histograms, etc.

**Examples of Numerical Data Visualization: Scatter Plot**

A Scatter (XY) Plot has points that show the relationship between two

sets of data. Scatter plots have the ability to show trends, clusters,

patterns, and relationships in a cloud of data points—especially a very large

one. Note: It's important to remember that correlation does not always equal

causation, and other unnoticed variables could be influencing the data in a

chart.

In the above

example, each dot shows one student’s marks scored versus the number of

hours/week studied.

**Disadvantages of Numerical Data **

·

Preset

answers that do not reflect how people feel about a subject.

·

“Standard”

questions from researchers may lead to structural bias.

·

Results

are limited.