Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves. How does a system learn?
A Computer program is said to learn from Experience “E” with respect to some task “T” and some performance measure “P”, if its performance on “T”, as measured by “P”, improves with “E”.
Figure 1: Machine Learning Workflow
The process of learning begins with observations or data (Training Data), such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers learn automatically without human intervention or assistance and adjust actions accordingly.
Machine learning algorithms are often categorized as supervised or unsupervised.
Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly.
Unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data.
Semi-supervised machine learning algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it.
Reinforcement machine learning algorithms is a learning method that interacts with its environment by producing actions and discovers errors or rewards. Trial and error search and delayed reward are the most relevant characteristics of reinforcement learning. Simple reward feedback is required for the agent to learn which action is best; this is known as the reinforcement signal.
Still no cleared about these methods? Not to worry. We will learn and practice in the chapters to come. One thing we should understand is Machine learning enables analysis of massive quantities of data and generally delivers faster, more accurate results in order to identify profitable opportunities or dangerous risks, it may also require additional time and resources to train it properly.
Let’s undertstand them with an example:
· Suppose you had a basket and filled it with different kinds of fruits.
· Your task is to arrange them into groups.
· For understanding let me explain the names of the fruits in our basket.
· We have four types of fruits. They are
Please note Reinforcement Learning is out of scope for this tutorial. We will have separate tutorial to cover this.
· You already learn from your previous work about the physical characters of fruits So arranging the same type of fruits at one place is easy now
· In data mining terminology the earlier work is called as training the data
· You already learn the things from your train data. This is because of response variable
· Response variable means just a decision variable
· You can observe response variable below (FRUIT NAME)
· Suppose you have taken a new fruit from the basket then you will see the size, color, and shape of that particular fruit.
· If size is Big, color is Red, the shape is rounded shape with a depression at the top, you will confirm the fruit name as apple and you will put in apple group.
· Likewise for other fruits also.
· The job of grouping fruits was done and the happy ending.
· You can observe in the table that a column was labeled as “FRUIT NAME“. This is called as a response variable.
· If you learn the thing before from training data and then applying that knowledge to the test data(for new fruit), This type of learning is called as Supervised Learning.
Suppose you have a basket and it is filled with some different types of fruits and your task is to arrange them as groups.
· This time, you don’t know anything about the fruits, honestly saying this is the first time you have seen them. You have no clue about those.
· So, how will you arrange them?
· What will you do first???
· You will take a fruit and you will arrange them by considering the physical character of that particular fruit.
· Suppose you have considered color.
· Then you will arrange them on considering base condition as color.
· Then the groups will be something like this.
o RED COLOR GROUP: apples & cherry fruits.
o GREEN COLOR GROUP: bananas & grapes.
· So now you will take another physical character such as size.
o RED COLOR AND BIG SIZE: apple.
o RED COLOR AND SMALL SIZE: cherry fruits.
o GREEN COLOR AND BIG SIZE: bananas.
o GREEN COLOR AND SMALL SIZE: grapes.
· The job has done, the happy ending.
· Here you did not learn anything before, means no train data and no response variable.
The Purpose of Train/Test Sets
Why do we use train and test sets?
Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem. The training dataset is used to prepare a model, to train it.
We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.
Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data. You will be able to understand by now that test and train set analysis is done for Supervised algorightms and not for unsupervised algorithms.
Figure 2: Exabyte and growth of data. (Source: IDC)
We live in 21st century and the data is everywhere. Every second tons of data are produced, if could be the text messages you are sending or posting a pic on Instagram. Since the dawn of time until 2005, humans had created 130 Exabytes of data. By 2020, its expected to reach 40,900 Exabytes. To understand this, we know that one letter takes about 1 byte of space. This is a phenomenal growth of the data we create. This is the reality of the world we live in. Our capacity to process this data is very less and even though machine can process much more data but still it will not be possible to process all these data. Machine learning provides us with that opportunity. Machine Learning algorithm can help us to step us to analyze all these data and help us to create value out of it.