AI Starter-Everything you need to know about Keras to build your first deep learning model

28 min readApr 8, 2019

AI Starter — Part -1

Welcome to the first part of the AI Starter series. I would like to give you a brief introduction of what one can learn from different parts of the AI Starter series.

Part -1 A complete explanation of standard format and flow of how to code in Keras.

Part -2 Build a neural network in Keras from scratch.

Part -3 Build a Convolution neural network in Keras from scratch to perform multi-class classification — Levels Up

Part -4 What are the learned parameters in Machine learning

The different parts of AI Starter series are inspired by the tutorial of many generous contributors who have made Computer vision, machine learning, and deep learning concepts accessible to everyone in a fun and easy way. I have tried my best to provide links to all the blogs I follow to learn AI in the simplest way.

There are no well-defined steps for learning the concepts of Artificial intelligence. I feel it is easy to begin but a very rigorous journey if the aim is to deliver results. This journey demands patience, persistence, time and a fearless attitude to stay in it.

These are my principles of staying in it:

1. Keep your basics crystal clear and practice it until you get it right 2.Read and stay updated every day 3. Never stop asking questions, always join a discussion group and keep participating 4.Do not fear to clean, create or label data 5. Implement and experiment fearlessly what you have read and push yourself to next level 6. Keep challenging yourself 7. Keep contributing to our AI society.

Let us begin to understand the standard format and flow of how to code in Keras.


What is a deep learning framework?

While it is possible to build deep learning solutions from scratch, deep learning frameworks are a convenient way to build them quickly. Such frameworks provide different neural network architectures and its building components in popular languages so that developers can use them across multiple platforms.

Among the popular open source, DL (Deep Learning) frameworks are TensorFlow, Caffe, Keras, PyTorch, Caffe2, CNTK, MXNet, Deeplearning4j (DL4J), and many more.

Many of these frameworks support Python as the programming language of choice. Among those supporting C++ include Caffe, DL4J, CNTK, MXNet, and TensorFlow. Torch was written in Lua and C, and PyTorch extends and improves on it with Python support. Paddle is a framework from Baidu.

How should one select a suitable deep learning framework?

Some obvious factors to consider are licensing, documentation, active community, adoption, programming language, modularity, ease of use, and performance. Keras and PyTorch are said to be easy to use but you can also consider TensorFlow for its popularity and low-level programming.

More specifically, you should check the following:

  • Style: Imperative or symbolic. Torch, Chainer, and Minerva are examples of imperative-style DL frameworks. Symbolic-style DL frameworks include TensorFlow, Theano and CGT. Imperative programs perform computations as they are encountered along with the program flow. Symbolic programs define symbols and how they should be combined. They result in what we call a computational graph. Symbols themselves might not have initial values. Symbols acquire values after the graph is compiled and invoked with particular values. Imperative frameworks are more flexible since you’re closer to the language. In symbolic frameworks, there’s less flexibility since you write in a domain-specific language. However, symbolic frameworks tend to be more efficient, both in terms of memory and speed. At times, it might make sense to use a mix of both framework styles. For example, parameter updates are done imperatively and gradient calculations are done symbolically.
  • Core Development Environment: Programming language, intuitive API, fast compile times, tools, debugger support, abstracting the computational graph, graph visualization (Tensor Board), etc.
  • Neural Network Architecture: Support for Deep Autoencoders, Restricted Boltzmann Machines (RBMs), Convolutional Neural Networks, Recurrent Neural Networks, Long Short-Term Memory, Generative Adversarial Networks, etc.
  • Optimization Algorithms: Gradient Descent (GD), Momentum-based GD, AdaGrad, RMSProp, and Adam, etc.
  • Targeted Application Areas: Image recognition, video detection, voice/audio recognition, text analytics, Natural Language Processing (NLP), time series forecasting, etc.
  • Hardware Extensibility: Support for multiple CPUs, GPUs, GPGPUs or TPUs across multiple machines or clusters.
  • Optimized for Hardware: Execute in optimized low-level code by supporting CUDA, BLAS, etc.
  • Deployment: Framework should be easy to deploy in production (TensorFlow Serving).

What is Keras?

Keras is an open source neural network framework written in Python. It is known for its user-friendliness, modularity (Plug and play), enabling fast experimentation. Keras is high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.

Here is an analogy, just as a web application has a frontend (GUI) and a backend (piles of code).In the same way, Keras is the frontend(user-friendly code interface, provide a high-level abstraction to complex mathematical operations) and any among Tensorflow, CNTK or Theano could be the backend (code that performs complex mathematics).

Keras is extremely simple to use, it is recommended for rapid prototyping. Being able to go from idea to result with the least possible delay is key to doing good research. Enabling GPU acceleration is handled implicitly in Keras. It is used for its code modularity, minimalism, extensibility, and Python-nativeness. Keras allows you to use TensorFlow in the backend eliminating the need to learn it.

Keras was developed with the objective of allowing people to write their own scripts without having to learn the backend in detail. Developers can use Keras to quickly build neural networks without worrying about the mathematical aspects of tensor algebra, numerical techniques, and optimization methods which is handled by its backend TensorFlow or Theano. After all, most of the users wouldn’t bother about the performance of scripts and the details of the algorithms.

However, one size does not fit all when it comes to Machine Learning applications — Keras won’t work if you need to make low-level changes to your CNN model. For that, you need TensorFlow. Although difficult to understand, once you get a hold of the syntax, you’ll be building your models in no time.

So, like everything, it all boils down to your requirements at hand. If you’re looking to fiddle around with Deep Neural Networks or just want to build a prototype — Keras is your calling. However, if you’re the one that likes to dive deep and get control of the low-level functionalities, you should spend some time exploring TensorFlow.

What is the standard format to code in Keras?

Seven steps to creating your deep learning network in Keras, Image courtesy.

I have explained each step one by one and later in all parts of the AI Starter series you will find that coding your CNN network is so easy and fun.

Step-1: Analyze the dataset

What is a dataset?

“It is a collection of images, numbers, observations or facts with its ground truth(class, label) or location of objects in it(annotation)”.

In all the parts of the AI Starter series, we will work with images as our input data to train and test the convolution neural network.

Learn by doing: Create a small dataset in 5 minutes

Create a folder and name it as DatasetCreate two subfolders inside Dataset, name them as train and test.Inside train and test folders create two subfolders in each and name them dog and catThen download any 10 images of dogs and cats from the internet.Pick any 5 images that have the dog in it and put them in the dog folder,which is the subfolder of train folders and Pick the rest 5 dog images and put them in the another Dog folder,which is the subfolder of of test folders. Do the same for cat images.Now, rename the images of dog and cat according to the content of the image:The image file name of dog images in “Dataset/train/dog/”:dog_1.jpg, dog_2.jpg, dog_3.jpg, dog_4.jpg, dog_5.jpgThe image file name of dog images in “Dataset/test/dog/”:dog_6.jpg, dog_7.jpg, dog_8.jpg, dog_9.jpg, dog_10.jpgThe image file name of cat images “Dataset/train/cat”:cat_1.jpg, cat_2.jpg, cat_3.jpg, cat_4.jpg, cat_5.jpgThe image file name of cat images “Dataset/test/cat”:cat_6.jpg, cat_7.jpg, cat_8.jpg, cat_9.jpg, cat_10.jpg

You just learned to create a dog and cat classification dataset. Here the names of the subfolders of the train and test folders are ground truths or label, which tells that all the images present in that folder belong to which category or class (Dog or Cat).

How to analyze a dataset?

In the above section, you just created a small classification dataset of dogs and cats, which is an easy task to just download very easily available images from the internet.

Let us suppose you need to create a dataset to classify between a specific breed of dogs (Bulldog) and cats (Siberian), for that, you might have to analyze a good number of online available benchmarked datasets like Kaggle Cats and Dogs Dataset. These datasets might have classes of multiple animals, like Lion, Panda, Monkey, Leopard with dogs and cats also being the one. But since you need a specific breed, you will have to pick them from the various dataset.

You will find many datasets online which might contain the images for the required classes (Dog or Cat). But making a selection of which one to choose depends completely on the expected outcome.

A very important point is to get yourself introduced with the concept of overfitting and underfitting. Overfitting — If you create a dataset with 2000 images of white kittens on green grass and gets a very high accuracy on the training dataset but fails to classify on test images where white kittens are roaming in a market place or sitting on a sofa, you must know that your network has become biased due to biased training dataset and it has learned to give good accuracies only when the kittens are sitting on green grasses. Underfitting — When you choose a small network and has loaded it with an extremely large unbiased dataset then the network does not have the capacity to learn from such a huge dataset. In this case, you must use a bigger network.

One must also consider the following points:

  • Dataset characteristics — Multivariate, Univariate, Sequential, Time-series, Text.
  • Attribute characteristics — Real, Integers, Strings, Float, Binary.
  • Number of Instances — the Total number of rows.
  • Number of Attributes — the Total number of columns.
  • Associated Tasks — Classification, Regression, Clustering, Other problem.

Step-2: Prepare the dataset

I would always recommend you to choose an available dataset rather than creating it from scratch. The already available datasets are benchmarked on many popular networks in well known AI challenges which have records of giving high accuracies.

But if your problem statement revolves around a very narrow, new or uncommon field. Like this one, where you need to classify the latest clothes designed by multiple fashion designer for a fashion week.

In that case, you will have to create a dataset of clothes designed by those designers only because it is not available online. You might have to organize a photoshoot to capture images of all the clothes. The nature of this dataset is a classification dataset where you need to build a model (AI engine) that learns unique features of cloths of participating fashion designer and later classify clothes according to the fashion designer who designed it.

One must also read about data augmentation, where the images can be rotated, translated, sheared, distorted to increase the dataset. But data augmentation must be done with utmost care as the purpose is not to increase the number of distorted images.

I am an image processing and computer vision enthusiast. I love crafting images using concepts of image processing and OpenCV function. I have crafted and trained on millions of images. I always recommend creating your own functions to augment images in a much-controlled way. These functions are also called as generators in Keras.

Data generation also involves post-processing steps like scaling, shuffling the images so that CNN gets trained equally on every class at all instance of time, training in mini-batches (A small set of images) at a time or single image every time.

Keras being a High-level API provides an image pre-processing module. keras.preprocessing.image.ImageDataGenerator() helps you define the parameters, the way you wish to augment your data. It does real-time data augmentation. All deep learning problem needs data, lots and lots of data. You might ask me a question of how does the system understands what kind of augmentation to perform.

So, the answer is, there are two ways, the first one is fit(x, augment=False, rounds=1, seed=None), where x is some sample data. This computes the internal data stats related to the data-dependent transformations, based on an array of sample data. The second way is to specify the augmentation parameters on your own.

Learn by doing: Perform data augmentation in 5 minute

Input image for augmentation
A set of 12 augmented image generated from the above input image.

The above code does the same work. It helps you visualize how your image data must be reshaped in order to produce, save and visualize the augmented images. I would recommend you to try the code. I have commented each line of the code so that you can find it easy to learn two ways to transform your data.

The above code finally generates 20 augmented image of a dog image as you can see in the above images. It is an extension of the above two codes. Please follow this blog for a detailed study on image augmentation. To understand the input shape of the image, I would recommend you to read the answer on image input shapes.

The above code will help you to generate and save images, which is fair if you are curious to view how augmentation works. But in real-time you might have a huge dataset, saving all the images might cost you a good amount of memory space. So, flow_from_directory() function in Keras helps you to take all the images from a folder and create augmented images as per the parameters defined in ImageDataGenerator.

Parameters define in above code for data augmentation:rotation_range=40,             

Always follow the Keras standard structure of folders, mentioned in the below image to create a dataset if you want to input images from multiple directories. Keras flow_from_directory() function takes images from multiple folders and uses them for training the CNN network.

In a folder named Data, we make three sub-folders Train, Valid and Test. Inside Train and Valid we make two sub-folders class_a, class_b. In the test folder, we have created a test folder which has a mixture of images from both the classes, class_a, and class_b. In each class folder, we put images belonging to that particular class.

The detailed explanation of flow_from_directory() function is explained in part 3of this series. In this part, you can have a look at below code. flow_from_directory() goes into “/DATA/TRAIN” folder looks for the subfolder like dog and cat. Considers dog and cat as classes. It also resizes the images to a size of height = 150 and width =150. The class mode is given as binary as there are only two classes to train dog and cat. Class mode helps to specify if it is a binary or multi-class classification problem. One of “categorical”, “binary”, “sparse”, “input”, or None. Default: “categorical”.

train_generator = train_datagen.flow_from_directory(
target_size=(150, 150),
validation_generator = test_datagen.flow_from_directory(
target_size=(150, 150),

The class mode determines the type of label arrays that are returned:

  • “categorical” will be 2D one-hot encoded labels,
  • “binary” will be 1D binary labels, “sparse” will be 1D integer labels,
  • “input” will be image identical to input images (mainly used to work with autoencoders).
  • If None, no labels are returned (the generator will only yield batches of image data.

In the case of cat and dog, the labels will be 0 and 1.

Step -3 Create the model

The model is made up of two parts :

Forward propagation and backward propagation

In supervised learning, the image and its true value are the inputs to a network.

In forward propagation, the input image is propagated through the network, features are extracted from the image( values of the image feature extractor are called as learned parameters, it can be randomly initialized at the beginning of forward propagation.). After the image has traveled through all the layers of the network. At the output node of the network, we predict the result which is expected to be close or exactly the true value. For an input image of a dog, the result should be (expected value, predicted) a dog.

After the prediction, it is not always needed that the result will be a dog, the result might be a cat or the result is a dog but with very low accuracy as we randomly initialized the learned parameters in forward propagation. Thus, to understand this we find the difference between the true value(ground truth) and the predicted value. This is called a loss. There are multiple ways of calculating loss. Once the loss is calculated we perform backpropagation.

In backward propagation, the learning happens when we take the calculated loss and propagate it back into the network. While propagating the loss we try to find the rate of change of loss with respect to each learning parameter(values of image feature extractors) to optimize the network. As the rate of change will help us to decide on by how much we should increase or decrease (update) the values of learned parameters so that we can reduce the loss and increase the overall performance of the network.

This was a very precise definition of both the parts with very minimal technical terms. As I said we will see each of this one step at a time.

This is a machine learning model representation.

Do not get scared by looking at the above animation. In this blog, we are going to understand everything in a very easy and fun way. One step at a time. Also, we will build multiple models in the different part of this series.

What is a machine learning model?

A machine learning model is a mathematical representation of a real-world process to give true results on unseen data for problems like binary classification, multi-class classification, regression, detection. There are different ways an algorithm can learn. In all the four parts we will use a learning method called as supervised learning.

Supervised learning

Supervised Learning means we have a set of training data along with its outcomes (labels, classes, real values, ground truth). We train a deep learning model with the training data so that the model will be in a position to predict the outcome (predicted label, predicted class) of future unseen data (or test data).

For example, Image of cat is a training data and “cat” is the class, real value or ground truth.

Later once the model is trained on these data, given a new test image it can predict the class (label, real value).

Now the question is how does this model look like. The definition says a model is a mathematical representation. I know you must be having a lot of question now like,

Is the model an equation or series of equations?

How do I create a model?

Should I code?

Is it easy or tough?

Let me assure you with Keras it is extremely easy and fast and yes, of course, you must code.

Look at the below shiny image. It is a model. It is called as Alexnet. It is a classification model. It is always highly recommended you take up Andrew Ng, Machine learning course to get your basics cleared. As in the image, you can see we have colorful rectangular blocks and inside each block runs a mathematical operation.


Forward propagation

What does a Convolution block do?

The gray input block is fed with input images and its class (real value or ground truth) as it is supervised learning. The red block says it is a Conv block. Yes, it is a block where convolution happens. The convolution between image and filters(Image processing filters). The only difference is the filter values are not fixed like the one you see in image processing filters like Sobel, Canny, Gaussian, Mean, Median, etc. The values in these filters are dynamic, keeps changing as it convolves with train images with an aim to update its value as it learns features from images of the different classes.

Image features
For this task, first of all, we need to understand what is an Image Feature and how we can use it.
Image feature is a simple image pattern, based on which we can describe what we see in the image. For example, eye, nose, the ear will be a feature on an image of a dog. All of these features are described as edges, corners, blobs in image processing, are much better features compared to a flat area like the forehead of the dog in the image which has no significant features to learn but only a plain brown color patch. The main role of features in computer vision(and not only) is to transform visual information into the vector space. This gives us the possibility to perform mathematical operations on them, for example finding a similar vector(which lead us to similar image or object on the image)

Features of a dog are its mustache, ear, nose, eyes.

Feature detection and Feature Description

But the next question arises. How do we find the features? So finding these image features is called Feature Detection.

So we found the features in the image (Assume you did it). Once you found it, you should find the same in the other images. What do we do?

We take a region around the feature, we explain it in our own words, like “upper part is blue sky, the lower part is building, on that building, there are some glasses, etc” and you search for the same area in other images. Basically, you are describing the feature. Similar way, a computer also should describe the region around the feature so that it can find it in other images. This description of features is called Feature Description. Once you have the features and its description, you can find the same features in all images and align them, stitch them or do things like this you want.

Learn by doing: Understand convolution, filters, feature detector, and descriptor in 5 minutes visually.

This is an amazing link to visually understand all of the above without coding everything on your own.

Learn by doing: Extract corners in an image in 5 minutes

The first image is an input image and the second image displays the corners in the image as red dots. Corners are features and an algorithm called Harris corner detection is a feature detector and descriptor.

You can run the above code to detect corners in any image.

How do you represent the first convolution layer in Keras?


from keras.layers import Conv2D
(Conv2D(32, (3, 3), input_shape=(150, 150,3)))
The above line of code is the way to represent a convolution layer in Keras. It has 32 filters of each size 3*3. The shape of input image is height = 150, width = 150 and number of channels = 3

What does a Relu block do?

Relu block is also called an activation block. Like every other block, this block also performs some mathematical operation. It uses different kinds of activation functions to do so. Here, Relu is used as an activation function. Activation is used to squeeze the output of a layer of the neural net. The output from the convolution layer might have different upper and lower limits of values.

It’s just R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x.So if an output from previous layer is a negative number it converts it to 0 and if it is a positive number, it keeps it as it is. It is a normalization algorithm.In our case the input to activation function would be an image after a convolution operation was performed on it.

from keras.layers import Activation
The above line of code is the way to represent an activation function. In this case the activation function is relu.

What is max-pooling?

MaxPooling layer is a layer where no learning happens, the only thing that happens is the reduction of dimension. It helps in reducing the number of learned parameters, thus reducing the computation and memory load.

The above image explains how max-pooling works. The kernel size here is 2*2. As nothing is learned here, the kernels are an empty matrix of a particular size (the blue circle on the left matrix). To perform max pooling one must consider the kernel size like the 2*2 circle in the left image and pick the maximum number from that 2*2 patch from the left image and create a small image like the one in the right. The right image is the output image of the max pooling layer. This feature also helps in solving overfitting.

In the same way, you can see 4 set of consecutive {Conv->Relu->Maxpool} in the Alexnet. Each set is also referred to as convolution layer. During the initial sets of these convolution layers, minor features will be learned example: Lines, curves and in the later sets major features like an eye, ear, nose, or tail of the dog will be learned.

How do you describe a max-pooling layer in Keras?

from keras.layers import MaxPooling2D
(MaxPooling2D(pool_size=(2, 2)))
The above line of code is the way to represent a the maxpooling layer in keras. In this case the size of maxpooling is 2*2.

What is FC block or fully connected block?

Finally, after several convolutional and max-pooling layers, all the features are passed through fully connected layers. The high-level reasoning/decision in the neural network is done via fully connected layers.Each neuron in a fully connected layer is connected to all neurons in the previous layer, as seen in the regular non-convolution model.

Convolution capture better representation of data and hence we don’t need to do feature engineering (Selecting and creating features manually or part of pre-processing).

After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. The output from the convolutional layers represents high-level features in the data. While that output could be flattened and connected to the output layer, adding a fully-connected layer is a way of learning non-linear combinations of these features.

If two fully connected layers (Dense)are stacked one after the another, it means every single node of the first FC layer is connected to every single node in the following FC layer. Features extracted in the previous layers are propagated through these layers. Yes, this layer also performs a mathematical operation. There is huge data processing going on here. Features might get repetitive across the training time and may also cause overfitting(over-learning the same feature). We stack a layer called a dropout layer between two FC layers to decrease overfitting.

To understand the difference between a fully connected layer and convolution layer I would recommend you to read part 2 and part 3 of the AI Starter series.

Input, hidden, output layers are fully connected layers where every node in the current layer is connected to every node in the next layer.
from keras.layers import Dense
The above line of code is the way to represent fully connected layer which is called as dense layer in keras.Here 1000 is the number of nodes in the fully connected layer.

What is a dropout layer?

Dropout is a technique for addressing the problem of overfitting. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting and overfitting too much. Generally, use a small dropout value of 20%-50% of neurons with 20% providing a good starting point.

When you use a larger network, you are likely to get better performance. when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.

from keras.layers Dropout
The above line of code is the way to represent dropout layer in keras. Here 0.5 stands for dropping 50 percent of neurons.
Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Crossed units have been dropped.

Now the last Fully connected layer has a number written on it — 1000. It means has there would be 1000 output nodes as it is trained for 1000 classes. As the rule says we calculate the loss using a loss function to find out how far is the prediction from the true value. To calculate the loss we use a loss function which could be any of this in Keras. In every function, you can see that there are two input arguments.

  • y_true: True labels.
  • y_pred: Predictions.

We will dig deep into loss function when we actually build a model in part 2.

After the 1000 nodes, we have a layer called a softmax layer. It is a function that provides probabilities for each possible class in a multi-class classification model. The probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image is a dog at 0.9, a cat at 0.08, and a horse at 0.02. For binary class classification, we use a sigmoid function, which gives the probability of one class.


How can we reduce the loss and increase the model accuracy?

The answer is optimizers. The optimizer reduces the loss by finding the rate of change of loss with respect to learned parameters in each layer. The learned parameters are updated simultaneously for each layer according to the calculated rate. These learned parameters are also called as weights.

The function of all the optimizers is to minimize the loss by updating the weight in such a way that the loss becomes zero.

List of the optimizers and how fast they reduce the loss.As you can see that Adadelta is the fastest. But it is not always important to make the loss zero, it should also be reliable.

As you are starting with these concepts now I would recommend you to start with Gradient descent, Gradient descent with momentum and Rmsprop. We will see how to use these concepts in part 2, 3. Among all these concepts learning rate is an important term to understand.

Learning rate is the rate with which the model should learn. The learning rate value is a small real value such as 0.1, 0.001 or 0.0001.The decision of how much our learning rate should depend on experimentation. Naive method for choosing learning rate is trying out a bunch of numbers and using the one that looks to work best, manually decreasing it over time when training doesn’t seem to improve the loss anymore. It tells how fast the weights must be learned.

The forward and back propagation is performed until the loss becomes zero( Predicted result == True value).

The above image is the complete summary of what happens inside a machine learning model.

Step — 4 Compile the model

Now let us combine the things we have learned until now.

compile(loss = 'binary_crossentropy',
optimizer= 'rmsprop',
metrics = ['accuracy'])
The above code is the the way you merge the concept loss, optimizer and metrics in keras.

You can compile a network (model) as many times as you want. You need to compile the model if you wish to change the loss function, optimizer or matrices.

You need a compiled model to train (because training uses the loss function and the optimizer). But it’s not necessary to compile the model when testing/predicting the model on a new data.

I think this is the best explanation in detail to understand the basic loss (mean square error) and optimization (Gradient descent), please read this blog.

I would always suggest to brush up your basics of mathematics. Links to brush up your calculus in five minutes.

Binary cross entropy” is the loss function used in the above code block. Work of the loss function is to calculate the difference between the predicted and true values by the machine learning model which is getting trained. This difference is also called loss, the lesser it is the better it is. The behavior of loss helps the model to understand what must be done to optimize the model so that the loss can be reduced.

Cross-entropy is commonly used to quantify the difference between two probability distributions. Usually, the “true” distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.

For example, suppose for a specific training instance, the label is B (out of the possible labels A, B, and C). The one-hot distribution for this training instance is, therefore:

Pr(Class A)  Pr(Class B)  Pr(Class C)
0.0 1.0 0.0

You can interpret the above “true” distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C.

Now, suppose your machine learning algorithm predicts the following probability distribution:

Pr(Class A)  Pr(Class B)  Pr(Class C)
0.228 0.619 0.153

How close is the predicted distribution to the true distribution? That is what the cross-entropy loss determines. Use this formula:


Where p(x) is the wanted probability, and q(x) the actual probability. The sum is over the three classes A, B, and C. In this case, the loss is 0.479 :

H = - (0.0*ln(0.228) + 1.0*ln(0.619) + 0.0*ln(0.153)) = 0.479

So that is how “wrong” or “far away” your prediction is from the true distribution.

Before understanding any other optimizer like the one mentioned here “rmsprop”. I would request you to understand gradient descent. I would very highly recommend this video and Andrew Ng, Machine learning course to understand the relationship between loss function and optimizers. Writing anything here will always be less to understand these topics.

A brief explanation could be rmsprop optimizer is used over other optimizers like gradient descent or gradient descent with momentum because it speeds up mini-batch learning.


We have used accuracy as the metrics here. The greater it is the better it is. There are many reasons to have another parameter with loss function (Loss function works hand in hand with the optimizer to update the model. Without the loss function the model can never be optimized as it will never know by what percentage(derivative or rate) it must optimize).

On the other hand, metrics do not play any role in optimization. Please read this. It clearly states that it is just to measure the fitness of your model with respect to another maths (accuracy, recall, F1 score, cosine-distance), which you may want to observe how these changes across epochs.

Epoch is a unit. When you have 1000 images in your training dataset and all the images has once undergone forward and backward propagation then we say it is one epoch. If epoch = 50, all the training images have undergone forward and backward propagation 50 times.

Recall, accuracy, F1 score is used over accuracy. Accuracy can be misleading. For example, in a problem where there is a large number of imbalanced data for classes, a model can predict the value of the class which had huge train data for all successful predictions and achieves a high classification accuracy. It would be wrong to say that our model will match the prediction accuracy on test data as well because we know that there was an imbalance in the training data for different classes.

The above image is from my notes. This is an image to understand the confusion matrix. The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.
A confusion matrix is a summary of prediction results on a classification problem.

There are four things that you must learn:

I took an effort to create this image. This image all about how the system performs if we want to understand how well the system is trained for the class cat given any test image.

Let’s consider we want to create an analysis of how the model performs for the class — cat, where the input test image is a cat image and the expected output of prediction must also be a cat.

True positive (TP): When the condition was present and the result of test/prediction says that the condition was present. If the test image was of a cat and the prediction result also states that it is a cat in the image.

False positive (TN): When the condition was absent and the result of test/prediction says that the condition was present. If the test image was not of a cat image and the prediction result states that it is a cat image.

False negative (FN): When the condition was present and the result of test/prediction says that the condition was absent. If the test image was a cat and the prediction result states that it was not a cat image.

True negative (TN): When the condition was absent and the result of test/prediction says that the condition was absent.If the test image was not a cat image and the prediction result states that it was not a cat image.

I know it might be a bit confusing if you are reading it for the first time. But, I would recommend you to understand it properly because you will see how important it is to tune your hyperparameters as it gives you a precise summary of the performance of your model.

Now, let us discuss the definition of accuracy:

Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives

Step -5 Fit the model

batch_size = 16
epochs = 50
no_train_img = 1001
no_validation_img = 800
train_generator = train_datagen.flow_from_directory(
target_size=(150, 150),
steps_per_epoch= no_train_img/batch_size,
epochs = 50,
validation_steps= no_validation_img/ batch_size)

This is the switch to start the engine of all that we have read from the beginning. The first argument accepts the training data which is generated (data is augmented) on the fly while training.

Performing data augmentation is a form of regularization, enabling our model to generalize better. However, applying data augmentation implies that our training data is no longer “static” — the data is constantly changing. Each new batch of data is randomly adjusted according to the parameters supplied to ImageDataGenerator.

Internally, Keras is using the following process when training a model with .fit_generator :

  1. Keras calls the generator function supplied to .fit_generator (in this case, flow_from_directory).
  2. The generator function yields a batch of size batch_size to the .fit_generator function.
  3. The .fit_generator function accepts the batch of data, performs backpropagation (training), and updates the weights in our model.
  4. This process is repeated until we have reached the desired number of epochs.

We must keep in mind that a Keras data generator is meant to loop infinitely — it should never return or exit.

Since the function is intended to loop infinitely, Keras has no ability to determine when one epoch stops and a new epoch begins.

Therefore, we compute the steps_per_epoch value as the total number of training data points divided by the batch size. Once Keras hits this step count it knows that it’s a new epoch.

It should typically be equal to the number of unique samples of your dataset divided by the batch size. You may also want to refer the link.

steps_per_epoch = TotalTrainingSamples / TrainingBatchSizevalidation_steps = Totalvalidationimages / ValidationBatchSize

Basically, the two variables are: how many batches per epoch you will yield.
This makes sure that at each epoch:

  • You train exactly your entire training set
  • You validate exactly your entire validation set

step -6 Evaluate the model

It’s important that we evaluate on our testing data so we can obtain an unbiased (or as close to unbiased as possible) representation of how well our model is performing with data it has never been trained on.

After fitting the dataset to the model, the model needs to be evaluated. Evaluating the trained model with an unseen test dataset so that we can obtain an unbiased (or as close to unbiased as possible) representation of how well our model is performing with data it has never been trained on. The evaluate() function in Keras expects two arguments.

evaluate(X_test, Y_test)

Step -7 Summary


Learn by doing: Print the summary of Alexnet model

Run the below code to print the model on your console.

Summery is an extremely developer friendly concept. It helps you to see how your model looks on the command window. As the name suggests it gives you the summary of your model and very detailed information of what is happening in each layer. To understand this you need to dig deep and understand something called as learnable parameters. But for now, you can enjoy looking whats getting printed on your command window. You are good to go.

Closing note:

In this article, we learned to analyze a dataset, to prepare a dataset, to create a model, basics of machine learning model, compile, fit and predict results. I hope I could give you an idea of how you can build your first convolution neural network in Keras. In part 2 and 3 you will learn to train and test your CNN model for multiclass classification and binary classification. I assure you it will be really fun and easy. Above all this, I highly recommend Andrew Ng Machine learning course, there is no substitute for this course. See you all soon. Please give your kind feedback for this article, it will encourage and help me improve my work.

I am extremely happy to collaborate with Shubham Shrey a dexterous graphic designer. Special thanks for creating this thoughtful logo for my AI -Starter series.





Computer Vision contributor. Lead Data Scientist @ Love Data Science.