Classifying images of Cats and Dogs using Convolutional Neural Networks (CNNs)
Is it a dog? Is it a cat? The Cats vs Dogs problem is a computer vision problem that seems simple when a human is trying to distinguish between the two pets. But computationally, it has been solved with a satisfying accuracy only a few decades back with the creation of Convolutional Neural Networks by Yann LeCun. In this walkthrough, we will attempt to solve the cat and dog classification problem by fitting, training and testing a CNN model from scratch.
What are Convolutional Neural Networks?
Convolutional networks are a special form of multi-layered neural networks whose architecture is inspired by that of the visual cortex of mammals.
Their design follows the discovery of visual mechanisms in living organisms. These artificial neural networks (also called convolutional neural networks, or CNNs) are able to categorize the information from the simplest to the most complex. Convolutional neural networks have many applications in the recognition of images, videos, or natural language processing. They consist of a multilayer stack of neurons, mathematical functions with several adjustable parameters, which preprocess small amounts of information. Convolutional networks are characterized by their first convolutional layers (usually one to three). A convolutional layer is based, as its name suggests, on the mathematical principle of convolution, and seeks to identify the presence of a pattern (in a signal or in an image for example).
For an image, the first convolutional layer can detect the outlines of objects (e.g. a circle), the second convolutional layer can combine the outlines into objects (e.g. a wheel), and subsequent layers (not necessarily convolutional) can use this information to distinguish a car from a motorcycle. A learning phase on known objects makes it possible to find the best parameters by showing the machine, for example, thousands of images of a house, a car, or a train … and as with ordinary neural networks, the layer parameters are determined by backpropagation of the gradient: cross-entropy is minimized during the training phase. But in the case of CNNs, these parameters designate in particular the features of the images.
Preparation: Connect to Kaggle and download data
We use the Cats VS. Dogs image dataset from Kaggle, an more than 800 MB zip file. We can use one of two ways to deal with the dataset: 1. Download to local, unzip the file, and re-upload to Google Colab. Or 2. Directly connect Kaggle and Colab. The first way is obviously more time-consuming than the second approach. Therefore, we will quickly walk through how to connect Kaggle with Google before getting into the modally. Otherwise, all other parts would not walk properly
- Download API token from Kaggle
Sign in to Kaggle Account and follow the instructions in the screenshot. Then, you have downloaded the Kaggle API Token JSON file on your local computer.
2. Upload Kaggle API to Google Drive
After downloading the API token, we upload it into Google Drive in the folder “Kaggle”.
3. Mount Google Drive
In this step, we mount the Google drive and change the working directory, which contains the Kaggle JSON file.
4. Download and unzip dataset
Next, we find the Dogs vs Cats dataset in Kaggle and copy the API command as shown in the screenshot.
Following that, we paste the API command in the Colab to download the zip file to the working directory in Google Drive, where we select in the previous step. “!ls” helps to check the contents in the working file after removing zipped files.
Environment Setup
Before we begin, we organize and import all the packages needed in the later process altogether.
Exploring Images
We first set up the training and testing data’s directory and find the total number of each. In our case, the training dataset contains 25000 labeled images, and the testing dataset includes 12500 unlabeled photos.
The idea is to use 70% of the training set to train the model and the rest 30% to test the model’s accuracy. We then choose some unlabeled testing images at random to showcase the model we trained and see if it correctly classifies the unlabeled photos.
Number of training images — 25000
Number of testing images — 12500
After that, we plot a sample of the data for both cats and dogs.
We first explore the training samples by displaying some cat images. We find that the photos come in different sizes and image quality. There are also images with more than one cat or contain human in the picture. However, cats look really cute! What about dogs?
We can also display dog images with a minor change in the filename, and dogs also look adorable.
Image Processing
We start by uniforming the image size and specifying the category index of cats and dogs. The cats are 0, and dogs are 1 according to the index.
We use the function above to return training images and labels while resizing the image to 60 by 60 in grey mode.
And, we can follow the same for testing images, but changes on extracting the label of testing images.
The last step before building our model is to standardize the images and split the training dataset into training and validation datasets, which split the entire training dataset by 20/80. As a result, there will be 5000 validation images and 20000 training images.
Building the CNN model
We would start with a simple CNN model with two blocks that composes one layer of Conv2D and one layer of Maxpooling as the prediction baseline. Then, we move on to more complex CNN models; one contains three convolutional blocks to extract features from photos. The other one adds regularization on top of the previous model. We hope to demonstrate how the CNN model can evolve to improve prediction by comparing the validation accuracy. The modeling process is shown in the above chart.
Basic Model
For reference, we built a foundation model with simple CNN layers. CNN models are made up of convolutional, padding, and pooling layers, as mentioned in the introduction. We built CNN in its simplest version, using two convolutional layers and two pooling layers, without any extra regularisation or normalization. Across the picture, we applied a convolutional layer with kernel size 3 and the same padding. This allows for the extraction of high-level features without reducing dimensionality. Next, we used Max pooling to reduce the computational burden on the model while also acting as a denoise layer to eliminate the noisy pixels in the feature map. We flatten the picture into listed neurons after two rounds of feature extraction and utilize backpropagation to tweak the model parameters, similar to other neural networks. Finally, a vector of numbers was converted into a vector of probabilities using the softmax classification activation function.
To assess the model, we must first establish the weights and then specify the loss function, optimizer, and metrics. We employ cross-entropy as a loss function, the NAdam optimizer, and quantify model performance based on prediction accuracy because we have a classification problem.
After constructing the model, we are ready to fit the model on the training set. In addition to the simple fit, we implemented early stopping to monitor the accuracy and loss. If the performance is not improving after 4 epochs, we stop the training to save compile time.
We recorded training loss and accuracy for both the training and validation sets during training, which can be seen on the line chart. The highest level of training accuracy was 98 percent. The validation accuracy, on the other hand, did not increase over time, which is why the model only lasted 7 epochs. There looks to be a lot of overfitting because the training and validation accuracy is so far off. Furthermore, the validation and training losses do not converge in the end. All of this suggests that the model has to be improved further.
Three blocks CNN model
Specifically, in our case:
1. The first two convolutional layers are identical in terms of filter parameters, which are 32. Additionally, input share = [60, 60, 1] since the image has a size of 60, and those convolutional layers are followed by a max-pooling layer.
2. Then we have a max-pooling layer with a pool size of 2, so it divides each spatial dimension by a factor of 2.
3. The same structure is then repeated twice more: two convolutional layers of 64 filters followed by a max-pooling layer.
4. Notice how the number of filters increases as we go up the CNN toward the output layer (it was originally 64, then 128): this is due to the number of low-level features is often rather small.
5. It is a common practice to double the number of filters following each polling layer: because the pooling layer divides each spatial dimension by a factor of 2, and there is sufficient space to double the number of feature maps in the following layer without exploding the number of parameters, memory usage, or computational load.
6. We must flatten the input of the dense network since it requires a 1D array of characteristics for each instance.
7. Finally, a dense output layer is attached at the end to generate the prediction.
Three blocks CNN model with regularization
The structure is similar to all the previous models. We further adjust our model by introducing additional layers to reduce overfitting and vanishing gradients problems in previous models.
Firstly, we stack a batch normalization layer (BN layer) between two 2D convolutional layers. Each input is zero-centered and normalized in the BN layer, which then scales and shifts the output using a vector for scaling and another vector for shifting. Meanwhile, the algorithm estimates the mean and standard deviation of the input over the current batch of input. However, it can increase the runtime due to added complexity to the model. (Géron, 2019)
The other adjustment with the model is the dropout layer in each block. The dropout layers temporarily ignore some neurons in the current training step. The ignored neuron may become active in the following training step. One of the most significant benefits of using dropout is to reduce the overfitting issue and improve model prediction. As suggested in “Hands-on Machine learning with Scikit-Learn, Keras, and TensorFlow” we use two dropout layers with a dropout rate of 50% (40%-50% suggested in the book) the beginning and ending block. We add another dropout layer at a 20% drop rate in the second block of the model.
Model Comparison
After building all three models above, we also create an additional model which uses the parameters available on Kaggle, as “Model 4”. Model 1 is the base model that we build initially, which has the lowest validation accuracy and highest loss among all models. The validation accuracy rises to a high point when using the model with regularisation and achieving 89 percent validation accuracy. Meanwhile, when using the model with regularisation, the loss also decreases to the low end. Model 4 has dropout layers, just like Model 3, but one of the most significant differences in the batch normalization layers. The batch normalization layers in our model are stacked between two convolutional layers, whereas model 4’s model architecture lacks a BN layer.
Result Demonstration
This block of the code is to demonstrate the prediction power of model 3. The green 1 line is used to feed all of the testing data into model 3 and make predictions or classifications. The line labeled in green 2 is used to select 25 photos at random from the testing dataset. The remaining code assists in plotting the selected photos and attaching the predicted label to the x-axis. For our team, photos 7 and 13 in the plot are too hazy to tell whether they are of a dog or a cat, so we cannot judge whether the model prediction is correct or incorrect. However, the rest of the model predictions are accurate from our eyes.