Object recognition is a popular computer-vision research field that is specific to imaging applications. Think of having an image that you want to automatically recognize, basically what it depicts; flowers, cats, dogs, cars, or what else.  This technology has many real world applications such as traffic-sign trackers that guide drivers, part of hand-writing recognition systems.

In this article, a brief literature survey about object recognition and some benchmark datasets will be discussed. Laconically, you can figure out how computers recognize objects, how scientists worked on that, and what you need to can test your technique for developing such technologies.

If you are a data analyst, or researcher, you will need object recognition. Even, if you are a technophile, you will get some technology industrial secrets in this article. This article is not for the state-of-the-art, but a starter that will show robust techniques that are used to recognize imaged objects.

Key Concepts

Object recognition is a popular computer-vision research field that is specific to imaging applications. Think of having an image that you want to automatically recognize, basically what it depicts; flowers, cats, dogs, cars, or what else.  This technology has many real world applications such as traffic-sign trackers that guide drivers, part of hand-writing recognition systems.

Subscribe Now
Please wait...

By reading this article you will:

  • What is Object Recognition?
  • What are Benchmark datasets?
Big Data Specialization from UC San Diego

What is Object Recognition? 

Recognizing objects in imaging, is a matured research, with highly demanded industrial applications associated with competitive research funds to accomplish the most accurate fit system. Do you remember the former security Captcha? Aren’t they similar to the images in Figure-1?

Figure 1- SVHN dataset
Figure 1- SVHN dataset

Yes, but those depicted are more likely to be recognized by a computerized software. These images, in figure 1, are for “The Street View House Numbers (SVHN)” dataset. SVHN consists of more than 600,000 digit images with a common structure. Hence, it is commonly used in research and is considered to be a benchmark dataset that scientists use to compare results against the ones of the other methods, and tune the parameters of their algorithm. So, they can configure how it is beneficial to proceed or develop another ideas.

To can understand more about how objects are actually recognized, recently, we need to understand, before, what researchers have considered possible approaches in the early beginnings of that technology.

Edge Detection is one of the earliest technologies of computer vision that is used to approach object recognition.


Figure 2- Edge Detection
Figure 2- Edge Detection

Figure 2 illustrates an image along with its edge-detected version. Don’t think that it’s a trivial task, as brightness, contrast, and smoothing always affect mathematical filters that are used to detect edges of objects in the image.

Other earlier recognizing techniques are based on brute force matching between image-under-consideration and a knowledge-based dataset of images, where the software compares the observed image to a huge number of previously stored images, trying all transformations and rotations on observation so that it  can be a match to others.iIn addition to the introduction of a new concept of real-time processing in industry, the number of image dataset has increased dramatically because of the increase in processing-speed demand over the last decade. Traditional computer vision methods that depend on mathematics are no more a fit, nowadays. So, machine learning found its way to participate in computer-vision research, where it was a challenge to develop innovative machine-learning-based recognizers. Classification is one of the main core tools of machine learning. It is one of the most commonly used techniques to have better recognition, where k-nearest neighbors, support-vector machine, boosted decision trees, and many others were used as corner stone in many of the research trials. Methods used, vary on bases of accuracy and performance, but they have one thing in common, which is mathematics. This problem was overcome by the breakthrough of two scientists; Yann LeCun and Yoshua Bengio resorted to power of nature – power of God, where they undertook a similar step as Geoffrey Hinton; who is known as the father of neural networks. Yann and Yoshua mediated the neurobiological system which inspired them to propose a new structure of neurobiological computation system called CNN.

CNN is a new model of neural networks that abbreviates Convolutional Neural Networks.  These techniques are neuro-biologically inspired that imitate nature, especially human mind; how its cortex process data, images, scents and vocals, and how memory works. CNN is one specific model that emulates the animal visual cortex. This cortex is considered to be the most powerful visual cortex in existence. As CNN is basically a neural network with much sophisticated pre-processing layers. So, first, we need to get the big picture of neural networks’ mechanism to understand what CNN actually is.

A neural network, as brain emulator, consists of multiple layers; an input layer, some hidden layers, and an output layer, where each layer has many sub-units called neurons. Neurons, where all the computations occur, are the basic units that go on stimulation, in analogy with your physical neurons stimulation. Think back when you touched a hot pot by mistake; You always observe an instant shock to force you get your hand far from that hot object, don’t you? Yes, it is your brain neurons which was stimulated to shock your hand muscles to get away as quickly as possible. That stimulation is propagated through your brain neural network via synapses.


Figure 3- Biological Neural Network
Figure 3- Biological Neural Network

See figures 3 and 4, to imagine what a biological neural network is in comparison to a computerized neural network.


Figure 4- Neural Network
Figure 4- Neural Network

Now, let’s get back to our computerized neural network in figure 4, from left to right, yellow nodes form an input layer of 4 neurons, followed by 5 hidden layers of 4, 5, 6, 4, and 3 neurons respectively. Tod the extreme right, there is an output layer in pink.

This is what happens during processing. Computer sends input to neurons in input layer which is propagated to the next layer and the process continues till it reaches the output layer. During this propagation, many calculations are carried out. However, input has to be in pre-processed to a special format. Hence, input to CNN has to undergo pre-processing similar to that of retina of eye which means CNN consists of a fully connected neural network, proceeded by convolutional layers, and pooling layers (subsampling layers). That pre-processing is similar to the one which happens in the retina. Figure 5 illustrates that complex structure. Note: A fully connected neural network has all neurons in a single layer are connected to all neurons in the preceding layer with no exception.


Figure 5- Typical CNN
Figure 5- Typical CNN

Now, let’s return to our main topic of discussion, which is recognizing objects. To recognize an object in a image, first, the observed image is stored as pixels or an array of integer. It is later fed to CNN to undergo retina-like preprocessing followed by visual-cortex-like computerized stimulations to get the output that indicates the category of object depicted in the observed image. These outputs must be pre-configured to the CNN used. Pre-configuration of output values is automatically done by feeding the network with a huge bulk of pre-defined images that form a dataset. At this point, the main procedure is almost complete, but what a dataset is and what datasets can be namely used.

What are Benchmark datasets?

As discussed before, you have to provide a huge bulk of pre-defined images that form a dataset. These datasets have to be publically appreciated as a good structured, a good fit and a commonly used one (“publically”, here, refers to research community across the global in academia.) These datasets are called a benchmark.  Object recognition datasets which we are going to explore are subsets of the benchmarks and are available for public use by industry developers:

  1. MNIST Dataset
  2. SVHN Dataset
  3. CIFAR


These datasets are somehow subsets of a very huge datasets which are dedicated to academic institutes. Scientist in academia found that this technology had become mature to be introduced in industry, hence, they created publicly available subsets that can be used by development communities to develop such skills. Let’s start exploring the above mentioned datasets.

First, it’sMNIST database.

 MNIST Database

Mini NIST dataset is an outcomeof an extensive research done by three of the most famous and highly ranked researchers

The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.


It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.


MNIST’s official website shows that the error rate of recognizing objects has gradually decreased from 12% in 1998 to 0.23% in 2012 via using a specific structured CNN.

Figure 6- MNIST samples
Figure 6- MNIST samples

See a sample of those images in figure 6. For more such facts, navigate your browser to http://yann.lecun.com/exdb/mnist/  [4]

Next is one of the famous benchmark dataset created my Stanford.

 The Street View House Numbers (SVHN) Dataset

One of Stanford’s research conducted shows that a new beneficial dataset will help in developing the racing demand of accurate object recognition.

“SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.”

You can find a sample of SVHN in fig.7

Figure 7- SVHN samples
Figure 7- SVHN samples

SVHN structure can be briefly described as follows:,

  1. SVHN’s structure has 10 classes, one for each digit.
  2. Digit ‘1’ has label 1, ‘9’ has label 9 and ‘0’ has label 10.
  3. 73,257 digits are dedicated for training the model.
  4. 26,032 digits for testing one’s trained model (CNN in our case)
  5. Besides this, there are 531,131 extra digits that are somewhat less difficult samples yet to be add to the training data.

SVHN images come in two formats:

  • Original images with character level bounding boxes.

MNIST-like 32-by-32 images centered on a single character (many of the images do contain some distractors at the sides).For more such facts, navigate your browser to http://ufldl.stanford.edu/housenumbers/ [5]

Finally, let’s explore one of the interesting benchmarks used.


CIFAR dataset comes in two versions; CIFAR-10 and CIFAR-100.

“The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.”

The CIFAR-10 dataset can be described as follows:

  1. 60,000 32x-by-32 colored
  2. 10 classes, with 6,000 images per class.
  3. 50,000 training images and 10,000 test images.
  4. five training batches and one test batch, each with 10,000 images.
    1. The test batch contains exactly 1,000 randomly-selected images from each class.
    2. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5,000 images from each class.

For more such facts, navigate your browser to https://www.cs.toronto.edu/~kriz/cifar.html [5]

Figure 8- CIFAR sample images
Figure 8- CIFAR sample images

In the figure above, you can see different images associated with a variety of objects; airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Each image is characterized by different colors, position of object and different types; car models, species of cats, and so on. In addition to this, you can see notable differences like backgrounds of images; birds in forest with green background of leaves and trees, others in sky with blue sky as background, and others are photographed with face and neck only from a near photoshoot place with zooming.


I hope you now understand what object recognition is and how it is achieved, nowadays, via computerized software. Scientists are inspired by brain’s mechanism to structure neural networks and you can see how CNN uses retina-like pre-processing to fit object recognition technologies. Besides this, you will also understand how scientists use benchmark datasets to develop their solutions and what common between datasets.


[1]“Yann Lecan,” November 2017. [Online]. Available: http://yann.lecun.com/. [ Accessed 2017 ] .
[2]“Corinna Cortes,” November 2017. [Online]. Available: https://research.google.com/pubs/author121.html. [ Accessed 2017 ] .
[3]“Christopher J. C, Burges,” November 2017. [Online]. Available: http://chrisburges.net/. [ Accessed 2017 ] .
[4]“MNIST Database,” [Online]. Available: http://yann.lecun.com/exdb/mnist/. [Accessed 2017].
[5]“SVHN,” Stanford University, [Online]. Available: http://ufldl.stanford.edu/housenumbers/. [Accessed Novemeber 2017].
[6]“CIFAR,” Toronto University, [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html. [ Accessed 2017 ] .

Subscribe to our mailing list

* indicates required
Big Data Specialization from UC San Diego

Subscribe now

Subscribe now and get weekly free news letters and access to many free articles, researches and white papers. You can subscribe by creating your account or by entering your e-mail address in the following subscribe text box.

Author: rofaelemil

A junior R&D Machine Learning Engineer, with BSc. in computer and systems engineering, solid algorithmic, machine learning, data science background, who is enthusiastic reader with curiosity to learn new ideas and techniques, creative, innovative and self-motivated researcher who has experience with C/C++, Java, C#, Python, R, Big Data, Tableau and Scala.