How Quality Data Sets Help Machines Learn

Image Source:

Just like the fuel that runs the engine of a vehicle, a data set is a key ingredient for solving a given problem using machine learning or deep learning algorithms. This article takes a quick look at open source data sets for ML and DL, with a focus on TensorFlow and PyTorch.

There is a lot of research going on in and a lot of applications being built with machine learning. Machine learning (ML) algorithms automatically transform data into useful representations for the task in hand. These operations could be co-ordinate changes, linear projections (linear regression), translations (SVM), transformations (PCA), and so on. These algorithms are not usually creative. They just search in a hypothesis space.

Deep learning (DL) is a specific field of machine learning which emphasises learning in successive layers to get more meaningful representations of data. Deep in deep learning represents the number of layers or the depth of the layers used in it.

There are two kinds of problems in machine learning — regression and classification. Just suppose we are required to predict the cost of a house. Here, we need to get a numeric value. This becomes a regression problem. However, sometimes we need to take a “Yes” or “No” decision. This is a binary classification problem. At other times, we need to predict a class. Consider a problem where we are required to identify an animal like a cat, dog, horse, etc. This is a multi-class classification problem.

Whenever we get the data for a problem for supervised learning, we get attributes and labels. Attributes are independent variables with which we must predict the labels or dependent variables. Attributes are called inputs, predictors, features, or independent variables. Labels are called outputs, targets, outcomes, or dependent variables. If the labels are categorical, the problem becomes a classification problem; if these are numerical, it becomes a regression problem.

Importance of data sets for ML and DL
Machine learning or deep learning tries to learn patterns from the data available using algorithms. Data is the core of any ML/AI algorithm. Training data for ML is a key input to the algorithm that learns from it, and memorises the information for future prediction. Training data is the backbone of the entire machine learning project, without which it is not possible to train a machine that learns from humans. Raw data collected from various sources needs to be labelled or annotated so that it can be used to predict in the case of supervised machine learning.

Availability of such annotated data is very important to train and validate the algorithms. The quality and size of the data set is very important for training of the learning method in order to obtain good results and high performing models. Raw data cannot be directly used for training; it needs to be cleaned. If data is not good, a considerable amount of time needs to be spent to pre-process it so that it can be used. Data preprocessing takes a lot of time and effort to generate a quality data set. Hence data sets play a predominant role in training of the machine learning and deep learning models.

A quality data set saves the time and effort needed for machine learning and deep learning researchers and data scientists. A few years back everyone would generate data from different sources required for the learning algorithms. There were no standard data sets available and these were stored in private and public locations of the user’s choice. It was difficult to generalise the insights, make developments to learning algorithms, and enable data sets to be explored by wider audiences. As machine learning and deep learning technologies advanced, their frameworks started providing built-in data sets as part of their packages. This would enable users to access similar data sets, save the effort and time required for data preprocessing, and direct their knowledge to building high performing and complex models that provided advanced insights. Today, machine learning and deep learning frameworks like Scikit-learn, TensorFlow, and PyTorch (among many others) provide built-in data sets for almost all types of data like text, image, audio and video, with sufficient samples for training, validation and testing of models.

Libraries with built-in data sets
Lots of libraries nowadays come with built-in data sets that can be used to train machine learning algorithms. Built-in data sets prove to be very useful when it comes to practising ML algorithms — you need some random, yet sensible data to apply the techniques and get your hands dirty. Many modules in Python contain some common data sets similar to the popular ‘Iris’ data, MNIST digits data and Boston housing price data set.

Scikit-learn comes with a few small standard data sets that do not require one to download any file from some external website. Similarly, TensorFlow provides the TensorFlow data set for use. PyTorch also provides a lot of data sets for use in machine learning and deep learning. There are many data sets available but in this article, we will only discuss the usage of TensorFlow and PyTorch data sets.

Types of data sets available
There are different types of data sets available, as listed below. The type of data set used depends on the problem at hand.

Data set name Brief description Samples Format Usage
Omniglot Different (around 1623) handwritten characters from 50 different alphabets 38300 Text, Images,
Classification, One-shot learning
Places365 There are 1.8 million training images from 365 scene categories in the Places365-Standard
data set
2 million
Images Object classification,
Scene recognition tasks
SBD (Semantic Boundaries Data set) It currently contains annotations for each image, and provides both category-level and instance-level segmentations and boundaries. The segmentations and boundaries provided are for the 20 object categories in the VOC 2011 data set. 11355 images taken from the PASCAL VOC 2011 data set. Images Models for semantic contours prediction and semantic segmentation
STL10 This is an image recognition data set for developing unsupervised feature learning, deep learning, and self-taught learning algorithms. 100,000 unlabelled images for unsupervised learning, with 10 classes; images are 96×96 pixels in colour. Images Unsupervised learning
SVHN (Street View House Numbers) This is a real-world image data set for developing machine learning and object recognition algorithms with minimal requirements of data preprocessing and formatting. SVHN is obtained from house numbers in Google Street View images. Over 600,000 images Images Image classification
UCF101 This is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. It is an extension of the UCF50 data set, which has 50 action categories. 13320 videos from 101 action categories Videos Action recognition
VOC (the PASCAL Visual Object Classes) This data set consists of real images of complex scenes, including scale, pose, lighting and occlusion, for 20 classes — with complete annotation for all objects. Around 10,000 images Images with a
Object recognition and detection

Audio data:
This is basically useful in machine learning and deep learning tasks such as speech recognition and emotion recognition. Some audio data is also used for identification of diseases like Alzheimer’s and Covid-19 (the audio data of people coughing can be used to identify Covid-19). The speech data of people talking is also used to classify those with Alzheimer’s disease.

Image data: This is used for classification and object detection. Image data available with TensorFlow such as COCO and Wider-face can be used for object detection. Wider-face is a data set of images with pictures of people, and it can be used in deep learning to identify faces. Image data can also be used for deep learning algorithms such as segmentation, which can be used for self-driving cars.

Text data: Text data can be used for various natural language processing tasks such as movie review classification, fake news detection, summarisation, answering questions, topic detection, transcript summarisation, and detection of action items from emails, etc. An example of text data is the collection of email messages sent by employees of Enron Corporation.

Video data: Video data can be used for object detection in videos and for video segmentation. An example is the DAVIS (Densely Annotated VIdeo Segmentation) data set from TFDS (TensorFlow data set).

Translation data sets: Labelled data is also available for translation. These data sets can be used for machine translation tasks for different languages.

Given below are a few of the standard built-in data sets that are provided as part of the deep learning and machine learning frameworks. Each framework provides a superset of these data sets. The data sets listed below showcase the diversity provided by these frameworks so that users can exploit them to their advantage for building models catering to their domain.

CelebA (CelebFaces Attributes) is a large-scale face attributes data set with more than 200,000 celebrity images, each with 40 attribute annotations. It can be used for face attribute recognition, face detection and landmark (or facial part) localisation.

CIFAR-10 (Canadian Institute for Advanced Research) data set consists of 60,000 32×32 colour images, categorised into 10 classes with 6000 images per class. It can be used for image classification and computer vision tasks.

Cityscapes data set consists of 25,000 samples of video sequences recorded in street scenes, with pixel-level annotations. It can be used for classification and object detection.
COCO (Common Objects in Context) data set consists of images of daily scenes of common objects in their natural context. It has around 2.5 million images with labelled text which can be used for object recognition, classification and caption generation.

Fashion-MNIST (Modified National Institute of Standards and Technology) data set consists of images similar to the MNIST data set but from fashion product databases. It has around 60,000 images and can be used for image classification.

HMDB (Human Metabolome Database) is collected from a variety of sources, most of which are movies. But a small proportion has also been obtained from open source databases such as the Prelinger archive, YouTube, and Google videos. It has around 7000 clips divided into 51 action categories, each containing over 101 clips, and can be used for object recognition and action detection.

  • Kinetics is a collection of large-scale data sets of URL links of up to 650,000 video clips that cover various human action classes, depending on the data set version. The videos include human-object interactions and human-human interactions. Each action class has at least 400/600/700 video clips. Each clip is annotated, and can be used for object recognition and action detection.
  • LSUN data set contains around one million labelled images for each of the 10 scene categories and 20 object categories. It can be used for understanding scenes with many ancillary tasks like room layout estimation, saliency prediction, etc.

Table 1 lists a few more large data sets that are available today.

Using the TensorFlow data sets (TFDS)
TFDS provides a beautiful collection of data sets that can be readily used in TensorFlow, Python and other machine learning frameworks.

The following categories of data sets are available in TFDS: audio, image, image classification, object detection, questions and answers, structured, summarisation, text, translate, video and vision language. Each of these data sets can be used for a variety of machine learning and deep learning tasks.

Installation of these data sets can be done from two packages available with TFDS.

pip install tensorflow-datasets: This is the stable version, released every few months.

pip install tfds-nightly: Released every day, it contains the latest versions of the data sets.

We can import the data sets using the following commands:

import tensorflow as tf
import tensorflow_datasets as tfds

All data set builders are a subclass of tfds.core.DatasetBuilder. To get the list of available builders, use tfds.list_builders().

The following example taken from explains the usage of the TensorFlow data set MNIST.

1) Load MNIST
Load with the following arguments.

  • shuffle_files: The MNIST data is only stored in a single file, but for larger data sets with multiple files on disk, it’s a good practice to shuffle them when training.
  • as_supervised: Returns a tuple (img, label) instead of dict {‘image’: img, ‘label’: label} (ds_train, ds_test), ds_info = tfds.load( ‘mnist’, split=[‘train’, ‘test’], shuffle_files=True, as_supervised=True, with_info=True,)

2) Build training pipeline
Apply the following transformations.

  • TFDS provides the images as tf.uint8, while the model expects tf.float32; so normalise the images.
  • ds.cache: As the data set fits in memory, cache before shuffling for better performance.
    ds.shuffle: For true randomness, set the shuffle buffer to the full data set size.
Note: For bigger data sets that do not fit in memory, a standard value is 1000 if your system allows it.
  • ds.batch: Batch after shuffling to get unique batches at each epoch.
  • ds.prefetch: It’s a good practice to end the pipeline by prefetching for performances.
def normalize_img(image, label): “””Normalizes images: `uint8` -> `float32`.”””
return tf.cast(image, tf.float32) / 255., label
ds_train = normalize_img,
 ds_train = ds_train.cache() ds_train = ds_train.shuffle
ds_train = ds_train.batch(128) ds_train = ds_train.prefetch 

3) Build evaluation pipeline
Testing pipeline is similar to the training pipeline, with a small difference — there is no ds.shuffle() call.

Caching is done after batching (as batches can be the same between epochs):

ds_test = normalize_img,
ds_test = ds_test.batch(128) ds_test = ds_test.cache() 
ds_test = ds_test.prefetch(

4) Create and train the model
Plug the input pipeline into Keras:

model = tf.keras.models.Sequential([tf.keras.layers.Flatten 
(input_shape=(28, 28)), tf.keras.layers.Dense 
(128,activation=’relu’), tf.keras.layers.Dense(10)]), model.compile ( optimizer=tf.keras.optimizers.Adam(0.001), 
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],) ds_train, epochs=5, validation_data=ds_test,)

Using the data sets of PyTorch
Like any other ML or DL framework, PyTorch also has built-in data sets that can be explored for various applications. To perform ETL (extract, transform and load) on a given data set, PyTorch provides the two main classes given below.

  • Dataset: This is an abstract class representing a data set.
  • DataLoader: This is a Python iterable over the data set; so it wraps a data set and provides access to underlying data.

The Dataset abstract class has two methods, __len__() and __get_item__(), which need to be implemented for custom data sets by extending this class. The data sets can be passed to DataLoader object to load multiple samples in parallel by using multiprocessing worker modules of PyTorch.

The details of how to create a new data set and use it in PyTorch are given below, using the classes discussed above.

To generate a new data set, for example “newDataset”, it needs to be extended from the built-in abstract class Post that, the initialisation __init__(), __len__() and __getitem__() functions need to be overridden and the corresponding implementation needs to be provided. A sample code to illustrate the defining of the newDataset class using the Dataset abstract class of PyTorch is given below:

import torch
class newDataset(
# Initialization Function
def __init__(self, listIDs, labels):
self.labels = labels
self.listIDs = listIDs

def __len__(self):
# To get the total number of samples
return len(self.listIDs)
def __getitem__(self, index):
# get one sample of database
# get the required sample
newID = self.listIDs[index]
# Load the data and get the corresponding label
X = torch.load(‘data/’ + newID + ‘.pt’)
y = self.labels[newID]

# return the sample with label
return X, y

Now the DataLoader class of PyTorch provides an interface to use the data set generated above from the Dataset class.

# Check for GPU and enable GPU or CPU device accordingly
use_gpu = torch.cuda.is_available()
device = torch.device(“cuda:0” if use_gpu else “cpu”
torch.backends.cudnn.benchmark = True

loaderParams = {‘batch_size’: 64, ‘shuffle’: True, ‘num_workers’: 6}

The above parameters are passed to the DataLoader class and their purpose is given below.

batch_size: This denotes the number of samples contained in each generated batch; generally, this is multiples of 8 (8, 32, 64, 128…).

shuffle: The samples in the batch are shuffled for each epoch, so that batches between epochs are not similar for training the model. This allows the model to be more robust during training. Generally, it is set to True. Setting it to False will allow the use of the same samples in batches across training epochs.

num_workers: This denotes the number of processes that generate batches in parallel; a high number of workers will allow CPU computations to be managed efficiently.

maxEpochs = 50
# Define or get the data samples with training samples and validation samples
# seperately and store them in datapartition dictionary and similary their
# corresponding labels in datalabels dictionary.
datapartition = {‘train’: [‘id1’, ‘id2’, ‘id3’, ‘id4’, ‘id5’], ‘validation’: [‘id6’, ‘id7’, ‘id8’, ‘id9’]}
datalabels = {‘id1’: 0, ‘id1’: 2, ‘id3’: 0, ‘id4’: 3, ‘id5’: 1, ‘id6’: 0, ‘id7’: 2, ‘id8’: 0, ‘id9’: 3 }

# get the data generation object
trainingSet = newDataset(datapartition[‘train’], datalabels)
trainingGenerator =, **loaderParams)

validationSet = Dataset(partition[‘validation’], datalabels)
validationGenerator =, **loaderParams)

# To get each batch of samples from the training dataset
for i, batch in enumerate(trainingGenerator):
print (i, batch)

# To get each batch of samples from the validation dataset
for i, batch in enumerate(validationGenerator):
print (i, batch)

# To get each batch of samples and their labels from the training dataset
for batch, labels in trainingGenerator:
print (batch, labels)

# To get each batch of samples and their labels from the validation dataset
for batch, labels in validationGenerator:
print (batch, labels)

Data is important for machine learning and deep learning algorithms to perform better. Quality data is what is needed to get a good performing model. Acquiring data, cleaning the data for quality and annotating it is a tedious task, and consumes a considerable amount of time if started from scratch. Hence, pre-built data sets save time and allow one to think about improving the learning methods rather than concentrating on acquiring data. Most of the machine learning and deep learning frameworks or platforms, therefore, provide built-in data sets as part of the package. These data sets generally come in all flavours to be used for various applications like classification, object recognition, detection, segmentation, caption generation, sentiment analysis, emotion detection and action detection. Moving a step ahead, a few frameworks like PyTorch have improved the way the data set is handled for building models. This has significantly helped data scientists to save time, build high performing models, develop insights, and make considerable progress in deep learning and machine learning technologies.


Please enter your comment!
Please enter your name here