DevOps: A Few Best Practices for Data Science Projects

DevOps Data science

DevOps practices play a vital role in deploying data models efficiently in production environments. This article outlines four excellent DevOps practices.

DevOps comprises the best practices to integrate software development and IT operations. The main aim is to reduce the development life cycle and to deliver products with high quality. Different sets of tools are used by people who practice DevOps. These sets of tools are known as toolchains, and are expected to fit into one or more categories of software development and delivery such as coding, building, testing, packaging, releasing, configuring and monitoring.

The aim of data science is to build data models and deploy them in production environments. DevOps plays a vital role in the data science process in order to build flexible and agile teams for efficient deployment of models. Let’s briefly look at four DevOps best practices.

User version control
In data science projects, keeping track of changes in code is very essential, as it involves multiple files and complex code. DevOps tools like Git play an important role in source code management. Git is a free and open source tool capable of handling any large data science project. It is a great tool to track source code changes, enabling multiple developers to work together in non-linear development. Linus Torvalds developed it in 2005 for the Linux operating system. It contains powerful features for version control. Some of its important features are:

  • History tracking
  • Supports non-linear development
  • Distributed development
  • Supports collaboration
  • Scalable in nature

Git is very well suited to data science projects as it supports distributed development and collaboration. Scalability is also required for these projects. The use of such a tool in a data science project ensures faster delivery and improved quality. The workflow of Git is divided into three main states:

  • Working directory: This is responsible for modifying files in the current working directory.
  • Index: This is responsible for staging the files and adding a snapshot to a staging area.
  • Repository: This performs a commit and saves the snapshot permanently in the Git directory. It is also responsible for checking the existing version and making

Predictive model markup language (PMML)
PMML is a standard to represent pre-processing and post-processing data transformations and predictive models. This is a great tool for data science, as there is no need to do custom coding to move among different tools and applications. Since it is a standard, it fosters transparency and best practices. PMML is vital for data science as it supports a majority of data science algorithms and techniques. It builds predictive solutions by following the steps given below.

  • Data dictionary: It identifies and defines which input fields are useful for the problem at hand.
  • Mining schema: It is vital to generate consistent data. It deals with missing values, special values, obvious errors and outliers.
  • Data transformation: This is the most vital step of PMM. It is useful for detecting features and to generate relevant information from input data.
  • Model definition: This decides the parameters of the model.
  • Output definition: This defines expected outputs of the model.
  • Model explanation: This is responsible for defining performance metrics of a model.
  • Model verification: This comprises a set of input data with expected model outputs. It is responsible for testing the accuracy of the model.

Continuous integration
Continuous integration is vital for any data science project and should be done using automation tools for greater efficiency. Jenkins is the most important open source tool in this category. It is a self-contained, open source automation server which is used to automate all tasks such as building, testing and deploying software. It has the following features.

  • Continuous integration and delivery: It acts as a continuous delivery hub for any data science project.
  • Easy installation: It is a self-contained Java based program, ready to run features. It comprises packages for multiple operating systems such as Windows, Linux, macOS and other UNIX like systems.
  • Easy configuration: It has a Web interface for setup and configuration. This interface includes the facilities of error-checking and built-in help.
  • Plugins: It has a rich set of plugins that help to integrate with any tool for continuous delivery.
  • Distributed nature: It easily distributes work to multiple machines for faster integration and delivery.

User containers
Containers are very useful for data science projects for numerous reasons, the most important being standardisation. A Docker container is vital for data science for two main reasons.

  • (1) Reproducibility: Everyone has the same OS and the same version of tools. So if it works on your machine, it works everywhere.
  • (2) Portability: It eliminates setup hassle, and the movement from a local computer to a supercomputing cluster is easy.

Docker is a great tool for creating and deploying isolated environments for running applications with their dependencies. It acts as a lightweight virtual machine. Kubernetes is another open source tool for handling clusters of containers. It provides a framework to run distributed systems efficiently and therefore is very helpful to the data science community. It offers the following features to deploy data science projects efficiently.

  • Service discovery and load balancing: Container can get exposed using the DNS name or their IP address. It also manages the load if traffic is high.
  • Flexibility in storage mounting: It allows automatic mounting of the storage system of your choice.
  • Rollback facility: You can automate Kubernetes to create new containers for your deployment, remove existing containers and adopt all their resources to the new container(s).
  • Automatic bin packing: You can provide Kubernetes with a cluster of nodes and then decide how much CPU and memory (RAM) each container needs.

Deployment of data science projects is cumbersome and requires tremendous efforts. DevOps based best practices help a lot in this process. In this article, we have focused on four main best practices of data science using DevOps. These practices are: user version control, PMML, continuous integration and user containers. We have described the main features of all these practices and explored open source tools such as Git, PMML, Jenkins, Docker and Kubernetes. Knowledge of DevOps is very essential for the efficient deployment of any data science project.


Please enter your comment!
Please enter your name here