RCloud is DevOps for Data Science

0
2014

DevOps is the collaboration between the development and deployment streams of a
software system. It increases productivity by reducing the time between the development
and deployment of software. RCloud is a platform that accelerates data analysis related
insights by reducing the time between coding and deployment.

DevOps for data science is gaining popularity as it involves infrastructure, configuration management, integration, testing and monitoring. Hence, it accelerates data analysis insights. DevOps supports data scientists by creating integrated environments for most vital tasks such as data exploration and visualisation. Data scientists need varied types of
infrastructure to handle any complex project. DevOps does the provisioning and configuration of infrastructure for a variety of environments. Any data science model is iterative in nature as it involves new data that needs to be trained, and based on that, it
evolves new models that need to be made available to users. For this, the data scientist applies continuous integration and deployment practices. DevOps bridges the gaps between the training environment and the model deployment environment through continuous integration and continuous deployment pipelines.

RCloud is considered as DevOps for data science as it resolves the data analysis development and deployment issues of collaboration, sharing, scalability and reproducibility. RCloud was created at AT&T labs by Simon Urbanek, Gordon Woodhull and Carlos Scheidegger and is open source software. It is a Web based platform for all aspects
of data science such as analytics, visualisation and collaboration. It uses the R language for all tasks.

Features of RCloud

  • Can be used from anywhere: It is browser based software; so if you have Internet connectivity, you can use it from anywhere.
  • Project container facility: RCloud notebook consists of all the required components and associated data dependencies. It contains dependencies of data analysis that include code, comments, equations, visualisations, etc.
  • Association feature: RCloud provides an excellent capability for association. It is browser based software that is installed on a server or any distributed environment such as Hadoop. It provides access to all notebooks so that new users can view, copy, edit and update the analysis and visualisations with new data sets. It is very efficient and easy to use as it only needs a browser.
  • Distribution capability: RCloud is based on the URL sharing concept. It delivers value faster. It is agile enough for us to share the URL at any stage of analysis.
  • Scalability: RCloud can perform parallel connections to multiserver systems. It provides the data scientist with flexibility to run Big Data packages without writing complex code.
  • User directory based access: It contains a user directory that provides access to the notebooks of every user registered with RCloud.
  • Reproducibility: A data analysis on RCloud can be verified and executed by anyone with access to the notebooks without any concern for environmental variables.
  • Live code execution: RCloud notebooks are not static Web pages but code that is executed live.
  • Unique RCloud Web service interface: RCloud provides a unique Web service interface through which any notebook asset can be integrated with other technologies by simple means.
  • Promotes user engagement: RCloud is platform-independent. Access and control remains constant, which increases user confidence and engagement.
Figure 1: RCloud on GitHub

Why RCloud?
For effective results of a data science project, certain information must be shared among team mates. They must be agile enough to address new features and functionalities. They must move their results at different levels of the data science project such as data pre-processing, exploratory data analysis, predictive modelling and visualisation. RCloud is the perfect software to address these issues as it contains association, distribution, scaling and reproducibility (ADSR) characteristics. RCloud is very vital for data science for many reasons:

  • RCloud is an open source Web based platform for data science. It has excellent capability to help you share your ideas or work with your team mates. Figure 1 describes the RCloud components on GitHub.
  • RCloud provides a platform for thedata scientist to search relevant things without reinventing the wheel.
  • It provides fast interaction with data in the Hadoop Distributed File System (HDFS) or similar kinds of distributed file systems. This feature is very well suited to Big Data
    analytics.
  • RCloud differs from other DevOps of data science as it provides browser based access.
  • It gives a lot of flexibility to users to create any type of complex widgets, notebooks or dashboards.
  • Both registered and non-registered users can view or interact with live notebooks of the RCloud environment.
  • Communication in RCloud is done with standard communication protocols such as HTTP.
  • RCloud provides data scientists with the capability to run Big Data packages without writing complex code.
  • RCloud is well suited for Big Data applications due to its scalability feature.
  • It provides great security features so that unauthorised clients cannot make calls to the RCloud runtime environment. Notebooks of RCloud can also be encrypted for advanced
    security. Authenticated client server channelling is also a unique feature of RCloud.
  • RCloud maintains automatic Git based trials of code modifications. DevOps involves infrastructure, configuration management, integration, testing and monitoring. RCloud is considered as DevOps for data science as it resolves the data analysis development to deployment issues of ADSR. In this article, we have described the main features of RCloud in detail, with an emphasis on those that are very essential for data scientists.

LEAVE A REPLY

Please enter your comment!
Please enter your name here