This article touches upon the building blocks that are necessary to enable machine learning in the data received from IoT, and how cloud infrastructure can help if we use the power of open source tools effectively.
MQTT (http://mqtt.org/) or Message Queuing Telemetry Transport is an open OASIS and ISO standard (ISO/IEC 20922), lightweight, publish-subscribe network protocol that transports messages between devices. The protocol usually runs over TCP/IP; however, any network protocol that provides ordered, lossless, bi-directional connections can support MQTT. It is designed for connections with remote locations where a ‘small code footprint’ is required or the network bandwidth is limited (Source: https://en.wikipedia.org/wiki/MQTT).
Figure 1 explains how to connect the device to the cloud. Let’s discuss the components in detail in the sections below.
MQTT broker acts as a common point for receiving and publishing messages from clients who are subscribed with it. Clients can connect to the broker, and then receive messages from and also publish messages to the topics. In our case, clients are at the IoT device side and broker resides at the cloud virtual machine. So, the cloud MQTT broker receives data from IoT devices through topics, where devices publish the messages. Also, the cloud can communicate with devices by publishing messages to topics that are subscribed by the device.
We have a number of options here, and one popular choice is Mosquitto; you can download it from https://mosquitto.org/download/.
Eclipse Mosquitto is an open source (EPL/EDL licensed) message broker that implements the MQTT protocol versions 5.0, 3.1.1 and 3.1. It is lightweight, and is suitable for use on devices ranging from low power, single-board computers to full servers. The Mosquitto project also provides a C library for implementing MQTT clients, as well as the very popular mosquitto_pub and mosquitto_sub command line MQTT clients (Source: https://mosquitto.org/).
Eclipse Paho is another option. You can get more details at https://www.eclipse.org/paho/.
MQTT client acts as a client connected to the broker. It can receive and publish messages from/to topics that are subscribed by the client from the IoT devices. We have a number of options for MQTT client, and we have already mentioned the Mosquitto C library as MQTT client in the earlier section (https://github.com/eclipse/mosquitto). Eclipse Paho client for C is another option (https://www.eclipse.org/paho/clients/c/).
Data transfer from device to cloud
Based on the business use cases, data format and frequency of transfer need to be planned properly. You may have to consider the points given below.
- How does the data help in day-to-day activities, and also for future planning and forecasting?
- How frequently do you need the data? This is based on the data relevance and data changes applicable. For some cases, we may need data from devices each second, and in some cases every hour or day.
- Data from devices may need some transformation while storing it into a cloud database. And based on the business use cases, you may need to display data in your Web/mobile
- There may be challenges in the processing of data. For some cases, the system demands sequential data processing (one by one in the way data originated), some cases require parallel processing, and in some cases there may be a pre-condition before the new set of data is processed.
- Processing the data for ML could be another area you may have to plan. Do we need to process hourly, daily, weekly or once a month?
Preparing the ML infrastructure
Since we are choosing the cloud here, we have lots of options, and at the top of the list are AWS, Azure and Google Cloud. All these cloud providers have IoT and ML specific infrastructure and tools, but these are costly and you may not need them in the initial stages. We can create a normal virtual machine (VM) and choose memory, CPU, disk, etc, based on the data and transaction volume.
Given below are the tools and frameworks needed for an Apache Spark based ML infrastructure.
- Spark 2: Comes up with all the necessary tools from the Spark ecosystem — Hadoop, Mlib, etc. You can get more details from https://spark.apache.org/docs/latest/.
- Hadoop: You can either install Spark2 with Hadoop or Hadoop as standalone.
- Python3/Scala/Java: This depends on what language you prefer to write ML programs
PostgreSQL/MongoDB: Install this if you have to store data into traditional databases other than Hadoop HDFS for future use and reference.
- MLib/Tensorflow/Keras/Scikit learn: Choose from these ML libraries based on your needs.
- Data analytics tools: Choose these based on your needs.
The above list is based on the Spark ecosystem; you may have to pick and choose based on the tools and frameworks you are familiar with or are relevant for your business and technology choices.
Common ML use cases in IoT
Given below are a few common use cases based on the data received from devices (strictly based on my experience and may differ in your business case).
- Analysis of data patterns for a specific period: As an example, if data comes from a temperature sensor, then the pattern of temperature data for a location where the device is installed for a day can be analysed for that specific period.
- Data missing/changes in duration/changes in pattern, etc: It is important to understand the missing data or changes in frequency because immediate action is required, or it can lead to potential errors in our analytics/forecasts.
- Inactivities or other ambiguities in data flow: These errors in data processing must be avoided.
- Difference between the forecasted and real data: This may lead to a correction in data models and trainings.
- User and location behaviour from device to device: Data for each device may differ if user and location behaviour contribute some points to the data.
- Frequency of maintenance and root causes for that: This may be specific to location, usage, transaction volume, etc.
Importance of security
Security is critical if you are handling data. Here are a few things to take care of:
- Enable SSL/TLS while transferring data from the device to the cloud, to make sure the data is encrypted and secured.
- Enable proper security in the cloud to avoid potential data breach or hacking.
- Enable the security of Big Data with proper user and group roles; secure the data based on the customers and clients.
This article is an attempt to give you an understanding of the integration between IoT devices and the cloud. Apache Kafka is an alternative for MQTT, but the advantage of MQTT is its lightweight, hassle-free architecture.