The recent pandemic has promoted remote work in a big way, giving a huge boost to DevOps and AIOps. This rapid, wide scale change is creating real concerns in AIOps, DevOps, and IT service management, as organisations seek the best monitoring and incident response solutions for their now distributed enterprises. This article discusses, among other things, the top AIOps tools on GitHub.
Today’s digital era of hybrid infrastructure with heterogeneous commercial off-the-shelf and bespoke (service-based and monolith) apps (that have multiple middleware/ integration platforms with different technology stacks) has mandated IT Ops teams to provide uninterrupted IT service to its end users. This has made IT teams proactive enough to capture and resolve issues before they occur. These teams must deal with constant change while ensuring zero downtime. Seamless operation is the key to grow, compete and thrive.
Earlier, IT teams grew along with technology, from standalone to vertical scaling through distributed computing. The emergence of virtual computing led to a world of microservices and ephemeral logic with containerised scaling. Nowadays, organisations simply generate too much data for humans to monitor and understand manually or by using legacy tools.
Digital transformation is much more than digitisation of business processes. It brings in all the benefits of digital technologies to make business processes automated, agile and fast with reduced steps and activities, ensuring availability of the required information at a click for the business decision maker. This brings in additional responsibility to ensure that IT operations run proactively for identifying and resolving issues. The key building blocks — scalability, availability, fault tolerance, security, cost-effective, operational excellence — lead us towards IT operations with a DevOps and AIOps strategy.
AIOps is the application of artificial intelligence in IT operations
Artificial intelligence for IT operations (AIOps) brings together artificial intelligence (AI) with analytics and machine learning (ML) for automation in the identification and resolution of IT operations.
AIOps combines Big Data and machine learning to automate or replace all primary IT operations processes, including availability and performance monitoring, event correlation and analysis, anomaly detection and causality determination, as well as IT service management and automation.
As per Gartner, AIOps refers to ‘technology platforms that use machine learning (ML) and data science to solve IT operation problems’. Gartner predicts that the use of AIOps and digital experience monitoring tools to monitor applications and infrastructure will rise from 5 per cent in 2018 to 30 per cent in 2023.
AIOps helps IT operations and DevOps teams to work smarter and faster with the proactive algorithmic analysis of IT data to identify digital service issues and resolve them quickly, before business operations and customers are impacted. With AIOps, Ops teams can understand the immense complexity and quantity of data generated by modern IT environments, and prevent outages, maintain uptime and attain continuous service assurance.
With IT at the heart of digital transformation efforts, AIOps lets organisations operate at the speed they need to.
Union of DevOps with AIOps
DevOps is the set of practices that combines development (Dev) and IT operations (Ops) with the union of people, processes and technology to continually provide value to customers. DevOps helps the team to be highly performing, building better products faster for customer satisfaction.
DevOps gives the ownership, support and success of services to the developers that write the code. Small DevOps engineers’ teams with primary site reliability engineering (SRE) give insights to correct and strengthen the reliability and scalability of services. The challenge is to get the right people across geographically dispersed teams to understand and resolve the issues of monitoring vast amounts of data.
AIOps addresses the operational challenges and covers every aspect of an organisation’s service strategy. It helps in releasing people from operations to focus on mission-critical tasks, empowering them to build improved services for better customer experiences.
AIOps in open source
Most open source AIOps projects use Python, as it is the first programming language for machine learning. Based on an organisation’s thrust on operational efficiency, various AIOps and open source tools can be combined and used on AIOps platforms.
Top 5 open source AIOps tools on GitHub (based on stars)
1. SeldonIO/Seldon-core (stars: 2.2k)
This is an open source platform to deploy an organisation’s machine learning models on Kubernetes at a massive scale; it has over 2 million installs.
An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models, Seldon core converts your ML models (TensorFlow, PyTorch, H2O, etc) or language wrappers (Python, Java, etc) into production REST/GRPC microservices. Seldon handles scaling to thousands of production machine learning models and provides advanced machine learning capabilities out-of-the-box, including advanced metrics, request logging, explainers, outlier detectors, A/B tests, canaries, and more.
2. Logpai/Loglizer (stars: 781)
Loglizer is a machine learning based log analysis toolkit for automated anomaly detection. Logs are imperative in the development and maintenance process, as they allow developers and support engineers to monitor systems and track abnormal behaviours/errors. Loglizer provides a toolkit that implements a number of machine learning based log analysis techniques that have multiple supervised and unsupervised models with:
- Log collection
- Log parsing
- Feature extraction
- Anomaly detection
3. Whylabs/Whylogs (stars: 326)
This tool profiles and monitors the ML data pipeline end-to-end, and is available in Python and Java.
Whylogs is an open source statistical logging library that allows data science and ML teams to effortlessly profile ML/AI pipelines and applications, producing log files that can be used for monitoring, alerts, analytics, and error analysis. Whylogs is an excellent solution for profiling production ML/AI pipelines that operate on TB-scale data and with enterprise SLAs.
- Data insight
- Unified data instrumentation
4. Jixinpu/Aiopstools (stars: 224)
This is a fundamental package for AIOps with Python providing capabilities. Features include:
- Anomaly detection
- Alarm convergence
- Time series forecasting method
- Association analysis for alarms
5. AICoE/Log-anomaly-detector (stars: 168)
This is used for log anomaly detection – machine learning to detect abnormal events logs. Log anomaly detection (LAD) can connect to streaming sources and predict abnormal log lines. It uses unsupervised machine learning models to achieve this result. Lad-Core: ML Code is used for inferring if a log line is an anomaly. It uses W2V (word 2 vec) and SOM (self-organising map) with unsupervised machine learning. Grafana and Prometheus are used to visualise the health of the machine learning system, and can help track and prevent false positives in ML jobs.
Open source AIOps learning platforms
1. Tencent/Metis (stars: 1.1k)
Metis is a learnware platform in the field of AIOps. The current version of this open source learnware solves the anomaly detection problem of time series data from the perspective of machine learning.
2. Linjinjin123/Awesome-AIOps (stars: 930)
This platform gives a summary of AIOps learning materials at one place.
3. Chenryn/Aiops-handbook (stars: 506)
This is a collection of slides, repositories and papers about AIOps.
4. Logpai/Awesome-log-analysis (stars: 287)
This platform offers a curated list of awesome publications and researchers on log analysis, anomaly detection, fault localisation and AIOps.
Open source contributions to AIOps
Prometheus: This is an open source monitoring solution. It’s a graduate of a Cloud Native Computing Foundation (CNCF) project which focuses on monitoring for site reliability engineering (SRE). It simplifies pulling numerical metrics from a metrics endpoint.
Grafana: This is an open source metric analytics and visualisation suite popular among Prometheus users to visualise the metrics.
Elastic Stack: This is a suite of open source products from Elastic designed to help users search, analyse, and visualise data from any type of source, in any format, in real-time. When you run Elastic Stack with Elastic Search, it provides monitoring and logging solutions.
AI is the key to helping DevOps teams scale the technology created today and in the future. AIOps helps to make the management of IT operations simple and accelerate the speed of solving IT Ops problems by automating their resolution. It frees manpower to focus on innovating for a better customer experience, leading to maximum profitability for the business.