Motadata Docs

What is AIOps?

What can a few minutes of unplanned downtime cost a business? Loss of revenue? A damaged brand? Regulatory action? Or maybe all of it? Yes, probably it can cost a business all of the above. So, let us talk about how Artificial intelligent for operations, also known as AIOps can help them prevent this.

Before we go into the definition of AIOps, let us consider a scenario. Imagine you are an IT operations professional at a successful FMCG company that sells deodorants. At this company, you support a portfolio of applications, one of which is an invoicing application used by tens of thousands of partners every day.

For this specific application, your focus is to make sure it’s up and partners can deliver invoices consistently using it on a regular basis. One day you settle into your desk, you get a cup of coffee and you get a phone call. Out of nowhere, a sales representative is calling you to complain a partner is not able to upload an invoice because the application is down. What would you do in that scenario to get this application back up and running?

Now, before we continue down that scenario. Let’s talk about AIOps and its textbook definition. AIOps is about the application of artificial intelligence, machine learning models and advanced analytics to IT operational data. The objective is to empower IT professionals and operations professionals with data they need to make decisions and ultimately revolve and restore service to an application faster. So, with that definition in mind, let’s talk about how we can get this invoice application back up and running.

"AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination."

Gartner

Now, let’s think about the key data sources for our invoicing application. Let’s start with events, metrics, logs, alerts, and a few others but for now let’s focus on these data sources. In the real world, these data sources might look different depending upon your application architecture, types of data sources for your application, and data regulatory requirements.

Now, how do these key data sources fit into our model for AIOps?

Let’s think about three key steps:

  1. Discover and Monitor We are going to call the first step ‘Discover and Monitor’. In this first step, the data is ingested by the AIOps platform allowing it to create thresholds and baselines in turn for your specific application.

    Let’s think about it this way. What is normal for my invoicing application? What is the log ingestion rate? How many errors are acceptable based on our service level objective?

  2. Creating ContextThe next step of the process is ‘Creating context’. This is where AIOps really comes into the picture, it takes all the ingested data, and it surfaces it to an IT operations professional or site reliability engineer (SRE) in the form of a collaboration solution.

    Up until now, everything has been done in the background and as soon as this incident pops up with our invoicing application, it is visible to the SRE.

    Now they have the context on where the incident is located in the application, what specific actions are recommended to resolve this and most importantly, how those actions are based on similar incidents that have come up in the past?

  3. Act and AutomateNow that the SRE is armed with the relevant information, we come into the last phase which is ‘Act and Automate’. The suggested options for resolution are surfaced to the IT officer SRE or the ITOps professional, and then they have to act and automate to resolve this issue.

    The suggested options available to the SRE enable them to select what has worked in the past. With just a single click, they can activate a script or runbook to resolve this issue as soon as it’s detected.

This gets the invoicing application back up and running faster and makes sure that our partners are happier with this experience. Now, we have the overview, the three key steps, and the type of data that’s ingested by AIOps.

In summary, this system allows IT professionals to solve problem faster, predict problems before they occur, keep applications up and running, and help protect the business in the long run.

The ‘AI’ in AIOps does not mean that human operators will be replaced by automated systems. Instead, humans and the AIOps platform operate together, with the AI and ML algorithms augmenting human capabilities and enabling DevOps, SRE, and IT Ops teams to focus on what is meaningful.

Even though AIOps is an emerging space and the definition is still fluid but some of the core elements are as follows:

  • Machine Learning
  • Performance Baselining
  • Anomaly detection
  • Automated Root cause analysis
  • Predictive insights
  • Intelligent Alerting

On this Page