Fast and Efficient Process Mining with SberPM python library

Aleksandr Korekov
5 min readApr 28, 2021

--

Process Mining has become rather popular over the past decade. It is an approach to discover, monitor and optimize business processes.

As a link between Data Mining and Business Process Management, it takes process analysis to a whole new level.

Many companies and enterprises are using process mining to be competitive. Sber is not an exception. In December 2020, it open-sourced SberPM , the first Python library in Russia for conducting a comprehensive analysis of business processes.

Now let’s see how to get started with SberPM.

DataHolder

Process mining in SberPM starts with the DataHolder that loads, preprocesses and stores an event log.

Each event is marked by a case ID, an activity, and a timestamp(s). Other attributes of the event that can be listed are user, resource, location, text information, etc. Despite they are optional, they can help to get deeper insights into the process. All of the key attributes are stored as the metadata in the DataHolder and can easily be accessed.

DataHolder is a basic object passed into most of the SberPM’s algorithms. So, let’s load some data:

Here we loaded the ‘example.csv’ log and set key attributes: id_column (case ID), activity_column (activity), and start_timestamp_column (timestamp)

DataHolder is also equipped with some methods for basic operations with the event log. For instance, you can calculate the execution time of each activity or group the data by case ID and specified columns.

Most of the library’s algorithms require DataHolder as a data provider.

Miners and Graphs

Event logs available in corporate information systems contain data for each performed process step. By extracting knowledge from them, it is possible to create a model of the actual process flow.

SberPM offers several algorithms — referred to as miners — that discover a process graph from event data (with some assumptions):

  • SimpleMiner — draws everything found in the log (all nodes and edges)
  • CausalMiner — draws only direct links (no return links are displayed)
  • HeuMiner — removes edges with probability of occurrence less than a given threshold (the bigger the threshold, the less edges are depicted)
  • AlphaMiner — draws a graph as a workflow net taking into account causal, parallel, and independent relations
  • AlphaPlusMiner — Alpha Miner upgrade, handles one-loop cases (in contrast to Alpha Miner)

Each algorithm results in a process model that you can turn into a visual workflow with the help of graphviz:

At #Miner part the graph is created with the Heuristic Miner, then it is visualized with graphviz

Workflow net produced by Alpha or Alpha+ Miner can be exported as a BPMN diagram (Business Process Model and Notation):

By analyzing the data-driven process graph, you can spot the difference between ‘as-is’ and ‘should-be’ process flows, i.e. detect deviations, bottlenecks, and inefficiencies. Let’s compare graphs constructed by different miners for the same process:

So, CausalMiner allows you to display the process in a linear way, HeuMiner shows the most common chains, and the AlphaMiner clearly demonstrates the parallel sections of the process.

Metrics

Process mining is not limited to process discovery. An important part of the analysis is performance mining, that is assessing and monitoring of key process performance indicators, or simply metrics.

The metrics module is a basic class for this, and it offers the following types of statistics:

  • ActivityMetric — metrics aggregation over unique activities
  • TransitionMetric — metrics aggregation over unique transitions (e.g. transition from activity A to B)
  • IdMetric — metrics grouped by cases
  • TraceMetric — metrics aggregation for unique traces (trace is a ‘complete’ process, while ‘transition’ is any subprocess)
  • UserMetric — metrics aggregation by users
  • TokenReplay (fitness) — shows how well process is reconstructed

For all the classes (except the last one), number of occurrences, number of unique IDs / activities / users, loop percentage, time metrics (min, max, mean, median, etc.), and other statistics are calculated.

Example of how the UserMetric class works:

One of the advantages of this module is the speed of calculations.

Let’s say we want to find the most frequent traces and their average duration. The pandas solution takes 5 minutes and more than 10 lines of code, while the SberPM solution takes 1 minute and 3 lines of code.

In addition, the library provides interface for placing metrics to the process graph. You can do this as follows:

Thus, the visualization capabilities help to derive fact-based insights into the business process on all levels. In particular, it is possible to identify time delays, loops, inefficient performers, as well as workarounds and exceptions in the processes that can significantly affect business performance.

Next steps: Machine Learning in Process Mining

In addition to the classic process mining techniques, SberPM provides machine learning algorithms for process mining tasks. As for now, you can use process vectorization and clustering, as well as the module for automated search of insights. We will cover this functionality in the next article.

We hope that this post was helpful for you and made your process mining side of life a bit easier :)

Contacts

Feel free to reach us via mail: Aleksandr Korekov av.korekov@gmail.com and Danil Smetanev: smetanev.danil@gmail.com or directly at GitHub.

--

--

No responses yet