Data in Process (Part 1)

Decisions on process data

Simon Zambrovski
Holisticon Consultants

--

© PantheraLeo1359531

Introduction

Control flow and data flow form the essence of business processes expressed in BPMN 2.x. Data is loaded during the execution of service tasks, provided as user input in user tasks and is possibly changed during event correlations. In general, process automation using a process engine relies on data accessible to the process instance to drive the execution. Modern process engines offer an API for access instance-specific data.

But data is not data… I would like to share some thoughts on data modeling, scoping, naming, strategies of loading, serialization and implementation patterns on and around data in Camunda BPM engine. In my opinion it is advisable to think about those questions and make decisions for every process application (or even standardize them over the organization).

For example, Camunda BPM stores the entire instance-specific data inside so-called Process Variables (a key-value pair with variable name as a key and its value as value). Process Variables can be accessed via Java API (provided by the RuntimeService) or using expressions from BPMN model. The variables in Camunda are accessible in different scopes. The top-most scope is global, meaning that the variable is globally accessible and has the same value. An execution (e.g. a branch) defines its own scope and you may choose to set a variable in this local scope only (The concept maps to the concept of local variables in Java).

Process Data Loading Strategies

One of the first questions you should think of is “How much data do I want to load and how often?”. This question addresses data stored and available via some external storage, so we are not speaking about data generated and required by the process only. There are two extremes possible: references only and full.

If you decide to store references as process variables, you store only (immutable) foreign keys, pointing to some externally stored data records. The main idea of this approach is that your process engine loads data from external source every time it needs it and gets a current version. The footprint of your process variables is small. The disadvantage is, that you need to pre-load data for every single process step, debugging and operation is difficult (since data is not available in Cockpit and is not stored in history). The advantage is that you won’t have any issues with redundancy, synchronization, privacy and other issues resulting from data copying.

The opposite approach is to load all data from external sources and store them in process variables. The main advantage is that the data is available for service delegates, user tasks, operation and debugging. The footprint is large and you the size of history may get into trouble. The synchronization and privacy can be challenging, since you are creating snapshots of the external source and store them.

Finally, you may decide to mix this two approaches. Some data that is not a subject of frequent change may be loaded once and some other data may be loaded every time it is needed. Again, this is fully OK to do so, just make sure you have a clear understanding when to do what.

Process Data Storage Strategies

Independently from loading data from external source, your application needs to store data in Process Variables. This can be done by using either flat or rich variables.

Usage of flat variables means that you store primitive values only. If the data you want to store consists of multiple values, you store it in multiple process variables. Every value is accessible independently from others, but you end up with many process variables.

Usage of rich variables allows to store complex object graphs as process variables. You end up in fewer variables and more semantics in a process payload, but the serialization may become an issue.

Process Data Serialization

Camunda BPM provides support for a series of primitive types which can be stored directly as process variables. These are:

  • number (short, integer, long, double)
  • string
  • boolean
  • date
  • bytes
  • null

In addition, serialization of files and custom Java objects is supported (if those are Serializable). A better approach for complex objects is to use Camunda library called SPIN which allows for to use XML or JSON as a serialization format. In general, default serialization using Java Serializable mechanism lead to tight binding to implementation and should not be the preferred solution by any means.

The advantage of variables using primitive types is that no serialization issues can occur by the cost of small information density. On the other hand, the usage of SPIN allows a pragmatic solution for serialization of complex objects and object graphs.

Summary

In this part, I introduced three independent questions around process data in Camunda BPM, you should keep in mind during the application design. These are:

  • Loading strategy
  • Storage strategy
  • Serialization

To me, there is no silver bullet here, but you have to match your requirements with the possible alternatives.

In the next part of this series, I will focus on Process Variable Access.

--

--

Simon Zambrovski
Holisticon Consultants

Senior IT-Consultant, BPM-Craftsman, Architect, Developer, Scrum Master, Writer, Coach