How I cope with model of 400 elements, 1200 relations that mutates to be correct

Started by maksim aniskov, Today at 06:49:04 AM

Previous topic - Next topic

maksim aniskov

Hi all!

I want to share my story about how I deal with a model representing a real complex data processing setup that constantly mutates. I have been on a challenge of building the model and automation around in course of past two months as a part-time assignment. Now I'm  having solid proofs that my setup flies flawlessly and is saving me and the team a ton of time and effort each day. I'm very much proud of it!

My situation is that there is a team of approximately ten data engineers and devops who are full-time busy on improving or modifying the setup. A whole bunch of Apache NiFi dataflows running on several AWS EC2 instances comprise the setup. They pull data from several dozens of external APIs, then fan out processed data to downstream consumers, as well as data storage services. The whole setup is undergoing migrating from one AWS region to another. Those APIs live their own life, sometimes changing their endpoint location, or getting decommissioned. Or a new API appears.

How do I make sure that my architecture diagrams always correctly represent all that pretty complex "reality"? And how do I sleep well spending just 10 minutes a day fixing drift, when it gets detected?

The keyword here is "drift detection". Let me describe my toolchain.

Data harvesting. I use two sources to fetch all necessary data about "the real world": AWS CLI output and NiFi flow definition JSON files.

Data preparation. Before feeding the data to model validation scripts, I prefer to transform it to comma- or tab-separated text files. Making AWS CLI to produce such output is not a problem at all. Retrieving necessary data from NiFi flow definitions is the trickiest part of the challenge. Those JSON files that the data harvesting script retrieves from the servers, are around two million lines in total, if pretty-printed. I chose jq to do all the heavy-lifting, looping through objects, looking up for NiFi processor group and processor descriptions, and matching parameter substitutions in expressions against values in context definitions. What jq passes output are two comma-separated files. One maps NiFi root process group names to ids of AWS EC2 instances the groups run on. The second file lists all data sources and data sinks the process groups have in configuration of their data processors. What I'm interested in first of all is to know what kind of database, data streaming service or API is the NiFi processor's peer, and what is its URL, address or identification. I want to know all the ins and outs of data processing setup under analysis.

Validating model correctness. For marching my ArchiMate model against "the real world" data, I employ archi-powertools-verifier. The tool leverages Docker Compose to orchestrate Archi, Neo4j and sqllite containers.
The basic idea is pretty straightforward. A model validation batch does the following.
  • Convert content of ArchiMate model to set of comma-separated files;
  • Run SQL database engine (sqlite) or graph database engine (Neo4j);
  • Make it import model files and any additional data we have as result of data preparation we did before;
  • Run the set of SQL or CQL (Neo4j) queries I wrote. The queries compare what is in the tables (or elements and relationships, if Neo4j) which contain the model against what is those which contain the "real world" data. Read this for mode details and examples.

So, once a day my automation runs data harvesting, data preparation and model validation scripts. If my SQL and CQL scripts detect any discrepancies between the architecture model and AWS infrastructure or NiFi flow definitions, I get Drift detected! lines in the output accompanied with necessary problem descriptions and resource names or ids. To bring the architecture model back to state when it correctly represents the NiFi flows, I edit the model manually in Archi. Before committing my changes to git, I validate the model one more time. When there are no more drift warnings in the output, I'm ready to push the updated model to the repository to make it available to other collaborators. Doing this process again and again routinely, I have already collected statistics that our NiFi setup gets 2 to 4 changes requiring mirroring in the model each week. When drift is detected, it normally takes 10 or 15 minutes to edit, re-validate, commit and push changes. And I'd say that those several minutes spent is nothing if we consider that that is all the cost of making it guaranteed that the system architecture model is fully correct 100% of time.