In data analytics, there are a lot of nice and shiny buzzwords, products and concepts. Before you decide anything, you should be clear about your actual and future needs. Your analytics infrastructure should enable you to analyze data. But there are aspects that architectures support differently and you have to trade off. There is no free lunch.

Here are some explanations that should help you to orientate what you need and enable you to compare different approaches.

Data Location

It’s important where the data is located. You can leave the data on the source system and read it per analysis. Or you can copy the data to an analytics system.

Pros of leave and read are:

  • You will save space.
  • You will have always the most recent data.

Pros of copy are:

  • You may get fast analyses, since you can optimize the storage for analytics.
  • You maintain fast operations, since you don’t read data that the source system wants to use at the same time.
  • You get consistent results, since you control the updates.
  • You can implement historization and don’t lose information.

Usually the data is copied, but if the volume is big enough or you don’t have the resources you may want to leave it on the source.

Data Structure

How or if your data is structured has a significant impact on your analytics infrastructure. There are structured data, semi structured data and unstructured data. Structured data are for example relational databases. Semi structured data contains information how to separate values and identify structure. Examples for tabular semi structured data are simple Excel sheets or CSV files. Examples for hierarchical are XML or JSON that are often used by web services and APIs. Examples for unstructured data are images, PDF or plain text.

Analytics is all about reduction of information to be consumable by humans or processes that humans create and understand. Reduction requires structured data. Semi structured data can be transformed into structured data. Unstructured data may be transformable into structured data, but not always, not that easy or not error free. Avoid Excel, PDF or Text as data source, whenever possible.

Data Transformation

Data from different sources may be hard to combine, since there is no common identifier or different formats. There are again two approaches: "Schema on read" and "Schema on write". "Schema on read" means, you leave the data in its raw form and transform it when you analyze it. "Schema on write" means, if you write the data you transform it into a common format. You may change the format, the data types, normalize it and deduplicate it.

Pros of "Schema on read" are:

  • You may save effort on integrating new data.
  • The raw data may contain more information than the integrated data.

Pros of "Schema on write" are:

  • It’s less effort for analyses using integrated data in development and computation terms.
  • Investment in quality pays out more, due to the reuse of transformed data.
  • Preaggregated data will speed up analyses.

Data Volume

It’s hard to say what data volume is big, but anyway it should have a big influence on your individual solution. In analytics systems, performance is often provided by redundancy. Some results that are used several times or with reduced latency are precalculated and stored.
So one piece of information is stored many times in different ways. That’s performance efficient but not storage efficient.
Obviously that may be a problem with a lot of data.

Usually one can say that data that is automatically generated, e.g. from sensors or logging functions may come in high volumes. Manually generated data like orders in your ERP system or master data usually not.

Data Velocity

Data velocity means how much time elapses from data generation to analysis. High velocity may come along with other restrictions or increased effort.

Most common are scenarios with updates on a daily-basis. For regulatory supervision it may be enough to update your data once a quarter.

A often misused term in this topic is real-time. If your system is real-time capable, it means that you guarantee a result in a specified time. That’s important for example in embedded systems in automotive or industrial environments. In business context real-time is used as best effort latency and only in special cases necessary. Imagine a manager that makes decisions, based on reports that are updated and changed every 5 minutes. That holds the risk to react to random events instead of pursuit a strategy. That may be different in a cloud-based application scenario, where you want to scale-up or scale-down the system based on the usage. Or think about a process that changes the prices in a ecommerce scenario based on the recent sales.

Conclusion

You see if someone tries to sell you something without listening to your requirements, there is a good chance to end up with something that does not deliver what you need or may be accomplished with less effort.

As IT consultants, we try to solve problems on a daily basis. This is our normal workload, our daily business. But this is not our only duty, we need to keep up with the technical evolution, we need to learn continuously to satisfy our customers. This is why we read about new things in blogs, visit meetups in our free time and go to conferences (like the one this post is inspired by "down to earth architecture" by Uwe Friedrichsen @SAS 2019 in Munich). We are influenced by all these channels and need to be careful how to use the knowledge in our working environments or we end up with one of these stereotypical types of bad software architecture:

Stackoverflow architecture (or google-driven architecture)

We have a problem to overcome in our software system and are not familiar with the topic. Therefore, we search the internet for books, blogs or tutorials. We find a slightly related solved problem on Stackoverflow and copy the solution without much thinking.
We are not talking about copy-pasting code here, but rather abstract solutions like "where should ids be generated in CQRS." We do not want to downplay the absolute knowledge found on Stackoverflow, but we should be sure the solution we found actually fits the problem and/or adapt accordingly.

Conference-driven architecture

Whatever conferences you visit, you always feel attached to your track or topic. These could be things like micro-services, domain-driven design or EventSourcing. While these are very good solutions to their respective problems, they might solve problems you aren’t even encountering in your domain or there are other good solutions.
Additionally, most of the time, we are not starting an application from scratch. If we visit conferences regularly and always incorporate the hot topics, we end up with a mess after some time.

Hype-driven architecture

Similar to the conference-driven architecture, we find the hype-driven architecture. Every (new) application needs to be distributed into micro services. Of course, that’s not true. There are huge benefits in following a micro-service (or SCS) approach, but there are also challenges, constrains and problems! Learning and especially applying a framework is often useful. However, you should not force a framework onto your system if there is no need for it! Most of the time, learning how to solve your domain’s problems (e.g. how to handle consistency in distributed systems or mastering personal data and GDPR) is more beneficial than being a master of a framework.

Strategic architecture (aka PowerPoint architecture)

Usually, when you join a project there are some PowerPoint slides describing the architecture of the system or application. You go through these, but your colleagues advise against doing so: "these are for compliance" or "we made this for the latest steering committee". When the slides diverge too much from the actual structure or code, misunderstandings are about to happen! While there are reasons to display different aspects of your software to different stakeholders, try to minimize this.

Tunnel-vision architecture

As a software engineer or architect, you need to work on some topics in excessive detail. We need to build walls around us and analyze problems in-depth! Occasionally, we need to look around, too. With more experience, we learn to balance the extremes. Especially for younger developers, there is a risk of over-engineering one detail or creating problems on other ends of the system.

Blast-from-the-past architecture

Technology advances, business models evolve and the underlying software architecture needs to do this, too. There are challenges that a lot of software components face; an example is versioning of web APIs. A versioning concept of system-to-system APIs with /v1/, /v2/, /v3/ might work for applications that had a release once a month and a breaking change once a year, but probably won’t work for a fast paced API in an API economy where time-to-marked is a driving factor.

Big design up front

In a world with perfect information, where all user needs and every aspect of your system are clear, Big Design Up Front (BDUF) could work. BDUF is closely related to the waterfall approach of developing software. This clashes with the agile world. Similar to communism and capitalism, BDUF and agile development are two paradigms where neither is inherently bad or good – it’s just that one is more practical in real life. Especially in a fast-moving world where innovation is key, agile development won the battle and there is no place for BDUF architecture.

One-size-fits-it-all architecture

Develop your application as a polyglot, domain-driven micro-service architecture with CQRS and EventSourcing. Use Kubernetes as container orchestrator with Helm for deployment, Prometheus and Grafana for monitoring and GIT as source control system. Frontend is Angular, machine learning is done in python and we use Mongo and Cassandra for persistency. Caching is done through redis and the whole application needs to be cloud agnostic and conform to all cloud native principles. While this is a noble approach and a turn-on for software engineers, this might not suite our business needs in any way. We could solve many problems with this technology selection, but we are likely over-engineering and not optimizing our efforts.

Accidental architecture

Remember the cone of uncertainty? When you start developing a product, about everything is blurry. You don’t know the user-needs; you don’t know the scale of your application and so on. At this stage, you might not be able to find solutions to some problems as you cannot answer essential questions. At this point, you need to act accordingly! Work with interfaces, adapters and libraries that can be switched later easily or don’t put too much effort in some components as you will either replace them later or implement a more sophisticated version anyways.
Don’t just "do it" or you will end up with a mess of decisions that nobody wanted to make. Another way accidental architecture happens is the development team is either unaware of or under-experienced to identify key-issues.

How do we make sure not to end up with one of this? I’ll look for a more detailed answer in another article, but it boils down to this: we should ask why we need architecture initially.
We have requirements, constrains, problems etc. We figure out solutions (for example with an approach like "orient – explore – evaluate – support" from Uwe Friedrichsen). When we follow this path, we protect our systems from the types of bad architecture above. As unlikely as it seems, if we end up with an architecture that is similar to the ones above, it’s fine. We engineered it with the right intentions. Additionally, learn when to not use certain solutions and follow Uwe Friedrichsen’s advices:

  • Think holistically
  • Resist hyper-specialization
  • Get a T-shape profile
  • Leave your comfort zone once in a while
  • Understand your domain
  • Don’t fall for hypes
  • Cope with technology explosion
  • Master the foundation design
  • Don’t overact

I recently had the task to automate a program, with a COM interface, and integrate it in a database application.
I already used PowerShell to automate Docker, SqlPackage and others.
So my first thought was to use PowerShell in this case too, but due to the complexity of the task i decided against it.
I ended up with a C# solution with round about 300 classes and i’m happy with it, but it brought me to the question what good criteria are to choose between PowerShell and something else.

Basically PowerShell is a nice hammer, but not every problem is a nail.
And since almost every programming language is turing complete, you can every problem in every language, but they have their pros and cons.
This is especially true for .NET-based languages like C#, F# and PowerShell, since they share the same libraries.
So you can develop a nice graphical user interface using Windows Presentation Foundation in PowerShell, even though is was originally designed for C# applications.

There are many problems out there, that you can solve in PowerShell with less code than in other languages, which makes it faster to write and easier to maintain.
But now i will show you some cases when PowerShell is a little painful and other languages are a better choice.

Inhertiance and Polymorphism

PowerShell is object-oriented and where object-orientation is, there is polymorphism not far. So you define interfaces and maybe several implementations for it.
But since PowerShell is a interpreted language there is no type or interface checking before runtime.
In default, everything is of type Object and you see if a method is available, when you execute the code.
You can assert types, but not have to.

There are different ways to get new objects in PowerShell.
Often they are created in Commandlets that are written in C#, like Get-Process or New-Item.
Another common option is to create a custom object using New-Object -Type PsCustomObject -Property @{ 'Foo' = 'Bar' }.
That creates a generic object that can be extended by any property or method in runtime.
Another option is to create the object in PowerShell but write the class definition in C#.
You can do that from existing .NET libraries or even in runtime.
Store the C# code in a variable and add the classes with Add-Type.
That were the options for PowerShell version 4, but in version 5 classes were introduced.

All that methods have their reason.
Let me explain that using some questions:

  • Why would you want to create a PowerShell class, if you can use a PsCustomObject?
  • Why would you want to create a PsCustomObject, if you can use a Hashmap?

Commandlets are the default if you want to use existing PowerShell modules.
Hashmaps are the default if you need custom attributes in an object.
But if you want to pass data to existing Commandlets, for example to write it CSV files, then the easiest way is to use a PsCustomObject.
If you write your own functions that require parameters with specific properties and methods, then its better to define a class, that can be easily validated.
The next level of complexity is, if you write a function that has parameters that may be of the one or of the other type with the same interface.
Then you start thinking about abstract methods, reuse of code between these classes.
Here C# supports more expressions to simplify the code and compilation time validation improves the quality.

So if you start to create inherited classes in PowerShell you probably gone to far.
Maybe it’s better to create a PowerShell Module in C# or a .DLL in C# and include it in your Powershell code.

Concurrency and Parallel Computing

Since PowerShell supports using the System.Threading library of .NET, you can do multicore computation in PowerShell.
In some cases this is not even a bad idea.
A common case where PowerShell is used is automation and integration of other tools.
For example run a compiler, call a web service and so on. These tools may produce output that you might want to process while the tool is still working.
Sometimes you have to do that since otherwise you would get a overflow of the output buffer and you don’t get the entire output.
In that case you can define PowerShell block as variable and register it as a event handler.
But there are other cases where you have more parallel processes that need to be synchronized somehow and that may communicate in between. Then C# or F# has better expressions to manage asychronous calls.