February 11, 2021

Opinion — How to build a data architecture to drive innovation — today and tomorrow

Robbe Sneyders

Principal ML Engineer

Reference: https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights/how-to-build-a-data-architecture-to-drive-innovation-today-and-tomorrow

In June 2020 McKinsey published a report called “How to build a data architecture to drive innovation — today and tomorrow”. The report is an excellent view of modern future proof data architecture. In January 2021 McKinsey published “Breaking through data-architecture gridlock to scale AI”, a new iteration that highlights the importance of data architecture for AI.
The article highlights the advances in data technology and the recommended building blocks and methodologies. Based on our experience, we offer our view on the elements put forward in the article.

What do we like?

As an AI company, we like to keep the infrastructure as simple and as managed as possible. So we prefer to use serverless cloud data lake/data warehouse services, as recommended in the report. This enables affordable scalability and flexibility with no or minimal infrastructure maintenance. Schema-on read is a game-changer for semi-structured data.
The focus on a domain-driven approach highlights that the end to end responsibility for the data belongs within the domain that generates the data. Check out my blog post with a summary of the data mesh principles to learn more about shifting responsibilities to domains.

Organizing a data lake into multiple layers with a raw layer, preferably immutable, and curated layers managed by the domain is an excellent approach.
Starting from the curated layer, it’s easy to create new fit for purpose “data products”, in data mesh terms (or data marts for the Kimball generation). A good example is that a data set for an analytic tool often has a different data model or even file format than the same data used for ML model training.

The move away from pre-integrated commercial solutions in favour of a well-picked tech stack is exactly what we do for customers. It’s essential to only invest in technologies that are crucial and unique to the growth strategy of your business.
A mix of SaaS, commercial and plenty of open-source, backed by a large community, is more efficient and cost-effective than hitting the limits of traditional ETL software or (over)-engineering your own data processing frameworks. We are glad we’ve seen this data architecture in several requests for proposals.
It is a great blueprint. Yet, in our opinion, 2 shifts require a bit more consideration.

From batch to real-time processing

It makes absolute sense to ingest the data in real-time if:

your applications are based on a microservices design with CQRS on top of Apache Kafka or similar services such as Apache Pulsar
you have several business requirements that combine data from multiple domains in realtime and require real-time decision making
your data is streaming in nature. For example, the internet of things, web analytics, …

In the majority of cases, we have seen that the business value for real-time processing in the data warehouse context is limited.

A lot of data-driven decision making is not real-time. In some cases, we’ve seen daily usage but often insights are used on a weekly, monthly or quarterly basis.
From a technical point of view, the tech stack behind a lot of applications is not yet event-based so plenty of refactoring is needed to enable it for every action.
The costs of real-time processing are considerably higher and streaming data pipelines are more complex to develop compared to a (micro)-batch approach.
In front-end applications, users often request controls to minimize or suppress the number of real-time alerts because it interrupts the flow of their work.
Real-time management dashboards are underused because a limited number of employees have the time to analyse and act frequently during the day.

There are also alternatives to get to real-time insights:

In operational applications, real-time reporting is often integrated on top of the application database to offer full transactional consistency and input capabilities.
Inference with ML models tends to be tightly integrated into the operation application using trained models exposed as REST APIs or the inference is moving even closer to the device using edge processing.

In our data pipeline designs, we aim to process the data at the right time.
As data volumes increase we increase data ingestion rate to avoid long-running processes. In a modern job orchestrator, it’s easy to change the refresh rate. Data processing frameworks, such as Apache Beam, require minimal changes to switch to streaming or micro-batches in case it’s needed.
And don’t forget that AI, for example, advanced anomaly detection, offers plenty of opportunities to trigger only relevant alerts.

From point to point to decoupled data access

McKinsey recommends decoupling using APIs managed by an API gateway.
This is a proven approach to decouple micro-services or provide data, with a well-defined application interface and schema to one or more internal or external applications. A modern API gateway, such as Kong or Apigee, offers an additional security layer, observability and a documented user-friendly interactive developer portal.

APIs are however not recommended for large scale data exchange.
Large scale data processing frameworks and data science tooling are more efficient with a direct connection to the cloud data warehouse or data files in distributed storage.
Modern drivers and file formats, for example, Apache Parquet or the BigQuery storage API support high-speed, zero-copy data access using Apache Arrow. This is more efficient than using a REST API with limited filter capabilities, a limited number of rows for each request and slow JSON deserialization.
Furthermore, the majority of the data visualization tools prefer or only support drivers that support SQL.

We advise spending enough time to analyse the required integrations and take pragmatic design decisions.
It’s perfectly acceptable to process data in the data lake and export a subset of the data to ElasticSearch or an in-memory data store. Alerts can be published on a message or task queue. Integration with internal/external REST API can be tricky if the API is not scalable enough to handle the number of requests modern data processing frameworks generate.

In the report, we suggest moving the data catalog, mentioned in shift 5, to this section.
The data catalog should be the main entry point for business users, data scientists and application developers to discover the available data, check data lineage and the options to access the data. If you like to learn more about the pitfalls of microservices check this point of view.

Conclusion

We are looking forward to an update of the report this year because the data landscape for data and ML keeps evolving at a fast pace.
We’d love to hear your point of view. Do not hesitate to reach out in case you have any questions or suggestions.

‍