Robbe Sneyders
Principal ML Engineer
In June 2020 McKinsey published a report called “How to build a data architecture to drive innovation — today and tomorrow”. The report is an excellent view of modern future proof data architecture. In January 2021 McKinsey published “Breaking through data-architecture gridlock to scale AI”, a new iteration that highlights the importance of data architecture for AI.
The article highlights the advances in data technology and the recommended building blocks and methodologies. Based on our experience, we offer our view on the elements put forward in the article.
As an AI company, we like to keep the infrastructure as simple and as managed as possible. So we prefer to use serverless cloud data lake/data warehouse services, as recommended in the report. This enables affordable scalability and flexibility with no or minimal infrastructure maintenance. Schema-on read is a game-changer for semi-structured data.
The focus on a domain-driven approach highlights that the end to end responsibility for the data belongs within the domain that generates the data. Check out my blog post with a summary of the data mesh principles to learn more about shifting responsibilities to domains.
Organizing a data lake into multiple layers with a raw layer, preferably immutable, and curated layers managed by the domain is an excellent approach.
Starting from the curated layer, it’s easy to create new fit for purpose “data products”, in data mesh terms (or data marts for the Kimball generation). A good example is that a data set for an analytic tool often has a different data model or even file format than the same data used for ML model training.
The move away from pre-integrated commercial solutions in favour of a well-picked tech stack is exactly what we do for customers. It’s essential to only invest in technologies that are crucial and unique to the growth strategy of your business.
A mix of SaaS, commercial and plenty of open-source, backed by a large community, is more efficient and cost-effective than hitting the limits of traditional ETL software or (over)-engineering your own data processing frameworks. We are glad we’ve seen this data architecture in several requests for proposals.
It is a great blueprint. Yet, in our opinion, 2 shifts require a bit more consideration.
It makes absolute sense to ingest the data in real-time if:
In the majority of cases, we have seen that the business value for real-time processing in the data warehouse context is limited.
There are also alternatives to get to real-time insights:
In our data pipeline designs, we aim to process the data at the right time.
As data volumes increase we increase data ingestion rate to avoid long-running processes. In a modern job orchestrator, it’s easy to change the refresh rate. Data processing frameworks, such as Apache Beam, require minimal changes to switch to streaming or micro-batches in case it’s needed.
And don’t forget that AI, for example, advanced anomaly detection, offers plenty of opportunities to trigger only relevant alerts.
McKinsey recommends decoupling using APIs managed by an API gateway.
This is a proven approach to decouple micro-services or provide data, with a well-defined application interface and schema to one or more internal or external applications. A modern API gateway, such as Kong or Apigee, offers an additional security layer, observability and a documented user-friendly interactive developer portal.
APIs are however not recommended for large scale data exchange.
Large scale data processing frameworks and data science tooling are more efficient with a direct connection to the cloud data warehouse or data files in distributed storage.
Modern drivers and file formats, for example, Apache Parquet or the BigQuery storage API support high-speed, zero-copy data access using Apache Arrow. This is more efficient than using a REST API with limited filter capabilities, a limited number of rows for each request and slow JSON deserialization.
Furthermore, the majority of the data visualization tools prefer or only support drivers that support SQL.
We advise spending enough time to analyse the required integrations and take pragmatic design decisions.
It’s perfectly acceptable to process data in the data lake and export a subset of the data to ElasticSearch or an in-memory data store. Alerts can be published on a message or task queue. Integration with internal/external REST API can be tricky if the API is not scalable enough to handle the number of requests modern data processing frameworks generate.
In the report, we suggest moving the data catalog, mentioned in shift 5, to this section.
The data catalog should be the main entry point for business users, data scientists and application developers to discover the available data, check data lineage and the options to access the data. If you like to learn more about the pitfalls of microservices check this point of view.
We are looking forward to an update of the report this year because the data landscape for data and ML keeps evolving at a fast pace.
We’d love to hear your point of view. Do not hesitate to reach out in case you have any questions or suggestions.