Google Cloud Data Fusion — How ML6 can bridge the data gap.

11 April 2019, 14:35

Last year Google acquired Cask, the company behind CDAP, the CASK Data Application Platform. CDAP offers high-level APIs and a visual UI to build streaming and batch data pipelines and applications on top of Hadoop.
Google announced today that CDAP will be offered as a managed service in Google Cloud Platform: Google Cloud Data Fusion.
The open-source version remains the core of Google Cloud Data Fusion.

. . .

Cloud Data Fusion is a great addition to the growing number of data tools available on Google Cloud Platform. Using Google Cloud Data Fusion, we at ML6 can bridge the gap between code based data transformation tools such as Google Cloud Dataflow and more traditional UI based ETL and data integration tools.
Google Cloud Data Fusion batch and streaming pipelines are executed on Google Cloud Dataproc. Since CDAP is already in production for 5+ years it is compatible with various Hadoop distributions like Azure, AWS and of course Google Cloud. This will be valuable for hybrid cloud scenarios.

On the 28th and 29th of March 2019, I attended the Cloud Data Fusion training by the CDAP team for EMEA partners.

These are my main highlights.

The concepts, source/sink/transform/… are very familiar building blocks for users of ETL tools so the learning curve to build a scalable serverless data pipeline is low.

The Wrangler” transformation is the swiss-army knife in Google Cloud Data Fusion. It offers an efficient familiar visual interface to transform the data column by column. Extra functionality can be added using custom developed “directives”.

The Scala, Javascript and Python transformation steps, which can output multiple records.

Excellent support for flat and nested data. Schemas are automatically detected or can be manually defined using AVRO schema

Google Cloud Data Fusion pipelines are run as Google Cloud Dataproc jobs so you don’t need to manage any infrastructure.

One of the main benefits of the data pipelines is automatic data lineage and data preview. That’s extremely important in regulated or complex data integration environments

The data pipelines can be parameterized using macros.
Since all configuration is JSON-based in the background it’s easy to import/export schemas, directives etc.

A wide range of plug-ins, directives and predefined data pipelines are already available in the marketplace called the “Hub”

I would like to get a better view of Google Cloud Dataproc sizing and auto-scaling.

The support for windows and late arriving data is more advanced in Google Cloud Dataflow/Apache Beam.

Since most data pipelines are very specific make sure to pick the right tool for the job. That will be possible using Google Cloud Composer because a Google Cloud Data Fusion hook/operator is on the roadmap.

If you want to experiment with Google Cloud Data Fusion you can launch it in GCPThe first 120 hours for the basic edition are free each month.
It’s easy to setup the CDAP sandbox in a VM or on your local machine. Make sure you use a 64-bit Java JRE. Or use the VM or Docker container.

In an upcoming blog post, we will build a Google Cloud Data Fusion pipeline that combines bike share availability data in jsonline and master data in JSON.