Enabling the Data Analyst: Google Cloud Welcomes Dataform

Google Cloud recently announced its acquisition of Dataform, a platform that empowers analysts to manage the entire flow of data within the warehouse, using a single, unified workflow. Incredibly powerful, Dataform is the tool used for transformation of data, reshaping, cleaning, and crunching. Since Dataform is now part of Google Cloud, it is available for free. Google is offering a product that can truly change the way that all companies, large and small, work with data, making it efficient and painless.

Enabling the Data Analyst: Google Cloud Welcomes Dataform

Where Data Teams Come Together

Traditional data ecosystems used an ETL approach – Extract, Transform, Load. In that approach, data cleansing and transformation, the ‘T’, happened as the middle step – where the compute power was available. Only the reduced dataset made it to the final warehouse for analysis. 

Dataform helps data analysts independent from the data engineer. It switches the model around, promoting an ELT model – leaving the extraction and load to the data engineers, and the post-processing SQL transformation to the analysts, and all with the rigor of enterprise-level data and pipeline code management.

Modern cloud technologies and tooling, with low cost storage and MPP architectures, enable a different approach. The data engineers focus on the extraction and load into the data warehouse, while the analysts focus on the transformation afterwards. ETL becomes ELT, and the analysts’ influence and expertise is unleashed.

This is what Google Cloud, with Dataform and BigQuery, is offering for your enterprise. Dataform enables your entire data team to collaboratively develop, test, and share the data that the business needs to make decisions. Within your BigQuery data warehouse, Dataform enables analysts to manage all data processes, turning raw data into the clean datasets you need for analytics and business intelligence. And it does this with the same kind of rigor you expect of any production infrastructure development and coding process – version control, code reviews, testing, reusability, documentation, scheduling, and alerting.

Dataform – Under the Hood

Dataform - Under the Hood

Dataform has several features which help us manage our data and make our work more efficient.

  1. Define tables and views to be created in the desired warehouse
  2. Add documentation to tables and views already created
  3. Test the quality of data by defining assertions
  4. Code is reusable and can be used as multiple scripts
  5. Running other arbitrary SQL queries and SQL operations
  6. Ability to take snapshots of data
  7. Helps in data modeling by use of ready-made SQL packages
  8. Ability for version control within the platform
  9. Services for scheduling, alerting and logging

SQLX in Dataform

SQLX is an open source extension of SQL which is extensively used in Dataform, providing multiple advantages:

  1. Code cannot be reused in SQL, whereas SQLX code can be reused across different scripts, operations, and projects.
  2. Managing dependencies is hard in SQL and many teams write thousands of lines of queries to ensure data processing happens in a specific order. SQLX allows you to build a dependency tree of all actions to be run and ensures tables and views are created in the right order.
  3. There is also the possibility of adding JavaScript to a .sqlx file.

The Code

A Dataform project is a repository with collections of configuration files, SQLX files and, sometimes, Java Script files. The components are: 

Config files: This lets you configure a Dataform project, including type of warehouse, schema, scheduling, packaging, and environments.

Definition: SQLX files that will define new tables, views, assertions and other SQL operations.

Includes: JS files if required in our projects where variables and functions are defined.   

The Process:

  1. Develop in SQLX
  2. Dataform compiles the project in real time
  3. Dataform connects to the data warehouse to run a dependency tree or a subset
  4. Tables are created or updated in the data warehouse

Dataform Integration with BigQuery

Getting Started

Dataform connects with BigQuery using a service account with appropriate permissions. The JSON key for that service account must be uploaded to Dataform.

There will be several options to log in on the Dataform site, including choices to log in through Google, Microsoft, GitHub, Okta, or an email address. Once logged in, it will ask to upload the JSON key mentioned above to get appropriate permissions. There is documentation within Dataform which provides a step-by-step detailed explanation on getting started.

If you need help or guidance on using Dataform and BigQuery as part of your larger data strategy, contact our experts to set up a time to talk.

May 5th, 2021
DATA ANALYTICS

Get the latest industry news and insights delivered straight to your inbox.

2021-05-05T10:01:57-05:00