The Five W’s of Spatial Data Science and Spatial Data Analytics

Spatial Data Science is a term we use frequently at Maven Wave to describe the use of spatial data, algorithms, and analytical methods applied in concert with more traditional machine learning and deep learning techniques. While GIS software packages (e.g., ArcGIS, QGIS, Google Earth Pro, ERDAS, etc.) provide the foundational frameworks for explaining the who, what, when, and where via geospatial analysis and visualization, they do not adequately investigate or explain why and how on their own.

The Five W’s of Spatial Data Science and Spatial Data Analytics

These basic questions of data analytics are often referred to as the Five W’s — the fact that “how” begins with an H notwithstanding. In many of Maven Wave’s Data Science, Data Analytics, Custom AppDev, and IoT initiatives, exploration of these questions are critical. By combining Data Science and GIS as information science disciplines, we have a complete analytical representation of, and a framework for, the exploration of the Five W’s (and the one H).

In a previous post, we examined the use of Isolines as a geospatial concept as they relate to Spatial Data Science. In this post, we will explore a few more important geospatial and cloud concepts that are also extremely valuable tools in our Spatial Data Science toolbelt. 

Geospatial Data Ingestion and Enrichment Pipelines

When talking about Geospatial Data Enrichment Pipelines, the conversation boils down to cloud scale more than GIS or a Data Science concept. But it is foundational as an entry point, so it seems like a sensible place to start.

As you begin your Spatial Data Science journey, you will first need a firm understanding of your business use case and exploratory objective as well as knowledge of available source data (i.e., publicly and privately data that you will need to support your efforts).

Unless you are building a PoC of some sort, chances are you will need to wrangle, orchestrate, and persist large volumes of disparate data; data that often has no georeference or data that has varied projections

To facilitate this, we require cloud scale. Google Cloud excels as a cloud provider in its data stack tooling, as well as its broad embrace of open-source, data engineering-focused projects. This, coupled with Google’s feature-rich Location Based Services (LBS) APIs and Earth Engine, is why Google Cloud is the go-to cloud provider for any geospatial data project.

A typical project will make wide use of the following:

  • Google Cloud Storage (GCS)
  • Cloud Composer (Airflow)
  • BigQuery or Snowflake
  • AI Platform
  • Google’s rich ecosystem of LBS APIs
  • BI tooling such as Looker or DataStudio 

Depending on your use case, other relevant tooling may include:

  • Google Wear
  • Cloud IoT Core
  • Cloud PubSub
  • Dataflow
  • Kubeflow
  • Auto-ML

For reference, here is a simple representation of a minimal Spatial Data Science project stack:

The Five W’s of Spatial Data Science and Spatial Data Analytics

Much of that stack is fungible, but at the very least, you need data storage and a proper data orchestration layer before you can do anything.

For source data, your use-case should make fairly clear which means of persistence will be required. If you are working with large volumes of Orthoimagery, for instance, you should carefully consider your storage requirements.

For orchestration, Cloud Composer (Airflow) is an ideal solution. With Spatial Data Science efforts, we typically ingest data from a broad range of sources and formats — all with unique workflows and scheduling requirements. Airflow’s Directed Acyclic Graphs (DAGs) are an ideal means by which to perform tasks such as georeferencing, coordinate system projection/reprojection, and for generally executing geospatial algorithms such as getIntersecting() queries; all within the data ingestion pipelines themselves.

Put simply, geospatial operations that are first developed in a GIS platform such as QGIS or within exploratory Jupyter Notebooks can be easily operationalized as Python in Airflow DAGs.

The orchestration layer additionally enforces Terms of Service around data enrichment and persistence that exist for any third-party data source or API (such as Google’s Maps API Terms of Service). Airflow DAGs are quite possibly the perfect solution for such tasks. However, if your use-case is more around telemetry and/or IoT workloads, Dataflow can be easily coupled with Airflow as the primary orchestration layer. The point here is that cloud-scale data persistence and rich data/model orchestration are essential.

Geospatial Clustering

Now that we have the data in our hands, we can begin our Spatial Data Science in earnest. Let’s now take stock of a few Geospatial concepts that are broadly applicable.

Clustering is an unsupervised machine learning task that involves automatically discovering natural grouping in data. With clustering, we reason that data points within the same group should have similar properties and/or features. Conversely, data points in different groups should have highly dissimilar properties and/or features.

Understanding spatial dependencies between observation points or polygons/polylines is a perfect use case to apply clustering algorithms. DBSCAN and K-Means are examples of algorithms that are ideally suited for Geospatial Clustering.

Taking things a step further, we can add temporal dependencies such as seasonality and regional holidays. Cloud scale is essential for time series analysis of this nature. Additionally, density-based clustering and heatmap analysis are common ways in which Geospatial Clustering can be readily applied to common Data Science exploratory efforts. 

Geospatial Regression

Regression analysis in machine learning allows us to predict the output values based on input features of the source datasets. Multiple input variables are used to estimate a value of the target output (with the target variable typically being a continuous variable). Common types of regression algorithms are linear regression, regression trees, multivariate regression, and lasso regression. Regression models have broad and impactful applicability, particularly in financial forecasting, trend analysis, time-series prediction, retail/marketing, population health, and so on. As discussed in our previous post, Isolines can also be an incredibly useful feature set for this sort of prediction.

For example, in automotive retail regression tasks, we might look to determine the target price of a vehicle make and model given predictors such as vehicle type, available accessories/features, historical sales volume, buyer demographic history, etc.

Made spatially aware, geospatial regression algorithms such as Geographically Weighted Regression (GWR) allow spatial context to otherwise spatially independent variables (i.e., independent input variables can have spatial influences on other variables). For example, vehicle type variability (or lack thereof) at one location can affect a feature such as total available inventory at a nearby location. Needless to say, this concept has broad cross-industry applicability.

Geospatial Classification

Classification is a supervised learning concept that categorizes a set of data into classes (e.g., assigning a class label such as “spam/not spam”) is a typical go-to use-case. For accurate modeling, classification requires training datasets with many examples of inputs and outputs from which to learn.

Geospatial Classification is especially useful when classifying at the pixel level or when applied to the entire raster dataset. Orthoimagery data, such as Sentinel-2 or LANDSAT 8 data, are perfect candidates for Geospatial Classification (e.g. swimming pool/no swimming pool, solar panel/not solar panel) or, more expansively, a semantic segmentation task to determine land usage (e.g. forest, farmlands, urban areas, dried lands, water bodies, etc.). LiDAR point cloud data workloads are another example where Geospatial Classification is practical.

How Maven Wave Can Help

Many of these concepts have broad cross-industry applicability — and they all require cloud scale to take them to production. That’s where Maven Wave comes in: whether it’s architectural guidance or white-glove implementation, we’re here to help you. Through our Data Science, Data Engineering, IoT, and Application Development practices, Maven Wave can further help your teams take these concepts to production. All you need to bring is your vision. 

Contact us to discuss your use case with our team today.


About the Author

Shannon Thompson
Shannon Thompson is an Application and Cloud Architect with more than 20 years of software engineering experience, designing innovative solutions for the public and private sectors. Shannon's deep background in GIS, Cloud, edge/IoT, and mobile "hybrid" solution architectures has created a diverse portfolio of innovative cross-industry solutions. When Shannon is not geeking-out about LiDAR, he can be found playing music, cooking, kayaking, or playing with his kids. Shannon lives in Belmar, NJ with his wife Kristine, two sons Finn and Luke, and his dog Levi.
November 18th, 2021

Get the latest industry news and insights delivered straight to your inbox.

Sign up for our Newsletter