How Databricks Powers Stantec’s Flood Predictor Engine


It is a collaborative put up between Stantec and Databricks. We thank ML Operations Lead Assaad Mrad, Ph.D. and Information Scientist Jared Van Blitterswyk, Ph.D. of Stantec for his or her contributions.

 

At Stantec, we got down to develop the primary ever fast flood estimation product. Flood Predictor is constructed upon a ML algorithm educated on high-quality options. These options are a product of a function engineering course of that ingests information from quite a lot of open datasets, performs a collection of geospatial computations, and publishes the ensuing options to a function retailer (Determine 1). Sustaining the reliability of the function engineering pipeline and the observability of supply, intermediate, and resultant datasets guarantee our product can convey worth to communities, catastrophe administration professionals, and governments. On this weblog put up, we’ll clarify a number of the challenges we’ve encountered in implementing manufacturing grade geospatial function engineering pipelines and the way the Databricks suite of options have enabled our small staff to deploy manufacturing workloads.

Figure 1: Data flow and machine learning overview for flood predictor from raw data retrieval to flood inundation maps. Re-used from [2211.00636] Pi theorem formulation of flood mapping (arxiv.org).
Determine 1: Information circulate and machine studying overview for flood predictor from uncooked information retrieval to flood inundation maps. Re-used from [2211.00636] Pi theorem formulation of flood mapping (arxiv.org).

The abundance of distant sensing information offers the potential of fast, correct, and data-driven flood prediction and mapping. Flood prediction will not be simple, nevertheless, as every panorama has distinctive topography (e.g., slope), land use (e.g., paved residential), and land cowl (e.g., vegetation cowl and soil sort). A profitable mannequin could be explainable (engineering requirement) and basic: capable of carry out effectively over a variety of geographical areas. Our preliminary method was to make use of direct derivatives of the uncooked information (Determine 2) with minimal processing like normalization and encoding. Nevertheless, we discovered that the generalization of the predictions was not ample, mannequin coaching was computationally costly and the mannequin was not explainable. To handle these points, we leveraged Buckingham Pi theorem to derive a set of dimensionless options (Determine 3); i.e. options that don’t depend upon absolute portions however on the ratio of mixtures of hydrologic and topographic variables. Not solely did this scale back the dimensionality of the function area, however the brand new options seize the similarity within the flooding course of throughout a variety of topographies and local weather areas.

By combining these options with logistic regression, or tree-based machine studying fashions we’re capable of produce flood threat likelihood maps, in comparison with extra strong and sophisticated fashions. The mixture of function engineering and ML permits us to develop flood prediction in new areas the place flood modeling is scarce or not accessible and offers the idea of fast estimation of flood threat on a big scale. Nevertheless, modeling and have engineering with massive geospatial datasets is a sophisticated drawback that may be computationally costly. Many of those challenges have been simplified and develop into less expensive by leveraging capabilities and compute assets inside databricks.

Figure 2: Illustrative maps of the dimensional features first used to build the initial model for the 071200040506 HUC12 (12-digit hydrologic unit code) from the categorization used by the United States Geological Survey (USGS Water Resources: About USGS Water Resources).
Determine 2: Illustrative maps of the dimensional options first used to construct the preliminary mannequin for the 071200040506 HUC12 (12-digit hydrologic unit code) from the categorization utilized by the US Geological Survey (USGS Water Assets: About USGS Water Assets).

 

Figure 3: Maps of dimensionless indices for a single sub-watershed with a 12-digit hydrologic unit code of the 071200040506 HUC12. Compare to figure 6 for visual confirmation of the predictive power of these dimensionless features.
Determine 3: Maps of dimensionless indices for a single sub-watershed with a 12-digit hydrologic unit code of the 071200040506 HUC12. Evaluate to determine 6 for visible affirmation of the predictive energy of those dimensionless options.

Every geographical area consists of tens of hundreds of thousands of knowledge factors, with information compiled from a number of totally different information sources. Computing the dimensionless options requires a various set of capabilities (e.g., geospatial), substantial compute energy, and a modular design the place every “module” or job performs a constrained set of operations to an enter with a particular schema. The substantial compute energy required meant that we gravitated towards options that had been cost-effective but appropriate for big quantities of knowledge and Databricks Delta Stay Tables (DLT) was the reply. DLT brings extremely configurable compute with superior capabilities similar to autoscaling and auto-shutdown to cut back prices.

In the course of the conceptual improvement of Flood Predictor we positioned emphasis on the flexibility to rapidly iterate on information processing function creation, and fewer precedence on maintainability and scalability. The end result was a monolithic function engineering code, the place dozens of desk transformations had been carried out inside just a few python scripts and jupyter notebooks, it was laborious to debug the pipeline and monitor the computation. The push to productionizing Flood Predictor made it obligatory to deal with these limitations. The automation of our function engineering pipeline with DLT enabled us to implement and monitor information high quality expectations in real-time. An extra benefit is that DLT breaks the pipeline aside into the views and tables on a visually pleasing diagram (Determine 4). Moreover, we arrange information high quality expectations to catch bugs in our function processing.

Figure 4: Part of our Delta live table pipeline implementation for feature engineering.
Determine 4: A part of our Delta stay desk pipeline implementation for function engineering.

The high-level pipeline visualization and pipeline expectations make upkeep and diagnostics easier: we’re capable of pinpoint failure factors, go straight to the offending code and repair it, and cargo and validate intermediate information frames. For instance, we had been capable of uncover that our transformations had been resulting in pixels or information factors being duplicated as much as 4 occasions within the closing function set (Determine 5). This was extra simply detected after setting an “count on or fail” situation on row duplication. The chart in Determine 5 mechanically updates at every pipeline run as a dashboard within the Databricks SQL workspace. Inside just a few hours we had been capable of establish that the perpetrator was an edge case the place information factors on the sting of the maps weren’t correctly geolocated.

Figure 5: Databricks dashboard based on a SQL query to count the number of duplicated rows and categorize (color) by the frequency of duplication (2 or 4). huc and HUC12 are the 12-digit hydrologic unit code delineating subwatersheds.
Determine 5: Databricks dashboard primarily based on a SQL question to rely the variety of duplicated rows and categorize (shade) by the frequency of duplication (2 or 4). huc and HUC12 are the 12-digit hydrologic unit code delineating subwatersheds.

We now return to an earlier level: we’d like our flood prediction system to be as extensively relevant (basic) as doable. How will we quantify the flexibility of a educated Machine Studying mannequin to to generalize past areas throughout the coaching datasets? The geospatial dependency of factors in our datasets requires care when partitioning information into coaching and check units. For this we use a variant of the cross-validation method known as spatial cross validation (e.g.: https://www.nature.com/articles/s41467-020-18321-y) to compute analysis metrics. The overarching thought of spatial cross-validation is to separate the options and labels into a lot of folds primarily based on geographical location. Then, the mannequin is educated successively on all however one fold, leaving one fold out at every step. The analysis metric (e.g., root imply squared error) is computed for every step and a distribution is obtained. We had 10 subwatersheds with labels to coach on so we utilized a 10-fold spatial cross-validation the place every fold is a subwatershed. Spatial cross-validation is a pessimistic metric as a result of it could miss a fold with options that aren’t consultant of the set of coaching folds, however that is precisely the high-standard we would like our dimensionless options and mannequin to realize.

With our options and analysis course of in place the subsequent step is coaching the mannequin. Happily, coaching a statistical mannequin on a big information set is easy in Databricks. Ten subwatersheds include on the order of 100 million pixels, so the total coaching set doesn’t match into the reminiscence of most compute nodes. We experimented with various kinds of fashions and hyperparameters on a subset of the total information on an interactive Databricks cluster. As soon as we settled on a given algorithm and parameter grid, we then outlined a coaching job to make the most of the decrease price for job clusters. We use PySpark’s MLLib library to coach a scalable Gradient-Boosted Tree classifier for Flood Predictor and let MLflow monitor and monitor the coaching job. Selecting the best metric to judge a mannequin for flooding is necessary; for many occasions, the frequency of ‘dry’ pixels is far larger than that of flooded pixels (on the order of 10 to 1). We selected to compute the harmonic imply of precision and recall (F1 rating) in regards to the flood pixels as a measure of mannequin efficiency. This alternative was made due to the big imbalance in our goal labels and classification threshold invariance will not be fascinating for our drawback.

In contrast to different types of information, a consumer requests geospatial information by delineating a bounding field, a form file, or by referencing geographical entities like cities. A typical request for Flood Predictor output specifies a bounding field in decimal diploma coordinates, and a precipitation depth similar to 3.4 inches. The server facet software ingests these inputs, in a JSON format, and queries the function retailer for the pixel options throughout the requested bounding field. As is typical for geospatial information providers, the returned output will not be on the degree of a single pixel, however pre-defined tiles containing a given variety of pixels. In Flood Predictor’s case, every tile accommodates 256×256 pixels. The benefit of this method is proactively limiting the information quantity served by the API to keep up passable response occasions and never overload the database nodes. To attain this design, we tag every pixel within the database with a tile ID that specifies the tile it belongs to. So, after the consumer question is ingested, the server facet software finds the tiles that intersect the requested bounding field and predicts flooding for these areas. With this design, Flood Predictor can return high-quality flood predictions for dozens of sq. miles at 3-10 meters of decision inside solely minutes.

Figure 6: Flood predictor output compared to the label from 2D hydraulic and hydrologic modeling for the 071200040506 HUC12 and a 10-year storm event. These are categorical maps where “1” denotes flooded areas. The resolution of the map is 3 meters and downsampled by a factor of 100 for visualization purposes.
Determine 6: Flood predictor output in comparison with the label from 2D hydraulic and hydrologic modeling for the 071200040506 HUC12 and a 10-year storm occasion. These are categorical maps the place “1” denotes flooded areas. The decision of the map is 3 meters and downsampled by an element of 100 for visualization functions.

Flooding is essentially a geospatial random course of that’s constrained by spatial patterns of topography and land cowl and impacts a considerably massive variety of communities and belongings. That is why flood prediction has been the topic of quite a few bodily and statistical fashions of various high quality however there exists, to our information, no comparable product to Flood Predictor. Databricks has been a decisive issue for Flood Predictors success: it was by far probably the most cost-effective approach for a small staff to rapidly develop a proof-of-concept prediction device with accessible datasets, in addition to implement production-grade jobs and pipeline.

Backed by Databricks’ end-to-end machine studying operations, Stantec’s enterprise-grade Flood Predictor helps you get forward of the subsequent flooding occasions and save lives via well-designed predictive analytics. Test it out on the Santec web site: Flood Predictor (stantec.com) and on the Microsoft Azure market.

Leave a Reply