Book DP-500 Training

DP-500 Table of contents

In the previous articles, you’ve learned why we can refer to Azure Synapse Analytics as a Swiss-army knife for all analytic workloads. Why Swiss-army knife? Because you can leverage Synapse Analytics to implement end-to-end analytic solutions, by taking advantage of different platform components for ingesting, storing, querying, and visualizing the data. However, there is more to it – you can also use Synapse for scoring machine learning models.

But, before I explain how to take advantage of the T-SQL PREDICT function for scoring machine learning models, let’s first examine the prerequisites for handling machine learning tasks in Azure Synapse Analytics.

Prerequisites

First, and most important, a machine learning model must be trained outside Synapse Analytics! Not only that, the model has to be registered in either Azure Machine Learning or Azure Data Lake Storage Gen2. Additionally, the model must be in an Open Neural Network Exchange, or ONNX format.

What’s the benefit of scoring machine learning models within Azure Synapse Analytics, you might ask? Well, the main advantage is the fact that you don’t have to move the data for scoring outside of the secure data warehouse. Additionally, it provides an interface for using T-SQL language and its PREDICT function, instead of writing Python code.

High-level overview of ML model scoring

Let’s take a high-level overview of the workflow for scoring a machine learning model.

The first step, as already mentioned, Dedicated SQL pool needs a model that’s already trained. Since the Dedicated SQL pool supports only ONNX format, in case your models are stored in other formats, you need to convert them prior to using a Dedicated SQL pool.

For the input data, you have to choose between the following supported data types:

  • int, bigint, real, float for numeric data types
  • char, varchar, and nvarchar for text values.

Scoring data must be in the same format as training data. Please keep in mind that complex data types, like multi-dimensional arrays, are not supported.

Finally, make sure that input data types and names are corresponding to the data types and column names of the new prediction data.

Once you load the model into the Dedicated SQL pool, it’s stored as a hexadecimal string value. You may expand the model data with additional columns, for example, ID or description, but you should use varbinary(max) as the data type of the model column.

Here is a simple example of the table to store the model data:

CREATE TABLE [dbo].[MLmodel]
(
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Model] [varbinary](max) NULL,
    [Description] [varchar](200) NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    HEAP
)

Once the table is created, you can use the COPY command to load the model in the Dedicated SQL pool.

The last step is to score the machine learning model. For that task, you can use the T-SQL PREDICT function. This function takes two arguments as input:

  • Model – can be defined as a variable, literal value, or value produced by a scalar subquery
  • Data – specifies the data used for scoring or prediction

In the following example, I’m using a WITH clause to define the name of the result set for the data parameter:

SELECT d.*
, s.Score
FROM PREDICT(MODEL = (SELECT Model FROM MLmodel WHERE Id = 1),
DATA = dbo.table AS d, RUNTIME = ONNX) 
WITH (Score float) AS s;

Expanding with INSERT INTO

One of the common use cases when using PREDICT function, is to generate a score for input data, and then insert these predicted values into another table. For that purpose, you can use a well-known T-SQL syntax that combines the INSERT INTO operation with the SELECT statement:

DECLARE @model VARBINARY(max) = (SELECT model FROM MLmodel WHERE model_name = 'DemoModelCustomer');

INSERT INTO customerScoring (c1, c2, c3, c4, score)
SELECT d.c1, d.c2, d.c3, d.c4, s.score
FROM PREDICT(MODEL = @model, DATA = dbo.table AS d, RUNTIME = ONNX) WITH(Score FLOAT) AS s;

Conclusion

Although Azure Synapse Analytics is not a primary, go-to platform for machine learning tasks, it still enables managing some portions of the workload in a very convenient way. Not just that you can leverage T-SQL language for scoring machine learning models, but you can also quickly and easily scale out the resources in Azure Synapse Analytics for complex business requirements related to predictive analytics.

Thanks for reading!

Last Updated on October 5, 2022 by Nikola

Spread the music: