AZURE DATABRICKS OR AZURE ML: WHICH ML SOLUTION IS RIGHT FOR YOU?

TRANSCRIPT

Welcome, everyone, to "Azure Databricks or Azure ML: Which ML solution is right for you." Before we dive in, I want to quickly introduce to you Eastern Analytics, a boutique consulting service specializing in Microsoft Analytics, Power BI, and Azure. For more than 25 years, we've been helping companies maximize the value of their data through technology. As seasoned experts, we have the functional knowledge that allows us to understand our customer's sources and requirements, and the technical expertise to design and build systems that are robust, flexible, and secure. We assist customers at any stage of their analytics project and can take them through the 'four Ds' - Design, Development, Deployment, and DevOps. We cover everything from technology advice and solution architecture to support and staff augmentation. We are analytics people with the knowledge and skills that ensure our customers are successful in meeting their analytic needs.

Now, Scott Pietroski is Eastern Analytics Managing Partner and Senior Solution Architect. For more than 25 years, he's been building out analytic platforms for corporations such as Bose, Adidas, Estee Lauder, and many more. And he specializes in Microsoft Power BI and Azure and is excited to share with you today's presentation.

Hi, everybody. Welcome to today's presentation. Today, we're going to do an overview of Azure ML and Databricks. We're going to talk about the strengths of Azure ML. We're going to talk about the strengths of Databricks ML. We'll do a little bit about cost considerations between the two. We'll talk about data prep and modeling. And we'll talk about ongoing learning, or moving your models into production. After that, we will have a Q&A. We hold off to the end for the Q&A just because there's so much information packed into this. And it's only a half an hour long, so we do the questions at the end.

The first topic is an overview of Azure ML and Databricks. Azure ML itself is a standalone platform for machine learning. It's available in the Azure platform only. You can't get it out on Google Cloud or AWS. It has seamless integration with other Azure tools. It seamlessly integrates with the cloud storage, with the active directory, with the data factory. It includes Auto ML functionality similar to what's in Databricks, but it's presented in a different way.

For those of you who are not familiar with Auto ML, AutoML basically is a wizard-based tool that helps, or basically the system goes through and helps you select which algorithms are right to use, so which algorithms are the most accurate. There are steps involved that automatically goes through and helps you clean your data, do feature selection, and then goes through and marches through and runs multiple jobs and saves the artifacts for each one of the jobs and for different algorithms and also for the hyper parameters. So it's really a great place to start.

Azure ML is an ML platform. So as with Databricks, Databricks has a bunch of other functionality and ML is a component of that. For Azure ML it's not a component, it's the whole purpose of the platform. It can be GUI driven. So I'll show you in a moment it has a designer. So you, can do no code or low code, or also it can be notebook driven. And you can consume everything inside of a notebook and just do all of the coding. It's based in Python. It provides a framework for basically your common data science tools.

If we take a look at Azure ML here, we can see basically that this is inside of their designer studio. You can see over on the left, it has notebooks. You can basically code in your Jupyter Notebooks and then in the notebooks consume any of the elements that we have here for the components that you could create. You can use them in the designer.

This happens to be a picture of the designer here. Inside of this designer, you're basically creating pipelines and you would start with your data ingestion, whatever data set you're going to be consuming. Then you walk through and you do your data manipulation. Whatever you need to do basically behind the scenes is a data frame. The data frame is flowing through all of these different steps. This designer here, this pipeline happens to be training two different models with two different algorithms. And then it compares the results at the end to see which one is the most accurate.

But everything over here, you can basically do it in the designer. You can also do everything that you can do in the designer in notebooks. You set up your data components. So think about as connecting to other systems. Azure ML does not store the data. So Azure ML is a platform that ingests data, performs whatever tasks you ask it to do and then exports that data or that model, and stores that inside of an Azure storage account.

You've also got jobs. Think of the jobs that you run. The jobs themselves. There is no scheduler. I'll discuss that in a little while, but you do run things with jobs. It comes with a whole bunch of predefined environments. Think of them as software and OS bundles for creating your computes, because you can do anything you can do in the designer, plus a lot more inside of your notebooks inside of Python, you can basically import and load any models that you can get onto that compute, you can reference in your Python the way that a typical data scientist would.

If we talk about Databricks, Databricks is a standalone cloud environment. Azure ML is an environment built specifically for ML. A lot of you are probably familiar with Databricks for the basics. Databricks is in two Gartner Magic quadrants. So it actually is a database management system. And it also falls into a leader in the data science and machine learning platform.

So think about the DBMS as a competitor of Snowflake. And also the data science and ML that also competes with Snowflake. So, Databricks is available on AWS. It's also available on Azure and on Google Cloud. It uses something called a Delta Lake, which are open source libraries. It's an open source library that it uses to access BLOB storage and file storage. But it accesses it similar to a table. That's actually your Delta Lake tables. That's open source that they happen to be using for their reads and writes, and their access control and the rest of it.

Databricks is designed for parallel processing. Not only does it just support Python, which is what Azure ML does, but it also supports R, scala, Spark, and, then Python/SQL. You can, inside of their notebooks, you can actually switch to SQL and go after their Delta Lake tables. It's built off of its overall architecture is what they call Lakehouse architecture.

A Lakehouse basically is dealing with your BLOB storage. So you can have structured data, you can have unstructured data, semi-structured. And inside of that Data Lake you also can have the house part of the lakehouse. The house part is basically what you think of as something similar to an enterprise data warehouse. It's different than a relational database, but it's designed around tables and SQL.

Data bricks also includes Auto ML functionality. So in its ML area of functionality, just like you can in Azure ML, you can go in and pull in your data set, and then it will loop through and try to find the most accurate algorithm so that you can use that as a starting point.

Now, both environments, Azure ML and Databricks, when they go through and they run through and create all of those AutoML jobs, they're saving off notebooks in Python. So you can take those notebooks in either environment and then use them as a starting point. It's a great way to start off. It also does hyperparameter tuning, and data cleansing, and harmonization and things like that, that you would need to prep your data for ML. It too, provides an ML framework.

Now, if we look at...This is inside of Databricks. It's much more code driven. So it's almost all code driven. In this case, I happen to have a picture of just the machine learning component of it. And if we look through here, we can see that the workspaces and repositories, these are places where you actually store your notebooks and do your development work.

You do have data to actually connect to data sources, they tend to pull up wizards, which are cut and paste of code that you can then use inside of your notebooks. They do have separate computes, two different kinds of computes that you can use. Whether they're general or job based it has workflows, which is the job scheduler.

It also works on something called MLflow. You can go through just like Azure ML, you create your experiments and it automatically uses MLflow. Azure ML automatically uses it inside of the notebooks you actually toggle it on over right in the development. But it will use MLflow and it will track all of the different outputs, basically, of your ML experiment or the code. So, ML flow is very useful and it's an industry standard at this point.

If we look at the strengths of Azure ML. This is just Azure ML here, one of the biggest strengths is that it's DOE or code based. The designer is great because you can quickly prototype things. You can create your pipelines that do certain kinds of experiments, and then you can clone them and swap out data sets and tweak the columns and your features and things that you want in there. But it's GUI or it's code based. Not only does it do no code, which is the designer, but inside of the designer, if you have a step in the designer that one of their activities that they supply as standard does not support, then you can drag in a Python activity and do additional python code right in that activity. It also has an activity for R. You can drag in code in R in the middle of a pipeline and then continue on.

It comes with the standard libraries on its computes, which is basically like sci-kit learn. It uses MLflow. You can use Tensorflow. It comes automatically with Pandas and Numpy and whatever statistical packages you think are standard. If it's not in there, you can always pip install it.

Connectivity is GUI driven, so that's a strength. Why I say that's a strength is you can go in and create a data set inside Of Azure ML, and then from there it's always in that workspace. The connection is already defined. You can then go in and reference that data set from within a notebook. When you reference it from within a notebook, it automatically takes care of connecting and the rest of it. You just reference the object. Where over inside of a Databricks environment, you're really given just a skeleton code and you need to reuse that.

Auto ML. Auto ML is very robust inside of Azure ML. It generates the Jupyter Notebooks as I talked about earlier. If we look at this screenshot here. This happens to be a auto ML inside of the Azure ML environment. This happened to be the name of the experiment. And the experiment was run, and I happened to be focused on the child jobs. This was for one data set that we passed in. It automatically went through and did data cleansing, checking for class imbalances. Then after that went and it looped through and each one of these jobs was a different algorithm, or a different set of hyperparameters for an algorithm.

After, at the end, it suggests on your overview tab which one it thought was the most accurate and why. Along with that, it will discuss inside of data guardrails...Data guardrails are the actual data preparations and checks that it did on your data before it started running them through algorithms, replacement of nulls looking to make sure that you don't have features that have high cardinality, things like that.

Also, you store your models so you can export your models here to a model repository, and then it has all of the components of every single one of these jobs. It has the components that you can retrieve and use as a starting point or publish.

Strong SDK support. This is very important because anything you can build in the GUI and even more, you can do inside of the Python notebooks. And in the notebooks is where you use the SDK supported functionality. And that interacts with the environment. So, because of that, they've really built an object-based environment to do your development and your experiments against.

It has tight integration with the Azure Data factory. Basically, when you go in and you run an experiment, you can run it and execute it from within the GUI inside of Azure ML. But when it comes to automation or orchestration, you need to do that from the data factory. Because of that, it has very tight integration and it's easy to use to trigger something inside of a workspace, whether it's a retraining run or whether it's an inference run in batch. You can do that quite easily from the data factory.

Importantly, it has Power BI integration. So Databricks does not have Power BI integration. Any model that you publish inside of Azure ML. If you have authorizations to get to the ML workspace, you can then consume it inside of Power BI. This is a screenshot of the Power BI service. It works basically the same way whether it's in the desktop or the service. As long as you're logged into Azure, it will then show the models that you can consume.

In this case, I was out on the service inside a power query on one of the data flows and I selected that I wanted to add Insight, which pulls up the AI interface. When I did that, I can see here I've got two Azure ML models that I've published. I published as a REST endpoint. They happen to be a web endpoint is how I published these two. Now I can consume them and add them into any table, map my columns for my features into it and it will automatically return the inference columns and also our probabilities and a couple of others, depends on the model.

Last but not least, it's designed for ML. It's state of the art. It continues to evolve. It evolves every month. More buttons or more capabilities buried down deep will show up and it continues to evolve. It really is a nice platform.

If we look at the strengths of Databricks ML. Well, the biggest strength of Databricks ML is that it's part of Databricks. So if you're already up on Databricks, you're just going to use the ML functionality within Databricks without having to spin up another platform that you have to support, that you have to have people familiar with, whether it's the authorization assignments or access control and the rest of it. So, it sits on top of the Lakehouse and if you're already using data bricks, it's a great place just to continue on with.

It's code based. Now to some people, that's not necessarily a positive. But to a lot of data scientists, anybody who goes back a ways in time for it, they're already used to that. They already have most of their experiments have done in Python or Java. Because of that, it supports Python, Python for Spark, Scala, R, and SQL. It very easily integrates MLflow by simply doing a reference to a library that will collect all your information for you, which removes a lot of what used to have to be done manually by logging off your model's results as you looped through your different algorithms.

Now, the feature catalog. Now, the feature catalog is something unique which is pretty clever and it's expanding. What it is, is basically think of it as an area that's a set aside of what you would consider reference data or master data. So you think of it as you....for an example, would be if you took the US Census data, maybe by zip code. And then you built or engineered a bunch of features on it that you wanted to use, whether it was groupings by populations or whatever it happened to be. Those additional features that you engineer, you store that data off in the feature catalog and you can schedule runs to populate it and keep it up to date.

Now that feature catalog can then be consumed inside of the notebooks or inside of Auto ML. You can also say 'reference this feature catalog and here's my link.' So what it does is it gives you an area that you can maintain your feature set outside of your experiments and then make those features be available to all your other experiments but be consistent. It's nice.

For a model repository, it includes... not only do you store everything off for the MLflow and collect it, but you can register your models in the model repository. One of the nice things it has is a tagging that you can put on a model as to whether or not it's a production model, a test model. And Where that comes into play is not really just a tag. What happens is there's functions that you call within your notebooks to retrieve models. And you can tell it, rather than the version of the model you want, you tell it rather than telling it your model ID and then the version, you can tell the model ID and you want the version production. It will then go look in whatever version number is tagged as production and it will return that to you.

You also have to approve moving something into a production state. It's like a lifecycle management for a model where you can control consumption and make sure that you only consume the correct version of a model when you're actually running something.

Model consumption is easy. When you go and you're on a model, it'll suggest code, generate a code for you to easily cut and paste into your notebooks. It also supports different consumption patterns, meaning that you can reference it outside using kURL as your example, but you can reference it with an API calls and endpoint. It also supports batch and then streaming scenarios.

Another strength is it's got its integrated scheduler. There is no Scheduler in Azure ML. The Scheduler inside of Databricks, you're scheduling notebooks. You can already use the notebooks, and then you can also incorporate it into your Delta live flows.

And last but not least, is it has the Delta Lake integration. So, when you go and you pull data up for your Auto ML, you're pulling data up out of tables, your Delta Lake tables. And then also it wants to export the data to another Delta Lake table. So, it's reading and writing from tables in a database, which is a nice feature.

If we look now at cost considerations, these are estimated costs for just a couple of models that are published. As I'm sure you know, if you're working with cloud resources now, they can be very expensive. You can also do all sorts of things to contain your costs. For an Azure ML workspace, the workspace itself and the storage account is not a big deal. Everything in Azure ML is stored inside of an Azure storage account that's your own account. So, it's a platform as a service, but all of the data is being read and written to your own storage account. It's not storing your data. So, it requires basic computes. So, a compute instance that's on demand for your development work. They've added, for a while, from its inception, you weren't able to actually... It didn't auto-terminate, which was a problem. Just recently I noticed that it now auto terminates, so you can tell it to shut off every day at five o'clock so people don't leave it running all night.

Your compute cluster for training is on demand. It can be set to so many minutes after of inactivity. It'll terminate. Then your inference cluster tends to for your... This is for your REST endpoint for predictions. That tends to be the most expensive component, and that's because that's up and running. It's a Kubernetes cluster, and you need to have it up and running to be able to respond to a Power BI request. So because of that, they don't give you the ability to auto-terminate it.

If we look inside of Azure these are the components for Azure ML for a workspace that were created. We've got our Azure Machine Learning workspace, we've got the Kubernetes service, which is our inference cluster that's been up and running all the time, our Key Vault, Key Vault is just security, application insights, and last but not least, the storage account. This storage account is the area that everything is stored for your ML workspace.

Now, for Databricks. These numbers were thrown together as an estimate for also just a couple of models being in production. So, the cost to train them, the cost for inference runs, for batch runs, this is all variable. It all depends on the use case that you have, the size of your data set, the size of your models, and also whether or not you're already running these computes. If you're already running these computes for Databricks, then it may not add as much. If you go through and you look at the required computes, we've got an all-purpose compute for your development work, an all-purpose compute for on demand training as you're running through, and training models. Then after that, you can either have job computes that are running training, retraining in the future or for batch inferencing. Then the SQL endpoints and the SQL endpoints is a whole bunch of different end points that you can use or kinds where you're dealing with, whether or not it's dedicated or whether or not...Now they have serverless, which I think is in preview. But it's coming soon. Then you could also set them to auto-terminate.

This is just a screenshot of what it looks like inside of Databricks. I happen to be on a job compute, and this job compute was there because I ended up publishing something as a REST endpoint. It basically permanently runs until you stop it. The all-purpose computes are the ones that use for your notebook and interactive compute.

Data prep and modeling. The good thing about the Azure ML world is inside of its designer is there's a whole bunch of activities to do data prep. So, you can also do it inside of your notebooks, which is what people are typically used to. But you have a lot of functionality in there. The designer includes the ability to combine your data frames to do coding and vectorization and to do class balances. The notebook supports all of that stuff and more.

If we look at inside of Azure ML, this is in the in the design mode of the designer, and we can see just from one activity here. This one happens to be coding, and this was used for an NLP pipeline. We can see basically you can do here, you can set it to do TDiff what you're doing your gram settings. All of the settings of each one of these activities, each activity serves a purpose. This one here is splitting your data set. You can set up percentages behind it in its properties.

If we go over to the left here, we have data transformation. There's 19 different kinds of activities that you can drag in to manipulate your data and normalization. It has all sorts of things. We can see here that it has 19 different algorithms that you can drag in They're not all applicable for each use case - some for regression, some for classification. But it really has all the activities you can need to manipulate your data.

If we look at Databricks and ML, data prep can be done as part of any notebook. Data scientists are used to going in and, inside of their notebook, having different sets of activities that do different kinds of data prep on it.

This is a picture just of one of the notebooks inside of a workspace here. In this case, it happens to do to instantiate MLflow, and then it flows through and it's your usual Python notebook development. The feature store does allow you to engineer stuff outside of the ML and consume it, which is nice.

Now, the notebook development, once again, supports Python, Spark, Scala, SQL, and R. Once again, any library that you can get installed on your compute you can reference inside of your python and use it how you usually would for your data science tasks.

Now, last but not least, moving into production. For Azure ML, Azure ML supports a single and a multi-tier landscape. What that means is you can just have one environment that's your Dev box and your Prod box. They tend to basically recommend that you create workspaces by functional area. So, if it was financial forecasting you were going to use your environment for, you create a workspace for your financial forecasting. That's so that the workspace doesn't get too cluttered, that people know what the objects are inside of the environment, and then also the roles. With the roles, you can restrict who can do exactly what. You have the ability to restrict users so that only certain users can create computes to control cost. You have the ability to control who can publish a model, which means that you can make it so that nobody can overwrite something that's live with an inferior version.

It's Git enabled. It also supports DevOps or connects to DevOps with DevOps as Git.

It has endpoints that can be consumed in the data factory easily. If you're going to have... You have to think about your data factory and your flow from Dev to Prod on your data factory and how you're going to create your linked services to how many workspaces in Azure ML.

Now, one of the unique things that's just evolving now is data set monitors. What a data set monitor is, is basically, the system creates a profile of your data. And then later on, when you tell it to, it creates a profile of your data again. And it compares it to the original profile. And if it's changed too much, statistically, then you can tell it to alert you or to do a certain activity so that you know that there's been data drift. And you might need to retrain your models.

Inside of Databricks ML, Databricks also supports a single-tier and a multi-tier landscape. So, whatever you're doing now, you're going to want to continue to do. It's not uncommon for your experiments and your models and things to be created in a Dev environment. Instead of necessarily trying to move a model as an artifact, or those different files as an artifact, you tend to move your notebooks and things that created that model, and point at the same data and retrain a model in your target environment.

The model lifecycle management is basic, but it's present, which is better than most places. You can tag the model as production and then consume it as a production model. It's GIT enabled for the notebook development. So, whatever you're doing in Databricks now, you continue to do. Your jobs can be created within data bricks for the model retraining, which is nice. Just think about where you schedule jobs, where you allow people to create jobs and the rest of it.

Then data drift monitoring presently in data bricks, you need to custom design it. You need to basically come up with your own method of detecting data drift and then do whatever you need to programmatically to alert and to retrain.

Now we've reached. The question and answer period. Let me see if we have any questions.

The first question we have is, If you're doing your first ML project, which platform would you choose? For me, what I choose, I would think of it as who is my audience. If my audience is purely developers that have worked in Python, and that's what they're used to, and they don't really want necessarily a GUI, because they're just used to they've already got your libraries of code that you're written that you're used to linking together for your experiments, then you can go with Databricks.

If you're not used to that and you are maybe a data analyst who's working with some data science activities, then I'd go with Azure ML. Azure ML is a standalone environment. That's a plus because it's a standalone environment, but it's also a negative because it doesn't automatically have access to the data that you have inside your Databricks environment.

The next question is, Which platform works best with external databases? They both will connect to your databases. I tend to like Azure ML because it's wizard-based. Think of it as if you were creating a linked service inside of the data factory. It's the same thing. Once you created that, it's just there and you consume it from within your notebooks or within your data flows.

We have time for one more question, I guess, which is could you talk a little more, a little bit more about deployment landscapes? Deployment landscapes, you're really going to want to try to fit your ML into whatever your current landscape is. If you're used to dealing with a three-tier landscape for your cloud platforms, you're going to want to create similar kinds of environments to connect to for your ML world. If you don't do that, which I've seen people do, if you don't do that, then you do run into some consistency issues. That's always been, it just can be a problem that you have to work around. I know people that have a three-tier landscape for Databricks, a data factory, and their storage and everything else. And then they've only had a two-tier landscape for their Azure ML environment. It's something you just really have to sit down and look at and see how you can fit it in, how your link services can point at the different environments the way that you need to be able to get your accurate inferencing out of whatever environment you need and give people the ability to develop somewhere.

Anyways, it looks like we are out of time. Sorry, we didn't get to all the questions. We'll follow up with everybody after this just to see if you have any questions. I'm always here to answer them, and help in any way that I can.

Thank you very much for joining. I hope I was able to add some insight into some of this.

Azure Databricks or Azure ML: Which ML Solution is Right for You?

REPLAY

Transcript & PowerPoint Slides Available for Download

AZURE DATABRICKS OR AZURE ML: WHICH ML SOLUTION IS RIGHT FOR YOU?