-
Overview:
The following describes a customer’s machine learning (ML) classification problem and a solution. This is not intended to be a technical guide – rather a high-level overview discussing the problem, solution and components used.
-
Customer and Data Background:
A global consumer products leader purchases global marketing data from 5 sources/regions of the world. Each source provides data for all products sold within its region (customers & competitor’s products). Each source contains an overlapping set of products, but each source is assigned source/region specific codes and descriptions.
Below are records for the same product – delivered by source/region. It is one of 400,000 products in total.
-
Current Process:
Data is periodically downloaded and enriched by teams of specialists. The goal is to assign common categories and attributes to each product. The process is a combination of manual and automated assignments, both of which require knowledge and periodic adjustments as the products evolve.
-
Problem to Solve:
The volume of new products is high, very high – averaging 20 – 40K (thousand) new products every quarter. The manual process has become cumbersome, time consuming and is often inconsistent.
Most importantly, the value of the data diminishes over time. The insight provided by the data is point in time relevant – meaning that as time goes by, the insight gained from the data becomes less and less valuable. Accuracy and turn-around time are crucial, but it is not uncommon for it to take 4-6 weeks to fully process the data. The process has gained high visibility within the business and data drift (new data that does not align with historical data) is a consistent problem.
-
Manual Process:
Image 1 – Manual Process of enhancing and assigning master data attributes.
-
Solution Implemented:
Image 2 – New process of enhancing SAP HANA master data using Azure ML and Microsoft MDS.
-
Solution Description:
The solution was designed to automate the data enrichment process. Users are not engaged until the data is enriched and available for review.
-
Solution Components:
- SAP BW on HANA – SAP BW is SAP’s data warehousing solution. Data is stored in an SAP HANA database
- Azure Data Factory (ADF) – A data integration service used to copy data and orchestrate activities via pipelines
- Microsoft SQL Master Data Services (MDS) – A master data maintenance platform that is included with a SQL Server license.
- Azure Data Lake Storage (ADLS2) – An Azure storage account providing blob, file share and table storage.
- Azure Machine Learning (ML) – The Azure machine learning platform (PaaS) designed for data science. Allows for experimentation, publishing and consumption of machine learning models.
-
Data flow description:
Step 1: Data is pulled from SAP HANA Calculation Views using the Azure Data Factory (ADF) and written to an MDS staging table.
Step 2: The ADF triggers MDS/SQL stored procedures that enrich the data and assigns some attribute values. Additional steps then perform data cleansing and feature engineering to ensure the data is prepared for Azure ML and the required NLP steps.
Step 3: The ADF extracts the prepared data from MDS and sends it to Azure ML for classification/prediction.
Step 4: The ADF moves the predicted classifications & probability scores from Azure ML back into MDS for review and approval.
Step 5: The business corrects and/or approves the predictions and assignments within SQL MDS.
Step 6: BW pulls the approved data from MDS back into BW and the HANA database via a BW process chain.
Step 7: Periodic re-training of the ML models – using the cleaned and approved classifications data in MDS. The ML models are retrained to account for data drift and manual adjustments made by the users.
-
Solution Advantages:
- Data review & approval is via an easy-to-use MDS interface (MS Excel)
- ML probability scores are available for each prediction, allowing users to focus on records with the lowest scores
- ML models are always up to date, ensuring accurate predictions during the next prediction run
- ML data drift is covered by re-training the models via ADF and Azure ML integration
- No manual processing steps. All steps are automated
- No stand-alone code to be maintained. All logic is now within SQL MDS and Azure ML
- The time from extraction to review is now a matter of minutes rather than days
- The process is reliable, repeatable, and secure
- Implementation Details:
Azure Data Factory (ADF)
- Built-in ADF connectors used to link each system
- ADF used to move data between all three environments (HANA, MDS, Azure ML)
- ADF used to trigger ML Training and Predictions in Azure ML
- 50+- pipelines created (10 per source). One per processing step with a parent pipeline for scheduling and monitoring
-
SQL Master Data Services (MDS)
- Separate product masters/entities created for each source
- Domains used to restrict assignments to permissible values
- All products stored in MDS (new & prior approvals)
- SQL Code used for attribute assignments
- SQL Code used for ML feature engineering
- MDS used as the source for all ML training data
- ML classifications & probabilities available to users during the approval process
Azure Machine Learning (ML)
- Training Problem: Supervised, Multi-Class Classification w/ NLP
- Multi-Class consists of 420 possible classification values
- Algorithms/libraries Used: Scikit-learn, Support Vector Machine (SVM), LinearSVC with CalibratedClassifierVC
- TF-IDF used as part of NLP, Settings: N-Grams: 2-3 depending upon the source and size of corpus.
- 5 models created (one per source) – each source has its own set of features and assignments
- Automated ML used to confirm the correct algorithm was selected
- Azure ML batch endpoints used for training and inference pipeline
- Average of 96.2% accuracy across the 5 models
Developer Notes:
Microsoft SQL MDS –
- Keep business rules simple: Business rules work well for less complex scenarios. The more complex a validation rule becomes, the more inconsistent the output appears to be.
- Use “Change Sets” to overcome user concurrency issues: Multiple users working on the same data set will conflict and cause validation errors if change set functionality is not used.
- Test all functionality as part of your proof of concept. Do not assume the functionality will work as anticipated.
Azure Data Factor (ADF) –
- ADF worked as expected. The ADF learning curve can be steep, so prior experience is required.
Azure Machine Learning (ML) –
Azure ML experiments were conducted in two ways, via the no-code designer and via the SDK using Python.
ML Designer – No-code
- Easy to use, but you must have prior experience with ML and python to figure it out.
- It’s designed so Data Scientists can perform quick prototyping and experimentation.
- Development work in the designer CAN NOT BE EXPORTED. Meaning – there is no GIT integration with the designer, there is no way export pipelines and no way to copy a workspace.
- If you plan on going live with a designer-trained model, plan on publishing it from the environment in which it was built.
- MLOps is limited with the Designer. Hopefully, this will change in future releases.
- Standard ML libraries (Scikit-learn etc.) are included with pre-configured compute clusters.
- A limited number of algorithms are available in the GUI. All of the basic types are covered, but if you want to us one that’s not available you need to include code in the pipeline.
- With the designer, there is no standard way of including probabilities in your output. It has to be coded.
- The designer comes with data engineering utilities (data cleansing, normalization etc.
- Our feature engineering needs were too complex for the designer
Azure ML SDK
- It's documented well. There are clear examples of each API call.
- GIT integration is via a command line interface. It would be nice if it worked like the ADF does.
- It supports solid MLOps using a combination of Azure ML, DevOps and Power Apps.
- Everything is done in code, so you can reference the Azure ML architecture and do almost anything.
- A Databricks compute cannot be used to train/predict. It can only be used for data engineering steps.
- Feature engineering and class imbalances could be addressed using the SDK.
Note: The above notes are as of the original date of this post (March 6, 2022). The functionality will continue to evolve as the products mature.
The above solution was designed and implemented by Scott Pietroski. Please direct all questions to Scott.Pietroski@eastern-analytics.us
Partner / Eastern Analytics, Inc. Solution Architect - Design thru delivery. 25+ Years of Data Warehousing, BI, Analytics, ML Specialties: MS Azure, HANA, Data Engineering, and Machine Learning experience.
Scott can be reached at:
Scott.Pietroski@Eastern-Analytics.us