After creating a machine studying mannequin, you want a spot to run your mannequin and serve predictions. If your organization is within the early stage of its AI journey or has finances constraints, chances are you’ll battle to discover a deployment system to your mannequin. Constructing ML infrastructure and integrating ML fashions with the bigger enterprise are main bottlenecks to AI adoption [1,2,3]. IBM Db2 may also help clear up these issues with its built-in ML infrastructure. Somebody with the information of SQL and entry to a Db2 occasion, the place the in-database ML characteristic is enabled, can simply study to construct and use a machine studying mannequin within the database.
On this put up, I’ll present the right way to develop, deploy, and use a choice tree mannequin in a Db2 database.
These are my main steps on this tutorial:
Arrange Db2 tables
Discover ML dataset
Preprocess the dataset
Prepare a choice tree mannequin
Generate predictions utilizing the mannequin
Consider the mannequin
I carried out these steps in a Db2 Warehouse on-prem database. Db2 Warehouse on cloud additionally helps these ML options.
The machine studying use case
I’ll use a dataset of historic flights within the US. For every flight, the dataset has info such because the flight’s origin airport, departure time, flying time, and arrival time. Additionally, a column within the dataset signifies if every flight had arrived on time or late. Utilizing examples from the dataset, we’ll construct a classification mannequin with determination tree algorithm. As soon as skilled, the mannequin can obtain as enter unseen flight knowledge and predict if the flight will arrive on time or late at its vacation spot.
1. Arrange Db2 tables
The dataset I take advantage of on this tutorial is out there right here as a csv file.
Making a Db2 desk
I take advantage of the next SQL for making a desk for storing the dataset.
db2start
hook up with <database_name>
db2 “CREATE TABLE FLIGHTS.FLIGHTS_DATA_V3 (
ID INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
YEAR INTEGER ,
QUARTER INTEGER ,
MONTH INTEGER ,
DAYOFMONTH INTEGER ,
DAYOFWEEK INTEGER ,
UNIQUECARRIER VARCHAR(50 OCTETS) ,
ORIGIN VARCHAR(50 OCTETS) ,
DEST VARCHAR(50 OCTETS) ,
CRSDEPTIME INTEGER ,
DEPTIME INTEGER ,
DEPDELAY REAL ,
DEPDEL15 REAL ,
TAXIOUT INTEGER ,
WHEELSOFF INTEGER ,
CRSARRTIME INTEGER ,
CRSELAPSEDTIME INTEGER ,
AIRTIME INTEGER ,
DISTANCEGROUP INTEGER ,
FLIGHTSTATUS VARCHAR(1) )
ORGANIZE BY ROW”;
After creating the desk, I take advantage of the next SQL to load the information, from the csv file, into the desk:
db2 “IMPORT FROM ‘FLIGHTS_DATA_V3.csv’ OF DEL COMMITCOUNT 50000 INSERT INTO FLIGHTS.FLIGHTS_DATA_V3”
I now have the ML dataset loaded into the FLIGHTS.FLIGHTS_DATA_V3 desk in Db2. I’ll copy a subset of the data from this desk to a separate desk for the ML mannequin improvement and analysis, leaving the unique copy of the information intact.
SELECT rely(*) FROM FLIGHTS.FLIGHTS_DATA_V3
— — — 1000000
Making a separate desk with pattern data
Create a desk with 10% pattern rows from the above desk. Use the RAND operate of Db2 for random sampling.
CREATE TABLE FLIGHT.FLIGHTS_DATA AS (SELECT * FROM FLIGHTS.FLIGHTS_DATA_V3 WHERE RAND() < 0.1) WITH DATA
Rely the variety of rows within the pattern desk.
SELECT rely(*) FROM FLIGHT.FLIGHTS_DATA
— — — 99879
Look into the scheme definition of the desk.
SELECT NAME, COLTYPE, LENGTH
FROM SYSIBM.SYSCOLUMNS
WHERE TBCREATOR = ‘FLIGHT’ AND TBNAME = ‘FLIGHTS_DATA’
ORDER BY COLNO
FLIGHTSTATUS is the response or the goal column. Others are characteristic columns.
Discover the DISTINCT values within the goal column.
From these values, I can see that it’s a binary classification activity the place every flight arrived both on time or late.
Discover the frequencies of distinct values within the FLIGHTSTATUS column.
SELECT FLIGHTSTATUS, rely(*) AS FREQUENCY, rely(*) / (SELECT rely(*) FROM FLIGHT.FLIGHTS_DATA) AS FRACTION
FROM FLIGHT.FLIGHTS_DATA fdf
GROUP BY FLIGHTSTATUS
From the above, I see the courses are imbalanced. Now I’ll not achieve any additional insights from the complete dataset, as this will leak info to the modeling section.
Creating practice/check partitions of the dataset
Earlier than accumulating deeper insights into the information, I’ll divide this dataset into practice and check partitions utilizing Db2’s RANDOM_SAMPLING SP. I apply stratified sampling to protect the ratio between two courses within the generated coaching knowledge set.
Create a TRAIN partition.
name IDAX.RANDOM_SAMPLE(‘intable=FLIGHT.FLIGHTS_DATA, fraction=0.8, outtable=FLIGHT.FLIGHTS_TRAIN, by=FLIGHTSTATUS’)
Copy the remaining data to a check PARTITION.
CREATE TABLE FLIGHT.FLIGHTS_TEST AS (SELECT * FROM FLIGHT.FLIGHTS_DATA FDF WHERE FDF.ID NOT IN(SELECT FT.ID FROM FLIGHT.FLIGHTS_TRAIN FT)) WITH DATA
2. Discover knowledge
On this step, I’ll take a look at each pattern data and the abstract statistics of the coaching dataset to realize insights into the dataset.
Look into some pattern data.
SELECT * FROM FLIGHT.FLIGHTS_TRAIN FETCH FIRST 10 ROWS ONLY
Some columns have encoded the time as numbers:
— CRSDEPTIME: Laptop Reservation System (scheduled) Departure Time (hhmm)
— DepTime: Departure Time (hhmm)
— CRSArrTime: Laptop Reservation System (scheduled) Arrival Time
Now, I acquire abstract statistics from the FLIGHTS_TRAIN utilizing SUMMARY1000 SP to get a worldwide view of the traits of the dataset.
CALL IDAX.SUMMARY1000(‘intable=FLIGHT.FLIGHTS_TRAIN, outtable=FLIGHT.FLIGHTS_TRAIN_SUM1000’)
Right here the intable has the title of the enter desk from which I would like SUMMARY1000 SP to gather statistics. outtable is the title of the desk the place SUMMARY1000 will retailer gathered statistics for the complete dataset. In addition to the outtable, SUMMARY1000 SP creates just a few further output tables — one desk with statistics for every column kind. Our dataset has two sorts of columns — numeric and nominal. So, SUMMARY1000 will generate two further tables. These further tables comply with this naming conference: the title of the outtable + column kind. In our case, the column varieties are NUM, representing numeric, and CHAR, representing nominal. So, the names of those two further tables will likely be as follows:
FLIGHTS_TRAIN_SUM1000_NUM
FLIGHTS_TRAIN_SUM1000_CHAR
Having the statistics out there in separate tables for particular datatypes makes it simpler to view the statistics that apply to particular datatype and scale back the variety of columns whose statistics are seen collectively. This simplifies the evaluation course of.
Examine the abstract statistics of the numeric column.
SELECT * FROM FLIGHT.FLIGHTS_TRAIN_SUM1000_NUM
For the numeric columns, SUMMARY1000 collect the next statistics:
Lacking worth rely
Non-missing worth rely
Common
Variance
Commonplace deviation
Skewness
Extra kurtosis
Minimal
Most
Every of those statistics may also help uncover insights into the dataset. As an illustration, I can see that DEPDEL15 and DEPDELAY columns have 49 lacking values. There are massive values in these columns: AIRTIME, CRSARRTIME, CRSDEPTIME, CRSELAPSEDTIME, DEPDELAY, DEPTIME, TAXIOUT, WHEELSOFF, and YEAR. Since I’ll create a choice tree mannequin, I don’t must cope with the big worth and the lacking values. Db2 will cope with each points natively.
Subsequent, I examine the abstract statistics of the nominal columns.
choose * from FLIGHT.FLIGHTS_TRAIN_SUM1000_CHAR
For nominal columns, SUMMARY1000 gathered the next statistics:
Variety of lacking values
Variety of non-missing values
Variety of distinct values
Frequency of probably the most frequent worth
3. Preprocess knowledge
From the above knowledge exploration, I can see that the dataset has no lacking values. These 4 TIME columns have massive values: AIRTIME, CRSARRTIME, DEPTIME, WHEELSOFF. I’ll go away the nominal values in all columns as-is, as the choice tree implementation in Db2 can cope with them natively.
Extract the hour half from the TIME columns — CRSARRTIME, DEPTIME, WHEELSOFF.
From trying up the outline of the dataset, I see the values within the CRSARRTIME, DEPTIME, and WHEELSOFF columns are encoding of hhmm of the time values. I extract the hour a part of these values to create, hopefully, higher options for the training algorithm.
Scale CRSARRTIME COLUMN: divide the worth with 100 offers the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET CRSARRTIME = CRSARRTIME / 100
Scale DEPTIME COLUMN: divide the worth by 100 offers the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET DEPTIME = DEPTIME / 100
Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET WHEELSOFF = WHEELSOFF / 100
4. Prepare a choice tree mannequin
Now the coaching dataset is prepared for the choice tree algorithm.
I practice a choice tree mannequin utilizing GROW_DECTREE SP.
CALL IDAX.GROW_DECTREE(‘mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TRAIN, id=ID, goal=FLIGHTSTATUS’)
I referred to as this SP utilizing the next parameters:
mannequin: the title I wish to give to the choice tree mannequin — FLIGHT_DECTREE
intable: the title of the desk the place the coaching dataset is saved
id: the title of the ID column
goal: the title of the goal column
After finishing the mannequin coaching, the GROW_DECTREE SP generated a number of tables with metadata from the mannequin and the coaching dataset. Listed here are among the key tables:
FLIGHT_DECTREE_MODEL: this desk comprises metadata concerning the mannequin. Examples of metadata embody depth of the tree, technique for dealing with lacking values, and the variety of leaf nodes within the tree.
FLIGHT_DECTREE_NODES: this desk offers details about every node within the determination tree.
FLIGHT_DECTREE_COLUMNS: this desk offers info on every enter column and their position within the skilled mannequin. The knowledge contains the significance of a column in producing a prediction from the mannequin.
This hyperlink has the entire record of mannequin tables and their particulars.
5. Generate predictions from the mannequin
Because the FLIGHT_DECTREE mannequin is skilled and deployed within the database, I can use it for producing predictions on the check data from the FLIGHTS_TEST desk.
First, I preprocess the check dataset utilizing the identical preprocessing logic that I utilized to the TRAINING dataset.
Scale CRSARRTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET CRSARRTIME = CRSARRTIME / 100
Scale DEPTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET DEPTIME = DEPTIME / 100
Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET WHEELSOFF = WHEELSOFF / 100
Producing predictions
I take advantage of PREDICT_DECTREE SP to generate predictions from the FLIGHT_DECTREE mannequin:
CALL IDAX.PREDICT_DECTREE(‘mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TEST, outtable=FLIGHT.FLIGHTS_TEST_PRED, prob=true, outtableprob=FLIGHT.FLIGHTS_TEST_PRED_DIST’)
Right here is the record of parameters I handed when calling this SP:
mannequin: the title of the choice tree mannequin, FLIGHT_DECTREE
intable: title of the enter desk to generate predictions from
outtable: the title of the desk that the SP will create and retailer predictions to
prob: a boolean flag indicating if we wish to embody within the output the likelihood of every prediction
outputtableprob: the title of the output desk the place the likelihood of every prediction will likely be saved
6. Consider the mannequin
Utilizing generated predictions for the check dataset, I compute just a few metrics to judge the standard of the mannequin’s predictions.
Making a confusion matrix
I take advantage of CONFUSION_MATRIX SP to create a confusion matrix based mostly on the mannequin’s prediction on the TEST dataset.
CALL IDAX.CONFUSION_MATRIX(‘intable=FLIGHT.FLIGHTS_TEST, resulttable=FLIGHT.FLIGHTS_TEST_PRED, id=ID, goal=FLIGHTSTATUS, matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX’)
In calling this SP, listed here are among the key parameters that I handed:
intable: the title of the desk that comprises the dataset and the precise worth of the goal column
resulttable: the title of the desk that comprises the column with predicted values from the mannequin
goal: the title of the goal column
matrixTable: The output desk the place the SP will retailer the confusion matrix
After the SP completes its run, we’ve got the next output desk with statistics for the confusion matrix.
FLIGHTS_TEST_CMATRIX:
This desk has three columns. The REAL column has the precise flight standing. PREDICTION column has the expected flight standing. Since flight standing takes two values – 0 (on time) or 1 (delayed), we’ve got 4 potential mixtures between values within the REAL and the PREDICTION columns:
TRUE NEGATIVE: REAL: 0, PREDICTION: 0 — The mannequin has precisely predicted the standing of these flights that arrived on schedule. From that CNT column, we see that 11795 rows from the TEST desk belong to this mixture.
FALSE POSITIVE: REAL: 0, PREDICTION: 1 — these are the flights that truly arrived on time however the mannequin predicted them to be delayed. 671 is the rely of such flights.
FALSE NEGATIVE: REAL: 1, PREDICTION: 0 — these flights have arrived late, however the mannequin predicted them to be on time. From the CNT desk, we discover their rely to be 2528.
TRUE POSITIVE: REAL: 1, PREDICTION: 1 — the mannequin has precisely recognized these flights that had been late. The rely is 4981.
I take advantage of these counts to compute just a few analysis metrics for the mannequin. For doing so, I take advantage of CMATRIX_STATS SP as follows:
CALL IDAX.CMATRIX_STATS(‘matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX’)
The one parameter this SP wants is the title of the desk that comprises the statistics generated by the CONFUSION_MATRIX SP within the earlier step. CMATRIX_STATS SP generates two units of output. The primary one reveals total high quality metrics of the mannequin. The second contains the mannequin’s predictive efficiency for every class.
First output — total mannequin metrics embody correction predictions, incorrect prediction, total accuracy, weighted accuracy. From this output, I see that the mannequin has an total accuracy of 83.98% and a weighted accuracy of 80.46%.
With classification duties, it’s normally helpful to view the mannequin’s high quality components for every particular person class. The second output from the CMATRIX_STATS SP contains these class stage high quality metrics.
For every class, this output contains the True Optimistic Price (TPR), False Optimistic Price (FPR), Optimistic Predictive Worth (PPV) or Precision, and F-measure (F1 rating).
Conclusions and key takeaways
If you wish to construct and deploy an ML mannequin in a Db2 database utilizing Db2’s built-in saved procedures, I hope you’ll discover this tutorial helpful. Listed here are the primary takeaways of this tutorial:
Demonstrated a whole workflow of making and utilizing a choice tree mannequin in a Db2 database utilizing in-database ML Saved procedures.
For every step within the workflow, I offered concrete and practical SQL statements and saved procedures. For every code instance, when relevant, I defined intuitively what it does, and its inputs and outputs.
Included references to IBM Db2’s documentation for the ML saved procedures I used on this tutorial.
O’Reilly’s 2022 AI Adoption survey[3] underscored challenges in constructing technical infrastructure and abilities hole as two high bottlenecks to AI adoption within the enterprise. Db2 solves the primary one by supplying an end-to-end ML infrastructure within the database. It additionally lessens the latter, the abilities hole, by offering easy SQL API for creating and utilizing ML fashions within the database. Within the enterprise, SQL is a extra widespread ability than ML.
Take a look at the next sources to study extra concerning the ML options in IBM Db2 and see further examples of ML use instances carried out with these options.
Discover Db2 ML Product Documentation
Discover Db2 ML samples in GitHub
References
Paleyes, A., Urma, R.G. and Lawrence, N.D., 2022. Challenges in deploying machine studying: a survey of case research. ACM Computing Surveys, 55(6), pp.1–29.
Amershi, S., Begel, A., Fowl, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B. and Zimmermann, T., 2019, Could. Software program engineering for machine studying: A case research. In 2019 IEEE/ACM forty first Worldwide Convention on Software program Engineering: Software program Engineering in Observe (ICSE-SEIP) (pp. 291–300). IEEE.
Loukides, Mike, AI Adoption within the Enterprise 2022. https://www.oreilly.com/radar/ai-adoption-in-the-enterprise-2022/