Categorisation of Machine Learning algorithms for business applications

Categorisation of Machine Learning algorithms for business applications

Practicing the scientific approach to the data exploration one should know at what extent certain method can be applied. Neural Nets are futile for the stock market’s predictions. Monte-Carlo algorithms couldn’t offer much help either, and poorly implemented Random Forest algorithm can literally ruin your vacation in South-East Asia, especially if it was implemented by NSA. In this article we will briefly introduce machine learning methods classification and see how they are relevant to the different lines of business.

Preface

From the cradle to the grave, we are making decisions - from our first decision to attract mother’s attention to one of our last decisions when asking the doctor for pain treatment. Decision making is an essential part of our life, with motivation (conscious or somewhat fuzzy) as the background of this process. Business is no different in this sense – as our computerized “decision maker” we can use a Rule Engine (BRE) and some preceding logic.
Not long ago BRE were perceived as a paramount part of business intelligence (BI). The truth is, the implementation of the effective BRE can be exceedingly simple, as it was demonstrated in [2]. Using IoC (Java) or NDS (OraDB) lightweight BRE can be implemented right where it is needed (close to the objects change location, i.e. event source), code samples and supporting DB structures are available for download. Quite often, however, external BRE will just complicate the situation, adding complexity to the technical infrastructure, require object serialisation/deserialization, imposing addition costs (sometimes quite high). Most importantly, classic BRE just answers to the clearly formulated questions in “Yes/No” fashion, contributing only a little to the question formulation. Even worse if the logic is fuzzy and/or based on random data. Thus, building the motivation for decision making is not only the most important part of BI, but also the hardest one, as it will require statistical data analysis with elements of prediction and simulation using adaptive algorithms.


"There are no right answers to wrong questions." - Ursula K. Le Guin


Prediction, simulation and adaptation denote the presence the learning capabilities, see Arthur Samuel’s definition in the following table. Gnosiology (philosophical concept dedicated to theory of knowledge) identifies three distinctive knowledge gaining approaches: supervised, unsupervised and reinforced. Each approaches the problem with its own set of methods and algorithms, with different levels of applicability, depending on problem at hand. There are no strict borders between sets of methods and since the total number of statistical and learning algorithms is more than 700, it is simply not possible even to mention a half of them in a short blogpost. Here I will just try to associate and group the learning methods and most common algorithms with business areas of applicability, starting from ML approaches of gaining knowledge.

For simplicity I will use one example from classic AI books [3], the fairy tale, where Prince Charming is searching Sleeping Beauty in a Kingdom Far-Far-Away with some help of know-it-all Owl, capable to say only “Yes” and “No” (see BRE above).

Supervised Learning

Using the kingdom’s map and wise Owl with spoken language disorder, Prince Charming could use bisection method, dividing the map in half and asking Owl “where is the princess” repeatedly, until the last half will be of the size of an average cave. So here, Prince Charming gets help from the supervisor, Bisection Regression Isolation algorithm from the Regression group of algorithms. This is a quite broad group of algorithms, including Linear Regression with Single or Multiple Variables. Another common group is based on Classifications, linear and non-linear, and Vector Machine Support, where the last one is a combination of regression and classification methods, focused on establishing probabilistic classifiers, helping with creating the most optimal model for finding the Princess.

One of the most fascinating features of supervised learning are the neural networks (NN) and we should mention them here as they getting more and more popular nowadays as hardware becomes capable to support this concept. It is rather hard to nail this aspect and its practical implementation – self organised maps (SOM) into fairy-tale example, but let’s imagine that after locating the occupied cave, Prince Charming needs to validate that the lady inside there is the true Beauty. Naturally, Prince Charming has developed his own validation criteria (supported by weights -W) on that matter and will apply them (i.e. testing updated models) by doing the optical recognition of hair, cheeks, chin, lips and other features. So, in other words a tree-like structure of chained neurons, collectively responsible for descending certain patterns from observed objects and doing the feature extraction in a form of Dimensionality reduction algorithm. More neurons in the net – more precise (neurons dendritic tree cumulative effect of neuron multipliers of W) and faster the feature recognition and assembling into one complete image (neuron summation box).

Neural networks in supervised learning

Some may ask where these weights (W) criteria came from. Prince learned them, of course. Initially from his father, mother and movies like “gentlemen prefer blondes” (the supervisors, provided initial models). These are partial weight coefficients, because taken for granted, they could produce a partially satisfying result (early marriages usually do not last). So Prince Charming entire life has been adapting these W-ratios accessing partial derivatives with satisfaction thresholds, like “blondes do not really drive well”, “reds are too hysterical”. In mathematics, some call it “mathematical convenience” – adopting the math function of our processor depending on level satisfaction of the output. One more thing should also be mentioned here - the backpropagation phenomena. Neural Nets can electronically be presented by the chain of operational amplifiers each with an adaptive feedback. When your complex audio system with all possible signal processors get too loud, you rotate the master volume control on the final amplifier. The same is true for NN, where you can make the adjustments backwards, which fit the amplifier chain quite well – vigour power amplification on preamp is a risky business due to severe signal distortions. As you understand the feedback in amplifiers should be negative, otherwise adaptive recognition will fail spectacularly. Self-tuning is the essential feature of the next learning method we are about to discuss.

Unsupervised Learning

Let’s imagine that Owl’s language disorder worsened, and Prince Charming cannot rely on Supervisor anymore. But Prince Charming still has the map and determination. He knows that the beauty is in the cave and caves are most common in mountains, not in bogs or deserts. Probably some more parameters from data selection could help – like knowledge of that Princess was put asleep by a poisoned apple, so apple groves must not be far away from the mountain slopes, and so on (presuming that rotten apple would not attract Princess’s attention). Clustering, Dimensional Reduction, Principal Component Analysis and other Bayesian statistic methods will help our Prince Charming to find most probable area on the map for detailed investigation (more model trainings in ML terms). In many ways our Prince Charming will try to implement most prominent radiolocation and target recognition methods (for passive location at this time) by analysing and detecting the data distribution law, finding the mean and standard deviation for detected distribution law and applying Neyman-Pearson criteria to selection – minimising the target miss probability with a fixed probability of positive Princess detection (0.95, for instance and Normal distribution). You probably noticed that I formulated the Neyman-Pearson criteria not entirely correctly (minimised parameter is usually the false alarm), but for simplicity I omit the situation where we could have several Sleeping Beauties, where some of them are decoys (not a Princess, or not Beauty, or not sleeping, or combination of all. Let’s hope that Prince is truly determined and will not deviate). All this knowledge will help our Prince later, when he becomes a King, to fairly collect taxes and detect tax frauds and evasions. Bayesian statistical theory is highly versatile, so later our good King (former Prince) will be able to access the distribution law for declared taxes in certain line of business (apple cider brewery), find the abnormal deviation, link them to the fact that these certain breweries have commercial offices on Cayman Islands, and treat it as a tax evasion pattern.

Generally, unsupervised learning methods can (and in most cases should) support and reinforce Supervised ones, as you can understand that Bisection method, demonstrated in preceding paragraph is far from optimal.

Reinforcement learning

The situation is getting worse - not only is Owl gone but Prince has lost the Kingdom’s map. A terrible blow; our Prince now not more than a robotic vacuum cleaner without any supervisor, who can give the directions or set a virtual walls. This is a rather painful trail-and-error method, based on feedbacks we get after each step – positive or negative. Clearly, it is statistically quite close to the Markov sequences and discrete decision process. Probabilistically it can be described by Monte-Carlo methods.

Like in preceding paragraph, it is clear that the learning method will provide the best results when supported by others, especially supervised ML. Universal approximation in neural nets is the most common combination in realisation of the learning approach.

Linking problem areas with algorithms and tools, supporting algorithmic methods

Definitions

Synonyms

Field of study that gives computers the ability to learn without being explicitly programmed

Arthur Samuel

Data Mining/Prediction

Classification

Anomaly detection

Addressing the following problems

Algorithms

Methods

Tool

Supervised learning.

 

Supervised learning is learning from examples provided by a knowledgeable external supervisor. Machine do the classification based on external (from classifier) classification input.

Parametric/Non-parametric algorithms

Linear Regression with One Variable

Linear Regression with Multiple Variables

2,3,4

Vector machines support

(Vapnik–Chervonenkis theory)

Linear classification:

Hard-margin

Soft-margin

Non-linear classification

3,4

Kernels

The Kernel Tricks:

Kernel Principal Component Analysis (Schematic)

3,4

Neural networks

Cost Function and Backpropagation

1,3,4

Unsupervised learning

 

Machine do not have any classification or supervisor, who can validate existing classification, but have enough data for analysis and algorithms to perform them.

Clustering

Multiclass Classification

K-Means Clustering and PCA

3,4

Dimensionality reduction

Independent Component Analysis

Principal Component Analysis (PCA)

Probabilistic PCA (PPCA)

Neural networks

3,4

Recommender systems

Utility matrix composing

3,4

Deep learning

Universal Approximation

Probabilistic Interpretation

3,4

Reinforcement learning (RLA)

Machine has no classification and algorithm, only data. So machine starts from building algorithms based on input to master classification

Markov Decision process (MDP)

 

Preference-based reinforcement learning (PBRL)

 

Value function approaches

3,4

Gradient temporal difference

3,4

Monte-Carlo every visit (all combination of Monte-Carlo methods)

3,4

Hidden Markov

4

Preceding ML Categorisation will be incomplete without mapping to the existing tools and algorithms they support. Further, this table will be the source for linking existing algorithms to the relevant business cases.

Note: Flink and Kafka are technically the supporting tools, providing event distribution and map-reduce for the next two.

Tools and algorithms supported

No

1

2

3

4

 

Apache Flink(ML)

Apache Kafka

Apache SPARK MLlib

Apache Mahout

1

Streaming dataflow Engine  for reliable communication and data propagation with plugged elements of machine learning, (FlinkML), covering:

Alternating Least Squares (ALS, supervised learning), Multiple linear regression and SVM. Preprocessor with pupport supervised Ml and recommendations (in general).

Kafka is high-performing EDN with strong clustering support.

Logistic regression and linear support vector machine (SVM)

Logistic Regression - trained via SGD

2

Classification and regression tree

Stochastic Principal Component Analysis (SPCA, DSPCA)

3

Random forest and gradient-boosted trees

Random Forest (Unsupervised learning)

4

Recommendation via alternating least squares (ALS)

 

Distributed regularized Alternating Least Squares (DALS)

Matrix Factorization with ALS on Implicit Feedback

5

clustering via k-means, bisecting k-means, Gaussian mixtures (GMM)

k-Means Clustering

Canopy Clustering

Fuzzy k-Means

Streaming k-Means

Spectral Clustering

6

Topic modeling via Latent Dirichlet allocation (LDA)

Topic Models: Latent Dirichlet Allocation (Topic Model)

7

Survival analysis via accelerated failure time model

Logistic Regression - trained via SGD

8

Singular value decomposition (SVD) and QR decomposition

Stochastic Singular Value Decomposition (SSVD, DSSVD)

 

Stochastic SVD

 

Distributed Cholesky QR (thinQR)

9

Principal Component Analysis (PCA)

PCA (via Stochastic SVD)

10

Linear regression with L1, L2, and elastic-net regularization

Hidden Markov Models

 

11

Isotonic regression

 

12

Multinomial/binomial naive Bayes

Naive Bayes / Complementary Naive Bayes

13

frequent item set mining via FP-growth and association rules

Collaborative Filtering: Item and Row Similarity Distributed and in-core

14

Sequential pattern mining via PrefixSpan

 

15

Summary statistics and hypothesis testing

 

16

Feature transformations

 

17

Power iteration

Lanczos Algorithm

18

Model evaluation and

hyper-parameter tuning

 

Multilayer Perceptron

ML is not a magic pill. Borrowing a lot from math statistics, ML can be applied where we need extensive data preparation/simulation for predictive analysis and final decision making. You do not need any prediction to forsee that Prince Charming and the Woken Beauty will “live happily everafter and die together on the same day”. Radiolocation and target designation using statistical methods [5,6] have been used with notable success for decades. The key word here is “learning”, and we have highlighteda table below with selected business areas that have high demands of machine learning capabilities.

Mapping business areas with applied alghoritms from above (number reference as tool.algorithm)

Business Area

Opportunity description

Algorytm

Predictive Modeling Factories

 

Business What-If analysis require reliable models for predicting client’s loyalty and purchase behaviour.  Before launching marketing companies, new products or line of business, large amounts of data must be analysed (if not possible to gather such amounts, than data should be simulated using adequate stochastic distribution) predictive modelling factories shall be trained using multi-criterial statistical frameworks for forecasting possible outcomes of planning actions.

4.5, 4.18, 3.6, 3.14, 4.6

 

Avertising Technology

Cablecom Enterprise strives to improve the overall customer experience (not only for VOD). It does so by gathering and aggregating information about user preferences through the purchasing history, watch lists, channel switching, activity in social networks, search history and used meta tags in search (semantic web), other user experiences from the same target group, upcoming related public events (shows, performances, or premieres), and even the duration of the cursor's position over certain elements of corporate web portals. The task is complex and comprises many activities, including meta tag updates in metadata storage that depend on new findings for predicting trends and so on; however, here we can tolerate (to some extent) the events that aren't processed or are not received. As in previous case we can model them.

3.4,4.4, 3.2

Risk and Fraud Detection

 

For Bank transaction monitoring, we do not have such a luxury (missed events, from above). All online events must be accounted and processed with the maximum speed possible. If the last transaction with your credit card was at Bond Street in London, (ATM cash withdrawal) and 5 minutes later, the same card is used to purchase expensive jewellery online with a peculiar delivery address, then someone should flag the card with a possible fraud case and contact the card holdercardholder. This is the simplest example that we can provide. When it comes to money laundering tracking cases in our borderless world—the decision-parsing tree from the very first figure in this chapter—based on all possible correlated events will require all the pages a large book, and you will need a strong magnifying glass to read it; the stratagem of the web nodes and links would drive even the most worldly wise spider crazy. To stay sane, the real-time scoring should be reinforced by ML algorithms, supporting responsive decision-making processes reliable and effective.

3.4, 4.4

4.13

Insurance Analytics

 

In fact, the modern insurance industry was born in the early 18th century as a result of thorough statistical analysis of personal shares collected, number of deceased insured members and totally insured members (see Amicable Society and lately - National Insurance Act 1911). The similar statistical methods are used nowadays for building the optimal insurance pricing models, minimising losses and increase operational profits. Neural networks and abnormal patterns detection will help with preventing insurance frauds (both: hard - planned or invented loss, such as a staged car theft, and soft - exaggerated otherwise-legitimate claims).

3.10,4.10,

3.7, 3.12, 4.12

Healthcare

 

Similar to the tasks depicted above, a properly modelled individual healthcare plan will help insurance companies to predict the insurance costs and let the employer assess its potential losses. The employee (as individual) will be able to choose the most optimal insurance provider and insurance plan.

The doctor will be supported in his diagnoses and the patient will have the second opinion based on the best prediction model and optimal treatment.

3.10,4.10

Customer Intelligence

 

Reduce churn and increase up-sales for the targeted audience require reliable modelling of this targeted audience. It leads to constricting thousands of models for predictive analysis and generating the most attractive personalise offers. ML application in this business area is closely related to the top two discussed above.

3.5,4.5

Crime prevention

One obvious application of unsupervised ML has been discussed already above – the optical character recognition widely used on the streets and in net, thanks to neural nets.  

 

Sorry, I cannot help quoting another famous:

“You should never underestimate the predictability of stupidity.”(с) Bullet Tooth Tony

The amount of information about individuals in social networks is delightful: photo of the prominent politician’s daughter showing a new Bentley her father could never afford; businessmen, declared insignificant amount in his last tax declaration, posing on a 60-foot yacht with clearly recognisable registration number and so on.

 

Tax evasion, money laundering, and corruption detection – all will benefit from pattern detection, classification and regression tree algorithms.

 

Terrorist activity detection.

NSA/CIA’s favourite Random Forest (RF) algorithm is based on gathering and categorisation of 80 parameters [4] including mobile phone usage, pattern-of-life, social network activity and travel behaviour.  Word of caution shall be spread: RF algorithm denotes the necessity of properly random sampling (bootstrapping) to avoid overfitting during bagging phase. Results will be devastating [4]. Here is another quote from Snatch:

 

“Boris, do not use idiots for this job”

 

Conclusion

As it was stated earlier, this classification did not pursue complete coverage of all possible algorithms in ML domain, nor to position ML as the universal approach for data analysing. Every algorithm has its own benefits and weaknesses as the fairy-tale example strive to demonstrate, and poorly implemented it could nonetheless lead to devastating results [4]. For instance – neural networks tend to be slow in training and hard in physical realisations and in large, NN weight derived during training can be hard to comprehend. Random Forest is faster to train and easier to implement, but without proper estimation of generalisation errors during forest building process, it can fail spectacularly. As a result, an Al-Jazeera correspondent can get highest cores as AL-QA’IDA terrorist. Taking all declared 80 criteria from one functional set (practically all criteria are from GSM usage patterns) create a misbalanced decision tree. In general, Random Forest is not good for predicting outcomes beyond the range of training data. Based on that we can generalise that learning algorithms are only as good, complete and representative as the data sets they are based upon (that is why Flink and Kafka are included as essential elements of data gathering and classification). The need for a Data Scientist thus becomes crucial for a successful outcome. Understanding of the dataset statistical distribution law is the first step in any algorithm adoption. Decision making should be based on a restricted Neyman-Pearson approach, adapted to detected distribution law, which is not always Gaussian [5], [6]. We hope that you now can see that BI is far more complex than implementation of obvious BRE (like Oracle) in your technical infrastructure.

References

  1. Machine Learning, Tom Mitchell, McGraw Hill, 1997.
  2. Applied SOA Patterns on the Oracle Platform, Pack Publishing, Sergey Popov – August 12, 2014.
  3. Artificial Intelligence, A. Cachko, Eureka Publishing, 1978.
  4. The NSA’s SKYNET program may be killing thousands of innocent people
  5. Levin B.R. Theoretical Foundations of Statistical Radio Engineering, 1989, Radio and Communications.
  6. Tikhonov V.I. The optimal signal reception, 1983, Radio and Communications.
  7. Top-banner "Snow White and Prince Charming Check the Map"

 

Om bloggeren:
Sergey has a Master of Science degree in Engineering and is a Certified Trainer in SOA Architecture, Governance and Security. He has several publications in this field, including comprehensive practical guide “Applied SOA Patterns on the Oracle Platform”, Packt Publishing, 2014.

comments powered by Disqus