What Artificial Intelligence and Machine Learning can do – and what not

May 9, 2017 / Ingo Mierswa / Leave a comment

I have written on Artificial Intelligence (AI) before. Back then I focused on the technology side of it: what is part of an AI system and what isn’t. But there is another question which might be even more important. What are we DOING with AI?

Part of my job is to help investors with their due diligence. I discuss companies with them in which they might want to invest. Here is a quick observation: By now, every company pitch is full with stuff about how they are using AI to solve a given business problem.

Part of me loves this since some of those companies are on something and should get the chance. But I also have a built-in “bullshit-meter”. So, another part of me wants to cringe every time I listen to a founder making stuff up about how AI will help him. I listened to many founders who do not know a lot about AI, but they sense that they can get millions of dollars of funding. Just by adding those fluffy keywords to their pitch. The bad news is that it sooner or later actually works. Who am I to blame them?

I have seen situations where AI or at least machine learning (ML) has an incredible impact. But I also have seen situations where this is not the case. What was the difference?

In most of the cases where organizations fail with AI or ML, they used those techniques in the wrong context. ML models are not very helpful if you have only one big decision you need to make. Analytics still can help you in such cases by giving you easier access to the data you need to make this decision. Or by presenting this data in a consumable fashion. But at the end of the day, those single big decisions are often very strategic. Building a machine learning model or an AI to help you making this decision is not worth doing it. And often they also do not yield better results than just making the decision on your own.

Here is where ML and AI can help. Machine Learning and Artificial Intelligence deliver most value whenever you need to make lots of similar decisions quickly. Good examples for this are:

Defining the price of a product in markets with rapidly changing demands,
Making offers for cross-selling in an E-Commerce platform,
Approving a credit or not,
Detecting customers with a high risk for churn,
Stopping fraudulent transactions,
…among others.

You can see that a human being who would have access to all relevant data could make those decisions in a matter of seconds or minutes. Only that they can’t without AI or ML, since they would need to make this type of decision millions of times, every day. Like sifting through your customer base of 50 million clients every day to identify those with a high churn risk. Impossible for any human being. But no problem at all for an ML model.

So, the biggest value of artificial intelligence and machine learning is not to support us with those big strategic decisions. Machine learning delivers most value when we operationalize models and automate millions of decisions.

The image below shows this spectrum of decisions and the times humans need to make those. The blue boxes are situations where analytics can help, but it is not providing its full value. The orange boxes are situations where AI and ML show real value. And the interesting observation is: the more decisions you can automate, the higher this value will be (upper right end of this spectrum).

One of the shortest descriptions of this phenomenon comes from Andrew Ng, who is a well-known researcher in the field of AI. Andrew described what AI can do as follows:

“If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.”

I agree with him on this characterization. And I like that he puts the emphasis on automation and operationalization of those models – because this is where the biggest value is. The only thing I disagree with is the time unit he chose. It is safe to go already with a minute instead of a second.

K-Nearest Neighbors – the Laziest Machine Learning Technique

May 2, 2017May 3, 2017 / Ingo Mierswa / Leave a comment

Family:
Supervised learning

Modeling types:
Classification, Regression

Group:
Lazy Learners / Instance-based Learners

Input data:
Numerical, Categorical

Tags:
Fast, Local, Global

One of my quirky videos from the “5 Minutes with Ingo” series. It explains the basic concepts of k-Nearest Neighbors N in 5 minutes. And has unicorns.

Concept

k-Nearest Neighbors is one of the simplest machine learning algorithms. As for many others, human reasoning was the inspiration for this one as well. Whenever something significant happened in your life, you will memorize this experience. You will later use this experience as a guideline about what you expect to happen next.

Consider you see somebody dropping a glass. While the glass falls, you already make the prediction that the glass will break when it hits the ground. But how can you do this? You never have seen THIS glass breaking before, right?

No, indeed not. But you have seen similar glasses or in general similar items dropping to the floor before. And while the situation might not be exactly the same, you still know that a glass dropping from about 5 feet on a concrete floor usually breaks. This gives you a pretty high level of confidence to expect breakage whenever you see a glass fall from that height on a hard floor.

But what about dropping a glass from a foot height onto a soft carpet? Did you experience breaking glasses in such situations as well? No, you did not. We can see that the height matters, so does the hardness of the ground.

This way of reasoning is what a k-Nearest Neighbors algorithm is doing as well. Whenever a new situation occurs, it scans through all past experiences and looks up the k closest experiences. Those experiences (or: data points) are what we call the k nearest neighbors.

If you have a classification task, for example you want to predict if the glass breaks or not, you take the majority vote of all k neighbors. If k=5 and in 3 or more of your most similar experiences the glass broke, you go with the prediction “yes, it will break”.

Let’s now assume that you want to predict the number of pieces a glass will break into. In this case, we want to predict a number which we call “regression”. Now you take the average value of your k neighbors’ numbers of glass pieces as a prediction or score. If k=5 and the numbers of pieces are 1 (did not break), 4, 8, 2, and 10 you will end up with the prediction of 5.

We have blue and orange data points. For a new data point (green), we can determine the most likely class by looking up the classes of the nearest neighbors. Here, the decision would be “blue”, because that is the majority of the neighbors.

Why is this algorithm called “lazy”? Because it does no training at all when you supply the training data. At training time, all it is doing is storing the complete data set but it does not do any calculations at this point. Neither does it try to derive a more compact model from the data which it could use for scoring. Therefore, we call this algorithm lazy.

Theory

We have seen that this algorithm is lazy and during training time all it is doing is to store all the data it gets. All the computation happens during scoring, i.e. when we apply the model on unseen data points. We need to determine which k data points out of our training set are closest to the data point we want to get a prediction for.

Let’s say that our data points look like the following:

We have a table of n rows and m+1 columns where the first m columns are the attributes we use to predict the remaining label column (also known as “target”). For now, let’s also assume that all attribute values x are numerical while the label values for y are categorical, i.e. we have a classification problem.

We can now define a distance function which calculates the distance between data points. Especially, it should find the closest data points from our training data for any new point. The Euclidean distance often is a good choice for such a distance function if the data is numerical. If our new data point has attribute values s₁ to s_m, we can calculate the distance d(s, x_j) between point s to any data point x_j by

The k data points with the closest value for this distance become our k neighbors. For a classification task, we now use the most frequent of all values y from our k neighbors. For regression tasks, where y is numerical, we use the average of all values y from our k neighbors.

But what if our attributes are not numerical or consist of numerical and categorical attributes? Then you can use any other distance measure which can handle this type of data. This article discusses some frequent choices.

By the way, K-Nearest Neighbors models with k=1 are the reason why calculating training errors are completely pointless. Can you see why?

Practical Usage

K-Nearest Neighbors, or short: K-NN, should be a standard tool in your toolbox. It is fast, easy to understand even for non-experts, and it is easy to tune it to different kind of predictive problems. But there are some things to consider which we will discuss in the following.

Data Preparation

We have seen that the key part of the algorithm is the definition of a distance measure. A frequent choice is the Euclidean distance. This distance measure treats all data columns in the same way though. It subtracts the values for each dimension before it sums up the squares of those distances. And that means that columns with a wider data range have a larger influence on the distance than columns with a smaller data range.

So, you should normalize the data set so that all columns are roughly on the same scale. There are two common ways of normalization. First, you could bring all values of a column into a range between 0 and 1. Or you could change the values of each column so that the column has a mean 0 with a standard deviation of 1 afterwards. We call this type of normalization z-transformation or standard score.

Tip: Whenever you know that the machine learning algorithm is making use of a distance measure, you should normalize the data. Another famous example would be k-Means clustering.

Parameters to Tune

The most important parameter you need to tune is k, the number of neighbors used to make the class decision. The minimum value is 1 in which case you only look at the closest neighbor for each prediction to make your decision. In theory, you could use a value for k which is as large as your total training set. This would make no sense though, since in this case you would always predict the majority class of the complete training set.

Here is a good way to interpret the meaning behind k. Small numbers indicate “local” models, which can be non-linear and the decision boundary between the classes wiggle a lot. If the number grows, the wiggling gets less until you almost end up with a linear decision boundary.

We see a data set in two dimensions on the left. In general the top right is red and the bottom left is the blue class. But there are also some local groups inside of both areas. Small values for k lead to more wiggly decision boundaries. For larger values the decision boundary becomes smoother, almost linear in this case.

Good values for k depend on the data you have and if the problem is non-linear or not. You should try a couple of values between 1 and about 10% of the size of the training data set size. Then you will see if there is a promising area worth the further optimization of k.

The second parameter you might want to consider is the type of distance function you are using. For numerical values, Euclidean distance is a good choice. You might want to try Manhattan distance which is sometimes used as well. For text analytics, cosine distance can be another good alternative worth trying.

Memory Usage & Runtimes

Please note that all this algorithm is doing is storing the complete training data. So, the memory needs grow linearly with the number of data points you provide for training. Smarter implementations of this algorithm might choose to store the data in a more compact fashion. But in a worst-case scenario you still end up with a lot of memory usage.

For training, the runtime is as good as it gets. The algorithm is doing no calculations at all besides storing the data which is fast.

The runtime for scoring though can be large though which is unusual in the world of machine learning. All calculations happen during model application. Hence, the scoring runtime scales linearly with the number of data columns m and the number of training points n. So, if you need to score fast and the number of training data points is large, then k-Nearest Neighbors is not a good choice.

RapidMiner Processes

You can download RapidMiner here. Then you can download the processes below to build this machine learning model yourself in RapidMiner.

Train and apply a K-NN model: knn_training_scoring
Optimize the value for k: knn_optimize_parameter

Please download the Zip-file and extract its content. The result will be an .rmp file which can be loaded into RapidMiner via “File” -> “Import Process…”.

So, what is Data Science then?

April 20, 2017April 21, 2017 / Ingo Mierswa / 3 Comments

I just finished a post on explaining the relationship between Artificial Intelligence, Machine Learning, and Deep Learning. And somebody immediately pointed out: But what about Data Science? How does Data Science relate to all this?

Good question. That’s what I am going to write about today then.

In case you do not want to read the whole post from yesterday (shame on you!), here is a quick summary:

Artificial Intelligence is covering anything which enables computers to behave like a human. Machine Learning is a part of this as well as language understanding and computer vision.
Machine Learning deals with the extraction of patterns from data sets. This means that the machine can find rules for optimal behavior but also can adapt to changes in the world. Deep Learning is part of this, but so are decision trees, k-means clustering, or linear regression, among others.
Deep Learning is a specific class of Machine Learning algorithms which are using complex neural networks. Recent advances in parallel computing made those algorithms feasible.

Deep Learning is a subset of methods from Machine Learning. Which is again a subset of Artificial Intelligence.

Which brings us now finally to Data Science. The picture below gives an idea how Data Science relates to those fields:
Data Science is the practical application of all those fields (AI, ML, DL) in a business context. “Business” here is a flexible term since it could also cover a case where you work on scientific research. In this case your “business” is science. Which actually is more true than you want to think about.

But whatever the context of your application is, the goal are always the same:

extracting insights from data,
predicting developments,
deriving the best actions for an optimal outcome,
or sometimes even perform those actions in an automated fashion.

As you can also see in the diagram above, Data Science covers more than the application of only those techniques. It also covers related fields like traditional statistics and the visualization of data or results. Finally, Data Science also includes the necessary data preparation to get the analysis done. In fact, this is where you will spend most of your time on as a data scientist.

A more traditional definition describes a data scientist as somebody with programming skills, statistical knowledge, and business understanding. And while this indeed is a skill mix which allows you to do the job of a data scientist, this definition falls a bit short. Others realized this as well which led to a battle of Venn diagrams.

The problem is that people can be good data scientists even if they do not write a single line of code. And other data scientists can create great predictive models with the help of the right tools. But without a deeper understanding of statistics. So the “unicorn” data scientist (who can master all the skills at the same time) is not only overpaid and hard to find. It might also be unnecessary.

For this reason, I like the definition above more which focuses on the “what” and less on the “how”. Data scientists are people who apply all those analytical techniques and the necessary data preparation in the context of a business application. The tools do not matter to me as long as the results are correct and reliable.

First Post – or: About the Insanity of Being an Entrepreneur in Data Science

April 18, 2017November 20, 2017 / Ingo Mierswa / Leave a comment

Ingo doing what he is doing best: explaining things nobody asked for.

Oooook, I finally made the step and created a space for my thoughts on data science, startup life, and photography. And all other things I deal with. The reason why I decided to do this is simple: there are not a lot of entrepreneurs in the data science field. Founding a company is hard. Data science is hard. Doing both things at the same time is insane.

It should not surprise that over the years many people have reached out to me seeking for advice. I love sharing insights or ideas so please continue to do so. But what about those who do not have the chance to talk to me directly? Exactly. That’s when I realized that I should get my act together and spend some time on writing the stuff down. Which is why I created this blog.

I also enjoy teaching others about the concepts of data science. Which is why at one point I created the crazy video series 5 Minutes with Ingo. This idea now is also the origin for the name of this blog. If you want to learn more about machine learning and data science, you should check out those videos. But be warned: they are a bit quirky.

In the future, I will share lessons learned as an entrepreneur (where I still feel like an amateur) as well as on data science (which I know a thing or two about). I mix in some photography posts – because I enjoy it and why not?

Hope you enjoy this little place here. And let me know if you have anything on which you would like to get my perspective on.