# K-Nearest Neighbors – the Laziest Machine Learning Technique

Family:
Supervised learning
Modeling types:
Classification, Regression
Group:
Lazy Learners / Instance-based Learners
Input data:
Numerical, Categorical
Tags:
Fast, Local, Global

One of my quirky videos from the “5 Minutes with Ingo” series. It explains the basic concepts of k-Nearest Neighbors N in 5 minutes. And has unicorns.

## Concept

k-Nearest Neighbors is one of the simplest machine learning algorithms.  As for many others, human reasoning was the inspiration for this one as well.  Whenever something significant happened in your life, you will memorize this experience.  You will later use this experience as a guideline about what you expect to happen next.

Consider you see somebody dropping a glass. While the glass falls, you already make the prediction that the glass will break when it hits the ground. But how can you do this? You never have seen THIS glass breaking before, right?

No, indeed not. But you have seen similar glasses or in general similar items dropping to the floor before.  And while the situation might not be exactly the same, you still know that a glass dropping from about 5 feet on a concrete floor usually breaks.  This gives you a pretty high level of confidence to expect breakage whenever you see a glass fall from that height on a hard floor.

But what about dropping a glass from a foot height onto a soft carpet?  Did you experience breaking glasses in such situations as well?  No, you did not.  We can see that the height matters, so does the hardness of the ground.

This way of reasoning is what a k-Nearest Neighbors algorithm is doing as well.  Whenever a new situation occurs, it scans through all past experiences and looks up the k closest experiences.  Those experiences (or: data points) are what we call the k nearest neighbors.

If you have a classification task, for example you want to predict if the glass breaks or not, you take the majority vote of all k neighbors.  If k=5 and in 3 or more of your most similar experiences the glass broke, you go with the prediction “yes, it will break”.

Let’s now assume that you want to predict the number of pieces a glass will break into.  In this case, we want to predict a number which we call “regression”.  Now you take the average value of your k neighbors’ numbers of glass pieces as a prediction or score.  If k=5 and the numbers of pieces are 1 (did not break), 4, 8, 2, and 10 you will end up with the prediction of 5.

We have blue and orange data points.  For a new data point (green), we can determine the most likely class by looking up the classes of the nearest neighbors.  Here, the decision would be “blue”, because that is the majority of the neighbors.

Why is this algorithm called “lazy”?  Because it does no training at all when you supply the training data.  At training time, all it is doing is storing the complete data set but it does not do any calculations at this point.  Neither does it try to derive a more compact model from the data which it could use for scoring.   Therefore, we call this algorithm lazy.

## Theory

We have seen that this algorithm is lazy and during training time all it is doing is to store all the data it gets.  All the computation happens during scoring, i.e. when we apply the model on unseen data points.  We need to determine which k data points out of our training set are closest to the data point we want to get a prediction for.

Let’s say that our data points look like the following:

We have a table of n rows and m+1 columns where the first m columns are the attributes we use to predict the remaining label column (also known as “target”).  For now, let’s also assume that all attribute values x are numerical while the label values for y are categorical, i.e. we have a classification problem.

We can now define a distance function which calculates the distance between data points.  Especially, it should find the closest data points from our training data for any new point.  The Euclidean distance often is a good choice for such a distance function if the data is numerical.  If our new data point has attribute values s1 to sm, we can calculate the distance d(s, xj) between point s to any data point xj by

The k data points with the closest value for this distance become our k neighbors.  For a classification task, we now use the most frequent of all values y from our k neighbors.  For regression tasks, where y is numerical, we use the average of all values y from our k neighbors.

But what if our attributes are not numerical or consist of numerical and categorical attributes?  Then you can use any other distance measure which can handle this type of data.  This article discusses some frequent choices.

By the way, K-Nearest Neighbors models with k=1 are the reason why calculating training errors are completely pointless.  Can you see why?

## Practical Usage

K-Nearest Neighbors, or short: K-NN, should be a standard tool in your toolbox.  It is fast, easy to understand even for non-experts, and it is easy to tune it to different kind of predictive problems.  But there are some things to consider which we will discuss in the following.

### Data Preparation

We have seen that the key part of the algorithm is the definition of a distance measure.  A frequent choice is the Euclidean distance. This distance measure treats all data columns in the same way though. It subtracts the values for each dimension before it sums up the squares of those distances. And that means that columns with a wider data range have a larger influence on the distance than columns with a smaller data range.

So, you should normalize the data set so that all columns are roughly on the same scale. There are two common ways of normalization. First, you could bring all values of a column into a range between 0 and 1. Or you could change the values of each column so that the column has a mean 0 with a standard deviation of 1 afterwards. We call this type of normalization z-transformation or standard score.

Tip: Whenever you know that the machine learning algorithm is making use of a distance measure, you should normalize the data. Another famous example would be k-Means clustering.

### Parameters to Tune

The most important parameter you need to tune is k, the number of neighbors used to make the class decision.  The minimum value is 1 in which case you only look at the closest neighbor for each prediction to make your decision.  In theory, you could use a value for k which is as large as your total training set.  This would make no sense though, since in this case you would always predict the majority class of the complete training set.

Here is a good way to interpret the meaning behind k. Small numbers indicate “local” models, which can be non-linear and the decision boundary between the classes wiggle a lot. If the number grows, the wiggling gets less until you almost end up with a linear decision boundary.

We see a data set in two dimensions on the left.  In general the top right is red and the bottom left is the blue class.  But there are also some local groups inside of both areas.  Small values for k lead to more wiggly decision boundaries.  For larger values the decision boundary becomes smoother, almost linear in this case.

Good values for k depend on the data you have and if the problem is non-linear or not.  You should try a couple of values between 1 and about 10% of the size of the training data set size.  Then you will see if there is a promising area worth the further optimization of k.

The second parameter you might want to consider is the type of distance function you are using.  For numerical values, Euclidean distance is a good choice.  You might want to try Manhattan distance which is sometimes used as well.  For text analytics, cosine distance can be another good alternative worth trying.

### Memory Usage & Runtimes

Please note that all this algorithm is doing is storing the complete training data.  So, the memory needs grow linearly with the number of data points you provide for training.  Smarter implementations of this algorithm might choose to store the data in a more compact fashion.  But in a worst-case scenario you still end up with a lot of memory usage.

For training, the runtime is as good as it gets.  The algorithm is doing no calculations at all besides storing the data which is fast.

The runtime for scoring though can be large though which is unusual in the world of machine learning.  All calculations happen during model application.  Hence, the scoring runtime scales linearly with the number of data columns m and the number of training points n.  So, if you need to score fast and the number of training data points is large, then k-Nearest Neighbors is not a good choice.

### RapidMiner Processes

Please download the Zip-file and extract its content.  The result will be an .rmp file which can be loaded into RapidMiner via “File” -> “Import Process…”.

# Photo Friday: Newfoundland

It’s photo Friday.  I often will publish some of the images I took in the past years on Fridays.

Today I have some pictures which I took in 2016 in Newfoundland, Canada.  My wife Nadja and I visited both national parks: Gros Morne and Terra Nova.  We also visited the Bonavista peninsula and the northern tip around L’Anse Aux Meadows.

By the way, the image I currently use for the home page is the lighthouse of Bonavista which you can also see below.  I had a great email exchange with a reader earlier this week who identified the location right away. Impressive if you ask me.

Anyway, here are the images I have selected for today:

The one of the dark sea in the top left corner is one of my current favorites.  I just like the structure of the stones and how the colors came together.

And here is a small map of Newfoundland highlighting the areas we have visited:

# Your go-to-market strategy should drive the decision for open source

This is the second part of a series.

• Part I: Is Open Source right for your business? – Introduces some of the advantages of open source licenses but also makes clear that you should make the decision for open source based on your go-to-market strategy.
• Part II (this post): Your go-to-market strategy should drive the decision for open source – discusses a basic framework to determine if open source or at least a freemium strategy is something you should consider.
• Part III: Which is the right open source business model for you – discussed the different options for open-source-based business models as well as adjacent models.

Let’s assume that your business already has a general strategy. This means that you in general know what your desired market position is and how you want to get there.  Your product, marketing, and sales strategy should all follow this strategy of course. One key area is your go-to-market strategy which defines:

1. who are you selling to,
2. what resonates most with this group, and
3. how are you selling to it.

To figure this out, you need to look at different dimensions of your business.  The most important ones are your customers, your company itself, the ecosystem you are operating in, and your product. Every company is different, and so is your go-to-market strategy.  This post does not intent to cover all aspects on figuring out how to go to market.  Our goal is to “just” find out if open source is the right licensing strategy for you or not.  To do this, you need to answer the following 15 questions:

### Customers

• Is a high level of innovation needed for your target segment?
• What kind of company are you selling to? SME vs. enterprises?
• How much awareness do you need to create? Can a user community play a role in this?
• What is your approach for demand generation? Do you sell to users? Cold calling?

Indicators for the need of open source: Your market segment is developer-centric, or at least you deal with technical buyers. Your buyers care a lot about a high level of innovation. It is important to quickly integrate the latest technical developments.

Indicators for the need of a free offer: You mainly sell to smaller companies. Your most important lead source are the users of your product. Cold calling, events, or other forms of lead generation play a less important role. A user community exists for your product created through word of mouth.

If nothing of the things above applies to you, you can stop reading.  A traditional enterprise software approach is more likely to work for you.

### Company

• What is your sales channel? Direct or do you have a large focus on indirect channel?  Is the indirect channel creating own intellectual property based on your product?

Indicators for the need of open source: Your mission anchors around changing the life of many people or disrupting industries. The main path to achieve this mission is to build allies. You depend on partners who create their own intellectual property around your product.

Indicators for the need of a free offer: Your mission anchors around market leadership and owning a market. Your product is a self-starter and does not need a complex setup or lots of education.

### Ecosystem

• Do you need to make a fast land-grab or are you in a market where you replace an existing solution?
• Is your product part of a technology stack?  What about the other products?  Are they open source?

Indicators for the need of open source: You have many competitors and some offer their product already open source. Your product is part of a technology stack where all or most other products are open source.

Indicators for the need of a free offer: You follow a land-grab strategy trying to own your market quickly.

Be careful if you are in a market with only a few competitors and none is open source yet. It often does not make sense to open your own product in the early stage if there are no other reasons for doing so.

### Product

• Is your product supposed to be the best-in-class leader in your field? Or are you following a good-enough strategy replacing an existing solution?
• Can a community fill gaps in quality assurance, documentation, or education?
• Do you offer an API so that developers can augment your product through extensions?
• Is your product running on-premises or is it Software-as-a-Service?

Indicators for the need of open source: Your product is solid but not best in class. You follow a good-enough strategy for replacing current market leaders. You have a strong user community which is willing to help you on documentation. You have an API for extending your product or envision a marketplace for extensions.

Indicators for the need of a free offer: You have a SaaS product with only little time commitment necessary to get started.

It is hard to motivate a user community to provide documentation. Do not over-estimate your ability to get input from your users on those things!

Now three things can happen:

• No or only a few indicators for open source: If you only got a few positive hints that an open source license is the right thing for you: don’t do it. You will only get the disadvantages of open source but none of the advantages.
• Hints that you need a free offer but the license does not matter: Use a freemium-based business model instead of an open source license.
• Hints that you need a free offer and open source matters: Go with an open source software license. And chose one of the open-source based paradigms for packaging & pricing (next post).

## Practical Example

To illustrate this framework with an example, I give you some of the key answers for RapidMiner:

• Our buyers are technical buyers and developers (data scientists in our case).
• Data science is a field with a high level of innovation.
• We already have a user community of more than 200,000 users attracted by word of mouth.
• We use a PQL approach for selling, i.e. customers become users first before we sell to them.
• RapidMiner’s mission is to change the way of how the industry of data scientists is working.
• We sell directly and through partners, who often create own intellectual property.
• The data science market is still premature and a fast land-grab is the desired strategy.
• Crowded competitive landscape with many open source technologies.
• The technology stack for data scientists is full of open source technologies.
• The evolution of data science and machine learning is so fast that an API for extensions or a marketplace is a must.

There are also some points which make RapidMiner not perfect for an open source or even a freemium approach:

• Data science is complex; the product has some learning curve.
• We sell to companies of all sizes, especially also to larger enterprises.
• The product is among the best in class.

But as you can see, there is more evidence that an open source license is beneficial for RapidMiner than not.  You can do the same exercise for your business which should tell you if open source is something you should consider. The next step is then to develop a business model around a good packaging of your product. We will discuss this in the next post.

# Is Open Source right for your business?

This is the first part of a series.

When you start a company, you need to think about thousands of things. The most important ones are product-market-fit and your go-to-market strategy.  The former determines if you offer a product for which a market exists.  The latter defines what your market is, what resonates with it, and how do you sell to it.

I plan to do a series of posts in which I will discuss a well-proven decision process for or against using an open source license.  I will do this under the light of your go-to-market strategy.  In this first post, we will see some of the advantages of an open source license.  Which leads many companies to decide for it to get access to those advantages right away. We will see, however, that this process often is somewhat flawed.  In later posts, we will introduce a framework how you should make this decision instead and what your options for open source models are.

## Why open source?

The first thing you should define is your business strategy.  What is the market position you want to be in?  And in order to achieve this position: what are you doing and what are you not doing?  Finally you need to think about how you can align the company behind this approach.  This is the basis for everything else and it can make or break your company to get this right.

After your business strategy is defined, you can work on your go-to-market strategy. And only then – and as part of your considerations – you should decide if an open source license is the right move for your product. Most companies do it exactly the other way round. They first decide for an open source license and then try to build a business around it. The reason is that “open source” offers a lot of great advantages as we will see below.

As a user, we are always happy to get software under an open source license.  It means less vendor lock-in and less license costs.  Let’s face it: most people care about “free” as in “free beer”, not as in “freedom of speech”.

As a software developer, the decision for open source also makes sense. Developers benefit so much from open source libraries created by other developers.  Some of the leading programming languages (e.g. Python or Java) as well as our favorite IDEs are open source as well (e.g. Eclipse).

There is also recognition and fame.  If you make your product open source, it will be more attractive to other developers and used by many.  This is a nice perk after all those nights of hard work to create the stuff in the first place.

But all those points above actually miss the point.  Which is that open source products have a bigger chance to create a user community around the product. This is what makes this licensing model attractive in most cases.

User communities around your software product offer a lot of value:

• Market presence (social lift, bigger user numbers, impact on analyst relations, brand awareness)
• Quality assurance
• Potential extension of your R&D team (if there is a developer community around your product as well or if your users are developers)
• Potentially this community can add to top of your sales funnel
• Asset for investors
• Being part of something big is generating trust and emotional “me too” triggers

RapidMiner found another great value of our community with its “Wisdom of Crowds” feature. The product analyzes the behavior of its users in the background.  It then makes recommendations for good analytical functions to other users. In a complex field like data science this form of guided analytics can be extremely useful.

There is one more huge value of open source, and it might be the biggest value of them all.  Open source also gives users the freedom to innovate.  People can extend and embed the software and are creating new innovations by doing so.

All those values sound great – so what is the catch?

## Open Source is NOT a business model!

The problem is that “open source” on its own is not a business model. It is a software license which accelerates innovation.  It allows for the embedding of your software into other products which generates new products.  And this often leads to a developer community around your original product.

As a side-effect, it also supports the widespread adoption because of the free price. But this is also true for freemium offers like they are used by every SaaS-software vendor in the market. Most users of Dropbox never pay a dollar and only use the free version of the product. Dropbox is not open source though and it would not have made any difference if they would have been. The free offer alone was doing the trick.

We can see that open source brings a lot of advantages.  However, sometimes those might not be necessary or sometimes this type of license can even be negative for reaching your business goals.  Which brings us back to the question of the go-to-market strategy.  In the next post, we will discuss this and how this strategy can relate to open source license models.

# The Perfect Drink for Data Scientists?

“Who says America doesn’t invent anything anymore?”

I was writing about data scientists and unicorns before.  People somehow believe that data scientists need many conflicting skills. Which is why it is close to impossible to find one in the wild. Just like with unicorns.

Starbucks has a heart for data scientists and brought us a new unicorn-themed Frappuccino.  It is only available for a few more days and is one of the biggest marketing activities from Starbucks I have seen.

Some describe that it “tastes like sour birthday cake and shame”.  But so what?  After all I am one of those data science unicorns so this should be THE drink for me!  I made my daily trip to Starbucks to get one.  But unlike all other days, I found myself surrounded by 12-year old girls going crazy about the new drink.  And when I finally saw this pink-blue-sugar-bomb in the hand of one of the teens I realized: I can’t do this.  I ended up ordering my regular coffee instead.

But what about the “data science unicorn” talk?  The least I could do was to grab an uber-sized paper Frappuccino and take a picture with it.  Since I was accidentally ripping it off a wall, they kicked me out of the shop.  Starbuck’s heart for unicorns is not THAT big after all.*

Me holding up a giant paper version of the new Unicorn Frappuccino.  They kicked me out of the shop for accidentally ripping it off the wall.

* I was not actually kicked out of the shop but needed to fix the paper thingy back to the wall.  Understandable.  The baristas would be too busy and already suffer from meltdowns because of the drink.

# So, what is Data Science then?

I just finished a post on explaining the relationship between Artificial Intelligence, Machine Learning, and Deep Learning.  And somebody immediately pointed out: But what about Data Science? How does Data Science relate to all this?

Good question.  That’s what I am going to write about today then.

In case you do not want to read the whole post from yesterday (shame on you!), here is a quick summary:

• Artificial Intelligence is covering anything which enables computers to behave like a human.  Machine Learning is a part of this as well as language understanding and computer vision.
• Machine Learning deals with the extraction of patterns from data sets. This means that the machine can find rules for optimal behavior but also can adapt to changes in the world. Deep Learning is part of this, but so are decision trees, k-means clustering, or linear regression, among others.
• Deep Learning is a specific class of Machine Learning algorithms which are using complex neural networks.  Recent advances in parallel computing made those algorithms feasible.

Deep Learning is a subset of methods from Machine Learning.  Which is again a subset of Artificial Intelligence.

Which brings us now finally to Data Science.  The picture below gives an idea how Data Science relates to those fields:
Data Science is the practical application of all those fields (AI, ML, DL) in a business context.  “Business” here is a flexible term since it could also cover a case where you work on scientific research.  In this case your “business” is science.  Which actually is more true than you want to think about.

But whatever the context of your application is, the goal are always the same:

• extracting insights from data,
• predicting developments,
• deriving the best actions for an optimal outcome,
• or sometimes even perform those actions in an automated fashion.

As you can also see in the diagram above, Data Science covers more than the application of only those techniques.  It also covers related fields like traditional statistics and the visualization of data or results.   Finally, Data Science also includes the necessary data preparation to get the analysis done.  In fact, this is where you will spend most of your time on as a data scientist.

A more traditional definition describes a data scientist as somebody with programming skills, statistical knowledge, and business understanding. And while this indeed is a skill mix which allows you to do the job of a data scientist, this definition falls a bit short.  Others realized this as well which led to a battle of Venn diagrams.

The problem is that people can be good data scientists even if they do not write a single line of code. And other data scientists can create great predictive models with the help of the right tools.  But without a deeper understanding of statistics.  So the “unicorn” data scientist (who can master all the skills at the same time) is not only overpaid and hard to find.  It might also be unnecessary.

For this reason, I like the definition above more which focuses on the “what” and less on the “how”.  Data scientists are people who apply all those analytical techniques and the necessary data preparation in the context of a business application.  The tools do not matter to me as long as the results are correct and reliable.

# What is Artificial Intelligence, Machine Learning, and Deep Learning?

There is hardly a day where there is no news on artificial intelligence in the media.  Below is a short collection of some news headlines from the past 24 hours only:

It is interesting that most of those articles have a skeptical, if not even negative tone.  This sentiment was also fueled with statements of Bill Gates, Elon Musk, or even Stephen Hawking.  With all due respect, but I would not stand in public talking nonsense about wormholes so we should all focus a bit more on the areas we are experts in.

This all underlines two things: artificial intelligence and machine learning finally became mainstream. And people know shockingly little about it.

There is also a high dose of hype around those topics.  We all heard about “Linear Regression” before. This should not come as a surprise since it was already invented more than 200 years ago by Legendre and Gauss.  And still this overdose of hype can lead to situations where people are a little bit carried away whenever they use this method.  Here is one of my favorite tweet exchanges which exemplifies this:

Anyway, there is high level of confusion around those terms. This post should help to understand the differences and relationships of those fields. Let’s get started with the following picture. It explains the three terms artificial intelligence, machine learning, and deep learning:

Artificial Intelligence is covering anything which enables computers to behave like a human.  Think of the famous – although a bit outdated – Turing test to determine if this is the case or not.  If you talk to Siri on your phone and get an answer, this is close already.  Automatic trading systems using machine learning to be more adaptive would also already fall into this category.

Machine Learning is the subset of Artificial Intelligence which deals with the extraction of patterns from data sets. This means that the machine can find rules for optimal behavior but also can adapt to changes in the world. Many of the involved algorithms are known since decades and sometimes even centuries. But thanks to the advances in computer science as well as parallel computing they can now scale up to massive data volumes.

Deep Learning is a specific class of Machine Learning algorithms which are using complex neural networks.  In a sense, it is a group of related techniques like the group of “decision trees” or “support vector machines”.  But thanks to the advances in parallel computing they got quite a bit of hype recently which is why I broke them out here. As you can see, deep learning is a subset of methods from machine learning.  When somebody explains that deep learning is “radically different from machine learning“, they are wrong.  But if you would like to get a BS-free view on deep learning, check out this webinar I did some time ago.

But if Machine Learning is only a subset of Artificial Intelligence, what else is part of this field?  Below is a summary of the most important research areas and methods for each of the three groups:

• Artificial Intelligence: Machine Learning (duh!), planning, natural language understanding, language synthesis, computer vision, robotics, sensor analysis, optimization & simulation, among others.
• Machine Learning: Deep Learning (another duh!), support vector machines, decision trees, Bayes learning, k-means clustering, association rule learning, regression, and many more.
• Deep Learning: artificial neural networks, convolutional neural networks, recursive neural networks, long short-term memory, deep belief networks, and many more.

As you can see, there are dozens of techniques in each of those fields. And researchers generate new algorithms on a weekly basis.  Those algorithms might be complex.  The conceptual differences like explained above are not.

# First Post – or: About the Insanity of Being an Entrepreneur in Data Science

Ingo doing what he is doing best: explaining things nobody asked for.

Oooook, I finally made the step and created a space for my thoughts on data science, startup life, and photography. And all other things I deal with. The reason why I decided to do this is simple: there are not a lot of entrepreneurs in the data science field.  Founding a company is hard.  Data science is hard.  Doing both things at the same time is insane.

It should not surprise that over the years many people have reached out to me seeking for advice.  I love sharing insights or ideas so please continue to do so.  But what about those who do not have the chance to talk to me directly?  Exactly.  That’s when I realized that I should get my act together and spend some time on writing the stuff down.  Which is why I created this blog.

I also enjoy teaching others about the concepts of data science.  Which is why at one point I created the crazy video series 5 Minutes with Ingo.  This idea now is also the origin for the name of this blog.  If you want to learn more about machine learning and data science, you should check out those videos. But be warned: they are a bit quirky.

In the future, I will share lessons learned as an entrepreneur (where I still feel like an amateur) as well as on data science (which I know a thing or two about).  I mix in some photography posts – because I enjoy it and why not?

Hope you enjoy this little place here.  And let me know if you have anything on which you would like to get my perspective on.