IoT sensors, big data and advanced analytics to ensure optimal growth - Part III
Due to the feedback received from our first post, in this third in our blog series “IoT sensors, big data and advanced analytics to ensure optimal growth”, we will go through a series of steps one can use to solve NASA’s Big Data problem from 1997. As we mentioned, NASA faced a few challenges when trying to visualize very large datasets. The image shown in Figure 1 was generated from an interactive visualization of AIS data created in Python with cloud infrastructure hosted on Microsoft Azure. The dataset was composed of 1,2 billion events, each event composed of a ship’s location (lat-lon pair) among other features.
Figure 1. Visualizing Big Data. The figure was generated from an AIS dataset made of 1.2 billion entries. Each bright pixel represents a ship has been present during a period of time. The brighter pixels represent geographical points where a high concentration of ships has been present.
Visualizing Big Data… in a meaningful way
Let’s start by giving you very good news, we are not in 1997 anymore! (Don’t mean to say that Aqua’s Barbie Girl didn’t rock…). Nowadays we are in a deluge of options for handling Big Data. However, when it comes to different approaches for visualization of large datasets in a meaningful way, there is still a lot of work to do.
We will now explain a few steps, similar to the ones followed to create the interactive visualization we mentioned above. In the case we present here, we will make use of several libraries in Python, among them Pandas (a personal favorite). There are several alternatives to Pandas, see for example Dask and Pandas on Ray. Let us start by loading the data… in memory!
Even though this is normally a very standard procedure, we need to read data coming from several CSV files. As awesome as Pandas can be, it is not designed to read data while benefiting from parallelization. To speed up the process of loading the data to memory, we used the “@numba.jit()” decorator. What this decorator do is to compile the decorated function on-the-fly to produce efficient machine code. It is a 50% speed up that comes basically for free.
Let us now create a simple plotting function as follows:
Next, we will use this function to perform a little sampling of data. We will take 1.000 points and present them in a scatter plot. At present most respectable plotting programs can handle a plot of 1.000 data points.
The result of this sampling is presented in Figure 2.
Figure 2. Result of a 1.000 points sampling. We sample from 1,2 billion data points.
As expected, the result of this scatter plot is meaningless. We have a dataset that consists of 1,2 billion points! We say we are “undersampling”.
We can try to lessen the undersampling problem by increasing the sample size. Let’s take a sample of 100.000 points. We present the result in Figure 2.
Figure 3. Result of a 100.000 points sampling. We sample from 1,2 billion data points.
We can see from Figure 3 that even though we significantly increased the sample size, we are still sampling only a 0.0083% of the dataset (undersampling), while at the same time we are saturating the part of the map where we have most data points (Norway). We can not see any meaningful structure!
We can try an easy fix: reducing the dot size and brightness. This will help us alleviate the saturation problems. We see in Figure 4 that we indeed improve the plot when it comes to the saturation, but it is still obvious we are undersampling.
Figure 4. Result of a 100.000 point sampling with reduced dot size and brightness. The saturation problem is alleviated.
When we reduce the brightness of the dots in the plots, what we are effectively doing is to make the plot so that only pixels with a high density of data points are presented as bright. What happens is that our dataset has events distributed all over the globe but with the highest density in Norway. Since AIS is a dataset resulting from the tracking of ships, the data points will remain mostly within geographical points “within waters”.
Let’s take a closer look at the distribution of events per pixel. We do it in the following way:
We present the result in Figure 5.
Figure 5. Histogram of event counts per pixel.
What we can see from the histogram is that most of the pixels have very low event counts, most of them have values below 2.000, while there are a few pixels with a much larger count of data points. When we try to map all these values, nearly all of the pixels will be colored with the lowest brightness or color in the range. At the same time, the highest brightness will be used for the few pixels with a lot of events. The result is that we will gain very little insight.
The problem we face is that of a non-linear distribution of data. We must use nonlinear scaling to map all the data range into a visible color range. If we use a logarithmic transformation to flatten out the histogram presented in Figure 5, we obtain the result present in Figure 6 below.
Figure 6. First step towards visualizing 1,2 billion data points in a meaningful way. We apply a non-linear transformation to avoid saturation and undersampling issues.
Why did we apply a logarithmic transformation to alleviate the nonlinear nature of the data? Why didn’t we use any other function? The answer is that the choice was rather arbitrary, we must start the exploration somewhere! A better option is to equalize the histogram of the data before building the image. In this way, we ensure that the structure present in the dataset is visible at every data level (visibility obtained from zoom level).
Figure 7. Histogram of event counts per pixel after equalizing the histogram from Figure 5.
Figure 8. Static visualization of 1,2 billion points after nonlinear data transformation.
Even if we can see more structure on the map (It really isn’t map but a simple scatter plot!) displayed in Figure 7 we can get more insights if we had an interactive visualization. We need to have interactivity due to the fact that big datasets will generally have different structures at very many different levels (In the case of this dataset we see that in Norway there is a much richer structure and significantly more data points).
We have created such an interactive visualization where one can zoom into any level in the map and the nonlinear transformation is applied to the part of the dataset that the zoom level is focusedVi on. In this way we can, for example, discover navigation routes at a global and at a national level. Visualizations like this allow industry experts to ask questions to their data in an interactive way, without the need to over-complicating things.
When we used our interactive visualization to zoom into Norway we obtained the result present in Figure 9. From a sort of scatter plot, most of us can recognize the structure present in the Figure. A few interesting patterns arise, such as navigations routes. If one zooms in into the fjords, one can start discovering what obviously are ferries navigation routes.
Figure 9. Static visualization of 1,2 billion points after a nonlinear data transformation.
Figure 10. Left: Map with pixels in the 90 percentile by event count. Right: Map with pixels in the 99 percentile by event count.
The creation of an interactive plot is the first step towards understanding and discovering structures within big datasets. For example, in Figure 10 we present the 90 and 99 percentiles of data with focus on Norway. In this way easily visualize
- Maritime routes
- Ports (Origins – Destinations)
- Clustering in hot spots
In a future blog post, we will explore the use of deep learning techniques to perform more sophisticated analysis of AIS datasets. We will also present some of the potential applications of such analysis that can benefit the maritime industry. Until then, happy commute!