Clubhouse is now FREE with all core features, for up to 10 users! Learnmore
Data

Using Elasticsearch, Kibana, and Python to easily navigate (and visualize) lots of data

Johnny Dunn

Elasticsearch is renowned as an extremely robust, fast, all-in-one solution for data storage, search, and analytics.

Elasticsearch is open-source and highly scalable, and is built on top of Apache Lucene (Java). It is document-oriented and, like MongoDB and other NoSQL databases, works with JSON. Elasticsearch also works very nicely with Kibana, an open source data visualization and analytics platform designed specifically for Elasticsearch. Kibana offers a suite of tools in a browser-based dashboard with powerful graphing and statistical capabilities.

This guide details: 1). The steps to set up Elasticsearch and Kibana locally on your machine (Windows or Mac / Unix), 2). How to move large amounts of data from a CSV source into Elastic’s tools using a scripting language like Python, and 3). Creating meaningful visualizations from the data in Kibana.

When we have our data in place in Elasticsearch, you’ll see how we can use Kibana to easily understand and visualize patterns / trends in our data.

Important prerequisite: Get Java

If you don’t have Java installed on your system, you’ll need to do so before you can use Elasticsearch. You’ll need the JDK (Java Development Kit), which you can download from here.

You should be able to install Java on your OS directly using one of the files downloaded.

1. Installing Elasticsearch

Quick Install:

Elasticsearch has an installer (MSI) for Windows with options laid out nicely in a GUI, which you can download and install here:

If you’re on a Mac, then you can also easily install Elasticsearch with Homebrew:

$ brew update
$ brew install elasticsearch

Manual Install:

Let’s also go through the steps for a manual installation of Elasticsearch, which will work on any platform, including Linux distributions.

First, you should download the corresponding package you need from here.

For Windows, this will give you a zip file. For Mac and Linux distributions (including Debian), you’ll be downloading a .tar.gz file. Copy / move this file to your preferred directory of installation (I chose C:/).

On Unix systems, you’ll run this commands to extract the .tar.gz file:

Linux:

$ tar -xvf elasticsearch-7.0.0-linux-x86_64.tar.gz

Mac:

$ tar -xvf elasticsearch-7.0.0-darwin-x86_64.tar.gz

On Windows, there are unfortunately no built-in tools to compress and extract zip files using the command line. If you really want to do so, you can download some external tools to do that like pkunzip and 7zip (both free). Or you can just right-click on the zip of course, and extract its contents to the directory!

When that’s done, your path to the Elasticsearch installation should look like this:

$ /FAKEPATH/elasticsearch-7.0.0

Adding Java and Elasticsearch to your path

Now you’ll need to add the environment variable JAVA_HOME to your path, which you can do on Unix by modifying your .bash_profile file.

$ nano ~/.bash_profile

Add this to .bash_profile for Java:

export
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_18164/Contents/Home

Add this for Elasticsearch:

export ES_HOME=~/FAKEPATH/elasticsearch-7.0.0

Finally, add this so those vars will be added to your path:

export PATH=$ES_HOME/bin:$JAVA_HOME/bin:$PATH

For Windows, we can add vars to our path using System Properties, which you can find in the Control Panel.

Click Environment Variables, and you’ll see a list of your System Variables, including Path.

Under System Variables, click Add New, and add variables for JAVA_HOME as well as ES_HOME.

Now, double-click on Path in System Variables, and click Add New.

Add both %JAVA_HOME%\bin and %ES_HOME%\bin to the Path and click OK.

$ java
$ elasticsearch

Go ahead, and start Elasticsearch, which you can do with either:

$ .\bin\elasticsearch

while you’re in Elasticsearch’s installation directory, or with:

$ elasticsearch

since you’ve added it to your path.

Installing Kibana

Now that we have Elasticsearch up and running, let’s install Kibana, which will allow us to visualize our Elasticsearch data, and navigate it into a highly functional and refined dashboard.

Kibana dashboard - image source: https://www.elastic.co/products/kibana

Kibana is also provided by Elastic so, fortunately, its installation process is exactly like the one for Elasticsearch.

Download the corresponding package for your system here:

Again, on Unix, you can use the command line to extract the tar.gz files:

Linux:

$ tar -xvf kibana-7.0.0-linux-x86_64.tar

Mac:

$ tar -xvf kibana-7.0.0-darwin-x86_64.tar

On Windows, right-click and extract the zip file into your desired directory, or use one of the third-party CLI tools.

You can then open up config/kibana.yml`in a text editor, and set `elasticsearch.hosts to point to your Elasticsearch instance. By default, it’s set to Elasticsearch’s default address running locally, http://localhost:9200, so you don’t need to make any of these changes here.

While you’re in the Kibana directory, you can run:

$ bin/kibana 

On Windows, that command would be:

$ bin\kibana.bat

to start running Kibana locally, on port 5601 (by default).

Once you do all this, if you go to http://localhost:5601 in your browser, you’ll see this:

2. Importing data

Elastic offers us some sample datasets that can be imported by Elasticsearch and navigated with Kibana. You can get started with exploring this data within minutes, and the content includes randomly generated logs, randomly generated user accounts, and the complete works of William Shakespeare!

But let’s figure out how we can get our own data in for analysis. While we didn’t actually have any data in Elasticsearch, making our Kibana dashboard totally empty, we’re going to change that right now!

Elasticsearch can function as both your database and as a means of providing search and analytics. For this tutorial, we won’t be getting into any of the searching features that Elasticsearch offers in great detail, but you should know that things like autocomplete and fuzzy matching can be done easily with Elasticsearch.

CSV (comma separated value) files are a de facto standard for storing records pretty much universally. They’re easy to organize, modify, and share with other people, at least up until the point you start having gigabytes and gigabytes of data. When you do have data at that scale, you’re probably going to want to shift away from spreadsheets and move to a more robust and developer-friendly solution.

We can go to https://www.data.gov/, a source of open data provided by the United States government, to find a plethora of information in CSV format.

This one seems straightforward and interesting enough: The 2010 Census Populations by Zip Code. Note, that this data contains only zip codes found in Los Angeles, California.

Downloading and opening the CSV file into a spreadsheet looks like this:

Sure, it’s only 320 records, which isn’t really that much information to look at and traverse, but if we were to load this data into Kibana, we’d be able to explore this information much more intuitively and quickly using its tools.

So let’s get these CSV records into our Elasticsearch instance. We’ll be using Python for this, but any scripting language is suitable for this use case.

Our Python script will be a lot easier to write if we use the client provided by Elastic, which can be installed with pip.

$ pip3 install elasticsearch

Here’s what the Python script will look like:

from elasticsearch import helpers, Elasticsearch
import csv

es = Elasticsearch()

with open('./2010_Census_Populations_by_Zip_Code.csv') as f:
index_name = 'census_data_records'
doctype = 'census_record'
reader = csv.reader(f)
headers = []
index = 0
es.indices.delete(index=index_name, ignore=[400, 404])
es.indices.create(index=index_name, ignore=400)
es.indices.put_mapping(
index=index_name,
doc_type=doctype,
ignore=400,
body={
doctype: {
"properties": {
"Zip Code": {
"type": "float"
},
"Total Population": {
"type": "float",
},
"Median Age": {
"type": "float"
},
"Total Males": {
"type": "float",
},
"Total Females": {
"type": "float",
},
"Total Households": {
"type": "float",
},
"Average Household Size": {
"type" :"float"
}
}
}
}
)
for row in reader:
try:
if(index == 0):
headers = row
else:
obj = {}
for i, val in enumerate(row):
obj[headers[i]] = float(val)
print(obj)
# put document into elastic search
es.index(index=index_name, doc_type=doctype, body=obj)
print(obj)

except Exception as e:
print('error: ' + str(e) + ' in' + str(index))
index = index + 1

f.close()

The CSV file should be in the same directory as the script. If you’re using a different CSV data source, you should also change the script to match the name of that file.

Let’s explain what the script does.

In Elasticsearch, an index can be thought of like a database in a traditional relational database. Documents, which are the objects of data, are indexed, meaning that they are stored and made searchable by that index. Doc_type is not necessarily required, as it’s defaulted to _doc; however, defining it will allow us to keep our schema more easily maintainable.

What the lines of code above do is take all the data from the CSV file, and import it into Elasticsearch record by record. The headers of the CSV file are used for the object schema when importing into Elasticsearch, and all the values of the CSV data are converted to floats (just as a general catch-all for safety and simplicity for now, as most of the data only needs to be in integers).

Before we save the CSV records though, we have to map the Elasticsearch index (with es.put_mapping) so our data attributes match their desired variable types. Otherwise, the data wouldn’t be stored as numerical values within Elasticsearch, and we wouldn’t be able to perform calculations and statistical analysis on it in Kibana.

When that script is finished running, you’ll need to create a new index pattern to match the index you’ve defined, so that Kibana knows to look for that in Elasticsearch.

When that’s done, you’ll be able to explore your data in Kibana like this!

Now you can go to the Visualizations section and add new visualizations and dashboards!

3. Exploring the data and building meaningful visualizations

It might take you some time to explore all of Kibana’s visualization tools and how to use them, but as you’ll see, we can easily create lots of interactive graphs to display the important patterns and trends in our data.

The graph below shows the top twenty zip codes in Los Angeles with the highest median age:

And the graph below shows the top twenty zip codes with the lowest median age:

You can see that the lowest median age in a zip code (that is populated with people) is about 20, while the highest median age in the same city is in the early 70s.

Now this graph below shows the top twenty zip codes with the highest median age graphed against the average household size of those zip codes.

Finally, this graph shows the converse of the above—the top twenty zip codes with the lowest median age, graphed against their average household sizes.

Notice anything interesting about these results, when comparing them side-by-side?

One conclusion is clear from these graphs: The older a household is, the smaller the size of it!

If we think this through, this makes a lot of sense for us. Although we might initially think that older families are more likely to settle down and have children (correlating with a higher household size), we also have to consider what happens when those children grow up and move out! And when that happens, the median age of a household will also be older than those of new families just starting out raising kids. So in fact, the results we see actually make perfect sense with previous historical trends in socioeconomics.

With Kibana’s visualizations, oddities in data clearly stick out. Let’s take a look at one of our data points that seems to be special: Zip code 90822 which has a staggering average household size of 4.5, almost twice as much as other zip codes with similar median ages in the same area.

If we simply type in 90822 in Google Maps just to get an initial look at its location and geography, we can immediately ascertain the reason for this discrepancy.

This zip code has a pediatric medical center right in the middle of it! So naturally, families with more younger age children will gravitate towards this area.

So you see, using Elasticsearch and Kibana in conjunction allows us to easily gleam new insight in massive amounts of data, with a surprisingly small amount of code and initial configuration!

We hope you can find some useful things to do with your newfound knowledge! Have you done anything exciting or discovered any interesting conclusions with Elasticsearch, Kibana, or data analytics in general? Reach out on Twitter or let us know in the comments below!