Zeppelin: Data Analysis For the People

As data becomes more prevalent across many industries, there is a greater need for tools that facilitate the democratization and collaboration of data analysis projects. One such application is Apache Zeppelin a web-based, open-source data analytics notebook. Zeppelin supports using multiple languages within the same notebook, sits on top of Apache Spark to run calculations, and has considerable built-in data visualization tools. Notebooks can be exported and shared as JSON and multiple parties can work on them collaboratively through GitHub. Visuals can also be published as HTML to a website.

In the following example, I demonstrate the range of functionality using Zeppelin to analyze bus trip data from Madison Metro, our local municipal bus service. Screenshots show the code and results from a simple analysis and calculation of basic statistics on Madison Metro Bus trips and associated bus stop and ridership data.


Using Zeppelin to Analyze Madison Metro Bus Data

The screenshot below demonstrates how I loaded the data as a text file into Spark using the Scala language. The code maps the data to fields in a dataframe and converts the dataframe into a table construct. By converting the data into a table in Spark, standard SQL can be run against the data, even though it has never been loaded into a database.
Madison bus trip analysis code snippet on loading data into a table.


The image below shows the query and the resulting graph of the number of trips for each bus route. Unsurprisingly, bus route 80, a free route that services the UW, has the highest number of trips in the analyzed resultset.

SQL data query and resultant bard graph result.



Here I load another text file to show bus trip stops and ridership data, which is similar to the first example above:

Code snippet on how to find trip distance.


Finally, this screenshot shows a SQL query that joins the first data file together with the second data file to show the longest bus route (route 56), the number of stops along the route, the total distance covered in the route, and the average distance between stops in the route.

SQL data query with results shown below.


This Apache Zeppelin notebook demonstration is drawn from a presentation I did at the Apache Zeppelin. You can download the notebook and presentation here: https://github.com/Pshrub/bigdatamadison_spark

Try it for yourself! Any local data set will do, or replicate this example using Madison Metro Bus Data.