Make Contact

Close
Invalid Input
Please type your name. Please type your company Invalid email address.
Invalid Input

Make Contact

608.294.5460
634 W Main St., Ste 201
Madison, WI  53703

Map
[email protected]

Zeppelin: Data Analysis For the People

As data becomes more prevalent across many industries, there is a greater need for tools that facilitate the democratization and collaboration of data analysis projects. One such application is Apache Zeppelin a web-based, open-source data analytics notebook. Zeppelin supports using multiple languages within the same notebook, sits on top of Apache Spark to run calculations, and has considerable built-in data visualization tools. Notebooks can be exported and shared as JSON and multiple parties can work on them collaboratively through GitHub. Visuals can also be published as HTML to a website.

In the following example I demonstrate the range of functionality using Zeppelin to analyze bus trip data from Madison Metro, our local municipal bus service. Screenshots show the code and results from a simple analysis and calculation of basic statistics on Madison Metro Bus trips and associated bus stop and ridership data.


 

Using Zeppelin to Analyze Madison Metro Bus Data

The screenshot below demonstrates how I loaded the data as a text file into Spark using the Scala language. The code maps the data to fields in a dataframe and converts the dataframe into a table construct. By converting the data into a table in Spark, standard SQL can be run against the data, even though it has never been loaded into a database.

bustrips load

 

The image below shows the query and resulting graph of the number of trips for each bus route. Unsurprisingly, bus route 80, a free route that services the UW, has the highest number of trips in the analyzed resultset.

busroute count

 

Here I load another text file to show bus trip stops and ridership data, which is similar to the first example above: 

tripdistance load

 

Finally, this screenshot shows a SQL query that joins the first data file together with the second data file to show the longest bus route (route 56), the number of stops along the route, the total distance covered in the route, and the average distance between stops in the route.

longestbusroute

 

This Apache Zeppelin notebook demonstration is drawn from a presentation I did at the BigDataMadison meetup. You can download the notebook and presentation here: https://github.com/Pshrub/bigdatamadison_spark

Try it for yourself! Any local data set will do, or replicate this example using Madison Metro Bus Data.  
 

Make Contact

Looking for a team to help your idea take flight?
Get in touch and we'll talk it out.

Phone or Email

(608) 294-5460

Address

Earthling Interactive
634 W Main St., Ste 201
Madison, WI 53703