With this demo we pretend to show, the effective combination of using Apache Kylin, an analytical engine on top of a Hadoop Cluster, and Tableau, a tool for visualizing Big Data Sources in a very intuitive and simple way,
The used data are about the historic academic performance in an big university, with about (>100 millions rows).
With this data source, we have created a Dashboard using some graphs offered by Tableau, you can explore it in detail, below.
In this use case we have used together Apache Kylin, Tableau Desktop and Tableau Public to support interactive data analysis (OLAP) and developing a dashboard, from data source with Big Data features (Volume, Speed, Variety).
The data source contains the last 15 years of academic data from a big university. Over this data source, we have designed a multidimensional model with the aim of analyze student?s academic performance. We have stored in our Data Warehouse about 100 million rows, with metrics like credits, passed subjects, etc. The analysis of these facts is based on dimensions like gender, qualification, date, time or academic year.
However this data volume is too large to be analyzed using traditional database systems for OLAP interactive analysis. To address this issue, we decide to try Apache Kylin, a new technology that promises sub second interactive queries for data Volumes over billions and trillion of rows on the fact table.
Apache Kylin architecture is based on two Hadoop stack technologies: Apache Hive and HBase. First, we have to implement the Data Warehouse (DW) on Hive database using a star or a snow flake schemas. Once we have implemented one of these data models, we can define an OLAP cube on Kylin. To this end, we have also to define a Kylin?s cube model using Kylin?s GUI with wizard. At this moment, Kylin can generate the MOLAP cube in an automatic process. After cube creation, we can query the OLAP cube using SQL queries or connecting to a BI tool using the available J/ODBC connectors.
With aim to explore the data and generate visualizations that allows users to extract useful knowledge from data, we have chosen Tableau BI tools: Tableau Desktop and Tableau Public.
Tableau Desktop is a commercial desktop self-service BI tool that enable users to create professional dashboards easily, dragging and dropping data concepts and charts to a new dashboard. Using this tool we have developed a dashboard, similar to our use cases with Power BI or Apache Zeppelin. Moreover, in order to show you our demo developing process, we have prepared a video.
Finally, we have published our demo dashboard using Tableau Public. Tableau Public is a free cloud tool that allow to share our dashboard on the web. However this is limited to use with a data extract that we need to upload together with dashboard. If we require a direct connection to Kylin, we have to use Tableau Server, a commercial tool. Because of our demo dashboard is designed to direct connection with Tableau server, changes on parameters will not have any effect on dashboard.
Developed by eBay and later released as Apache Open Source Project, Kylin is an open source analytical middle ware that supports the support analysis OLAP of big volumes of information with Big Data charactertistics, (Volume, Speed, and Variety).
But nevertheless, until Kylin appeared in the market, OLAP technologies was limited to Relational Databases, or in some cases optimized for multidimensional storage, with serious limitations on Big Data.
Apache Kylin, builded on top of many technologies of Hadoop environment, offer an SQL interface that allows querying data set for multidimensional analysis, achieving response time of a few seconds, over 10 millios rows.
There are keys technologies for Kylin; Apache Hive and Apache HBase.
The Data Warehouse is based on a Start Model stored on Apache Hive.
Using this model and a definition of a meta-data model, Kylin builds a multidimensional MOLAP Cube in HBase.
After the cube is builded the users can query it, using an SQL based language with its JDBC driver.
When Kylin receives an SQL query, decide if it can be resolved using the MOLAP cube in HBase (in milliseconds), or not, in this case Kylin build its own query and execute it in the Apache Hive Storage, this case is rarely used.
As Kylin has a JDBC driver, we can connect it, to most popular BI tools, like Tableau, or any framework that uses JDBC.
Tableau is a leader Business Intelligence (BI) enterprise software. Tableau has connectors for almost all the existing data sources such as Big Data ones like Apache Kylin (ODBC), Spark SQL, Hive, Impala, HP Vertica, Google Big Query or Redshift, but also has connectors for more traditional data sources such as SQL Server (and Analysis Services) or PostgreSQL. In the following we describe the main tools of Tableau BI software:
Tableau Desktop: Desktop BI tool for data discovery and interactive dashboard design. Belongs to the so called self-service tools due to its simple design that makes easy to any kind of user developing and sharing data visualizations.
Tableau Server: BI server tool that requires installation on our managed infrastructure. Allows dashboards have direct connection to the data sources. This feature is an advantage for Big Data scenarios where we recommend parameterize queries because is not feasible extract all data from source and upload to the cloud. Moreover Tableau Server allows to share dashboards intra organization and control security at fine grained levels (access and data model).
Tableau Online: Similar to Tableau Server but it does not require local infrastructure because it is a cloud service.
Moreover, there are two Tableau free tools:
Tableau Desktop Public: Is a free edition of Tableau Desktop but with some important limitations at data sources. It only has connectors for sources such as Google Sheets, Microsoft Excel, CSV, JSON, statistics (SAS, SPSS o R) o geographical and spatial data (ESRI shape files, KML, and MapInfo). Therefore, we cannot use this tool to connect with data sources like Apache Kylin.
Tableau Public: Tableau Public is a free cloud tool that allow to share our dashboard on the web. However this is limited to use with a data extract that we need to upload together with dashboard.
As Big Data sources, we have generated academic data for last 15 years of an university, we more than a million students.
In the Data Warehouse we have 100 millions rows with metrics like sum of credits, approved subjects, suspended subjects or enrolled subjects.
Also there are derivative metrics, like, performance rate, success rate, calculated based on the relation between aprovved credits and enrolled credits.