With this demo we pretend to show, the effective combination of using Apache Kylin, an analytical engine on top of a Hadoop Cluster, and Superset, a tool for visualizing Big Data Sources in a very intuitive and simple way,
The used data are about the historic academic performance in an big university, with about (>100 millions rows).
With this data source, we have created a Dashboard using some graphs offered by Superset, you can explore it in detail, below.
In this use case we have used together Apache Kylin and Superset to support interactive data analysis (OLAP) and developing a dashboard, from data source with Big Data features (Volume, Speed, Variety).
The data source contains the last 15 years of academic data from a big university. Over this data source, we have designed a multidimensional model with the aim of analyze student's academic performance. We have stored in our Data Warehouse about 100 million rows, with metrics like credits, passed subjects, etc. The analysis of these facts is based on dimensions like gender, qualification, date, time or academic year.
However this data volume is too large to be analyzed using traditional database systems for OLAP interactive analysis. To address this issue, we decide to try Apache Kylin, a new technology that promises sub second interactive queries for data Volumes over billions and trillion of rows on the fact table.
Moreover, in order to make possible the exploration of the Kylin's cube data using SQL and the creation of dashboards to share with the final users, we have used the Superset tool.
Superset includes connectors to work with Sqlite and Druid but it has some additional packages available to make connections with other data sources. The standard SQLAlchemy allows us to make queries in different data sources if we have installed the pertinent connector. Kylinpy has been used as a connector to allow the usage of SQL queries over Kylin.
Superset includes a query environment (SQL lab) that allows us to develop SQL queries over a data source that can be used to make graphical representations (with the results of each query). Additionally, Superset allows us to create dashboards using the generated graphical representations. Once a dashboard is created, we must manage the access permissions to grant access to each one of the authorized users. In this case we have configured Superset to allow the public access to this dashboard.
Developed by eBay and later released as Apache Open Source Project, Kylin is an open source analytical middle ware that supports the support analysis OLAP of big volumes of information with Big Data charactertistics, (Volume, Speed, and Variety).
But nevertheless, until Kylin appeared in the market, OLAP technologies was limited to Relational Databases, or in some cases optimized for multidimensional storage, with serious limitations on Big Data.
Apache Kylin, builded on top of many technologies of Hadoop environment, offer an SQL interface that allows querying data set for multidimensional analysis, achieving response time of a few seconds, over 10 millios rows.
There are keys technologies for Kylin; Apache Hive and Apache HBase.
The Data Warehouse is based on a Start Model stored on Apache Hive.
Using this model and a definition of a meta-data model, Kylin builds a multidimensional MOLAP Cube in HBase.
After the cube is builded the users can query it, using an SQL based language with its JDBC driver.
When Kylin receives an SQL query, decide if it can be resolved using the MOLAP cube in HBase (in milliseconds), or not, in this case Kylin build its own query and execute it in the Apache Hive Storage, this case is rarely used.
As Kylin has a JDBC driver, we can connect it, to most popular BI tools, like Tableau, or any framework that uses JDBC.
Superset is an open-source visualization tool developed by AirBnb released as an Apache project. Currently, this project is on development due to its recent creation (2016). This tool provides us a wide range of graphical representations to explore and visualize the data allowing the creation of dashboards in an intuitive way. This tool has been developed to work efficiently in personal computers but also in big data clusters.
Moreover, Superset uses SQLAlchemy to facilitate the integration with different relational database managers such as MySQL, PostgeSQL or Oracle. Additionally, Superset can be integrated with some big data tools such as Kylin, Druid or Vertica. In order to make the connection between Superset and this data sources, it is required to install a package which acts as a middleware and to configure the connection using the SQLAlchemy standard.
Superset includes some additional features that we must underline. This tool includes a role-based security authorization system which can be configured using LDAP, Oauth or OpenID as sources. The users of Superset can have specific privileges in terms of access and in terms of creation and modification of data sources, graphs, dashboards, etc.
If you are interested to implement your BI company project with Superset do not hesitate to contact us on StrateBI.
As Big Data sources, we have generated academic data for last 15 years of an university, we more than a million students.
In the Data Warehouse we have 100 millions rows with metrics like sum of credits, approved subjects, suspended subjects or enrolled subjects.
Also there are derivative metrics, like, performance rate, success rate, calculated based on the relation between aprovved credits and enrolled credits.