Big Data - Demo - Stratebi

Power BI Dashboard , with Kylin

Technical Details
Use Case

With this demo we pretend to show, the effective combination of using Apache Kylin, an analytical engine on top of a Hadoop Cluster, and Power BI, a tool for visualizing Big Data Sources in a very intuitive and simple way,

The used data are about the historic academic performance in an big university, with about (>100 millions rows).

With this data source, we have created a Dashboard using some graphs offered by Power BI, you can explore it in detail, below.

Information

In this use case we have used together Apache Kylin and Power BI to support interactive data analysis (OLAP) and developing a dashboard, from data source with Big Data features (Volume, Speed, Variety).

The data source contains the last 15 years of academic data from a big university. Over this data source, we have designed a multidimensional model with the aim of analyze student's academic performance. We have stored in our Data Warehouse about 100 million rows, with metrics like credits, passed subjects, etc. The analysis of these facts is based on dimensions like gender, qualification, date, time or academic year.

However this data volume is too large to be analyzed using traditional database systems for OLAP interactive analysis. To address this issue, we decide to try Apache Kylin, a new technology that promises sub second interactive queries for data Volumes over billions and trillion of rows on the fact table.

Apache Kylin architecture is based on two Hadoop stack technologies: Apache Hive and Apache HBase. First, we have to implement the Data Warehouse (DW) on Hive database using a star or a snow flake schemas. Once we have implemented one of these data models, we can define an OLAP cube on Kylin. To this end, we have also to define a Kylin's cube model using Kylin's GUI with wizard. At this moment, Kylin can generate the MOLAP cube in an automatic process. After cube creation, we can query the OLAP cube using SQL queries or connecting to a BI tool using the available J/ODBC connectors.

With aim to explore the data and generate visualizations that allows users to extract useful knowledge from data, we have chosen Microsoft Power BI tools: Power BI Desktop and Power BI Service (free of charge version).

Power BI Desktop is a completely free desktop self-service BI tool that enable users to create professional dashboards easily, dragging and dropping data concepts and charts to a new dashboard. Using this tool we have developed a dashboard, similar to our use cases with Tableau or Apache Zepelin.

Once designed the dashboard, we have published it on the Web with Power BI cloud service (free edition). In other to do that, we have to create an extract of the data and upload it with the dashboard. This process is transparent to users, who also can program data refreshing frequency using Pro or Premium versions of the Power BI service (commercial tools).

For more information, you can access the TodoBI blog post.

Information

Developed by eBay and later released as Apache Open Source Project, Kylin is an open source analytical middle ware that supports the support analysis OLAP of big volumes of information with Big Data charactertistics, (Volume, Speed, and Variety).

But nevertheless, until Kylin appeared in the market, OLAP technologies was limited to Relational Databases, or in some cases optimized for multidimensional storage, with serious limitations on Big Data.

Apache Kylin, builded on top of many technologies of Hadoop environment, offer an SQL interface that allows querying data set for multidimensional analysis, achieving response time of a few seconds, over 10 millios rows.

There are keys technologies for Kylin; Apache Hive and Apache HBase.
The Data Warehouse is based on a Start Model stored on Apache Hive.
Using this model and a definition of a meta-data model, Kylin builds a multidimensional MOLAP Cube in HBase.
After the cube is builded the users can query it, using an SQL based language with its JDBC driver.

When Kylin receives an SQL query, decide if it can be resolved using the MOLAP cube in HBase (in milliseconds), or not, in this case Kylin build its own query and execute it in the Apache Hive Storage, this case is rarely used.

As Kylin has a JDBC driver, we can connect it, to most popular BI tools, like Tableau, or any framework that uses JDBC.

For more information, you can access the TodoBI blog post.

Information

Power BI is a set of Business Intelligence (BI) tools created by Microsoft. Due to its simplicity and powerful, this emerging tools are becoming a leader BI technology like others such as Tableau, Pentaho or Microstrategy. Like these technologies, Power BI is a self-service BI tool, extremely simple but with a lot of powerful features as the following: dashboard developing (called reports in Power BI), web and intra organization sharing and collaborative work, including dozens of powerful charts (ej. line chart with forecasting on page 2 of our demo), connection to relational and Big Data sources, support for natural language Q & A, support to execute and visualize R statistic programs or data preprocessing (ETL).

The above features are implemented across the different tools of Power BI suite. Power BI Desktop is a desktop tool for data discovery, transformation and visualization. It is a completely free tool with connectors to the most used relational and Big Data sources. Although for same data sources there are specific connectors, with Apache Kylin we have to use the ODBC connector available on Apache Kylin web page. In this way, we connect to Kylin and a data extract from data source is automatically generated by Power BI. At this moment we can create our demo visualization as follows: i) define data model, ii), apply some data transformations if needed (e.g. date format), iii) generate calculated metrics (e.g. student success rate), and then, iv), create the dashboard visualization, with one or multiple pages (e.g. our demo has two page interchangeable with bottom bar selector).

At this time, we have used Power BI Service (cloud) to publish on the web our new dashboard join with data extract. To this end, we created an account of Power BI free. In this case, there are also Pro and Premium commercial editions with additional features like data extraction automatic refreshing and direct connections to some data sources such as SQL Server (also Analysis Services), Oracle or Cloudera Impala. However none of these direct connectors are for Apache Kylin, then with Kylin we have to use data extraction and data extract refreshing approaches.

In addition to Power BI Desktop and Power BI Services (Free, Pro and Premium) there are other Power BI tools such as Power BI Mobile (access to dashboard from smartphone and collaborative work) or Power BI Embedded (to use visualizations in ad-hoc apps, web portals, etc).

If you are interested to implement your BI company project with Power BI do not hesitate to contact us on StrateBI.

For more information, you can access the TodoBI blog post.

Information

As Big Data sources, we have generated academic data for last 15 years of an university, we more than a million students.

In the Data Warehouse we have 100 millions rows with metrics like sum of credits, approved subjects, suspended subjects or enrolled subjects.

Also there are derivative metrics, like, performance rate, success rate, calculated based on the relation between aprovved credits and enrolled credits.

For more information, you can access the TodoBI blog post.

Used Technologies

Apache Hadoop is the most popular Big Data environment, it allows the distributed computing on clusters with commodity hardware and low cost.

The basic and default configuration for a Hadoop cluster includes distributed storage of data using (HDFS), a resource manager (YARN) Yet Another Resource Negotiator, and running on top of this one, is the (Map Reduce) framework, that perform the distributed processing of data.

Besides these components, there are another set of higher level tools, for storing and processing data, like Hive or Spark, as an example. They offer the abstraction that simplifies the development for that environment.

As mentioned before, Hadoop is the most popular Big Data environment, the reason is because it offer a wide range of technologies and a very high robustness level. It is ideal for the new concept of Data Lake for the later analytics using powerful BI tools.

Flume is a distributed and trustworthy system for the efficient collection, aggregation and processing of Streaming Data.

Kafka is a distributed message system that use the pattern publish-subscribe, is fault tolerant, horizontal scalable and is ideal for Stream Data Processing

To make easier the management, installation and maintenance of hadoop cluster we work with two main Hadoop Distributions.

A hadoop distribution is a software package, that include the basic components of Hadoop, with a plus of other technologies, frameworks and tools and the possibility of installing using a web application.

About this, in Stratebi we recommend the use of a hadoop distribution. Being Hortonworks and Cloudera the leader distributions currently in the market. For this reason our demo is running over a Cloudera distribution and a Hortonworks distribution.

Spark implements the Map Reduce programming paradigm making intensive usage of RAM memory instead of disk.

Using Spark, we can improve the performance of Map Reduce applications by implementing iterative algorithms, machine learning (MLib), statistics analysis R module, or real time analytics Spark Streaming, all this is icluded in our demo.

Information

Information

Information

Information

I+D+i BigData

Used Technologies