kylin_icono OLAP Analysis with Kylin and STPivot

With this demo we pretend to show, the effective combination of using Apache Kylin, an analytical engine on top of a Hadoop Cluster, and STPivot4, an OLAP visor developed by Stratebi, with interactive analysis capabilities, over a Big Data Set.

The used data are about the historic academic performance in an big university, with about (>100 millions rows).

On this Data Source we have created a multidimensional OLAP view, using STPivot4. The user can interact freely with the example and see the usefulness of these tools.

loading Loading
Information

In the Use Case we present here, we use Apache Kylin and STPivot for allowing interactive OLAP analysis of a Data Warehouse, that contains Big Data typical data, (Volume, Speed, Variety).

The data contains, the last 15 years of a big university. We have designed a multidimensional model for analyze the academic performance. We have about 100 millions rows, with metrics like, credits, passed subjects, suspended subject, etc. The analysis of this facts are based on dimensions like sex, qualification, date, time or academic year.

Given such big volume of data, using traditional OLAP (R-OLAP and M-OLAP) systems don't meet the required performance, for this reason we are testing Apache Kylin, that allows response times in a few seconds in worst case for volumes higher than 10 billions rows.

There are keys technologies for Kylin; Apache Hive and Apache HBase.
The Data Warehouse is based on a Start Model stored on Apache Hive.
Using this model and a definition of a meta-data model, Kylin builds a multidimensional MOLAP Cube in HBase.
After the cube is builded the users can query it, using an SQL based language with its JDBC driver.

Kylin also support OLAP Analysis, using MDX language, we install the STPivot4, that an OLAP visor developed by StrateBI as part of a suite Lince BI
STPivot4 uses Mondrian as OLAP engine and can be deployed in a Pentaho BA Server, both of them open source technologies, in this way STPivot allows create and explore multidimensional views, using the OLAP cube defined with Apache Kylin.


Information

Developed by eBay and later released as Apache Open Source Project, Kylin is an open source analytical middle ware that supports the support analysis OLAP of big volumes of information with Big Data charactertistics, (Volume, Speed, and Variety).

But nevertheless, until Kylin appeared in the market, OLAP technologies was limited to Relational Databases, or in some cases optimized for multidimensional storage, with serious limitations on Big Data.

Apache Kylin, builded on top of many technologies of Hadoop environment, offer an SQL interface that allows querying data set for multidimensional analysis, achieving response time of a few seconds, over 10 millios rows.

There are keys technologies for Kylin; Apache Hive and Apache HBase.
The Data Warehouse is based on a Start Model stored on Apache Hive.
Using this model and a definition of a meta-data model, Kylin builds a multidimensional MOLAP Cube in HBase.
After the cube is builded the users can query it, using an SQL based language with its JDBC driver.

When Kylin receives an SQL query, decide if it can be resolved using the MOLAP cube in HBase (in milliseconds), or not, in this case Kylin build its own query and execute it in the Apache Hive Storage, this case is rarely used.

As Kylin has a JDBC driver, we can connect it, to most popular BI tools, like Tableau, or any framework that uses JDBC.


Information

STPivot4 is very powerful an OLAP visor, developed by StrateBI and it is part of the suite of BI Tools Lince BI.
With this visor we intend to make easier the usage of this kind of tools, focusing on a successful user experience.

Besides, the addition of a query editor wizard, new graphs, multidimensional tables, an advanced formula editor or the possibility of export the content in many format, are a few features we remark in STPivot4 and make it a leader technology among other OLAP visors.

STPivot works on a MDX engine, Mondrian.
For this STPivot can be used as a plugin in Pentaho BA Server (CE).


Information

As Big Data sources, we have generated academic data for last 15 years of an university, we more than a million students.

In the Data Warehouse we have 100 millions rows with metrics like sum of credits, approved subjects, suspended subjects or enrolled subjects.

Also there are derivative metrics, like, performance rate, success rate, calculated based on the relation between aprovved credits and enrolled credits.


I+D+i BigData

In StrateBI we believe in the value of Big Data technologies for data processing and the possibility of obtain knowledge using it, with the goal of making easier the process of decisions in any industry. Our team makes a great job on I+D+i in Big Data

Research

We keep updated about news and scientific articles published about Big Data technologies.

Its made with emerging ones that we think have a great potential, as well as the consolidated ones.

With this, we detect new features that can improve the behavior or performance of our solutions.

Development

We put in practice the results of the research phase.

We deploy the improvements and validate its application in real use cases, similar to the ones we show in this demo.

Innovation

Once we test the usefulness and robustness of improvements or new features added we introduce in our solutions in different projects.

In this way StrateBI guarantees the use of cutting edge Big Data technologies, previous tests and improvements by out I+D+i in Big Data


Used Technologies

hadoop

Apache Hadoop is the most popular Big Data environment, it allows the distributed computing on clusters with commodity hardware and low cost.

The basic and default configuration for a Hadoop cluster includes distributed storage of data using (HDFS), a resource manager (YARN) Yet Another Resource Negotiator, and running on top of this one, is the (Map Reduce) framework, that perform the distributed processing of data.

Besides these components, there are another set of higher level tools, for storing and processing data, like Hive or Spark, as an example. They offer the abstraction that simplifies the development for that environment.

As mentioned before, Hadoop is the most popular Big Data environment, the reason is because it offer a wide range of technologies and a very high robustness level. It is ideal for the new concept of Data Lake for the later analytics using powerful BI tools.

flume

Flume is a distributed and trustworthy system for the efficient collection, aggregation and processing of Streaming Data.

kafka

Kafka is a distributed message system that use the pattern publish-subscribe, is fault tolerant, horizontal scalable and is ideal for Stream Data Processing

hortonworks cloudera

To make easier the management, installation and maintenance of hadoop cluster we work with two main Hadoop Distributions.

A hadoop distribution is a software package, that include the basic components of Hadoop, with a plus of other technologies, frameworks and tools and the possibility of installing using a web application.

About this, in Stratebi we recommend the use of a hadoop distribution. Being Hortonworks and Cloudera the leader distributions currently in the market. For this reason our demo is running over a Cloudera distribution and a Hortonworks distribution.

spark spark streaming

Spark implements the Map Reduce programming paradigm making intensive usage of RAM memory instead of disk.

Using Spark, we can improve the performance of Map Reduce applications by implementing iterative algorithms, machine learning (MLib), statistics analysis R module, or real time analytics Spark Streaming, all this is icluded in our demo.