spark_icono Kafka

Real time data retrieval of Wikipedia updates, using Apache Kafka


Data Source

Wikipedia Articles

Type Date Title

Wikimedia Articles

Type Date Title

Wikidata Articles

Type Date Title
Information

In the "Cluster view", we can see 3 brokers and 3 producers that emit tuples to the cluster kafka.

The component "Kafka Producer" connects to the wikipedia stream and registers a listener, that is a subject of the observer pattern; when an update is generated in the wikipedia, it is received through the "Socket" and this one notifies to the "Listener", that contains an org.apache.clients.producer.KafkaProducer, the producer registers a callback to notifies that a message has being sent to Kafka, the notification object contains the offset and partition for each message, in this step, every minute we sent via API the time in milliseconds and the offset for that time.

This information is stored in a PostgreSQL Data Base. When a user select a date, the system looks in the Data Base in O(1) time, for an offset for that time, then we seek to that offset. The cluster kafka stores data for 3 days.

Once we get the offset for the requested date, we ask for a "Thread Safe Kafka Consumer" to the "Consumer Holder". The "Thread Safe Kafka Consumer" performs the seek and poll operations to point the index and start consuming data from there.

By default, a org.apache.kafka.clients.consumer.KafkaConsumer is not Thread Safe, so can't be used in a concurrent environment (like this, where many user could see the demo simultaneously). To solve this problem we implemented a version of this Consumer that allow multiple threads to access in a synchronized way to this object.

Information

In the "Cluster view", we can see 3 brokers and 3 producers that emit tuples to the cluster kafka.

The component "Kafka Producer" connects to the wikipedia stream and registers a listener, that is a subject of the observer pattern; when an update is generated in the wikipedia, it is received through the "Socket" and this one notifies to the "Listener", that contains an org.apache.clients.producer.KafkaProducer, the producer registers a callback to notifies that a message has being sent to Kafka, the notification object contains the offset and partition for each message, in this step, every minute we sent via API the time in milliseconds and the offset for that time.

This information is stored in a PostgreSQL Data Base. When a user select a date, the system looks in the Data Base in O(1) time, for an offset for that time, then we seek to that offset. The cluster kafka stores data for 3 days.

Once we get the offset for the requested date, we ask for a "Thread Safe Kafka Consumer" to the "Consumer Holder". The "Thread Safe Kafka Consumer" performs the seek and poll operations to point the index and start consuming data from there.

By default, a org.apache.kafka.clients.consumer.KafkaConsumer is not Thread Safe, so can't be used in a concurrent environment (like this, where many user could see the demo simultaneously). To solve this problem we implemented a version of this Consumer that allow multiple threads to access in a synchronized way to this object.

I+D+i BigData

In StrateBI we believe in the value of Big Data technologies for data processing and the possibility of obtain knowledge using it, with the goal of making easier the process of decisions in any industry. Our team makes a great job on I+D+i in Big Data

Research

We keep updated about news and scientific articles published about Big Data technologies.

Its made with emerging ones that we think have a great potential, as well as the consolidated ones.

With this, we detect new features that can improve the behavior or performance of our solutions.

Development

We put in practice the results of the research phase.

We deploy the improvements and validate its application in real use cases, similar to the ones we show in this demo.

Innovation

Once we test the usefulness and robustness of improvements or new features added we introduce in our solutions in different projects.

In this way StrateBI guarantees the use of cutting edge Big Data technologies, previous tests and improvements by out I+D+i in Big Data


Used Technologies

hadoop

Apache Hadoop is the most popular Big Data environment, it allows the distributed computing on clusters with commodity hardware and low cost.

The basic and default configuration for a Hadoop cluster includes distributed storage of data using (HDFS), a resource manager (YARN) Yet Another Resource Negotiator, and running on top of this one, is the (Map Reduce) framework, that perform the distributed processing of data.

Besides these components, there are another set of higher level tools, for storing and processing data, like Hive or Spark, as an example. They offer the abstraction that simplifies the development for that environment.

As mentioned before, Hadoop is the most popular Big Data environment, the reason is because it offer a wide range of technologies and a very high robustness level. It is ideal for the new concept of Data Lake for the later analytics using powerful BI tools.

flume

Flume is a distributed and trustworthy system for the efficient collection, aggregation and processing of Streaming Data.

kafka

Kafka is a distributed message system that use the pattern publish-subscribe, is fault tolerant, horizontal scalable and is ideal for Stream Data Processing

hortonworks cloudera

To make easier the management, installation and maintenance of hadoop cluster we work with two main Hadoop Distributions.

A hadoop distribution is a software package, that include the basic components of Hadoop, with a plus of other technologies, frameworks and tools and the possibility of installing using a web application.

About this, in Stratebi we recommend the use of a hadoop distribution. Being Hortonworks and Cloudera the leader distributions currently in the market. For this reason our demo is running over a Cloudera distribution and a Hortonworks distribution.

spark spark streaming

Spark implements the Map Reduce programming paradigm making intensive usage of RAM memory instead of disk.

Using Spark, we can improve the performance of Map Reduce applications by implementing iterative algorithms, machine learning (MLib), statistics analysis R module, or real time analytics Spark Streaming, all this is icluded in our demo.