spark_icono Spark Streaming

Real Time Analytics of wikipedia updates, using Spark Streaming

Wikipedia Update

Article Type

Article Language


In this example the Kafka cluster is used as a repository of real time data, that means that all data inserted in the cluster are available to kafka consumers, and guarantees persistence for a configurable amount of days, in this case, we have set it to 3 days.

The cluster have 3 brokers that manage the flow of messages "from" and "to" the producer and consumer respectively.
In this demo we have implemented a kafka producer, configurable by .property file, for connecting to the wiki real time stream.
The connection is made using WebSocket with Java client.

Every tuple we receive from wiki real time stream, is processed by our implementation of org.apache.clients.producer.KafkaProducer, at this point we add some useful data to each tuple, like time or wiki type, and finally this data is sent to the kafka cluster.

For more information, you can access the TodoBI blog post.


The kafka cluster are receiving all the data, that can be consumed by any external service using a org.apache.clients.consumer.KafkaConsumer, that connect to the cluster and start drinking data for the preferred topic.

Spark Streaming allows the analysis of a set of data received in real time in an specified window time, in this case, we have configured 4 seconds. In this demo we use the Kafka connector to Spark Streaming for getting that window time.

The analysis over that set of data y made using Map Reduce operations for grouping and count updates, by type, language and modality. Once the reduce operation is done we send the result to the Spring Framework API Rest.

Also we have implemented an http client using the package, to make it as lightweight as possible, we avoid to include any external dependency library in the Spark application.

As we can add custom functions inside Map or Reduce steps, in this case, we implemented a notifier, that uses the mentioned http client for send the result to our API REST, this solution is an example of how flexible Spark Streaming can be, to allow custom code.

For more information, you can access the TodoBI blog post.


At the time a user open the demo page, the page request a connections to the end point of our system that is sending data in real time, using a WebSocket in the client side.

On the server side, a connection is created to the client, and while the connection is open (the client computer is on and the browser is open), the systems looks for new data in the "Broadcast Queue". This component, at the time, are receiving data from the API REST sent by the implemented http client inside the Spark Streaming application

The "Broadcast Queue" implementation, allows that all connections to the server can get the data from the same queue data structure, in an optimal O(1) time (Computational Complexity of retrieve data from a Message Queue).

Also, the Message Queue allows an efficient communication between Spark Streaming and the Server Socket, in O(1) without counting network delays.

This implementation allows a high number of concurrent connections to the server, and client can see the information in real time.

For more information, you can access the TodoBI blog post.

I+D+i BigData

In StrateBI we believe in the value of Big Data technologies for data processing and the possibility of obtain knowledge using it, with the goal of making easier the process of decisions in any industry. Our team makes a great job on I+D+i in Big Data


We keep updated about news and scientific articles published about Big Data technologies.

Its made with emerging ones that we think have a great potential, as well as the consolidated ones.

With this, we detect new features that can improve the behavior or performance of our solutions.


We put in practice the results of the research phase.

We deploy the improvements and validate its application in real use cases, similar to the ones we show in this demo.


Once we test the usefulness and robustness of improvements or new features added we introduce in our solutions in different projects.

In this way StrateBI guarantees the use of cutting edge Big Data technologies, previous tests and improvements by out I+D+i in Big Data

Used Technologies


Apache Hadoop is the most popular Big Data environment, it allows the distributed computing on clusters with commodity hardware and low cost.

The basic and default configuration for a Hadoop cluster includes distributed storage of data using (HDFS), a resource manager (YARN) Yet Another Resource Negotiator, and running on top of this one, is the (Map Reduce) framework, that perform the distributed processing of data.

Besides these components, there are another set of higher level tools, for storing and processing data, like Hive or Spark, as an example. They offer the abstraction that simplifies the development for that environment.

As mentioned before, Hadoop is the most popular Big Data environment, the reason is because it offer a wide range of technologies and a very high robustness level. It is ideal for the new concept of Data Lake for the later analytics using powerful BI tools.


Flume is a distributed and trustworthy system for the efficient collection, aggregation and processing of Streaming Data.


Kafka is a distributed message system that use the pattern publish-subscribe, is fault tolerant, horizontal scalable and is ideal for Stream Data Processing

hortonworks cloudera

To make easier the management, installation and maintenance of hadoop cluster we work with two main Hadoop Distributions.

A hadoop distribution is a software package, that include the basic components of Hadoop, with a plus of other technologies, frameworks and tools and the possibility of installing using a web application.

About this, in Stratebi we recommend the use of a hadoop distribution. Being Hortonworks and Cloudera the leader distributions currently in the market. For this reason our demo is running over a Cloudera distribution and a Hortonworks distribution.

spark spark streaming

Spark implements the Map Reduce programming paradigm making intensive usage of RAM memory instead of disk.

Using Spark, we can improve the performance of Map Reduce applications by implementing iterative algorithms, machine learning (MLib), statistics analysis R module, or real time analytics Spark Streaming, all this is icluded in our demo.