Real Time Analytics of wikipedia updates, using Spark Streaming
In this example the Kafka cluster is used as a repository of real time data, that means that all data inserted in the cluster are available to kafka consumers, and guarantees persistence for a configurable amount of days, in this case, we have set it to 3 days.
The cluster have 3 brokers that manage the flow of messages "from" and "to" the producer and consumer respectively.
In this demo we have implemented a kafka producer, configurable by .property file, for connecting to the wiki real time stream.
The connection is made using WebSocket with Java client.
Every tuple we receive from wiki real time stream, is processed by our implementation of org.apache.clients.producer.KafkaProducer, at this point we add some useful data to each tuple, like time or wiki type, and finally this data is sent to the kafka cluster.
The kafka cluster are receiving all the data, that can be consumed by any external service using a org.apache.clients.consumer.KafkaConsumer, that connect to the cluster and start drinking data for the preferred topic.
Spark Streaming allows the analysis of a set of data received in real time in an specified window time, in this case, we have configured 4 seconds. In this demo we use the Kafka connector to Spark Streaming for getting that window time.
The analysis over that set of data y made using Map Reduce operations for grouping and count updates, by type, language and modality. Once the reduce operation is done we send the result to the Spring Framework API Rest.
Also we have implemented an http client using the java.net package, to make it as lightweight as possible, we avoid to include any external dependency library in the Spark application.
As we can add custom functions inside Map or Reduce steps, in this case, we implemented a notifier, that uses the mentioned http client for send the result to our API REST, this solution is an example of how flexible Spark Streaming can be, to allow custom code.
At the time a user open the demo page, the page request a connections to the end point of our system that is sending data in real time, using a WebSocket in the client side.
On the server side, a connection is created to the client, and while the connection is open (the client computer is on and the browser is open), the systems looks for new data in the "Broadcast Queue". This component, at the time, are receiving data from the API REST sent by the implemented http client inside the Spark Streaming application
The "Broadcast Queue" implementation, allows that all connections to the server can get the data from the same queue data structure, in an optimal O(1) time (Computational Complexity of retrieve data from a Message Queue).
Also, the Message Queue allows an efficient communication between Spark Streaming and the Server Socket, in O(1) without counting network delays.
This implementation allows a high number of concurrent connections to the server, and client can see the information in real time.