Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Table of Contents
styleTable of Contents

Introduction

This area is dedicated to Google Summer Of Code projects past and present.
Primarily it should harvest documentation relating to these projects but most importantly it should act as a platform for anyone to contribute towards Samza GSoC project initiatives.

2015

Proposal: Expanding Apache Samza’s consumers/producers to a wider spectrum of messaging systems

Description

Different JIRA issues were logged stating that Samza should be able to consumer and produce data to other distributed publisher/subscriber systems e.g. Amazon Kinesis, Apache ActiveMQ, Twitter's Kreskel, and other systems. The community proposed that the integration with such systems could be done as a the Google Summer of Code project for 2015 .

Student

Renato Marroquin

Mentor

Yan Fang

Full proposal

Proposal Title : Expanding Apache Samza’s consumers/producers to a wider spectrum of messaging systems
Student Name: Renato Marroquin
Student Email : renatoj.marroquin@gmail.com
Project Description: Apache Samza[1] is a distributed stream-processing framework that can be deployed on top of Apache Hadoop Yarn in order to leverage existing Hadoop infrastructure. Samza’s main messaging system is Apache Kafka, but in spite of the fact that Samza has been developed around Kafka’s architecture, it has become a very popular stream processing system. In order to help Samza grow even more, the motivation of this project is to add the ability to read/write data from/to different message queues. This project aims to expand the capabilities of Samza for consuming/producing messages to two popular message queues, Apache ActiveMQ[2] (SAMZA-587) and Amazon Kinesis[3] (SAMZA-489). ActiveMQ is a message broker implementing Java Message Service API 1.1. It supports clients in many different programming languages e.g. C, C++, Erlang, among many others. Amazon Kinesis provides a cloud-based service for data processing over distributed data streams. It supports clients in Java, Python, and Ruby. Samza has an example application that uses Kafka as its messaging system. This sample application can be used for testing the resulting integration and also for providing users the ability to specify different messaging systems for testing Samza.
Tentative project architecture: Apache Samza has a well-defined API for new system consumers and producers in [4]. Therefore, each of the new messaging systems will extend the SystemProducer and SystemConsumer interfaces. The integration with Apache ActiveMQ will reside in a separate maven module similar to the “samza-kafka” module. The integration with Amazon Kinesis will be an external module that will reside on an external repository. This is because a project’s committer, and an Amazon engineer have started Amazon Kinesis integration with Samza. Although the code will not reside on the project’s main code repository, this integration will help on growing the community for Samza because of Amazon Kinesis popularity. Both integrations will have to be modelled similarly to the way the Kafka consumer/producer have been implemented. The three systems, ActiveMQ, Kinesis, and Kafka, provide similar functionality, however there exist differences that will need to be studied and understood to write a successful integration. For example, the way they deserialize/serialize data to be read/written, the way they handle offsets internally or the API each system offers.

Timeline:

Before 25 May. Getting familiarized with the Amazon Kinesis and Apache ActiveMQ ways of reading and writing data.
27 April - 24 May. Getting familiarized with the Apache Samza code base. Helping in generating documentation for the project, writing unit tests for it, or solving JIRA issues important to the Samza community.
25 May - 26 June. Complete the integration with Amazon Kinesis by implementing KinesisProducer, KinesisConsumer classes and their respective configuration classes for its correct functioning.
27 June – 27 July. Complete the integration with Amazon Kinesis by implementing ActiveMQProducer, ActiveMQConsumer classes and their respective configuration classes for its correct functioning.
28 July - 10 August. Profile the applications for further improvement while using Samza’s sample application as an end-to-end integration test between the projects.
11 August - 16 August. Write required tests and documentation for both integrations.
17 August - 21 August. Preparing for final release. Refactoring. Adding missing documentation for users.

JIRA Issue

Contribute with some JIRA issues to get more confortable with the code base.

https://issues.apache.org/jira/browse/SAMZA-587
https://issues.apache.org/jira/browse/SAMZA-489

Project Deliverables

  • Kinesis-Samza consumer and producer systems
  • ActiveMQ-Samza consumer and producer systems
  • JUnit Test cases for integrated systems

Implementation Approach

Phase 1
TaskStatus
  • Implement sample application based on Samza HelloWorld application.
DONE
  • Try out Amazon Kinesis basic functionality.
DONE
  • Implement basic prototype of SystemConsumer and SystemProducer API for Kinesis integration.
DONE
  • Improve prototype of Kinesis integration.
IN-PROGRESS
Phase 2
TaskStatus
  • Try out Apache ActiveMQ basic functionality.
TODO
  • Implement basic prototype of SystemConsumer and SystemProducer API for ActiveMQ integration
TODO
  • Optimize prototype of ActiveMQ integration.
TODO
Phase 3
TaskStatus
  • Write tests for Kinesis integration.
TODO
  • Write tests for ActiveMQ integration.
TODO
Phase 4
TaskStatus
  • Write documentation for new integrated systems.
TODO
  • Review checkpointing mechanism for new integrated systems.
TODO

Project Reports

Report Ending Week XXXX

Project description
Review of Previous Actions
Objectives
Future Actions
Mentors Comments