Data Science and Big Data Hadoop: Apache Flume

Apache Flume

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log data, events (etc...) from various webserves to a centralized data store.

It is a highly reliable, distributed, and configurable tool that is principally designed to transfer streaming data from various sources to HDFS.

data generators (such as Facebook, Twitter) generate data which gets collected by individual Flume agents running on them. Thereafter, a data collector (which is also an agent) collects the data from the agents which is aggregated and pushed into a centralized store such as HDFS or HBase.

Flume Event

An event is the basic unit of the data transported inside Flume. It contains a payload of byte array that is to be transported from the source to the destination accompanied by optional headers. A typical Flume event would have the following structure −

Flume Agent

An agent is an independent daemon process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent). Flume may have more than one agent. Following diagram represents a Flume Agent

As shown in the diagram a Flume Agent contains three main components namely, source, channel, and sink.

Source

A source is the component of an Agent which receives data from the data generators and transfers it to one or more channels in the form of Flume events.

Apache Flume supports several types of sources and each source receives events from a specified data generator.

Example − Avro source, Thrift source, twitter 1% source etc.

Channel

A channel is a transient store which receives the events from the source and buffers them till they are consumed by sinks. It acts as a bridge between the sources and the sinks.

These channels are fully transactional and they can work with any number of sources and sinks.

Example − JDBC channel, File system channel, Memory channel, etc.

Sink

A sink stores the data into centralized stores like HBase and HDFS. It consumes the data (events) from the channels and delivers it to the destination. The destination of the sink might be another agent or the central stores.

Example − HDFS sink

Note − A flume agent can have multiple sources, sinks and channels. We have listed all the supported sources, sinks, channels in the Flume configuration chapter of this tutorial.

Additional Components of Flume Agent

What we have discussed above are the primitive components of the agent. In addition to this, we have a few more components that play a vital role in transferring the events from the data generator to the centralized stores.

Interceptors

Interceptors are used to alter/inspect flume events which are transferred between source and channel.

Channel Selectors

These are used to determine which channel is to be opted to transfer the data in case of multiple channels. There are two types of channel selectors −

Default channel selectors − These are also known as replicating channel selectors they replicates all the events in each channel.
Multiplexing channel selectors − These decides the channel to send an event based on the address in the header of that event.

Sink Processors

These are used to invoke a particular sink from the selected group of sinks. These are used to create failover paths for your sinks or load balance events across multiple sinks from a channel.
============PRACTICALLY===========

A.sources=src
A.sinks=sk
A.channels=ch

A.sources.src.type=spooldir
A.sources.src.spoolDir=/home/nitin/Desktop/DMART_DATA

A.sinks.sk.type=hdfs
A.sinks.sk.hdfs.path=/data/flume/example_1_spooldir

A.channels.ch.type=memory

A.sources.src.channels=ch
A.sinks.sk.channel=ch

# flume-ng agent -n <agent_name> -c conf -f <config_file_path>

# flume-ng agent -n A -c conf -f /home/nitin/Desktop/flume_spool2.conf

-----------------------

AGENT NAME
SOURCE NAME
CHANNEL NAME
SINK NAME

SOURCE TYPE
SOURCE MISCELLANEOUS PROPERTIES

CHANNEL TYPE
CHANNEL MISCELLANEOUS PROPERTIES

SINK TYPE
SINK MISCELLANEOUS PROPERTIES

BINDING
SOURCE-->CHANNEL
SINK<--CHANNEL

-----------------

=====================

1. START ALL DAEMONS
2. CREATE A SPOOL DIR WITH THE NAME DMART_DATA
3. CREATE A CONFIG FILE AND MENTION PATH OF THE ABOVE DIR IN SPOOLING DIR
4. RUN THE FLUME AGENT
5. PUT A TEST FILE IN YOUR DMART_DATA DIRECTORY AND CHECK THE CONTENTS IN HDFS

================

=================NETCAT================= intercepting chat/msg============

a.sources=s
a.sinks=sk
a.channels=c

a.sources.s.type=netcat
a.sources.s.bind=localhost
a.sources.s.port=12345

a.channels.c.type=file

a.sinks.sk.type=hdfs
a.sinks.sk.hdfs.path=/data/flume/sp
a.sinks.sk.hdfs.fileType=DataStream

a.sources.s.channels=c
a.sinks.sk.channel=c

# flume-ng agent -n <agent_name> -c conf -f <conf_file_path>

# flume-ng agent -n a -c conf -f /home/nitin/Desktop/SMC.conf

==============================

start your daemons

1. save your config file
2. Start flume agent in one terminal
3. start telnet in another terminal (telnet localhost 12345)
4. Send a message and check contents of HDFS

a.sources=s
a.sinks=sk
a.channels=c

a.sources.s.type=exec
a.sources.s.command=jps

a.channels.c.type=file

a.sinks.sk.type=hdfs
a.sinks.sk.hdfs.path=/data/flume/1234
a.sinks.sk.hdfs.fileType=DataStream

a.sources.s.channels=c
a.sinks.sk.channel=c

# flume-ng agent -n <agent_name> -c conf -f <conf_file_path>

# flume-ng agent -n a -c conf -f /home/nitin/Desktop/SMC.conf

=================END====================

Data Science and Big Data Hadoop

Pages

Apache Flume

Apache Flume

Flume Event

Flume Agent

Source

Channel

Sink

Additional Components of Flume Agent

Interceptors

Channel Selectors

Sink Processors

No comments:

Post a Comment