Twitter Feed to Apache Hive using Apache Flume - vulabvulab

Posted on 29 April 2017 by Srinivas Nelakuditi

Twitter Feed to Apache Hive using Apache Flume

Step 1: CREATE Folder in HDFS to load Tweet Data from Twitter

Create a folder in HDFS
hadoop fs -mkdir /demo/tweets

Step 2: Configure Flume by creating a Flume Configuration file with source, sink and channel

NOTE: Please get your own credentials for twitter by registering at twitter.com as a developer. Replace the xxxxxxxxxxx in the configuration below with your own credentials.

Create a file called twitter.conf with the following contents

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource

TwitterAgent.sources.Twitter.consumerKey=xxxxxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret=xxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken=xxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret=xxxxxxxxxxx
TwitterAgent.sources.Twitter.keywords=hadoop, spark, apache spark, java, unix, linux, tesla, trump
# Describing/Configuring the sink

TwitterAgent.sources.Twitter.keywords= Vulab,BigData,Apache Spark, Spark Streaming,Data Science

TwitterAgent.sinks.HDFS.channel=MemChannel
TwitterAgent.sinks.HDFS.type=hdfs
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://sandbox.hortonworks.com:8020/demo/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
TwitterAgent.sinks.HDFS.hdfs.batchSize=1000
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600

TwitterAgent.channels.MemChannel.type=memory
TwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=1000

TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

 

Step 3: Start Flume Agent to load data from Twitter feed to HDFS Folder

Execute flume

flume-ng agent -n TwitterAgent -f twitter.conf

 

Step 4: Analyze the data in HDFS Folder

After a few minutes you should have tweet data populated in HDFS folder /demo/tweets

Step 5:

Analyze the tweet files stored under /demo/tweets in HDFS

[root@sandbox ~]# hadoop fs -ls /demo/tweets
Found 10 items
-rw-r–r–   1 root hdfs    1151525 2017-04-29 22:02 /demo/tweets/FlumeData.1493503276896
-rw-r–r–   1 root hdfs     980837 2017-04-29 22:03 /demo/tweets/FlumeData.1493503337957
-rw-r–r–   1 root hdfs     991589 2017-04-29 22:04 /demo/tweets/FlumeData.1493503398855
-rw-r–r–   1 root hdfs    1002683 2017-04-29 22:05 /demo/tweets/FlumeData.1493503459830
-rw-r–r–   1 root hdfs     947843 2017-04-29 22:06 /demo/tweets/FlumeData.1493503519964
-rw-r–r–   1 root hdfs     969875 2017-04-29 22:07 /demo/tweets/FlumeData.1493503580889
-rw-r–r–   1 root hdfs     927216 2017-04-29 22:08 /demo/tweets/FlumeData.1493503641935
-rw-r–r–   1 root hdfs     978663 2017-04-29 22:09 /demo/tweets/FlumeData.1493503702912
-rw-r–r–   1 root hdfs     972084 2017-04-29 22:10 /demo/tweets/FlumeData.1493503763948
-rw-r–r–   1 root hdfs     696234 2017-04-29 22:11 /demo/tweets/FlumeData.1493503825045

 

NOTE: Flume is a good tool to get tweets from twitter to HDFS. But we suggest using HDF or Apache NIFI for loading tweets from twitter to HDFS or Hive for analytics. You can read more about NIFI here.

 

Enjoy your coding. If you would like to have Vulab’s BigData team to work for you or if you want to get trained by us. Please contact us for Hadoop Training

0 Comments

Leave a Reply