How to get top N words count using Big Data Hadoop MapReduce paradigm with developer’s assistance

Aegis Softtech's big data analytics team introduce the tutorial of how to get top N words frequency count using MapReduce paradigm with developer’s assistance. You can try your hands on the code shared in this post and feedback your experience later.

N words count using MapReduce

We are introducing how to get top N words count from different articles and sort them accordingly using hadoop MapReduce paradigm.

MapReduce Problem Statement:

We have N number of articles in text format and we are interested in finding word frequency and also want to sort them accordingly so that we can find which words are most occurring among all those files.

I have tested the code in following environment

Java: 1.7.0_75
Hadoop: 1.0.4
Sample Input:

we do have N number of files in text format. I have used 20 big text files for performing this test.

Data Preparation:

Once we have collected all the input files, we have to upload them in HDFS.

I have created /input/articles directory and put all those files in that directory.

Solution :

We will use 2 steps to perform this task.

1.Using core MapReduce

We will use 1 mapper for parsing the files and count the single word occurance of a particular word.

We will use 1 reducer for the total count of the word frequency.

Once the mapper and reducer task is completed, we will have a partition file in our HDFS.

2. We will sort the data using sort utility based on frequency count data.

I will give a detailed explanation of this program and how to run it at the end of this document.

My code looks like,

TopNWordCountMapper.java import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class TopNWordCountMapper extends Mapper & lt; Object, Text, Text, IntWritable & gt; { private static final IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { /*Retrieving tokens from string input*/ StringTokenizer tokenizer = new StringTokenizer(value.toString()); while (tokenizer.hasMoreTokens()) { /*While tokens found put initial count as 1*/ word.set(tokenizer.nextToken()); context.write(word, one); } } } TopNWordCountReducer.java import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class TopNWordCountReducer extends Reducer & lt; Text, IntWritable, Text, IntWritable & gt; { public void reduce(Text key, Iterable & lt; IntWritable & gt; values, Context context) throws IOException, InterruptedException { /*Initial count will be 0 for a keyword*/ int total = 0; for (IntWritable value: values) { /*Getting previous value and add new value in count*/ total += value.get(); } context.write(key, new IntWritable(total)); } } TopNWordCountDriver.java import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class TopNWordCountDriver extends Configured implements Tool { @Override public int run(String[] args) { int result = -1; try { Configuration configuration = new Configuration(); Job job = new Job(configuration, "Word Frequency Count Job"); job.setJarByClass(TopNWordCountDriver.class); job.setMapperClass(TopNWordCountMapper.class); job.setReducerClass(TopNWordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); if (job.isSuccessful()) { System.out.println("Job is Completed Successfully"); } else { System.out.println("Error in job..."); } } catch (Exception exception) { exception.printStackTrace(); } return result; } public static void main(String[] args) throws Exception { int response = ToolRunner.run(new Configuration(), new TopNWordCountDriver(), args); System.out.println("Result = " + response); } }

Code Walk Through:

Most of the code is self-explanatory, so you can easily check and get line by line understanding of the code.

We are extracting data from text file using mapper class TopNWordCountMapper.java

We are counting total of a particular word using reduces class TopNWordCountReducer.java

If you still do not know why N words frequency matters, it is time to discuss it with the Aegis Softtech developers.

We know how to sort them according to Hadoop MapReduce. Let us discuss how we can reduce your problem effectively.

How to run this program:
Prepare Data:

Copy your data files in HDFS---I have put all files in my HDFS in /input/articles folder.

Now make a jar file out of this project using eclipse export jar facility.

Run jar file using hadoop jar command

I used,

hadoop jar <JarFileName>.jar <InputPath> <OutPut Path>

I used the following command to run at my local configuration.

hadoop jar TopNArticle.jar /input/articles /output/articles

Please note that while specifying output path, directory named “articles” directory must not exist it will be created automatically. once Big data Job is completed, data will be ready and your output directory will have a file starting With part-r-***** that is our intermediate data (in /output/articles). Now for retrieving data and frequency counts from those files, use following command,

Table of Content

N words count using MapReduce

MapReduce Problem Statement

Code Walk Through

How to run this program

Beginner’s Tutorial for Hadoop Map Reduce with Python

This blog consists of fundamentals of MapReduce and its significance in Hadoop development services.

Build a FatJar using Maven

Maven is one of the most used tools for Java application development. Maven packaging will create a Jar file without dependencies.

Hadoop Disk Replacement Tutorial - Eureka

Big Data Hadoop is widely acclimated by companies these days and with average 50+ nodes cluster and 100+ TB storage used in most of the enterprises there are