Aegis big data hadoop developers are posting this article to let the development community know how to get top N words frequency count via distinct articals in a sorted way using hadoop MapReduce paradigm. You can try your hands on the code shared in this post and feedback your experience later.
We are introducing how to get top N words count from different articals and sort them accordingly using hadoop MapReduce paradigm.
We have N number of articals in text format and we are interested in finding word frequency and also want to sort them accordingly so that we can find that which words are most occuring among all those files.
I have tested the code in following environment
- Java: 1.7.0_75
- Hadoop: 1.0.4
- Sample Input:
we do have N number of files in text format. I have used 20 big text files for performing this test.
Once we have collected all the input files, we have to upload them in HDFS.
I have created /input/articles directory and put all those files in that directory.
We will use 2 steps to perform this task.
We will use 1 mapper for parsing the files and count the single word occurance of a perticular word.
We will use 1 reducer for total count of the word frequency.
Once the mapper and reducer task is completed, we will have partition file in our HDFS.
- 2. We will sort the data using sort utility based on frequency count the data.
I will give detailed explanation of this program and how to run it at the end of this document.
My code looks like,
public class TopNWordCountMapper extends Mapper<Object,Text,Text,IntWritable>
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,InterruptedException
/*Retrieving tokens from string input*/
StringTokenizer tokenizer = new StringTokenizer(value.toString());
/*While tokens found put initial count as 1*/
public class TopNWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
/*Initial count will be 0 for a keyword*/
int total = 0;
for(IntWritable value : values)
/*Getting previous value and add new value in count*/
context.write(key, new IntWritable(total));
public class TopNWordCountDriver extends Configured implements Tool
public int run(String args)
int result = -1;
Configuration configuration = new Configuration();
Job job = new Job(configuration, "Word Frequency Count Job");
FileInputFormat.setInputPaths(job, new Path(args));
FileOutputFormat.setOutputPath(job, new Path(args));
System.exit(job.waitForCompletion(true) ? 0 : 1);
System.out.println("Job is Completed Successfully");
System.out.println("Error in job...");
} catch (Exception exception)
public static void main(String args) throws Exception
int response = ToolRunner.run(new Configuration(), new TopNWordCountDriver(),args);
System.out.println("Result = " + response);
Most of the code is self-explanatory, so you can easily check and get line by line understanding of the code.
We are extracting data from text file using mapper class TopNWordCountMapper.java
We are counting total of a particular word using reduces class TopNWordCountReducer.java
- How to run this program:
- Prepare Data:
Copy your data files in HDFS---I have put all files in my HDFS in /input/articles folder.
Now make jar file out of this project using eclipse export jar facility.
Run jar file using hadoop jar command
hadoop jar <JarFileName>.jar <InputPath> <OutPut Path>
I used following command to run at my local configuration.
hadoop jar TopNArticle.jar /input/articles /output/articles
Please note that while specifying output path, directory named “articles” directory must not exist it will be created automatically.
once Job is completed, data will be ready and your output directory will have a file starting
With part-r-***** that is our intermediate data (in /output/articles).
Now for retriving data and frequency counts from those files,
use following command,
hadoop fs -cat /output/articles/part* | sort -n -k2 -r | head -n500 > Top-500.txt
It will fetch data from those files and will output data in your current directory's Top-500.txt file.
so please make sure there is a Top-500.txt file in your current local directory.
I will give some overview of above command
It will sort data -k2 makes sure that key2 that is our frequency count will be considered as key while sorting.
head -n500 means we need top 500 records.
It will give 500 records with frequency count.
Professionals have just posted this article for big data hadoop developers and worldwide community to let them learn and use MapReduce paradigm for counting top N words frequency. Do try and comeback to feedback this story.
For further information, mail us at firstname.lastname@example.org