CC292 Labs: Word Count Exercise

Introduction

The aim of this exercise is to work through a simple text processing problem, and to compare possible solutions to this in Java and in JavaScript.

The aim is to process a string, split it into white-space separated words, and count the occurrence of each word. The initial program should print out the words in order, with each word followed by the number of times it occurred.  For the Java example, we'll use alphabetic word order.  For the JavaScript example, we'll use order of increasing word frequency.

For example, given the input:

the cat sat on the mat

The program should output (exact format not important):

cat: 1, mat: 1, on: 1, sat: 1, the: 2

We'll first work through a possible Java solution, and then give some hints about the JavaScript solution.

Java Solution

We can split the solution into the following steps:

  • split the sentence into its constituent words (we use the split() method of String to do this).
  • for each word, keep increment the count each time we see it (starting at zero)
  • need some data structure to keep track of the words

A natural data structure to use for this is a Map - also called an associative array.  Also, the requirement was for the output to be in alphabetic order.  A Sorted Map will do this for us.  In Java, Sorted Map is an interface, and the implementation class to use is a Tree Map.

Here's the solution: I've missed out the import statements, but if you copy and paste this into Intellij, it will prompt you to import the necessary classes.

public class WordCount {
  public static void main(String[] args) {
    String s = "the cat sat on the mat";
    String[] sa = s.split(" ");
    SortedMap<String,Integer> map = new TreeMap<String,Integer>();
    for (String w : sa) {
      Integer c = map.get(w);
      if (c == null) c = 0;
      c++;
      map.put(w, c);
    }
  System.out.println(map);
  }
}

When run, it produces this output:

{cat=1, mat=1, on=1, sat=1, the=2}

JavaScript Exercise

We can develop a JavaScript solution along similar lines.  This will demonstrate the use of

  • regular expressions
  • flexible arrays (can use as stacks or lists)
  • maps (associative arrays)
  • sorting with comparators

The split method is similar to the Java version.  Note that both the Java and the JavaScript split functions can take a plain string or a regular expression to use as a basis for splitting words up (i.e. to act as word delimiters).

The word count method wc is shown below.  This takes a string as an argument and creates a map (an associative array) with each word as a key, and the number of times it occurred as a value.

       function wc(str) {
            var sa = str.split(" ");
            var map = {};
            for (var i = 0; i < sa.length; i++) {
                var w = sa[i].toLowerCase();
                var count = map[w];
                if (count == null) count = 0;
                count++;
                map[w] = count;
            }
            return map;
        }

Then we have the overall function to test this in a web browser - note the use of document.getElementById() to get the input and output nodes.  We use node.textContent to get all the text under and HTML element (stripping out any HTML mark-up in the process).

In the current version, this is then overwritten by a simple test string. 

We take the string, and call the word count function to create the map of key value pairs.  These have to then be put into a sortable data structure.  Here we use an array of arrays, where each inner array holds a word together with the number of times it occurred.  JavaScript arrays are very flexible - we can create an empty one and then simply push new values onto it, as follows.

            str = "the zigzag cat mat mat mat sat on the mat....";
            var map = wc(str);
            var ca = [];
            for (var i in map) {
                ca.push([map[i] , i]);
            }

Having got this data structure, we can now sort it by calling a sort method.  In this version we'll sort it based on the number of occurrences.  Note how this is done by passing a comparator function to the sort method.  Consider a list with items [a, b] in it.  The comparator should return a negative number if a should come before b, zero if they are equal in order, and positive if b should come before a.  JaavaScript allows new functions to be declared in-line, or assigned to variables and passed as parameters.  In this case we define it in-line:

            ca.sort(function(a, b) {
                return a[0] - b[0];
            });

Then all that remains is to build the output, and to assign it as the text-content of the output node in the document:

            var op = "";
            for (var j = 0; j < ca.length; j++) {
                op += "{" + ca[j][0] + " : " + ca[j][1] + "} ";
            }
            outputNode.textContent = op;
    

The final for loop could be omitted - we could simply use ca.toString() to convert the array of word occurrence pairs to a string, but the output would be a flat list of comma separated values, which does not look so good.  This produced the following output:

{1 : cat} {1 : zigzag} {1 : mat....} {1 : sat} {1 : on} {2 : the} {3 : mat}

Exercise:  The current version sorts in order of increasing word frequency. Modify it to sort alphabetically.

 

 

 

end of page