Thursday, March 22, 2007

Understanding filter, map and reduce

Most of the computation problems that I have faced in the past could easily be solved using a mix and match of filter, map and reduce operations. These operations could be performed on any set of objects that can be iterated.

Filtering operation is one of the most fundamental operations that we perform more frequently than we think. In simple terms, we define a predicate and check if this predicate is true for each object in the set, iterating over them one by one. Whenever the predicate is true, we append the object to the output, otherwise we ignore. Consider grep tool as an example. The lines in the file(s) are iterable. If the current line is L, the the predicate is the question: "does L contain the pattern XYZ?". The output set has utmost as many elements as input set has.

Mapping is the operation of producing an output for each element in the input, by performing a function on that input. Unlike filtering, which used a predicate to check, the map uses a function. The map operation produces exactly the same number of elements as its input set. An example is printing the square of the all the numbers from 1 to N.

Reduction is an iterative operation performed on each element using that element and the partial result obtained in the previous step. So reduction operation takes two inputs: a function and an initial value to perform the computation with the first element. Reduce operation has exactly only one output.

Python provides language level support for these three operations. In C++ too you could perform these operations using a combination of remove_copy_if (for filtering), transform (for mapping) and partial_sum (for reduction).

Here comes the most interesting part of all. I claimed that most of the computation problems could be solved easily by using a combination of these three operations. Let us take an example and explore.

Example of billing a subscriber for his phone calls
Let us assume that we are receiving a huge volume of call records. We would like to calculate the monthly bill amount for user A. How would we do that? In three steps:
  1. Extract all the records for user A. (This is filtering with predicate: "is this call record for user A?")
  2. Find the day and night time tariff rates as per the time of the call. (This is mapping. With a function that returns the call rate depending on the time of the call.) And multiply the rate with the actual duration. (Again a mapping operation.)
  3. Sum up all the charges. (Reduction with initial value of zero or may be his previous month balance)
An important this to understand is we could also perform interleaved mapping and reducing in steps (2) and (3) above. For e.g. in the above case, finding the rate, calculating the call charge and adding that to the partial charges so far could be done in the same iteration. Yet, it remains that they are map and reduce operations. Such implementation tweaks could always be done to boost the performance.

No comments: