Clojure Sequences

By Stephen Molitor, OCI Senior Software Engineer

MAY 2011

Introduction

In our last SETT article on Clojure, we introduced the immutable style of programming favored by Clojure: Immutability in Clojure - Part 1, Programming Without Variables. We showed how recursion can be used to do things like looping over a collection that would ordinarily require setting mutable variables.

Traditionally, the functional programming languages Scheme and LISP that Clojure inherits from relied heavily on recursion. However we also said that recursion is not used quite as extensively in Clojure; instead, Clojure provides 'lazy sequences' which can provide a more efficient alternative to recursion in many cases.

This SETT article examines how to use lazy sequences in Clojure. We will also show how immutable collection data structures are efficiently shared via Clojure's persistent collections.

First, however, we will introduce Clojure's normal (non-lazy) sequences.

Sequences Everywhere

In Clojure, every collection like data structure can be treated like a sequence:

lists
vectors
sets
maps

In addition, many other things can behave like sequences:

strings
regular expression matches
directories
I/O streams
XML documents
etc....

Sequence Core API

A sequence supports three core functions:

first: gets the first item in a sequence
rest: gets everything after the first item in a sequence.
cons: constructs a new sequence

Sequence Examples

In the following code examples the result of an expression is shown with the comment character ';' followed by '->' and the result, like this:

(+ 2 2) ; -> 4

(+ 2 2) is the actual Clojure expression, and 4 is the result of the expression.

Getting the first element of a sequence

The following code creates a vector of the three integers 1, 2, 3, and then retrieves the first element ('1'):

(first [1 2 3]) ; -> 1

The first function returns nil if passed an empty sequence or nil. The rest function returns another sequence every element following the first element:

 (rest [1 2 3]) ; -> (2 3)

rest will return an empty sequence when passed a sequence with less than two elements. The cons function constructs a new sequence from a single element and another sequence:

(cons 4 [1 2 3]) ; -> (4 1 2 3)

Building on these three core three sequence functions, first, rest and cons, Clojure's standard library provides functionality for filtering, querying and manipulating sequences similar to what one finds in the Java collections API, Guava collections, etc.

The full functionality of Clojure's sequences is outside the scope of this article, but let's take a peek at a few key functions to get the flavor of things.

Note that all of the examples that follow can be applied to any sequence like data structure in Clojure, including I/O streams, regular expression matches, XML documents, etc.

Filtering Sequences

(filter even? [1 2 3 4]) ; -> (2 4)

The filter function takes a test function, in this case, the even? function, calls the function on each element in the original sequence, and returns a new (lazy) sequence containing all the elements in the original sequence for which the test function returned true.

Querying Sequences

 (every? even? [1 2 3 4]) ; -> false
 (every? even? [2 4 6 8]) ; -> true

The every? function takes a test function, in this case even?, and returns true if the test function returns true for every element in the sequence.

Transforming Sequences

(map #(* 2 %) [1 2 3]) ; -> (2 4 6)

The map function takes a transformation function, calls the transformation function on each element on the sequence, and returns a new sequence of transformed elements. In this case we are using Clojure's abbreviated syntax for an anonymous function:

#(* 2 %0) ; -> creates an anonymous function that multiplies its argument by 2

The # character marks the beginning of an anonymous function definition, surrounded by parenthesis. The implicit single argument is referred to by the % symbol.

Manipulating Sequences - Persistent Collections and Structural Sharing

Clojure sequences such as lists, vectors, sets and maps are immutable. So how do we 'add' items to a collection, if we can't modify the collection?

Instead of modifying the original collection, we create a new collection that contains all the elements of the original collection, plus the new element that we want to add.

Here is an example using maps:

 (def m1 {"a" 1 "b" 2}) ; -> create map m1 with two key/value pairs
 (def m2 (assoc m1 "c" 3)) ; -> 'add' new key/value pair to m1, assign to m2

First we create map m1 using the Clojure curly brace syntax for map literals. The m1 has two key value pairs, "a" -> 1 and "b" -> 2. Next, we call the assoc function which takes three arguments: a key ("c"), a value (2), and an existing map (m1). The assoc call returns a new map containing all three key / value pairs.

A natural question arises here: Isn't this horribly inefficient? Surely we can't afford to make copies of entire collections every time we want to 'add' one element.

The answer is that Clojure's collection use a technique called 'structural sharing' to compute only the differences between two different versions of a collection and share the common bits in memory. Version control systems like Subversion or GIT use similar techniques.

Logically, a version control system stores separate copies of every version of a file. Physically, however, it only stores the diffs, and computes the full version of a file on demand. Clojure's 'persistent' collections work the same way.

Here is another example using lists. First let's create an initial list l1 with two elements:

(def l1 (list 1 2))

Next let's create new two lists, l2 and l3 from contents of l1 and 'adding' one extra element:

(def l2 (cons 3 l1))
(def l3 (cons 4 l1))

Logically, we have three separate lists. In memory however, lists l2 and l3 share the contents of the first list l1.

We can demonstrate this as follows.

First let's verify that the contents of the 'rest' of l2 and l3 are equivalent to the contents of l1:

(= (rest l2) l1) ; -> returns true
(= (rest l3) l1) ; -> returns true

That just proves that the cons function worked correctly.

But let's see if the contents are identical and therefore shared in memory:

(identical? (rest l2) l1) ; -> returns true
(identical? (rest l3) l1) ; -> returns true

The identical? tests to see if two objects point to the same region in memory. It is similar to the == operator in Java. As we can see, the contents of list l1 are shared between all three lists,l1, l2 and l3.

Lazy Sequences

With the introduction to basic sequences out of the way let's look at lazy sequences.

A lazy sequence in Clojure is a virtual sequence that behaves just like a 'real' sequence except that its elements are computed on demand, lazily.

The filter and map functions demonstrated above actually return lazy sequences.

Infinite Sequences

Since each element is realized on demand, we can even create infinite lazy sequences (as long as we don't try to realize every element in the infinite sequence).

An example of an infinite sequence would be the set of all whole numbers. Let's create a lazy sequence that produces whole numbers.

To create a lazy sequence we use the lazy-seq macro. The lazy-seq macro takes an expression that returns a sequence, and only invokes that expression as it is needed. Our first version creates a lazy sequence that increments from a starting number:

(defn whole-numbers-from [n]
   (cons n (lazy-seq (whole-numbers-from (inc n)))))

We use the cons function to create a sequence whose first element is n, and the rest is the lazy sequence returned by recursively calling whole-numbers-from again, incrementing the value of n by one. The key is the call to lazy-seq. Without this the function would recur forever, until the end of time or we run out of whole numbers, whichever comes first. (Actually we would get a stack overflow error first). With the lazy-seq call, the information needed to generate each subsequent element is remembered, or memorized, to be produced on demand, as needed. We can retrieve a finite set of elements from our logically infinite sequence using the take function:

take 3 (whole-numbers-from 0)) ; -> (0 1 2)

We use the cons function to create a sequence whose first element is n, and the rest is the lazy sequence returned by recursively calling whole-numbers-from again, incrementing the value of nby one. The key is the call to lazy-seq. Without this the function would recur forever, until the end of time or we run out of whole numbers, whichever comes first. (Actually we would get a stack overflow error first). With the lazy-seq call, the information needed to generate each subsequent element is remembered, or memoized, to be produced on demand, as needed. We can retrieve a finite set of elements from our logically infinite sequence using the take function:

(take 3 (whole-numbers-from 0)) ; -> (0 1 2)

As each element is produced, whole-numbers-from is called again with the incremented value ofn, and the information for the previous call to whole-numbers-from is discarded and eligible for garbage collection. Thus nothing is placed on the stack; the stack does not nest more than one level and we do not get a stack overflow, even for large values of n.

We can provide a simpler version that starts at 0 using the built in iterate function:

(def whole-numbers (iterate inc 1))

The iterate function takes a function and starting value as arguments, and produces a lazy sequence of n, (f n), (f (f n)) etc. Note that we do not need to create whole-numbers as a function definition using defn. Instead we assigned the lazy sequence returned by the iterate call directly to the symbol definition whole-numbers using def. We can call our whole-numbersdefinition like this:

(take 3 whole-numbers) ; -> (1 2 3)

We can call standard sequence functions including first, filter, map, etc on our lazy sequence:

(first whole-numbers) ; -> 1
(take 3 (map #(* 2 %) whole-numbers)) ; -> (2 4 6)

The call to take is very important - we can't transform the entire set of whole numbers without running out of memory first!

Simple Recursion, Tail Recursion, Lazy Sequences, Standard Sequence Library - A Tale of Four Fibonaccis

To review what we have covered so far, let's look at four implementations for calculating the numbers of the Fibonacci sequence.

The Fibonacci numbers are defined as follows: The first two Fibonacci numbers are 0 and 1, and each of the following numbers is the sum of the previous two. So the third Fibonacci number is 1, because 0 + 1 = 1. The fourth Fibonacci number is 2, because 1 + 1 = 2. The first ten Fibonacci numbers are 0, 1, 1, 2, 3, 5, 8, 13, 21, 34.

We will create four implementations for calculating the Fibonacci sequence.

The first naive implementation will use normal, non-tail recursion. It should be easy enough to follow, but will not be efficient.
Second, we will refactor this implementation to use tail recursion. This version will be more efficient but harder to follow.
Next we will implement the sequence using lazy sequences.
Finally we will look at an elegant implementation that uses some of the built in functions in Clojure's standard sequence library.

1. Simple (non-tail) Recursion

Our first, naive implementation matches the problem definition well.

           ; naive implementation:
            (defn fib [n]
              (condp = n
                  0 0
                  1 1
                  (+ (fib (- n 1)) (fib (- n 2)))))
 
            ; call like this:
            (fib 9)  ; -> 34

The condp macro is a kind of switch statement. It takes the name of a two argument test function, = in this case, a value to use as the first argument to the test function, and a set of clauses where the first element of the clause is passed to the test function and the second part of the clause is the code to execute if the test evaluates to true.

So, our fib function returns 0 if n is 0, 1 if n is 1, and otherwise returns the sum of the previous two Fibonacci numbers. This fits our English language description of the Fibonacci sequence above well: the first Fibonacci number (0) is 0, the second (1) is 1, and each following number is the sum of the previous two. However, this implementation will not work for large values of n:

(fib 10000000) ; -> java.lang.StackOverflowError :(

Each call to fib recursively places two more calls on the stack. You will quickly overflow the JVM stack with this approach.

2. Tail Recursion

In the previous article in this series, we saw how to use the recur function to replace simple recursion with tail call recursion. Tail call recursion allows the compiler to unroll the recursion into a non-recursive loop that is more efficient.

In Clojure this is achieved by replacing the recursive function call with recur. The call to recur must be last expression evaluated in the function (thus the term 'tail call'). However, in our fib function the + function is the last expression. The two recursive calls to fib happen before the + call. If we attempt to replace the two calls to fibwith recur our code will throw a java.lang.UnsupportedOperationException. We will have to rearrange our function to allow for just one recursive call, and that call will need to be the last expression evaluated. One approach is to introduce two accumulator values to 'remember' the two previous Fibonacci numbers on the stack:

      (defn tail-fib
            ([n] (tail-fib n 0 1))
            ([n prev2 prev1]
               (condp = n
                   0 prev2
                   1 prev1
                   2 (+ prev1 prev2)
                   (recur (dec n) prev1 (+ prev1 prev2)))))

Our new function is overloaded based on the number of arguments based. If passed a single argument n, it calls itself with three arguments: (tail-fib n 0 1). The the last two arguments, prev1 and prev2 are the accumulator values that 'remember' the last two Fibonacci numbers calculated. Since the first two numbers of the sequence are 0 and 1 we can hard code those two and then start accumulating, with the call to recur. The recur is the last expression of the function.. This implementation works, even for large values of n, but is harder to read. It does not match the problem definition as well.

3. Replacing Recursion with Lazy Sequences

The phrase 'Fibonacci sequence' suggests an alternative implementation: a lazy sequence that calculates each succeeding Fibonacci number on demand. This approach turns out to be fairly straightforward (once one is comfortable with lazy sequences), and efficient:

       (def fib-seq 
             ((fn fib [prev1 prev2] 
                (lazy-seq (cons prev1 (fib prev2 (+ prev1 prev2)))))
              0 1))
 
        (take 10 fib-seq) ; -> (0 1 1 2 3 5 8 13 21 34)
        (nth fib-seq 9) ; -> 34

The definition of fib-seq is similar to whole-numbers-from above. The fn call creates a little helper function, fib, that takes the two previously accumulated numbers in the sequence, prev1 and prev2. It returns a lazy sequence whose first element is prev1. The following elements are the result of the fib helper function recursively calling itself, passing prev2 and the sum of the previous two numbers. After creating the helper function fib, our sequence definition invokes fib with the first two values of the sequence, 0 and 1. This is the reason for the double level of parenthesis around the fn call: first the function fib is defined, and then it is called with the arguments 0 and 1 to see our sequence. Since this recursive call is placed inside the call to lazy-seq, it is only invoked on demand, as needed, one element at a time. The stack does not nest and we do not run out of stack space, even for large values of n.

4. Using the Standard Sequence Library

The Clojure standard library comes with many useful functions that create lazy sequences. We have already seen two of them, iterate and map. We can use these functions to create a simpler version of the Fibonacci sequence:

            (def fibs (map first (iterate (fn [[prev1 prev2]] [prev2 (+ prev1 prev2)]) [0 1])))
 
            (take 10 fibs) ; -> (0 1 1 2 3 5 8 13 21 34)
            (nth fibs 9) ; -> 34

The call to iterate loops over pairs of the two previously accumulated Fibonacci numbers. Recall that iterate returns a lazy sequence. The Fibonacci sequence itself is simply the first element of each pair, so the call to (map first... simply translates the nested sequence of pairs into single elements, the elements of the Fibonacci sequence. The map function also returns a lazy sequence. This version was suggested by Christophe Grand and described in the book Programming Clojure by Stuart Halloway. This version reuses existing functions, is the simplest version, and is efficient.

Summary

In this SETT we demonstrated how Clojure's persistent collection library uses structural sharing to allow for efficient use of immutable collections. We also showed how Clojure's lazy sequences can create efficient, virtual collections over large (even infinite) sequences. We implemented different methods of calculating the Fibonacci sequence to show how to replace simple recursion with more efficient tail recursion, how to replace tail (or simple) recursion with lazy sequences, and how the lazy sequence functions in the Clojure standard library often remove the need to implement your own lazy sequences.

References

[1] Clojure Home Page
http://clojure.org
[2] Programming Clojure
http://pragprog.com/titles/shcloj/programming-clojure
[3] InfoQ article on Clojure Collections
http://www.infoq.com/articles/in-depth-look-clojure-collections

The Software Engineering Tech Trends (SETT) is a monthly newsletter featuring emerging trends in software engineering.

Clojure Sequences