Software Transactional Memory
By R. Mark Volkmann, OCI Partner
- Lock-based Concurrency
- Actor-based Concurrency
- Transactional Memory
- Software Transactional Memory (STM)
- Persistent Data Structures
- STM Implementations
- Clojure Overview
- Clojure Reference Types
- Clojure Validators and Watchers
- Clojure STM - High Level
- Clojure STM - Low Level
Writing software applications in which parts of the application run in concurrent threads introduces many challenges not present in single-threaded applications. Since the order of operations isn't fixed, failures can be difficult to repeat. Also, thoroughly testing the software requires more effort.
There are two general categories of multithreaded software. In one category the goal is to divide the work involved in processing a single "job" into pieces and run the pieces concurrently in order to complete the job sooner. In the other category the goal is to coordinate the execution of multiple "jobs", allowing them to execute concurrently so that as a group they complete sooner than if they were run serially. In both cases there can be contention over the data being accessed. However, the issue is perhaps more common and harder to avoid in the second category.
There are many models for creating concurrent software. The most popular today are based on locks, actors and transactional memory. This article briefly describes all three models and then dives deeper into software-based transactional memory (STM) .
The goals of this article are to:
- spread knowledge about STM
- support reasoning about the performance characteristics of STM
- encourage implementations of STM for many programming languages
- facilitate the ability of interested developers to suggest improvements to the STM implementation in the Clojure programming language
- provide developers with enough understanding of the Clojure STM implementation that they can develop tools to help with tuning STM usage (such as tracking the number of times a transaction retries and why it retries)
Please send feedback on errors and ways to improve explanations to firstname.lastname@example.org. Updates to this article that indicate the "last updated" date and provide a dated list of changes will be provided at .
Lock-based concurrency provides a mechanism for threads to safely access shared memory. In its simplest form, exclusive ownership of a lock must be acquired before a given block of code is executed. This is used to ensure that only one thread at a time can execute the code. In object-oriented programming languages, often the lock is associated with an object that is accessed by code in the block. Sometimes multiple objects are accessed and a single lock object is designated to provide exclusive access to that group of objects.
Benefits of using lock-based concurrency include the following:
- Developers have explicit control over when locks are obtained and released which allows optimal solutions.
- Many developers are already familiar with this form of managing concurrency.
- Many programming languages support the use of locks.
Issues with using lock-based concurrency include the following:
- It can be difficult to determine which lock(s) need to be obtained in order for a given block of code to execute safely.
- Variables for which a lock should be acquired can be accessed even when the wrong lock(s) are acquired and even when no locks are acquired.
- A thread is in a deadlock state when it is unable to acquire all the locks it needs to proceed because other threads own them and are themselves deadlocked. This can easily occur unless locks are acquired in a common order.
- Recovering from errors can be complicated because developers must remember to release locks that were acquired by the failing code.
- Correctly synchronized methods cannot be composed into compound methods without additional synchronization.
- The lock-based approach is pessimistic. It assumes that if multiple threads are each running sections of code that access the same memory then only one at a time can safely run. This is not always true. Making this assumption reduces the amount of concurrent processing that can occur.
The actor model  is an alternative to shared memory concurrency, such as lock-based concurrency and transactional memory. Actors are software entities that execute as separate processes or light-weight threads. Rather than accessing shared memory, actors only use and retain data that is passed to them in asynchronous messages. When an actor receives a message, it can do the following things concurrently or in any order:
- create new actors
- send messages to other actors
- specify how the next message it receives will be handled
Benefits of using actor-based concurrency include the following:
- Since no memory is shared between actors, access to data doesn't need to be synchronized.
Issues with using actor-based concurrency include the following:
- Since memory isn't shared between actors, some messages may need to be large. This negatively affects performance.
- Since memory isn't shared between actors, the data each actor holds can become inconsistent with the data held by other actors. There is no general mechanism for coordinating the activities of multiple actors.
The actor model is a central feature of the Erlang programming language. It is also supported by Scala and Haskell, but not Clojure .
For a more detailed description of actors, see http://en.wikipedia.org/wiki/Actor_model.
Transactional memory provides an alternative to lock-based and actor-based concurrency that simplifies writing applications in which concurrently running threads access shared memory.
The concept is similar to that of database transactions which provide ACID characteristics:
- "A" is for atomic.
- "C" is for consistent.
- "I" is for isolated.
- "D" is for durable.
"Atomic" means that either all the changes in a transaction will be made successfully (commit) or none of them will be (rollback). "Consistent" means that all the data seen by a transaction at its start and end will be consistent. In other words, constraints on the data will not be violated. "Isolated" means that changes made inside a transaction are not visible outside the transaction until it commits. "Durable" means that after a transaction has committed, changes it made will not be lost, even if there is a network, hardware or software malfunction.
Transactions are demarcated in code by a special syntax that varies across implementations. From the perspective of other threads, all the memory changes made within a transaction appear to happen at the same moment when a transaction is finished committing. The changes are not visible to other threads before then. Transactions operate on a consistent snapshot of the memory. If any memory that is written within transaction "A" is modified and committed by transaction "B" before "A" commits, the code in "A" is rerun. These characteristics make transactional memory atomic, consistent and isolated. Note that transactional memory is not durable. If the software crashes or there is a hardware malfunction, data in memory is typically lost. When durability is required, using a relational database is often recommended.
Transactional memory is optimistic. Each transaction assumes it will be able to run its code without concurrently running threads changing the values it writes. When this assumption doesn't hold, the transaction discards all the work it has done and retries from the beginning. The possibility of retries makes it necessary to avoid any actions (side effects) that shouldn't be repeated or cannot be undone. This includes I/O operations. Clojure provides a solution for this that involves a combination of Refs and Agents. This is discussed later.
In descriptions of the retry process later, this article distinguishes between a "transaction" and a "transaction try". A transaction includes one or more transaction tries. A transaction that completes without having to retry runs a single transaction try. Otherwise there are more than one.
Transactional memory distinguishes between committed values of variables and in-transaction values. Inside a transaction, the value of a variable starts out as either its initial value or the last value committed by a transaction. When the value of a variable is modified inside a transaction, that change is only seen and used in the transaction. When the transaction commits, changes become visible to code outside the transaction.
Benefits of using transactional memory include the following:
- It provides increased concurrency which means there are more opportunities for processing to be performed simultaneously instead of serially. This is especially true for transactions that only read data. Lock-based concurrency doesn't allow this kind of overlapping execution because it takes a pessimistic approach rather than an optimistic one.
- It is easier to write correct code using transactional memory than writing code that uses locks. The need to determine the locks to be acquired and the order in which to acquire them is removed. Instead, developers identify sections of code that require a consistent view of the set of variables it reads and writes.
- Implementations can guarantee that deadlock , livelock  and race conditions  will never occur.
Issues with using transactional memory include the following:
- There is a potential for a large number of transaction retries resulting in wasted work.
- There is overhead imposed by transaction bookkeeping such as storing histories of committed values, storing in-transaction values and acquiring locks before committing changes.
- Tool support is currently lacking. For example, having tools that identify how often each transaction retries and why they retry (such as learning which variables had write conflicts) would make it easier to tune applications when necessary.
Transactional memory works best in programming languages that distinguish between mutable and immutable variables and require mutable variables to be mutated inside a transaction (like in Clojure and Haskell). Without these features, developers are on their honor to only modify mutable variables inside transactions. They must ensure that no data that requires coordinated changes is modified outside an STM transaction. This is similar to the issue with lock-based concurrency where developers are on their honor to acquire the correct lock(s) before accessing variables where this is expected.
Garbage Collection Analogy
Dan Grossman, an associate professor at the University of Washington, wrote a paper titled "The Transactional Memory / Garbage Collection Analogy" . The paper illustrates several ways in which transactional memory (TM) is similar to garbage collection (GC). Some quotes from a related presentation follow:
"Many used to think GC was too slow without hardware."
"Many used to think GC was about to take over, decades before it did."
"Many used to think we needed a back door for when GC was too approximate."
These same thoughts are often expressed about TM today. Perhaps over time these thoughts will change in the same way they have changed for GC. Dan discusses this paper in episode 68 of the "Software Engineering Radio" podcast .
Software Transactional Memory (STM)
Transactional memory can be implemented by hardware or software. This article focuses on software implementations which are commonly referred to as STM. STM is an integral part of some programming languages. For others, it is supported through add-on libraries.
Persistent Data Structures
One way programming languages avoid the issue of data being modified by concurrently running threads is by providing immutable data structures or collection classes. Clearly data that cannot change doesn't need to be protected. It is often desirable to be able to create new data structures that are similar to existing ones, for example, a list with a new item added at one end or a hash map with a new key/value pair added. Persistent data structures  provide a way to do this where new data structures share memory with existing ones. Using these saves both memory and time. The benefit of having immutable data structures that are not persistent is questionable since creating new ones from them is slow and can consume a large amount of memory.
Many STM implementations are listed in the STM Wikipedia entry . Programming languages for which STM implementations exist include: C, C++, C#, Clojure, Common Lisp, Haskell, Java, MUMPS, OCaml, Perl 6, Python, Scheme and Smalltalk. For many of these languages there are multiple STM implementations from which to choose.
Just as there are many ways to implement garbage collection, there are many ways to implement STM. It is difficult to make general statements about the characteristics of STM implementations, such as memory usage and performance. The remainder of this article focuses on the Clojure implementation.
One aspect of the Clojure STM implementation that differs from many others is that data coordinated by transactions cannot be modified outside a transaction. With some implementations, developers are "on their honor" to use them correctly (for example, Deuce STM).
Clojure is a functional programming language that is a Lisp dialect and runs on the Java Virtual Machine (JVM). It was created by Rich Hickey. For a detailed discussion on Clojure, see my earlier article .
Clojure Reference Types
In Clojure all variables are immutable unless they refer to a reference object. There are four types of reference objects: Var, Atom, Agent and Ref. Each type is described below. For more detail, including sample code, see my earlier article.
Vars are variables that have a root value that is shared between all threads and can have thread-specific values. The functions that modify the value of a Var include
binding. The value of a Var is obtained by directly referencing it, as in other programming languages. Vars are mainly used for constants. In general, modifying the value of a Var is discouraged.
One use of Vars where modifying the value is seen as acceptable is for configuration variables. For example, if the special variable
*warn-on-reflection* is set to true, Clojure will output warning messages whenever it uses reflection to determine the type of an object. This is useful while tuning an application for performance as it provides guidance in determining where adding optional type hints may help.
Atoms are variables that have a single value that is shared across all threads. They are accessed in an atomic way, so it is safe for multiple threads to concurrently read and write them. The functions that modify the value of an Atom include
swap!. The value of an Atom is obtained by dereferencing it using
Agents are variables that, like Atoms, have a single value that is shared across all threads. They are modified asynchronously by invoking a function, called an "action", in another thread. The value returned by the action becomes the new value of the Agent. The value of an Agent is obtained by dereferencing it using
Actions sent to the same Agent are queued up so that only one action at a time will be run per Agent. The functions that queue an action are
send-off. They return immediately since the specified action is executed asynchronously. They differ only in the thread pool that is used. One pool has a fixed size and the other has a variable size. The action is is passed the current value of the Agent and any other arguments that were passed to
await function pauses the current thread until all actions sent to a given set of Agents have completed. The
await-for function is similar, but supports a timeout.
Actions sent to agents inside a transaction are held until after the transaction commits. This feature is sometimes used as a way to attach code that has side effects, such as I/O, to a transaction.
Refs are variables that, like Atoms and Agents, have a single value that is shared across all threads. They can only be modified inside an STM transaction. The functions that modify the value of a Ref include
commute. The value of a Ref is obtained by dereferencing it using
In this article, when discussing Refs, the terms "read" and "dereference" are used synonymously. A transaction is not required in order to dereference a Ref, but one is needed to get a consistent snapshot when dereferencing multiple Refs.
Refs are the only type of variable in Clojure that is coordinated by STM. For this reason, they are the focus of this article.
The most direct way to change the value of a Ref is to use the
ref-set function. It takes the Ref whose value is to be modified and the new value.
For both the
commute functions the first argument is the Ref whose value is to be modified and the second argument is a function that returns the new value. When this function is invoked, it is passed the current value of the Ref and any additional arguments that were passed to
commute. The return value becomes the new value of the Ref.
alter instead of
ref-set is typically seen as more idiomatic. A conversation with the CIP (Clojure Idiomatic Police) might go as follows:
CIP: Why did you use
ref-set instead of
You: Well, I already had the value I wanted to use for the Ref, so I just passed it to
CIP: Where did you get that value?
You: It was computed in a function I called.
CIP: Why didn't you just pass the Ref, the function and its arguments to
alter? Don't be afraid to pass functions to functions. This is encouraged in functional programming.
In most cases,
alter is preferred over
commute is only appropriate when the order of changes to the Ref made by concurrently running transactions doesn't matter. This is related to the mathematical definition of "commutative" which includes the phrases "the result obtained ... does not differ with the order". Choosing to use
commute basically says "I'm going to modify a Ref, but I don't care if another transaction modifies it before I commit my change. Decisions I make in this transaction based on my value of the Ref will still be correct. I want the value of the Ref to be recomputed when I'm ready to commit based on the newest value of the Ref."
Some situations where
commute makes sense include accumulating objects in a collection and storing values calculated from a collection (such as a minimum, maximum or average). Suppose a collection is held in a Ref. If two concurrent transactions add an object to the collection, in many cases it doesn't matter which adds to it first. There may be no reason for transaction "A" to retry just because transaction "B" added to the collection before "A" committed. Suppose another Ref holds a value computed from the collection such as the maximum value in it. There may be no reason for transaction "A" to retry just because transaction "B" committed a change to the maximum value. It can be safely and correctly computed again based on the latest contents of the collection.
alter is used and another transaction has committed a change to the Ref since the current transaction try began, the current transaction will be retried. When
commute is used, the current transaction will continue and the in-transaction value of the Ref will be passed to the function that is passed to
commute. This results in better performance because retries aren't required.
The functions and arguments passed to
commute during a single transaction execution are saved in a sorted map. The keys in the map are
Ref objects and the values are lists of objects that describe the functions and their arguments. The map is sorted based on the order in which the
Ref objects were created. When the transaction is committing, a write lock is acquired for each Ref in the map in sorted order and all its commute functions are called a second time to determine the Ref values that will be committed. Obtaining the write locks in this sorted order prevents deadlocks that might otherwise occur if multiple transactions attempt to commit their changes at the same time. The Ref values passed to the commute functions this time depend on whether another transaction has committed a new value since the current transaction try started. If so then the newest committed values are used. Otherwise, the in-transaction value is used. This means the values passed to the commute functions can differ from the values passed the first time they are called.
Within a transaction, a Ref cannot be set to a new value using
commute has been called on it.
Sometimes it is desirable to prevent other transactions from changing the value of a Ref that will be read by the current transaction, or perhaps modified later in the current transaction. (An example is preventing write skew which is discussed later.) This is achieved by calling
ensure on the Ref. While this guarantees that no other transaction can modify the Ref, it doesn't guarantee that the current transaction will be able to modify it. This is because any number of transactions can call
ensure on the same Ref.
The following table is useful for selecting between Atoms, Agents and Refs when creating mutable variables. The word "coordinated" here means managed by a transaction.
Clojure Validators and Watchers
Before diving into the implementation details of Clojure STM, it is important to have a basic understanding of some features of Clojure that are related to the implementation.
Validators are functions that are invoked whenever the value of a given reference type object is being modified. If the function determines that the change isn't "valid", it can return false or throw an exception to prevent the change from occurring. Each reference type object can have only one validator function. The
set-validator! function assigns a validator function to a reference object.
There are two mechanisms for being notified when the value of a reference object may have changed, watch functions and watcher Agents.
Watch functions must take four arguments which are an identifier "key", the reference object that may have changed, its old value and its new value. The key can be used to indicate the purpose of the watch or it can be any data to be made available to the watch function when it is invoked. Each reference object being watched can have a single watch function for each key value. A watch function is registered with a reference object by calling the
add-watch function which takes the reference object, a key and a watch function. A watch function is removed from a reference object by calling the
remove-watch function. which takes a reference object and a key.
Watcher Agents are notified that a reference object has changed by sending a given action to the Agent. The action function is passed the current value of the Agent and the reference object that changed, but not the old value. A watcher Agent is registered with a reference object by calling the
add-watcher function which takes a reference object, a send type (which identifies the thread pool to be used), an Agent, and the action function to be sent. The
add-watcher function creates an anonymous function that sends the action to the Agent and registers that with the reference object by passing it to the
add-watch function. The Agent is used as the key. A watcher Agent is removed from a reference object by calling the
remove-watcher function which takes the reference object and the Agent.
Clojure STM - High Level
This section describes the current Clojure STM implementation at a high level. Low level details are provided later in this article. The information is accurate as of Clojure 1.0 (git commit 94c4a6a on 8/3/2009), but is subject to change in the future. This article should not be viewed as a specification for Clojure STM.
It should be emphasized that understanding the details of the Clojure STM implementation is NOT required in order to use it successfully. Still, there are many reasons (listed in the goals of this article) that one might be interested in the details.
Currently Clojure is implemented by a combination of Java and Clojure source code. Nearly all the STM implementation is written in Java. An effort is underway to dramatically increase the amount that is implemented in Clojure. When that effort is complete, most or all of the code described in this and the next section will be discarded. It seems reasonable to speculate that the STM mechanisms will remain similar.
My intention is to update this article whenever the Clojure STM implementation changes so it remains an accurate source of information that is easier to digest than simply reading the source code.
The Clojure STM implementation is based on multi-version concurrency control (MVCC)  and snapshot isolation . The main difference between the standard definition of these concepts and the Clojure implementation is that Clojure works with in-memory values of variables instead of database tables, rows and columns. Wikipedia describes these concepts very well as follows. Parts related to databases are in square brackets. My clarifiations are in non-italic text.
"MVCC uses timestamps or increasing transaction IDs to achieve serializability. MVCC ensures a transaction never has to wait for a [database] object by maintaining several versions of an object. Each version would have a write timestamp and it would let a transaction read the most recent version of an object which precedes the transaction timestamp."
"If a transaction (Ti) wants to write to an object, and if there is another transaction (Tk) (that also wants to write it), the timestamp of Ti must precede the timestamp of Tk for the object write operation to succeed. Which is to say a write cannot complete if there are outstanding transactions with an earlier timestamp."
"Every object would also have a read timestamp, and if a transaction Ti wanted to write to object P, and the timestamp of that transaction is earlier than the object's read timestamp, the transaction Ti is aborted and restarted. Otherwise, Ti creates a new version of P and sets the read/write timestamps of P to the timestamp of the transaction." (The Clojure STM implementation does not use read timestamps.)
"The obvious drawback to this system is the cost of storing multiple versions of objects [in the database]. On the other hand reads are never blocked, which can be important for workloads mostly involving reading values [from the database]. MVCC is particularly adept at implementing true snapshot isolation, something which other methods of concurrency control frequently do either incompletely or with high performance costs."
"A transaction executing under snapshot isolation appears to operate on a personal snapshot [of the database], taken at the start of the transaction. When the transaction concludes, it will successfully commit only if the values updated by the transaction have not been changed externally since the snapshot was taken."
An issue with snapshot isolation is that it allows write skew . This happens when concurrent transactions read common sets of data and make changes to different data within that set and there are constraints on the data.
For example, suppose a town places a restriction (constraint) on the total number of dogs and cats that a family can own. Let's say the limit is three. When a person obtains a new dog or cat, they are entered in a database. John and his wife Mary have one dog and one cat. John adopts another dog while at the same time Mary adopts a cat. These transactions occur concurrently. Remember that transactions only see changes made by other transactions that have committed. John's transaction attempts to modify the number of dogs they own. The constraint isn't violated because they now have a total of three. Mary's transaction attempts to modify the number of cats they own. Like in the other transaction, the constraint isn't violated because they now have a total of three. Both transactions are allowed to commit, resulting in a total of four dogs and cats which violates the constraint. This is permitted because neither transaction attempts to commit a change to data that is being modified by another concurrent transaction.
Clojure provides a mechanism for avoiding write skew. See the
ensure function discussed later.
In Clojure an STM transaction is created by passing any number of expressions, called the body, to the
dosync macro (defined in
src/clj/clojure/core.clj). This greatly simplifies coding for concurrency over using explicit locks. It is not necessary to indicate which Refs might be modified in the transaction. However, developers still need to make good decisions about what code should be inside it a
dosync call, based on the set of Refs that need to be in a consistent state while the body executes.
The current Clojure STM implementation utilizes several Java concurrency classes. These include:
LockingTransaction and Ref Classes
The main classes in the Clojure STM implementation are
Ref, both in the
clojure.lang package. Both are located in the
src/jvm/clojure/lang directory. UML class diagrams showing the attributes and methods of those classes and related classes appear below. It will be useful to refer to these diagrams often while reading the remainder of this article.
dosync macro call that wraps each transaction body calls the
sync macro which calls the
LockingTransaction static method
sync macro passes body of the transaction to
runInTransaction as an anonymous function. Each thread has one
LockingTransaction object, maintained as a
ThreadLocal variable. If a
LockingTransaction object hasn't yet been created for the current thread then one is created. If the current thread is already running a transaction then the anonymous function is called inside it, i.e., transactions compose. Otherwise a new transaction is started and the function is called inside it.
The status of a transaction is always one of the following five values:
The meaning of each of these status values is self evident, except perhaps RETRY and KILLED.
When the status is RETRY, the transaction will attempt a retry but hasn't started the next try yet. When that occurs, the status changes to
There are two situations that cause the status of a transaction to be set to KILLED. The first is when the
abort method is called on a transaction. That sets the status to KILLED and throws an
AbortException which causes the transaction to stop without retrying. (In the current version of Clojure, no code calls
abort.) The second is when the
barge method (see the section "Barging" below) is called on a transaction. That sets the status to KILLED and allows the transaction to retry.
Ref object maintains a chain of its committed values (history) in its
tvals field. The length of the chain is controlled by the fields
maxHistory. These default to zero and ten, respectively, but can differ for each
Ref object. Other values can be specified when Refs are created by passing options to the
ref function. They can also be modified using the
ref-max-history functions. See the section "Faults" below to learn why the chain length matters.
Ref object has a
ReentrantReadWriteLock. Any number of concurrent transactions can hold its read lock OR a single transaction can hold its write lock. The only time one of these locks is held for the duration of a transaction is when
ensure is called on a Ref. In that case the transaction will hold the read lock until either the Ref is modified in the transaction or the transaction commits. There is no circumstance under which a write lock will be held for the duration of a transaction. Write locks are obtained only briefly when certain things occur during a transaction and then quickly released. They are acquired again when the transaction commits and then released when the commit completes. For more details on how these
ReentrantReadWriteLocks are used, see the "Clojure STM - Low Level" section which describes the
lock field in
alter on a Ref gives it an in-transaction value that isn't seen by other transactions until the transaction commits. This also sets the
tinfo field of the Ref to refer to an object that describes the transaction that modified the Ref, including the order in which it started relative to others and its current status. This is how one transaction knows that another one has modified a Ref and hasn't yet committed. Think of this like giving the Ref a ticket for admission into the commit party that will hopefully happen later. A Ref cannot have a ticket for more than one of these parties. For more details on how the
tinfo field of a Ref is set, see the "Clojure STM - Low Level" section which describes the
lock method in the LockingTransaction class.
LockingTransaction object maintains a map of in-transaction values for the Refs it has modified in its
vals field. The keys in this map are
Ref objects and the values are the
java.lang.Object values of the Refs. If a Ref is only read inside a transaction, its value is always retrieved from the
tvals chain. This requires walking the chain to find the newest value that was committed before the current transaction started. This can make it inefficient to dereference a Ref multiple times in a transaction unless it is modified before the second dereference. The first time a Ref is modified in transaction, its new value is placed in the
vals map. For the remainder of the transaction, its value is retrieved from the
A "fault" occurs when there is an attempt to read the value of a Ref inside a transaction, but the Ref has no in-transaction value and all the values in its chain of committed values were committed after the current transaction began. This can happen when other transactions have committed a change to the Ref since the current transaction try began. When a fault occurs, the transaction will retry. When a transaction commits a change to a Ref that has experienced a fault in another transaction AND the chain is shorter than
maxHistory, a new node is added to the chain of committed values for the Ref. This also happens if the chain is shorter than
minHistory. Lengthing the chain reduces the likelihood that another fault will occur for the Ref in the future. If no fault has occurred for Ref, instead of adding a new node, the last node in the chain becomes the first and it is modified to describe the newly committed value.
It should be clear from the discussion above that the length of the history chain for each Ref can differ. The chains adapt based on the faults that occur. If a fault never occurs for a given Ref and it has a
minHistory of zero (the default) then the length of its chain will never exceed one. These chains never become shorter. After a fault causes a chain to grow, it either remains at that length or grows even longer when more faults occur.
The term "barge" is used to describe the process of determining whether a given transaction should retry while the current transaction continues. When one transaction attempts to
barge another, it will only succeed if three conditions are met. First, the transaction doing the barge must have been running for at least 1/100th of a second (see the BARGE_WAIT_NANOS constant). Second, this transaction must have started before the transaction to be barged. This means older transactions are favored over newer ones. Third, the status of the other transaction must be RUNNING and must be successfully changed to KILLED. The check of the transaction status and changing it are done atomically. The last condition means that if the other transaction is in the process of committing (status = COMMITTING) its changes, it will not be barged.
When a transaction retries, all the in-transaction changes it has made to Refs are discarded and execution returns to the beginning of the transaction body. There are many scenarios that can cause a transaction to retry. Each is described below.
alterfunction is called on a Ref. This causes an attempt to "lock" the Ref, but one of the following situations occur:
- A write lock cannot be obtained for the Ref because another thread holds a read or write lock for the Ref.
- A change to the Ref has been committed since the current transaction try started.
- Another transaction has made an in-transaction change to the Ref that hasn't been committed yet and an attempt to
bargethe other transaction fails.
- An attempt is made to read the value of a Ref inside a transaction (i.e., dereference it), but one of the following situations occur:
- Another transaction has barged the current one.
- The Ref has no in-transaction value yet, and a fault occurs. This means the chain of committed values for the Ref doesn't contain a value that was committed before the current transaction began.
ensurefunction is called on a Ref, but another transaction has barged the current one.
- The current transaction is committing its changes, but another running transaction has an in-transaction change and an attempt to
A transaction won't retry indefinitely. There is a limit that is specified by the RETRY_LIMIT constant in the
LockingTransaction class. It is currently set to 10,000. This is used in the retry loop inside the
run method in the
LockingTransaction class. If it is exceeded then a generic
Exception object is thrown.
Retries are triggered by throwing a
RetryEx which is a custom subclass of
java.lang.Error defined in
LockingTransaction.java. It extends
Error instead of
Exception so that it won't be caught by user code that simply catches
Exception. The retry loop contains a try block with a catch for
RetryEx. The catch simply allows execution to return to the top of the loop. No other kinds of exceptions are caught, so they cause the retry loop to exit.
IllegalStateException is thrown from many methods in the Clojure implementation. When thrown inside a transaction, the transaction will not retry. Those related to STM are thrown when the following occur:
- An attempt is made to get the
LockingTransactionfor the current thread, but it hasn't been created yet or it isn't running. This happens when any of the following functions are called outside a transaction:
- An attempt is made to get the value of Ref, but no value has been bound to it yet (like an uninitialized variable).
- An attempt is made to set the in-transaction value of a Ref using
alter, but the
commutefunction has already been called on the Ref in the current transaction.
- A validation function for a Ref either returns false or throws an exception.
Deadlocks, Livelocks and Race Conditions
The Clojure STM implementation guarantees that deadlocks, livelocks and race conditions will not occur.
In the context of multithreaded software applications, these terms can be described as follows. A deadlock occurs when concurrently running threads cannot proceed because they are each waiting on a resource for which another has acquired exclusive access. A livelock occurs when concurrently running threads are performing work (as opposed to be being blocked, waiting on resources), but cannot complete due to something that other threads have done or not done. A race condition occurs when the outcome of a thread is affected by the timing of changes to shared state made by another concurrently running thread.
In Clojure STM, this applies to modifying a Ref within a transaction. Suppose transactions A and B start in that order and are running concurrently. If they both attempt to modify the same Ref using
alter (write conflict), preference is given to transaction A because it has been running the longest (provided it has been running for a minimal amount of time). This avoids livelock that might occur if retries were based on the order in which refs are set. Transaction B will retry, possibly multiple times, and will be able to modify the Ref after transaction A commits its changes. When transaction B finally runs to completion, it will do so with a consistent view of the changes committed by transaction A.
When transactions modify Refs using
commute, no write conflicts occurs. Deadlocks are avoided during the commit by obtaining write locks for the Refs in a consistent order (which is the order in which the
Ref objects were created).
Clojure STM Overhead
Using STM instead of explicit locks adds processing overhead. This doesn't necessarily mean that the STM approach is slower. Since STM is optimistic and locks are pessimistic, it is possible for STM to provide better overall performance for some applications due to greater concurrency.
When a Ref that has not yet been modified inside a transaction is read inside the transaction, a value is selected from its chain of committed values. As long as no other transaction has committed a new value for the Ref since the current transaction began, this will be fast since the value used will be that of the first node in the chain. Otherwise some chain walking will take place.
The value of a Ref can only be modified inside a transaction. When this is done, there is overhead incurred to verify that a transaction is running. It also must verify that the
commute function hasn't already been called on the Ref. (It wouldn't make sense to allow this because commute functions are called a second time during the commit and any other setting of the Ref wouldn't have any lasting effect after the transaction completes.) If this is the first time the Ref has been modified in the current transaction, there is overhead incurred in adding the Ref to a set of Refs that the transaction has modified and marking the Ref as being modified by the current transaction (see the description of the
lock method in the next paragraph). Finally, the new value is added to the map of in-transaction values for the current transaction. Subsequent reads of the Ref inside the transaction will use the value from that map instead of selecting a value from the chain of committed values.
lock method in the
LockingTransaction class introduces overhead. This involves many steps that are summarized below:
ensurewas called earlier in this transaction try on the Ref then release the read lock for the Ref.
- Attempt to acquire a write lock for the Ref. If it cannot be acquired then trigger a retry for the current transaction.
- If another transaction has committed a change to the Ref since the current transaction try began then trigger a retry for the current transaction.
- If another transaction has made an as-yet uncommitted change to the Ref, attempt to
bargeit. If that fails then trigger a retry for the current transaction.
- Mark the Ref is marked as being locked by the current transaction by setting its
tinfofield to refer to an
Infoobject that describes the current transaction.
- In any case, release the write lock for the Ref.
When the end of the transaction is reached, there is overhead incurred in committing the modified Refs, which makes their new, in-transaction values visible outside the current transaction. This involves many steps that are summarized below:
- Change the status of the current transaction from
- Rerun any
commutefunctions called on Refs in the transaction. This involves acquiring a write lock for each commuted Ref.
- Acquire write locks for all other Refs modified in the transaction.
- Call all the validate functions that were registered on the Refs that were modified. Retry the current transaction if any of them disapprove of a change.
- Add a node to the chain of committed values, or modify an existing node, for each Ref to be modified (depending on faults,
- Create a
Notifyobject for each Ref that was modified and has at least one registered watcher.
- Change the status of the current transaction from
- Release all the write locks acquired previously.
- Release all the read locks still held due to calls to
- Clear the
commutescollections in preparation for the next retry or new transaction in the current thread.
- If the commit was successful, notify all registered watchers of the changes using data in the
Notifyobjects and dispatch all actions sent to Agents in the transaction.
- Clear the
actionscollections in preparation for the next retry or new transaction in the current thread.
- Return the value of the last expression in the transaction body.
Clojure STM - Low Level
It is not necessary to learn the details presented in this section in order to effectively use Clojure STM. However, this information will be very useful to anyone that wishes to:
- understand exactly what is happening at runtime in Clojure STM
- read the source code for the Clojure STM implementation
- suggest improvements to Clojure STM
- add features to the Clojure STM
- compare the Clojure STM implementation to that of another language
- create an STM implementation for another language
Feel free to skip to the "Conclusion" section near the end if you don't have a need for additional detail.
While reading about the low-level details of the Clojure STM implementation, refer to the earlier class diagrams to aid with understanding the relationships between the many related classes and interfaces. It is also helpful to look at the source code in
LockingTransaction.java while reading this section.
These macros are defined in
|dosync||This macro accepts any number of expressions. It simply passes them to the
|sync||This macro accepts any number of expressions. It runs them inside a transaction. If it is called in the context of an existing transaction, the expressions are run inside it. Otherwise a new transaction is started.
It wraps the expressions in an anonymous function that implements the
These functions are defined in
ref-set functions must be called inside a transaction, i.e., inside a call to the
dosync macro. When the phrase "in-transaction value" is used below it means either the in-transaction value if there is one or the newest committed value before the transaction try began.
|add-watch||This function takes a reference object, a key and a function. It registers the function under the key for the reference object. When the value of the reference object changes, the function is invoked with the key, the reference object, the old value and the new value.
This calls the
|add-watcher||This function takes a reference object, a send type (which identifies the thread pool to be used), an Agent and the action function to be sent. It creates an anonymous function that sends the action to the Agent and registers that with the reference object by passing it to the
|alter||This function takes a Ref, a function that will return its new value, and optional, additional arguments. It changes the value of a Ref to the result of invoking the given function on the in-transaction value of the Ref and any additional arguments.
This calls the
|commute||This function is similar to
Second, when the transaction is committing its changes, all the calls to commute functions are rerun. They aren't necessarily run in the same order in which they were invoked within the transaction body.
In all calls to the commute functions, both in the body of the transaction and during the commit, the Ref value passed to the given function is determined in the same way. If
This calls the
|ensure||This function takes a Ref. It prevents other transactions from setting an in-transaction value for the Ref, which also prevents them from committing a new value. This can be used to avoid write skew. It also helps avoid retries due to write conflicts on the Ref. The current transaction will be able to set a new value for the Ref later in the transaction as long as other transactions haven't called
This calls the
|ref||This function creates and returns a
This calls a constructor in the
|ref-history-count||This function takes a Ref. It returns the current length of its history chain.
This calls the
|ref-max-history||This function takes a Ref and an optional length. If only a Ref is provided, it returns the maximum length of the history chain for the Ref. If a length is also provided, it sets the maximum length and returns it. If the history chain is already longer than the specified length, it is not shortened to the new maximum length. It just won't grow any longer.
This calls the
|ref-min-history||This function takes a Ref and an optional length. If only a Ref is provided, it returns the minimum length of the history chain for the Ref. If a length is also provided, it sets the minimum length and returns it.
This calls the
|ref-set||This function takes a Ref and a proposed new value. It changes the value of a Ref to the given value.
This passes the value to the
|remove-watch||This function takes a reference object and a key. It removes the watch function associated with the given key from the map of watch functions for the reference object.
This calls the
|remove-watcher||This function takes a reference object and a watcher Agent. It removes the watch function with a key that matches the Agent from the map of watch functions for the reference object.|
|send||This function takes an Agent, a function and optional arguments. It passes the function and optional arguments, along with a final argument of
|set-validator!||This function takes a Ref and a function. It sets the validator function for the Ref to the given function.
This calls the
ARef class is defined in
src/jvm/clojure/lang/ARef.java. It is the superclass of the
|validator:IFn||This is the single validator function associated with this Ref.|
|watches:IPersistentMap||This is a map of watch functions, one per key value, for this Ref.|
|addWatch||This method takes an
|getValidator||This method gets the validator function for this Ref.|
|getWatches||This method gets the map of watch functions for this Ref.|
|notifyWatches||This method takes old and new values of the Ref. It invokes all the watch functions for the Ref, passing them their key, this Ref, the old value and the new value.|
|removeWatch||This method takes an
|setValidator||This method takes an
|validate||This method takes a proposed new
Ref Nested Classes
Ref class is defined in
src/jvm/clojure/lang/Ref.java. It is a subclass of the
One static nested class is defined inside the
|TVal||Objects from this class are POJOs that represent versions of values for Refs. They are nodes in a doubly-linked, circular list. The fields of this class are all package-level so they can be accessed directly by the
|msecs:long||This is the creation system time of this object.|
|next:TVal||This is a reference to the
|point:long||This is an ordered identifier of a transaction commit. Values that were committed in the same transaction will have the same value for this field.|
|prior:TVal||This is a reference to the
|val:Object||This is a committed value for the Ref.|
All fields in the
Ref class are package level, so
LockingTransaction objects can directly access them.
|This field is used to count the number of times a transaction has attempted to retrieve the value of this Ref from its history chain, but found no value known before the read point which is an ordered identifier for the beginning of the current transaction try. This can happen if other transactions have committed new values for the Ref since the current try of the current transaction began. Transactions that encounter this situation will retry.
This field is reset to zero in the
|This field holds a unique, ordered identifier for the Ref. Its value is assigned in the Ref constructor by calling
|This field is used to assign a unique identifier to each
|This field maintains a pair of locks, one for readers and one for writers. Only one thread at a time can hold the write lock. When the write lock is held, no thread can hold the read lock. Otherwise any number of threads can hold the read lock. Being "reentrant" means that if a thread holds a lock and requests it again, perhaps through a recursive call, the request will succeed (since it already owns the lock) and execution will continue.
No other methods use the read and write locks for
|These fields are used to limit the number of historical values that are retained for the Ref. They default to zero and ten, respectively. Details on how these values are used is provided later in the description of the
|This field holds the status of the transaction that has made an uncommitted change to the value of this Ref, but hasn't yet committed the change. When not null, it serves as an indication that the Ref has an active writer.|
|tvals:TVal||This field holds the newest node in a circular chain of
|validator:IFn||This field represents a Clojure function that will be invoked whenever the value of the Ref is about to be modified. A reference object can have only one validator function at a time. The validator function should return a value that can be cast to a boolean whose value indicates whether the current value of the Ref is valid. Alternatively, it can throw an exception if the value is not valid. Returning false or throwing an exception prevents the change from taking place and causes an
|This field holds a map of functions that are called after a change to a Ref has been committed. See the earlier section "Clojure Validators and Watchers" for information on registering watchers with a reference object. After the
The methods in the
Ref class that play a significant role in the STM implementation are described below.
|constructors||This class has two constructors. One takes only an initial value as a
|alter||This method takes a function whose return value will become the new value of the Ref, and optional arguments to be passed to the function after the in-transaction value of the Ref. It gets the value of the Ref by calling the
|commute||This method takes a function whose return value will become the new value of the Ref, and optional arguments to be passed to the function after the current value of the Ref. It simply passes all its parameters to the
|currentVal||This method returns the newest committed value of the Ref. First, it obtains a read lock for the Ref. Next, it accesses the value of the first TVal object in the history chain for the Ref. If that chain is empty, meaning the Ref is unbound, an
|deref||This method returns the current value of the Ref. If the current thread is in a transaction then it calls the
|getHistoryCount||This method acquires the write lock for the Ref (blocking until successful), calls the
|histCount||This method returns the number of
|isBound||This method returns a boolean that indicates whether there is at least one
|set||This method takes a proposed new value for the Ref. It simply passes the Ref and the new value to the
|touch||This method simply passes the Ref to the
|trimHistory||This method removes all the but newest
LockingTransaction class is defined in
|BARGE_WAIT_NANOS||This specifies the minimum amount of time (1/100th of a second) that a transaction must have run in order for it to attempt to cause another conflicting transaction to retry.|
|LOCK_WAIT_MSECS||This specifies the amount of time (1/10th of a second) to wait for something to happen. It is used by the
|RETRY_LIMIT||This specifies the maximum number of times a transaction will be retried before giving up. The current value is 10,000. If it gives up then a plain
|These describe the status of a transaction. Three of these have an obvious meaning.
LockingTransaction Nested Classes
These are static nested classes defined inside the
|AbortException||This is a simple subclass of
|CFn||Objects from this class hold a reference to a function (
|Info||Objects from this class describe the status of a transaction. A new one is created at the beginning of each transaction try. The class has a single method,
|Notify||This class has three fields, one to hold a Ref that changed, one to hold its previous value, and one to hold its new value. An instance is created and added to a list during a transaction commit for each Ref that was modified during the transaction. After the transaction is committed, the information in these objects is used to notify the watchers of the Refs about the changes.|
|RetryEx||This is a simple subclass of
|This is a reference to a function that was passed to the
|This is a list of arguments to be passed to the function.|
|This latch object is created in the constructor of the
|This holds the current status of the associated transaction.|
|This value identifies the order in which the associated transaction started, relative to other transaction starts, retries and commits.|
|This is a Ref that was modified in a transaction.|
|This is the previous value of the Ref.|
|This is the new value of the Ref.|
All fields in the
LockingTransaction class are package level. However, in the current version of Clojure no other classes extend
LockingTransaction and no other classes in the same package directly access its fields.
|This field is a collection of Agent actions. When actions are sent to Agents within a transaction, they are saved in this collection (by the
|This field is a map from Refs to the commute functions that have been called on them.
|This field holds the set of Refs on which the
|This field holds an object that describes the status of the transaction, the system time when that status was set, and a CountDownLatch that starts with a value of one. It has a single method,
When an attempt is made to set a Ref using
In both the
See the description of the
|This field is used to provide long values that indicate the order in which significant transaction-related events occur. These include the start of each transaction, each retry, and each commit.|
|readPoint:long||This field is used to track the order of the beginning of each transaction try across all transactions. At the beginning of each retry, including the initial try, the
This field has two uses.
It is used by the
It is used by the
|This field holds the exception object that is thrown in several places to cause the transaction to retry.
|This field keeps track of all the Refs that have been modified inside the current transaction try using
|startPoint:long||This field is assigned the value of the
|startTime:long||This field holds the system time when the first try to execute the transaction began. It is used by the
|This field holds the single
|This field is a map where the keys are Refs whose values have been modified within the current transaction and the values are their in-transaction values.|
The methods in the
LockingTransaction class that play a significant role in the STM implementation are described below.
|abort||This method stops the transaction, sets the status of the transaction to
|barge||This method takes an
If the the current transaction has run for at least BARGE_WAIT_NANOS (1/100th of a second) and it started before the given transaction then an attempt will be made to retry the given transaction. This will succeed as long as the status of the other transaction is
|bargeTimeElapsed||This method determines whether the transaction has been running for more than BARGE_WAIT_NANOS (1/100th of a second). Transactions that have been running for less time than this cannot barge other transactions. Since they have barely started running, they will retry instead.|
|blockAndBail||This method takes a
This part of the design was inspired by the paper "Software Transactional Memory Should Not Be Obstruction-Free" .
|doCommute||This method takes a Ref, a function whose return value will become its new value, and arguments to be passed to the function after the current value of the Ref. It verifies that a transaction is running in the current thread (status =
During a transaction commit, in the
|doEnsure||This method takes a Ref. It verifies that a transaction is running in the current thread (status =
|doGet||This method takes a Ref whose value is to be returned. It verifies that a transaction is running in the current thread (status =
If the history of committed values for the Ref doesn't contain a value that was committed before the current transaction try began then it increments the number of faults for the Ref and triggers a retry. Faults cause more history to be retained for the Ref in the future (managed in the
This method is called by the
|doSet||This method takes a Ref and a proposed new value. Its goal is to change the in-transaction value of the Ref. It verifies that a transaction is running in the current thread (status =
If the Ref hasn't been modified yet in the current transaction, indicated by its absence from the
If the Ref has a validator function, it will not be called until the transaction is committing the changes. Note that a new history chain node (TVal) for the Ref isn't created until the transaction is ready to commit.
|enqueue||This method takes an
|getCommitPoint||This method takes no arguments. It returns a value that is used to indicate the order in which commits occur. It is called by the
|getEx||This method takes no arguments. It returns the
|getReadPoint||This method takes no arguments. It is called by the
|getRunning||This method takes no arguments. It returns the
|isRunning||This method takes no arguments. It returns a boolean that indicates whether the transaction associated with the current thread is currently running. A transaction is considered to be running if a
|lock||This method takes a Ref, marks it as having its value set since the beginning of the transaction try, and returns its most recently committed value.
Locking a Ref is done by setting its
If the Ref has a value that was committed after the current transaction try began (called the "read point"), it triggers a retry of the current transaction. Next it checks whether the Ref has been modified in a concurrently running transaction. If so, it attempts to
|releaseIfEnsured||This method takes a Ref. If the
|run||This method is the heart of Clojure STM. It takes a
This method then enters a loop that manages transaction retries. The loop exits if the transaction completes successfully, the maximum number of retries is performed (10,000), or an exception other than retry exception (
Inside the loop, the
The status of the transaction is stored in an
Still in the loop, the
After the body is successfully executed, the status is checked to verify that no other thread has barged this transaction. That changes the status from
Now the process of committing the transaction begins. This involves four major steps. The first step is to rerun all the commute functions. The second step is to obtain a write lock for every Ref that was modified. The third step is to verify that the validator of every Ref that was modified approves of the change. The fourth step is to commit the change to each modified Ref.
The first commit step iterates though each of the Refs that were commuted. If during the current transaction
In the second commit step, a write lock is acquired for each Ref that was modified in the transaction body. If any requested write lock cannot be acquired then the current transaction will retry.
In the third commit step, the validator function of every modified Ref that has one is called. If any validator function disapproves of the change being made to its Ref then the current transaction will retry.
In the fourth commit step , changes to Refs are written to their history chains. Either a new
The finally block performs cleanup steps for the transaction. It releases all the write locks acquired for Refs back in step 2. It releases all the read locks acquired by ensures that weren't released in step 1. This is necessary for Refs on which
|This accepts a
|stop||This method changes the status of the transaction to a given value and signals that the change has completed by decrementing the latch in itsInfo object to zero. The
When a transaction completes, the
|tryWriteLock||This method takes a Ref. It tries for up to LOCK_WAIT_MSECS (1/10th of a second) to acquire a write lock for the Ref. If this fails due to the read or write lock being held by another thread (in which case the timeout will expire) or the current thread being interrupted, a retry of the current transaction is triggered. This method is called by the
STM implementations are somewhat complicated. However, developing applications that use STM is less complicated and easier to get correct than using a lock-based approach. It seems likely that over time using STM will become as common and trusted as using garbage collection is today.
Watch for updates to this article as the Clojure STM implementation improves, eventually being replaced by an implementation written in Clojure instead of Java.
-  my STM site
-  my SETT article: Clojure - Functional Programming for the JVM
-  Wikipedia Actor Model entry
-  Wikipedia STM entry
-  Wikipedia deadlock entry
-  Wikipedia livelock entry
-  Wikipedia race condition entry
-  The Transactional Memory / Garbage Collection Analogy
paper by Dan Grossman
-  Software Engineering Radio podcast on STM and Garbage Collection
-  State, You're Doing It Wrong
slides by Jonas Boner
-  Software Transactional Memory Should Not Be Obstruction-Free
paper by Robert Ennals
-  Clojure: STMs vs Locks
discussion with Cliff Click, Rich Hickey and others
-  Clojure main site - http://clojure.org
-  Persistent data structures
-  STM implementations
-  Multi-version concurrency control (MVCC)
-  Snapshot isolation