Protocol Parsing in Scala, Part II
By Charles Calkins, OCI Principal Software Engineer
April 2016
Introduction
While binary data protocols have been around for decades, so too have text-based ones. The popularity of functional languages like Scala lead to applications that need to parse these protocols.
In Part I of this article, scodec, parser library for Scala to parse binary data was discussed. This article picks up and describes the Scala Standard Parser Combinator Library, which is useful for parsing text. These libraries allow a protocol grammar to be translated almost exactly into executable code, making the implementation easier to understand and more easily verifiable "by inspection," reducing the chance of programmer errors.
Source code for this article is available in the associated archive.
A Text Protocol
Text-based protocols are useful when interacting with devices, as commands and configuration information may be easily understood, or even directly applied by an operator without requiring a binary encoding step.
For example, in the late 1990s, the author participated in a project that required control of a Thermotron temperature chamber over a serial port, where issuing the IDEN?
command would return a device identification string, and SETP1?
and SETP1,temp
would read and write the temperature chamber set point, respectively. Today, the Hypertext Transfer Protocol (HTTP), Extensible Markup Language (XML), and the JavaScript Object Notation (JSON) for control and communication, are ubiquitous.
This article considers a text-based device configuration description that is reminiscent of one used in industrial control that has been used by the author.
Consider a building automation application where smart devices are networked for monitoring and control. Open/close sensors on windows and doors ensure a safe environment, while temperature and humidity sensors, coupled with actuators to manipulate a heating and cooling system, assists maintaining the comfort of occupants. Activity detected by motion sensors automatically turn lights on and off, so rooms are only lit when they are in use, saving energy. Display panels provide a user interface to show what devices are installed, and provide a means to interact with them, such as to turn lights on and off, or change the thermostat's temperature, from centralized locations.
These devices may be considered as being composed of zero or more sensors and/or actuators.
A sensor is something that measures, either continually, such as a temperature sensor that generates readings once a minute, or on an event basis, such as a light switch, where a change of state is a one-shot, instead of repetitive, occurrence.
A device may consist of both sensors and actuators, such as a light dimmer which may be set physically by the user (reporting its new setting as an event) or programmatically by the system.
Part I of this article described a binary protocol for interaction with hardware devices that provides addressing based on a specific hardware ID or network address in domain / subnet / node form. Zeroes used for the domain or subnet allow for broadcast messages to be sent. Messages are identified by a command, and are either in a request/response form, where an outgoing message expects a reply, or are unsolicited, where a message arrives independently of a request.
For this application, we have the following messages:
- Identification. When this message is sent, all devices reply with their hardware IDs. This allows the sender to determine what devices are available on the network.
- Set address. This message, sent to a device given its hardware ID, sets the network address. With this message, address conflicts can be resolved, and devices may be grouped into domains and subnets, based on where they are installed. For instance, the domain could be the floor, and the subnet the room where the device is located, so messages (such as "set value to 0", a.k.a. "turn off") broadcast across a floor or to a specific room have a geographic effect.
- Get configuration. When a device receives this message, it returns a text block describing its configuration. Parsing that block is the subject of this article.
- New reading. This unsolicited message is sent from sensors on a periodic basis that measure values continually, or as events occur, such as when a switch was flipped, or a photoelectric beam is broken when motion is detected.
- Set value. This message is sent to an actuator to activate it, such as turning a switch on or off, or setting the percentage of dimming a light dimmer produces.
In order to know which sensors and actuators are present, and their capabilities, their configuration needs to be described:
- Devices, sensors and actuators need names to help identify their purpose, such as "lobby overhead light" or "office 302 window security sensor."
- Sensors and actuators need to indicate their type, such as if they are a switch or temperature sensor, so data may be properly collated and correct graphical elements may be presented on a user interface.
- The range of values acceptable is also needed, especially for actuators to know what values are acceptable to set.
- The rate at which sensors report their readings, as well as whether they are enabled for use, is important to know.
- Lastly, sensors and actuators may generate diagnostic information, so a user can tell if readings are unreliable or if maintenance is necessary.
Device configuration can be represented in three sections of a configuration file.
- The first provides device information, such as a device name and its address.
- The second is a list of sensors
- The third a list of actuators
Each has its own properties. This may be expressed as a series of textual property names, prefixed by a % sign, grouped by section. An EBNF-style grammar is as follows, where int, double and bool (encoded as 0 or 1) are the usual types, qstring is a string inside of double quotes, hex is a hex digit, and nl is the newline character. A number in curly braces indicates an exact count is required, productions are in italic, and literal text is in bold.
name | := | %NAME nl qstring nl |
address | := | %ADDRESS nl int . int . int nl |
hardwareID | := | %HWID nl hex{8} nl |
devInfo | := | %DEVINFO < name address hardwareID > |
daqRate | := | %DR nl (E | P , int) nl |
range | := | %RANGE nl double , double nl |
type | := | %TYPE nl int nl |
enabled | := | %EN nl bool nl |
diagnostic | := | %DIAGSTART nl <any characters> %DIAGEND nl |
sensor | := | %SENSOR int < name [daqRate] [range] [type] [enabled] [diagnostic] > |
actuator | := | %ACTUATOR int < name [range] [type] [enabled] [diagnostic] > |
device | := | [ devInfo sensor{n} actuator{m} ] |
The Scala Standard Parser Combinator Library can be used to convert this grammar almost directly into code.
As of Scala 2.11, this library is no longer bundled with Scala itself, but can be used by adding it to the libraryDependencies
entry in build.sbt
:
libraryDependencies ++= Seq(
"org.scala-lang.modules" %% "scala-parser-combinators" % "1.0.4"
)
One of the parser types included in this library are parsers based on regular expressions, expressed by the RegexParsers
trait.
To create a parser for the device configuration grammar, we begin by creating an object, which implements that trait. An option that can be set is whether whitespace is ignored or available to be parsed. As the grammar above specifies newlines in certain places, whitespace may not be skipped.
// src/main/scala/ConfigParser.scala
object ConfigParser extends RegexParsers {
override val skipWhitespace = false
We next define methods that correspond to the elements of the grammar. Each method must return a Parser[T]
, where T is the primitive type corresponding to the grammar element. To parse a newline character, we define a nl
function as:
def nl: Parser[String] = "\n"
As "\n"
is of String
type, the return type of nl
must be specified as Parser[String]
, instead of being inferred.
In comparison, a method to parse an integer:
def int = "[-+]?[0-9]+".r ^^ { _.toInt }
does not need an explicit return type, as it may be inferred from the expression.
A Scala regular expression follows standard rules for expressing regular expressions, where a single character between [
and ]
is matched, ?
matches zero or one occurrence of an expression, and +
matches one or more.
A string followed by .r
is converted into a Regex
object. ^^
is a parser combinator for function application. Given p ^^ f
, if p
succeeds, then the result is f
applied to the result of p
.
Here, the result of the parse is converted to an integer, giving the resulting type of int
as Parser[Int]
. Operator-style parser combinators, such as ^^
are defined in the Parsers
trait, where the RegexParsers
trait mentioned above extends Parsers
.
Parsing a double requires a more complex regular expression, but is otherwise the same.
def double = """[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?""".r ^^ { _.toDouble }
A string contained within double quotes can have its quotes stripped by using the ~>
and combinators, where the first keeps the right side of the parser, and the second keeps the left side.
def qstring = "\"" ~> ("\"\"" | "[^\"]".r).* <~ "\"" ^^ { _.mkString }
As with qstring
, literal text rather than a regular expression can be used in a parser.
In this method:
def bool = ("0" | "1") ^^ { _.toInt != 0 }
the literal strings 0
or 1
, if successfully parsed, are converted to an integer, that integer compared with zero, and the result returned as a boolean true or false.
With the parsers defined for the primitive elements, we combine them into more complex parsers.
The following will parse the name
element of the grammar:
def name = "%NAME" ~ nl ~ qstring ~ nl ^^ {
case _ ~ _ ~ name ~ _ => Name(name)
}
The ~
combinator combines parsers in sequence, where here, %NAME
is followed by a newline character, then a quoted string, and then another newline character.
As this results in a sequence of parsed results, a pattern match can be performed, and variables bound to the results of each parser combinator.
Only the quoted string is necessary to extract into the name
variable; the other results in the sequence may be ignored.
Rather than returning a primitive type, such as a String
or an Int
from the parser, an instance of our own Name
class is returned instead, giving the return type of name
as Parser[Name]
.
As in Part I of this article, case classes are useful to store data elements. It is defined as:
case class Name(name: String) {
override def toString() = s"""%NAME\n"$name"\n"""
}
Unlike the scodec parser library as described in Part I, which provided both decoding and encoding, "encoding" in our case is performed by expressing the case class as a string. Scala string interpolation is used as a compact way to include the value of name
in the output.
A dotted network address is parsed in a similar manner, extracting the domain, node and subnet from the address and storing them in a corresponding case class.
def address = "%ADDRESS" ~ nl ~ int ~ "." ~ int ~ "." ~ int ~ nl ^^ {
case _ ~ _ ~ domain ~ _ ~ subnet ~ _ ~ node ~ _ => Address(domain, subnet, node)
}
case class Address(domain: Int, subnet: Int, node: Int) {
override def toString() = s"%ADDRESS\n$domain.$subnet.$node\n"
}
A hardware ID consists of eight hexidecimal digits. Rather than having a separate method to represent a hex digit, the regular expression can be included directly into the parse.
In a Scala regular expression, {n}
specifies an exact number of matches of the preceding expression.
def hardwareID = "%HWID" ~ nl ~ "[0-9A-F]{8}".r ~ nl ^^ {
case _ ~ _ ~ id ~ _ => HardwareID(id)
}
case class HardwareID(id: String) {
override def toString() = s"%HWID\n$id\n"
}
With name
, address
, and hardwareID
defined, the devInfo
method can be specified. It continues to follow the same pattern.
def devInfo = "%DEVINFO" ~ "<" ~ name ~ address ~ hardwareID ~ ">" ^^ {
case _ ~ _ ~ name ~ address ~ hardware_id ~ _ =>
DevInfo(name, address, hardware_id)
}
case class DevInfo(name: Name, address: Address, hardwareID: HardwareID) {
override def toString() = s"%DEVINFO<$name$address$hardwareID>"
}
The data acquisition rate is of two forms:
- An event that may fire at any time
- A periodic sequence with a fixed interval
This may be expressed by first defining the alternatives:
def daqRate_Event = "E" ^^ {
case _ => DaqRate_Event()
}
def daqRate_Periodic = "P," ~ int ^^ {
case _ ~ interval => DaqRate_Periodic(interval)
}
trait DaqRateType {}
case class DaqRate_Event() extends DaqRateType {
override def toString() = "E"
}
case class DaqRate_Periodic(interval: Int) extends DaqRateType {
override def toString() = s"P,$interval"
}
and using them in the definition of daqRate
:
def daqRate = "%DR" ~ nl ~ (daqRate_Event | daqRate_Periodic) ~ nl ^^ {
case _ ~ _~ daqRate ~ _ => DaqRate(daqRate)
}
case class DaqRate(daqRate: DaqRateType) {
override def toString() = s"%DR\n$daqRate\n"
}
Defining the DaqRateType
trait, and using two case classes for the alternatives that extend the trait, allows for an easy definition of daqRate
, as the result of parsing either daqRate_Event
or daqRate_Periodic
is still of the same base type.
The implementation of range
, saType
(since type
is a Scala keyword), and enabled
is done in the same manner as those above.
A parser for diagnostic
is more complex. Suppose that some devices do not implement the diagnostic section correctly, and do not terminate it with %DIAGEND
.
Also, since the section may contain any text including nested strings, section names, or other text that may be difficult to express in a regular expression, a hand-written parser that absorbs all characters until either %DIAGEND
is found, or the end of the input is reached may be used.
To do this, the apply
method is implemented of Parser[Diagnostic]
, where the argument to apply
is the input to the parser, as follows:
def diagnostic = new Parser[Diagnostic] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
val secStart = "%DIAGSTART\n"
val secEnd = "%DIAGEND\n"
val text = source.subSequence(start, source.length).toString
val iStart = text.indexOf(secStart) // -1 if no match
if (iStart>=0) {
val contents = text.drop(iStart + secStart.length)
val iEnd = contents.indexOf(secEnd)
if (iEnd == -1)
Success(Diagnostic(contents),
in.drop(start + secStart.length + contents.length - offset))
else {
val strippedContents = contents.take(iEnd)
Success(Diagnostic(strippedContents),
in.drop(start + secStart.length + strippedContents.length +
secEnd.length - offset))
}
}
else
Failure(s"$secStart not found", in.drop(start - offset))
}
}
The parameter in
has two components:
source
, which is the text being parsedoffset
, which is the current index into the text being parsed
Here, text
is set to the text being parsed from the current position to the end of the string, and %DIAGSTART\n
is checked to see if it is within it.
If %DIAGSTART\n
is found, then %DIAGEND\n
is searched for.
If %DIAGEND\n
is missing, then all text after %DIAGSTART\n
through the end of the input is returned as the result of the parse; else only the text between %DIAGSTART\n
and %DIAGEND\n
is accepted.
In either case, characters are dropped from the input corresponding to the text that was parsed corresponding to the diagnostic contents, plus the start, and if it exists, end tags. Otherwise, if %DIAGSTART\n
is not found, then the parser returns Failure(error message)
.
With the diagnostic
parser complete, we can now parse the sensor configuration:
def sensor = "%SENSOR" ~ int ~ "<" ~ name ~ daqRate.? ~ range.? ~
saType.? ~ enabled.? ~ diagnostic.? ~ ">" ^^ {
case _ ~ index ~ _ ~ name ~ daqRate ~ range ~
saType ~ enabled ~ diagnostic ~ _ =>
Sensor(index, name, daqRate, range, saType, enabled, diagnostic)
}
While name
is required, the daqRate
, range
, saType
, enabled
and diagnostic
sections are optional (zero or one exists), as expressed by the ?
combinator. As the element may or may not exist, the result of the parse is contained within a Scala Option[T]
.
The corresponding case class is:
case class Sensor(index: Int, name: Name, daqRate: Option[DaqRate],
range: Option[Range], saType: Option[SAType], enabled: Option[Enabled],
diagnostic: Option[Diagnostic]) {
override def toString() = s"%SENSOR$index<$name" +
daqRate.map(_.toString).getOrElse("") +
range.map(_.toString).getOrElse("") +
saType.map(_.toString).getOrElse("") +
enabled.map(_.toString).getOrElse("") +
diagnostic.map(_.toString).getOrElse("") + ">"
}
The toString
method is implemented a bit differently here, as string interpolation does not produce the result that we need. For example, when stringifying an Option[T]
, we require either the stringification of T
, or nothing at all, depending on whether the value of the option is either Some(<value of type T>)
or None
. The default stringification of Option[T]
produces Some
and its contents, or None
as literal text.
The definition of actuator
is similar. With all of the various grammar elements complete, device
can be defined. Here, as zero or more sensors and/or actuators are present, the *
combinator is used, and the result of the parse of each is a List[T]
of elements.
def device = "[" ~> devInfo ~ sensor.* ~ actuator.* <~ "]" ^^ {
case devInfo ~ sensors ~ actuators => Device(devInfo, sensors, actuators)
}
case class Device(devInfo: DevInfo, sensors: List[Sensor], actuators: List[Actuator]) {
override def toString() = "[" + devInfo.toString +
sensors.mkString + actuators.mkString + "]"
}
As with Option[T]
, List[T]
does not stringify as needed when using string interpolation; so instead mkString
is called to perform the operation.
To parse text with ConfigParser
, we invoke the inherited parse
method, passing the grammar method to use, and a string to parse. As was seen in the implementation of diagnostic
, a parser returns an object indicating success, failure or an error in the parsing. We simplify this by defining a method that, given a string, returns an Option[Device]
, where a successful match produces Some(a device object)
, or None
if the parse fails.
def parseConfig(s: String): Option[Device] = {
ConfigParser.parse(ConfigParser.device, s) match {
case ConfigParser.Success(matched,_) => Some(matched)
case ConfigParser.Failure(msg,_) =>
{ println("Parse failure of " + s + " due to: " + msg); None }
case ConfigParser.Error(msg,_) =>
{ println("Parse error of " + s + " due to: " + msg); None }
}
}
Now that the parser is complete, tests ensure that it works correctly.
Testing
Specs2 is a commonly used testing framework for Scala that allows tests to be expressed either in narrative form, or in a more traditional programmatic way. The framework includes the org.specs2.matcher.ParserMatchers
trait to aid in creating tests for parser combinators.
To use Specs2 with support for parser combinator testing, both specs2-core
and specs2-matcher-extra
must be added to the libraryDependencies
entry in build.sbt
:
libraryDependencies ++= Seq(
"org.specs2" %% "specs2-core" % "3.6.6" % "test",
"org.specs2" %% "specs2-matcher-extra" % "3.6.6" % "test"
)
Tests are contained within a test specification class that inherits from Spec2's Specification
class, and specifications that need the additional support for testing parser combinators also extend ParserMatchers
. The parsers
variable also needs to be set to the class that contains the parser methods.
// src/test/scala/ConfigParserSpec.scala
import org.specs2.mutable._
import scala.util.parsing.combinator.RegexParsers
import org.specs2.matcher.ParserMatchers
class ConfigParserSpec extends Specification with ParserMatchers {
val parsers = ConfigParser
import ConfigParser._
Each element in the parser may be tested independently, such as this test for qstring
:
"qstring" should {
"recognize \"fred\"" in {
qstring must succeedOn("\"fred\"").withResult("fred")
}
"not recognize fred\"" in {
qstring must not succeedOn("fred\"")
}
"not recognize \"fred" in {
qstring must failOn("\"fred")
}
}
When expressing tests in narrative form, related tests are grouped together by the use of the should
keyword, and the in
keyword delimits an individual test.
The support for parser combinator testing is provided by the succeedOn
and withResult
methods, where succeedOn
is successful if the parser successfully parses the supplied text, and withResult
is used to confirm that the result of the parse matches a specific result. Rather than using not succeedOn
, failOn
yields the same behavior.
In these tests, a properly quoted string should result in the string with the quotes dropped, but strings that do not have quotes on both ends should fail to parse.
Running this test with sbt test
produces:
[info] ConfigParserSpec
[info]
[info] qstring should
[info] + recognize "fred"
[info] + not recognize fred"
[info] + not recognize "fred
If a test fails, [error]
instead of [info]
will be shown, an x instead of a + will be used, and an explanation of the test failure will be displayed.
Parser elements that are parsed successfully will return instances of case classes that should "round trip." That is, raw text is parsed into an instance of a case class, and, when that case class instance is stringified, it will produce the original raw text again. That may be expressed as:
"name" should {
"recognize '%NAME\n\"fred\"\n'" in {
name must succeedOn("%NAME\n\"fred\"\n").withResult(Name("fred"))
}
"encode Name(\"fred\") correctly" in {
Name("fred").toString === "%NAME\n\"fred\"\n"
}
}
Tests in narrative form may be included in test methods as well, which generalizes and simplifies the above:
def roundtrip[A](e: A, parser: ConfigParser.Parser[A]) = {
"roundtrip " + e.toString.replace("\n", "\\n") in {
parser must succeedOn(e.toString).withResult(e)
}
}
and which allows tests to be expressed in a very compact form.
The same test for name
now becomes:
roundtrip(Name("fred"), ConfigParser.name)
The call to replace("\n", "\\n")
converts literal newline characters to a textual representation that does not cause a line break, which makes the test output more readable, and the result of the test is shown as:
[info] + roundtrip %NAME\n"fred"\n
Tests for the other elements of the grammar are similar, and may be as complex as desired.
A test for device
, for example, may be:
roundtrip(Device(
DevInfo(Name("Weather Station"), Address(1,2,3), HardwareID("0123ABCD")),
List(
Sensor(1, Name("Temperature"), Some(DaqRate(DaqRate_Periodic(60))),
Some(Range(-20, 120)), Some(SAType(1)), Some(Enabled(true)),
Some(Diagnostic("I'm broken"))),
Sensor(2, Name("Humidity"), Some(DaqRate(DaqRate_Periodic(300))),
Some(Range(0, 100)), Some(SAType(2)), Some(Enabled(true)),
None),
Sensor(3, Name("IsRaining"), Some(DaqRate(DaqRate_Event())),
Some(Range(0, 1)), Some(SAType(3)), None, None)
),
List(
Actuator(1, Name("Alarm"),
Some(Range(0, 1)), Some(SAType(1)), Some(Enabled(true)), None))),
ConfigParser.device)
which displays the following as test output:
[info] + roundtrip [%DEVINFO<%NAME\n"Weather Station"\n%ADDRESS\n1.2.3\n
%HWID\n0123ABCD\n>%SENSOR1<%NAME\n"Temperature"\n%DR\nP,60\n
%RANGE\n-20.0,120.0\n%TYPE\n1\n%EN\n1\n%DIAGSTART\nI'm broken%DIAGEND\n>
%SENSOR2<%NAME\n"Humidity"\n%DR\nP,300\n%RANGE\n0.0,100.0\n%TYPE\n2\n%EN\n1\n>
%SENSOR3<%NAME\n"IsRaining"\n%DR\nE\n%RANGE\n0.0,1.0\n%TYPE\n3\n>
%ACTUATOR1<%NAME\n"Alarm"\n%RANGE\n0.0,1.0\n%TYPE\n1\n%EN\n1\n>]
Summary
Both the scodec and parser combinator libraries allow a grammar to be implemented in code in a way that is very similar to its original definition, which, as mentioned before, helps to ensure that, just by inspection, the parser is correct. Scala's expressiveness, in allowing methods to be defined with symbolic names, and its easy creation of case class instances, helps make this possible.
For additional information on Scala parser combinators, the author found these tutorials particularly helpful: Kerflyn's Playing with Scala Parser Combinator and poundblog's A Scala Parser Combinator Grammar for CSV.
References
- [1] Hypertext Transfer Protocol (HTTP)
https://tools.ietf.org/html/rfc2616 - [2] Extensible Markup Language (XML)
https://www.w3.org/TR/xml/ - [3] JavaScript Object Notation (JSON)
https://tools.ietf.org/rfc/rfc7159.txt - [4] Scala Standard Parser Combinator Library
https://github.com/scala/scala-parser-combinators - [5] Scala Regular Expressions
http://www.tutorialspoint.com/scala/scala_regular_expressions.htm - [6] Parsers.scala
https://github.com/scala/scala-parser-combinators/blob/1.0.x/src/main/scala/scala/util/parsing/combinator/Parsers.scala - [7] RegexParsers.scala
https://github.com/scala/scala-parser-combinators/blob/1.0.x/src/main/scala/scala/util/parsing/combinator/RegexParsers.scala - [8] scodec
http://scodec.org/ - [9] Specs2
https://etorreborre.github.io/specs2/ - [10] Playing with Scala Parser Combinator
https://kerflyn.wordpress.com/2012/08/25/playing-with-scala-parser-combinator/ - [11] A Scala Parser Combinator Grammar for CSV
https://poundblog.wordpress.com/2013/06/06/a-scala-parser-combinator-grammar-for-csv/
Software Engineering Tech Trends (SETT) is a regular publication featuring emerging trends in software engineering.
The Software Engineering Tech Trends (SETT) is a monthly newsletter featuring emerging trends in software engineering.