Attaining CORBA Interoperability Through Codeset Translation

Middleware News Brief (MNB) features news and technical information about Open Source middleware technologies.

September 02, 2002 - By Phil Mesnier, OCI Partner and Principal Software Engineer

Introduction

CORBA interoperability means different things at different levels.

At the level of the network, it means that processes may be able to communicate using varying transport mediums.
At the messaging layer, interoperability means that the participating ORBs use a common protocol, such as GIOP.
At the presentation layer, interoperability means that information contained in data may be interpreted and presented consistently.

This Middleware News Brief discusses interoperability at the presentation layer through codeset translation.

One of the challenges to distributed computing is dealing with applications using different numeric values representing text characters. There are many different character sets in use throughout the world, some of which use different values to refer to the same character. A collection of values that map to specific text characters is referred to as a codeset.

To successfully communicate text data between systems using different codesets, the text must be translated from the sender's codeset to that of the receiver.

Codeset Translation and CORBA

CORBA exists to ease the burden of distributed computing by internally dealing with many issues that application developers would otherwise have to deal with. Character code value translation is no exception to this.

The CORBA specification going back to CORBA 2.2 defines the means for applications to declare a Native Code Set for character based text (NCS-C). This is the codeset that is used by the application to read and write files or to interact with a user.

An application may have a second Native Code Set specifically for wide characters(NCS-W). Applications may use translators to convert from their Native Code Sets to various conversion codesets (CCS-C, CCS-W), both for 8-bit and wide characters. The conversion codesets are those the application is able to translate to and from the native codeset.

CORBA makes use of two kinds of codesets:

One for byte-oriented based text
The other for wide, or nonbyte-oriented, text

The term codepoint is used to describe a numeric value for a character that is more than 8-bits long. Wide characters may be 2, 3, or 4 bytes long.

Because the length is fixed, some bytes in the codepoint may be 0. For this reason, wide character text manipulation requires separate system calls that examine more than one byte at a time when computing string length or doing string comparisons.

Byte-oriented codesets may also comprise codepoints that are greater than one byte; however none of the codepoints will contain a byte with a value of 0. This allows strings of multibyte characters to be manipulated with traditional system calls.

CORBA uses the Open System Foundation (OSF) Code and Character Set registry when communicating codeset identity information between applications. The registry is freely available and contains the following details:

A description of the codeset
A unique numeric identifier
The maximum number of bytes needed to hold one (possibly escaped) character
A list of the character sets contained within the codeset

A codeset may contain several character sets that may be distinct from each other.

Using the OSF codeset registry values as identifiers allows CORBA servers to encode their native and conversion codesets into IOR profiles. CORBA clients may use these codeset identifiers to select an appropriate Transmission codeset for byte-oriented characters (TCS-C), or wide characters (TCS-W). To do so, the client uses the following algorithm:

If the NCS of the server matches the NCS of the client, use that.
If the NCS of one side matches a CCS of the other side, use that.
If both sides have a common CCS, use that.
If both sides have codesets (NCS or CCS) that are similar (contain common character sets), use those.
Otherwise a CODESET_INCOMPATIBLE CORBA System exception must be raised.

Note that the second option is a little ambiguous. It could be the case that while the client and server have different NCSs, they might both have a CCS that matches the other's NCS. It then becomes a choice of placing the translation burden on the client or the server.

Codeset negotiation happens with the first client request. Once set, the TCS remains fixed for the remainder of the connection.

Codeset Translation and TAO

This section refers to implementation details of code still in development. Look for this in a future release of TAO.

Codeset translation in TAO is pluggable. This gives the greatest level of flexibility, allowing developers to add translators only as needed.

To provide a new translator to TAO, four components must be supplied.

First and foremost, is the translator itself. This is a C++ class that interacts with the CDR stream to allow an application using one codeset to read and write stream data using another. Very diverse environments may require several translators.
Next, a translator factory is required to allow the ORB to make an instance (or many instances) of your translator.
Thirdly, the ORB must be made aware of the translator factories available to it. The resource factory has new initialization options that cause this to happen.
Finally, there must be a way to validate identifiers. The codeset registry performs this function.

The Translators

A translator is a C++ class that derives from either ACE_Char_Codeset_Translator or ACE_WChar_Codeset_Translator. These abstract base classes work closely with the ACE_InputCDR and ACE_OutputCDR classes in order to optimize reading and writing.

The interface for the translators is simple:

class ACE_Char_Codeset_Translator 
{
public:
  virtual ACE_CDR::Boolean read_char (ACE_InputCDR&,
                                      ACE_CDR::Char&) = 0;
  virtual ACE_CDR::Boolean read_string (ACE_InputCDR&,
                                        ACE_CDR::Char *&) = 0;
  virtual ACE_CDR::Boolean read_char_array (ACE_InputCDR&,
                                            ACE_CDR::Char*,
                                            ACE_CDR::ULong) = 0;
 
  virtual ACE_CDR::Boolean write_char (ACE_OutputCDR&,
                                       ACE_CDR::Char) = 0;
  virtual ACE_CDR::Boolean write_string (ACE_OutputCDR&,
                                         ACE_CDR::ULong,
                                         const ACE_CDR::Char*) = 0;
  virtual ACE_CDR::Boolean write_char_array (ACE_OutputCDR&,
                                             const ACE_CDR::Char*,
                                             ACE_CDR::ULong) = 0;
  static ACE_CDR::ULong ncs {return 0;}
  static ACE_CDR::ULong tcs {return 0;}
};

The read methods take text from the CDR stream encoded using the TCS and return to the application the text in the NCS.
The write methods perform the inverse operation, taking NCS text from the application, and writing it to the stream using the TCS.
- Of course, the example shown is used to translate byte-oriented codesets. The companion class, ACE_WChar_Codeset_Translator, has methods for reading and writing wchar text.
- See the full class definition in the header file ace/CDR_Stream.h for more information on the translator methods. An example translator implementation may be found in ace/Codeset_IBM1047.*.
The last two static methods are used to identify the Native Code Set and the Transmission Code Set the translator uses.
- When creating a translator, the appropriate OSF Character and Codeset registry values should be returned by these methods.

The Translator Factory

Once you have created the translators you wish to make available to TAO, you must create a translator factory to instantiate the translator as needed.

The translator factory is an ACE_Service_Object, meaning that it may be dynamically loaded using the service configurator.

All that is required to build a translator factory for TAO is to instantiate a template.

template
class TAO_Export TAO_Codeset_Translator_Factory_T

The template arguments are:

class NCS_TO_TCS. The actual translator class, derived from either ACE_Char_Codeset_Translator or ACE_WChar_Codeset_Translator.
int stateful. An optional argument indicating that the translator is stateful or stateless. Some codesets may employ special shift characters that operate on one or more following codepoints. In this case, the translator may not be reentrant, and therefore must not be shared by multiple threads.

To ease the burden of building new translator factories, there is a new TAO library, libTAO_CodeSet.so, which is built along with the other secondary ORB libraries. The directory $TAO_ROOT/tests/CodeSets contains a sample char translator and codeset factory for mapping between the ASCII codeset and EBCDIC. These codesets are more formally known as ISO-8859-1 and IBM-1047.

The Resource Factory

Now that you have have your translator and factory implementations, the last required element is to load the codesets into TAO at runtime. This is done through the use of service configurator directives.

The first directive is used to load the translator factory.
The second directive sets the NCS for the application and informs the ORB which translators are available.

Here is an example of service configuration directives for loading a codeset translator:

dynamic Char_IBM1047_ISO8859_Factory
  Service_Object * TAO_CodeSet:_make_TAO_Char_IBM1047_ISO8859_Factory () 

static Resource_Factory 
  "-ORBNativeCharCodeset EBCDIC -ORBNativeWCharCodeset 0x10026352 -ORBCharCodesetTranslator Char_IBM1047_ISO8859_Factory"

The first directive loads a service object called Char_IBM1047_ISO8859_Factory which is in the TAO_CodeSet library.
The second directive uses the resource factory to configure the NCS-C to the local codeset that is named "EBCDIC."
- Native codeset declaration may use a name that corresponds to either an entry in the codeset registry or to a number.
The second argument in the resource factory directive sets the native wide character code set using a numeric ID, which happens to correspond to "IBM-850 (CCSID 25426); Multilingual IBM PC Display-MLP".
The final argument tells the resource factory to add the previously loaded translator into the list of available byte-oriented translators.

An application using these configuration options would be able to communicate with other applications that use either IBM-1047 or ISO-8859-1, for character oriented text, or IBM-850 for wide characters. in which no translation is needed, or ISO8859, using the provided translator.

If no codeset information is configured, the ORB assumes that ISO-8859-1 is used as the byte-oriented codeset. There is no default for non-byte oriented codesets.

If any interface includes WChar or WString data types, then at least -ORBNativeWCharCodeset must be specified.

The Codeset Registry

The value supplied with -ORBNativeCharCodeset or -ORBNativeWCharCodeset may be a number or a name. In either case, the resource factory validates the value by locating it in the Codeset Registry. ACE now has a class that provides a wrapper on platforms that natively support DCE/RPC.

On platforms that do not provide that support, ACE emulates the behavior by using its own version of the Codeset Registry database. As shipped, the ACE Codeset Registry DB is empty. However, a utility, mkcsregdb, will read a text file and generate the database.

One is not required to populate the Codeset Registry with all possible codesets. It is quite reasonable to build a registry with only the codesets you will actually support, as a subset of the entire registry available from the OSF. Simply construct a file containing a subset of the OSF's full codeset registry, add your own system-specific local names, and run mkcsregdb. Having run mkcsregdb, you will have to rebuild ACE to link in the new registry details.

Here is an example of a single entry from the OSF's codeset registry, version 1.2g:

start
description             IBM-1047 (CCSID 01047); Latin-1 Open System
loc_name        NONE
rgy_value               0x10020417
char_values             0x0011
max_bytes               1
end

Note that the local name (loc_name) is assigned the string "NONE" because local names are not defined by the OSF. For the configuration shown above work, the loc_name should:

Be given the string "EBCDIC"
Then regenerate the codeset database by running mkcsregdb
Then rebuild ACE

Many operating system vendors define their own local names, which may be related to Locales. In many cases, it is possible to obtain a localized codeset registry from a system vendor. Otherwise it is not difficult to produce your own.

Summary

CORBA is designed to achieve interoperability within the framework to minimize the requirements or restrictions on application developers. In order to provide this interoperability at the presentation layer, TAO now supports negotiated codeset translation.

Developers wishing to deploy a distributed application that interchanges text data between platforms using different character codesets may produce or otherwise acquire translators that simply plug into the framework and allow for data exchange without modifying the participating applications.

References

[1] CORBA/IIOP specification, Current is version 3.0, version 2.6.1 was referenced in the development of TAO's codeset translation capability
http://www.omg.org/technology/documents/corba_spec_catalog.htm
[2] OSF Codeset Registry
http://www.opengroup.org/tech/rfc/rfc40.2.html