FAQ: Parsing

Parsing with XML4C

This section answers questions connected to programming in XML4C. It explains how to install XML4C, how to uninstall XML4C, and explains the nitty-gritty details of setting up your environment and building an application using XML4C. It also explains the various language encodings supported by XML4C and also attempts to answer questions on compliance to various standards.

  1. What compilers are being used on the supported platforms?
  2. I cannot run my sample applications. What is wrong?
  3. I just built my application using the XML4C parser. Why does it crash?
  4. How do I setup my build environment to build XML4C applications?
  5. How do I use the source code for building my application?
  6. What are the differences between the AIX and Solaris build settings?
  7. Is XML4C thread-safe?
  8. How do I find out what version of XML4C I am using?
  9. How do I un-install XML4C?
  10. How do I add an additional transcoding file in the existing set?
  11. How does entity reference nodes handled in DOM?
  12. What kinds of URLs are currently supported in XML4C?
  13. Can I use XML4C to parse HTML?
  14. Can I use the streaming classes given in the sample programs?
  15. I keep getting an error : "invalid UTF-8 character". What's wrong?
  16. What encodings are supported by XML4C?

The Answers ...

What compilers are being used on the supported platforms?

XML4C has been built on the following platforms with these compilers

Operating System

Compiler

Windows NT/98

MSVC 6.0

AIX 4.1.4  and higher

xlC 3.1

Solaris 2.6

CC version 4.2

HP-UX B10.2

aCC and CC

HP-UX B11

aCC and CC

Linux

gcc

 

 

 

.

 

 

 

I cannot cannot run my sample applications. What is wrong?

There are two major installation issues which must be dealt with in order to use XML4C from your applications. The DLL or shared library must be locatable via the system's environment. And, the locale files used by XML4C for its transcoding must be locatable.

On UNIX platforms you need to ensure that your library search environment variable includes the directory which has the shared library (On AIX, this is LIBPATH while on Solaris it is LD_LIBRARY_PATH). Thus, if you extracted your binaries under $HOME/fastxmlparser, you need to point your library path to that directory.

export LIBPATH=$LIBPATH:$HOME/fastxmlparser/lib (AIX)

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/fastxmlparser/lib (Solaris, Linux)

export SHLIB_PATH=$SHLIB_PATH:$HOME/fastxmlparser/lib (HP-UX)

On Win32, you would insure that the XML4C2 DLLs are in the PATH environment.

For the transcoding files, the most natural mechanism, which is used in the binary release, is to place them relative to the shared library or DLL. The transcoding converter files should be in the intlFiles/locales directory relative to the shared library or DLL. This will allow them to be located automatically.

However, if you redistribute XML4C within some other product, and cannot maintain this relationship, or if your build scenario does not allow you to maintain this relationship during debugging for instance, you can use the XML4C2INTLDIR environment variable to point to these locale files. This variable may be set system wide, within a particular command window, or just within the client application or higher level libraries, as is deemed necessary. It must be set before the XML system is initialized (see below.)

 

I just built my application using the XML4C parser. Why does it crash?

In order to work with the XML4C parser, you have to first initialize the XML subsystem. The most common mistake is to forget this initialization. Before you make any calls to XML4C APIs, you must call

XMLPlatformUtils::Initialize():
try {
    XMLPlatformUtils::Initialize();
}
catch (const XMLException& toCatch){
    // Do your failure processing here
}

This initializes the XML4C system and sets its internal variables. Note that you must the include <util/PlatformUtils.hpp> file for this to work.

The second common problem is the absence of the transcoding converter files. This problem has a simple fix, if you understand how the transcoding converter files are searched.

XML4C first looks for the environment variable XML4C2INTLDIR . If it finds this variable in your environment settings, then it assumes that the transcoding converter files are kept in that directory. Thus, for example, if you had set your environment variable to (say):

set XML4C2INTLDIR=d:\myxml4c2\intlFiles\locales

the transcoding converter files (all files having extension .cnv and convrtrs.txt) will be searched under d:\myxml4c2\intlFiles\locales

If you have not set your environment variable, then the search for the transcoding converters is done relative to the location of the shared library IXXML4C2_2.dll (or libIXXML4C2_2.a on AIX and libIXXML4C2_2.so on Solaris and Linux, libIXXML4C2_2.sl on HP-UX). Thus if your shared library is in d:\fastxmlparser\bin, then your transcoding converter files should be in d:\fastxmlparser\bin\intlFiles\locales.

Before you run your application, make sure that you have covered the two possibilities mentioned above.

 

How do I setup my build environment to build XML4C applications?

For Windows NT/95/98

In order to build applications using XML4C, you will need to set the following items in your build environment. You will need to use the mechanisms that your development environment provides for specifying these values. The mechanisms for Microsoft Visual C++ are described below.

Note: The Microsoft Visual C++ project files for building the sample programs that are included with the distribution of XML4C already include the necessary settings. So you don't need to change anything unless you are doing something special

Add the directory <full_path_to_xml4c2_2_0>\include to your include file search path (use the "Project/Settings: C/C++->Preprocessor->Additional Include Directories" setting in MSVC, or set the INCLUDE environment variable on the command line).

Add the directory <full_path_to_xml4c2_2_0>\lib to your library path (use the "Link->Input->Additional Library Path" setting in MSVC, or set the LIB environment variable on the command line).

The C preprocessor variables must be defined in the environment to make the headers do the right kind of build, controlling the development environment and the character mode. These are set from within MSVC using the "Project/Settings: C/C++->General->Preprocessor Definitions" . Under Win98/95, the best setting would be _MBCS or _SBCS, but NT's native format is Unicode so _UNICODE would provide the optimum performance on that platform.

You must use the DLL version of the C/C++ runtime library. In MSVC pick "Multithreaded DLL" for your release builds and "Debug Multithreaded DLL" for your debug builds. This is done in the "Project/Settings: C/C++->Code Generation->Use Runtime Library" setting.

Add the XML4C import library as import libraries for your project. This is named IXXML4C2.lib for this release, and should be set in the "Project/Settings: Link->General->Object/library Modules" setting in MSVC.

For UNIX (AIX, Solaris, HP-UX, Linux)

XML4C has been tested on AIX 4.1.4, Solaris 2.6, Linux 5.1, HP-UX 10.2 and HP-UX 11). In order to build applications using XML4C, you may need to write a Makefile for convenience.

The following steps gives a detailed instruction for building and running your application:

Have the compiler set in your PATH variable. For example, export
PATH=$PATH:/usr/bin/xlC_r/bin (where xlC_r is the compiler used)

Have your application in any desired directory or for convenience can have it under <full_path_to_xml4c2_2_0>/samples/<applnname> directory.

Copy a makefile from a sample, say
<full_path_to_xml4c2_2_0>/samples/Projects/AIX directory .

Set the environment variable ROOTDIR as <full_path_to_xml4c2_2_0>.
For example,
export PATH=$PATH:<full_path_to_xml4c2_2_0>/bin (if korn shell)
setenv PATH $PATH:<full_path_to_xml4c2_2_0>/bin (if c shell)

The makefiles shipped with XML4C use the ROOTDIR environment variable. These makefiles can be found in the samples/Projects/AIX directory.

Set the library path variable in the environment as
<full_path_to_xml4c2_2_0>/lib:/usr/lib.

Libraries under <full_path_to_intlFiles directory>/lib are Shared libraries and libraries under <full_path_to_xml4c2_2_0>/lib directory are static libraries.

Check the directory for internationalization converter files (.cnv files). The most obvious place where you'll find the converter files is <full_path_to_xml4c2_2_0>/lib/intlFiles/locales. If the converter files are NOT in this directory, then you need to set an environment variable XML4C2INTLDIR giving the proper directory name where the files reside.

For example,
export XML4C2INTLDIR=<full_path_to_intlFiles> (if korn shell)
setenv XML4C2INTLDIR <full_path_to_intlFiles> (if c shell)

If you have the international converter files in <full_path_to_xml4c2_2_0>/lib/intlFiles/locales directory, then make sure you unset the XML4C2INTLDIR environment variable.This can be done by executing unset XML4C2INTLDIR

The makefile needs to be updated with the application related information. This can be done by updating the following things in the makefile
Edit the
OUTDIR variable
<
full_path_to_xml4c2_2_0>/bin/obj/<applnname> directory.

Edit the OBJS variable to create the application related object files. For example, for sample having foo1.cpp foo2.cpp edit
OBJS= ${OUTDIR}/foo1.o
${OUTDIR}/foo2.o

Edit the SRC variable to point to the application source.

Under the 'makedir:' directive change the sample name to the appropriate application name. This process creates the necessary directories prior to building the application.

Edit the executable name to the desired name. For example, for application having executable name as foo edit:
${EXEC}/foo : ${OBJS}
${LIB}/*.a (or .so on Solaris and .sl on HP-UX)

Make sure the library files are under <full_path_to_xml4c2_2_0>/lib directory and it contains libIXXML4C2_2.a (or libIXXML4C2_2.so). Make sure the include files are under <full_path_to_xml4c2_2_0>/include directory. Make sure the internationalization transcoding converter data files are under <full_path_to_xml4c2_2_0>/lib/intlFiles/locales directory or where ever the XML4C2INTLDIR is set by you.

Change the .o and .cpp files with the appropriate application filenames. For example for the application having foo.cpp update as follows:
 
$(OUTDIR)/foo.o: ${SRC}/foo.cpp
xlC_r ${CMP} $(INCLUDES) -o$(OUTDIR)/foo.o ${SRC}/foo.cpp

Update the 'clean:' directive to clean the right application executable. For example, for foo application executable modify as
clean:
   
rm -f ${OUTDIR}/*.o ${EXEC}/foo

After the Makefile is all set, run make as follows:

For an application makefile named foo.mak
make -f foo.mak COMPILESWITCH="-w -O" (for optimized builds)
make -f foo.mak COMPILESWITCH=-g (for debug builds)

To clean the build run as follows:
make clean -f foo.mak COMPILESWITCH="-w -O" (for optimized builds)
make clean -f foo.mak COMPILESWITCH=-g (for debug builds)

 

 

How do I use the source code for building my application?

For Windows platforms.

You will need to include a project file in your workspace to program your application. Otherwise, you can use the provided workspace and add your application to it as a separate project.

In the first case the project file is: \xml4c2\Projects\Win32\VC6\IXXML4C2\IXXML4C2\IXXML4C2.dsp

In the second case the workspace is: \xml4c2\Projects\Win32\VC6\IXXML4C2\IXXML4C2.dsw

You must make sure that you are linking your application with the IXXML4C2.lib library and also make sure that the associated dll is somewhere in your path. Note that you must either have the environment variable XML4CINTLDIR set, or keep the international converter files relative to the IXXML4C2_2.dll (as it came with the original binary drop) for the program to find it.

For AIX:

You will need to link the XML4C shared library libIXXML4C2_2.a to your application. When your application executes, it must find the same shared library in your LIBPATH environment variable. Thus, if your shared library is in $HOME/xml4csrc2_2_0/lib then you must type:

export LIBPATH=$HOME/xml4csrc2_2_0/lib:$LIBPATH

Needless to say, you must also make sure that the xlC_r compiler and the associated utility makeC++SharedLib_r is also in your PATH environment variable. (These compiler files are usually installed in the /usr/lpp/xlC/bin directory.) Note that you must either have the environment variable XML4CINTLDIR set, or keep the international converter files relative to the libIXXML4C2_2.a as it came with the original binary drop for the program to find it.

For Solaris and Linux:

Everything in the above explanation is true except for the environment variable, the compiler name and shared library extension. The library search path environment variable is called LD_LIBRARY_PATH on Solaris and Linux. The compiler that needs to be on your PATH is CC (The Solaris native compiler) and gcc for Linux. The shared library name is libIXXML4C2_2.so (instead of libIXXML4C2_2.a).

For HP-UX:

The same goes true for HP-UX as well. The library search path environment variable is called SHLIB_PATH on HP-UX. The compiler that needs to be on your PATH is CC (the older CFront based compiler) or aCC (the advanced C++ compiler) . The shared library name is libIXXML4C2_2.sl.

See Also: The FAQ Distribution page describes some scripts that can help you build the sources directly on UNIX.

 

What are the differences between the build settings on various UNIX platforms?

There are small differences between the various flavors of UNIX. Most UNIX programmers know these differences and work around it. For your convenience, here is a list of differences you should know in order to work smoothly with XML4C.

Environment variables: You need to set an environment variable to tell your executables where to search for shared libraries. That environment variable is LIBPATH on AIX and  LD_LIBRARY_PATH on Solaris and Linux and  SHLIB_PATH on HP-UX. Your shared library resides in $HOME/xml4c2_2_0/lib, and you must include this path in the environment variable LIBPATH,  LD_LIBRARY_PATH or SHLIB_PATH depending on your platform.

Makefile Names: The names of makefile are different to distinguish the platform. Currently the Makefiles are handwritten and tuned for the specific platform. In future versions, a generic makefile generator (like autoconf and automake) will be used to take care of the platform differences more elegantly. On Solaris the makefile name is Makefile.sun, on AIX you should use Makefile.aix while on HP-UX you should use Makefile.hpaCC or Makefile.hpCC.

 

Is XML4C thread-safe?

This is not a question that has a simple yes/no answer. Here are the rules for using XML4C in a multi-threaded environment:

Within an address space, an instance of the parser may be used without restriction from a single thread, or an instance of the parser can be accessed from multiple threads, provided the application guarantees that only one thread has entered a method of the parser at any one time.

When two or more parser instances exist in a process, the instances can be used concurrently, and without external synchronization.  That is, in an application containing two parsers and two threads, one thread can be running within the first parser concurrently with the second thread running within the second parser.

The same rules apply to XML4C DOM documents - multiple document instances may be concurrently accessed from different threads, but any given document instance can only be accessed by one thread at a time.

DOMStrings allow multiple concurrent readers.  All DOMString const methods are thread safe, and can be concurrently entered by multiple threads.  Non-const DOMString methods, such as appendData(), are not thread safe and the application must guarantee that no other methods (including const methods) are executed concurrently with them.

 

How do I find out what version of XML4C I am using?

The version string for XML4C happens to be in one of the source files. Look inside the file src/com/ibm/xml/util/XML4CDefs.hpp and find out what the static variable gXML4C2FullVersionStr is defined to be. (It is usually of type 2.1.0 or something similar). This is the version of XML you are using.

If you don't have the source code, you have to find the version information from the shared library name. On Windows NT/95/98 right click on the DLL name IXXML4C2_2.dll in the bin directory and look up properties. The version information may be found on the Version tab.

On AIX, just look for the library name libIXXML4C2_2.a (or libIXXML4C2_2.so on Solaris/Linux and libIXXML4C2_2.sl on HP-UX). The version number is coded in the name of the library.

 

How do I uninstall XML4C?

XML4C only installs itself in a single directory and does not set any registry entries. Thus, to un-install, you only need to remove the directory where you installed it, and all XML4C related files will be removed. The install directory for the binary drop is xml4c2_2_0 while the source files reside in the directory 'xml4csrc_2_n_n'.

 

How do I add an additional transcoding file in the existing set?

Transcoding files shipped with XML4C exist in the bin/intlFiles/locales directory on Win32 and in the lib/intlFiles/locales directory on AIX and Solaris. All transcoding files have the extension .cnv and are platform specific binary files. We provide the utility 'makeconv' to generate these binary files. To add an additional transcoding file, you need to first define your new code-set in ASCII format (which has the extension .ucm ). The coding format for an encoding may be obtained from one of the existing files in intlFiles/data/locales (in the source drop). After you create the .ucm file for your new language, you need to convert it to a binary form using makeconv.

Thus, if your new code-set is defined in file mynewcodeset.ucm, you would type

makeconv mynewcodeset.ucm

to create the binary transcoding file mynewcodeset.cnv. Make sure that this .cnv file is packaged in the same place as the others, i.e. in a directory ./intlFiles/locales relative to where your shared library is.

You can also add aliases for this encoding in the file 'convrtrs.txt', also present in the same directory as the converter files.

 

How are entity reference nodes handled in DOM ?

If you are using the native DOM classes, the function  setExpandEntityReferences controls how entities appear in the DOM tree. When setExpandEntityReferences is set to false (the default), an occurance of an entity reference in the XML document will be represented by a subtree with an EntityReference node at the root whose children represent the entity expansion. Entity expansion will be a DOM tree representing the structure of the entity expansion, not a text node containing the entity expansion as text.

If setExpandEntityReferences is true, an entity reference in the XML document is represented by only the nodes that represent the entity expansion. The DOM tree will not contain any entityReference nodes.

 

What kinds of URLs are currently supported in XML4C?

We now have a spec. compliant, but limited, implementation of the class URL.

  • The only protocol currently supported is the "file:// " which is used to refer to files locally.
  • Only the 'localhost' string is supported in the host placeholder in the URL syntax.

This should work for command line arguments to samples as well as any usage in the XML file when referring to an external file.

Examples of what this implementation will allow you to do are:

e:\>domcount file:///e:/xml4c2/build/win32/vc6/debug/abc.xml

or

e:\>domcount file::///xml4c2/build/win32/vc6/debug/abc.xml
e:\>domcount file::///d:/abc.xml

or

e:\>domcount file:://localhost/d:/abc.xml

Example of what you cannot do is:

Refer to files using the 'file://' syntax and giving a relative path to the file.

This implies that if you are using the 'file://' syntax to refer to external files, you have to give the complete path to files even in the current directory.

You always have the option of not using the 'file://' syntax and referring to files by just giving the filename or a relative path to it as in:

domcount abc.xml

 

Can I use XML4C to parse HTML?

Yes, if it follows the XML spec rules. Most HTML, however, does not follow the XML rules, and will therefore generate XML well-formedness errors.

 

Can I use the utility classes used in the sample programs?

The utility classes used in the sample programs are only for illustration. They are very limited in their functionality and IBM does not give support for these classes. They were written primarily to support XML4C itself and have no external documentation. You can use them as long as you understand their limitations and take full responsibility to support your customers.

 

I keep getting an error: "invalid UTF-8 character". What's wrong?

There are many Unicode characters that are not allowed in your XML document, according to the XML spec. Typical disallowed characters are control characters, even if you escape them using the Character Reference form: &#xxxx; See the XML spec, sections 2.2 and 4.1 for details. If the parser is generating this error, it is very likely that there's a character in there that you can't see. You can generally use a UNIX command like "od -hc " to find it.

Another reasonfor this error is that your file is in some non UTF/ASCII encoding but you gave no encoding="" string in your file to tell the parser what its real encoding is.

 

What encodings are supported by XML4C?

XML4C uses a subset of IBM's International Classes for Unicode (ICU) for encoding & Unicode support.

Besides ASCII, the following encodings are currrently supported:

  • UTF-8
  • UTF-16 Big Endian, UTF-16 Little Endian
  • IBM-1208
  • ISO Latin-1 (ISO-8859-1)
  • ISO Latin-2 (ISO-8859-2) [Bosnian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian (in Latin transcription), Serbocroation, Slovak, Slovenian, Upper Sorbian and Lower Sorbian]
  • ISO Latin-3 (ISO-8859-3) [Maltese, Esperanto]
  • ISO Latin-4 (ISO-8859-4)
  • ISO Latin Cyrillic (ISO-8859-5)
  • ISO Latin Arabic (ISO-8859-6) [Arabic]
  • ISO Latin Greek (ISO-8859-7)
  • ISO Latin Hebrew (ISO-8859-8) [Hebrew]
  • ISO Latin-5 (ISO-8859-9) [Turkish]
  • Extended Unix Code, packed for Japanese (euc-jp, eucjis)
  • Japanese Shift JIS (shift-jis)
  • Chinese (big5)
  • Extended Unix Code, packed for Korean (euc-kr)
  • Russian Unix, Cyrillic (koi8-r)
  • Windows Thai (cp874)
  • Latin 1 Windows (cp1252)
  • cp858
  • EBCDIC encodings:
    • EBCDIC US (ebcdic-cp-us)
    • EBCDIC Canada (ebcdic-cp-ca)
    • EBCDIC Netherland (ebcdic-cp-nl)
    • EBCDIC Denmark (ebcdic-cp-dk)
    • EBCDIC Norway (ebcdic-cp-no)
    • EBCDIC Finland (ebcdic-cp-fi)
    • EBCDIC Sweden (ebcdic-cp-se)
    • EBCDIC Italy (ebcdic-cp-it)
    • EBCDIC Spain & Latin America (ebcdic-cp-es)
    • EBCDIC Great Britain (ebcdic-cp-gb)
    • EBCDIC France (ebcdic-cp-fr)
    • EBCDIC Hebrew (ebcdic-cp-he)
    • EBCDIC Switzerland (ebcdic-cp-ch)
    • EBCDIC Roece (ebcdic-cp-roece)
    • EBCDIC Yugoslavia (ebcdic-cp-yu)
    • EBCDIC Iceland (ebcdic-cp-is)
    • EBCDIC Urdu (ebcdic-cp-ar2)
    • Latin 0 EBCDIC

Additional encodings to be available later:

  • EBCDIC Arabic (ebcdic-cp-ar1)
  • Chinese for PRC (mixed 1/2 byte) (gb2312)
  • Japanese ISO-2022-JP (iso-2022-jp)
  • Cyrllic (koi8-r) .

 

Copyright (c) IBM Corp. 1999, Center for Java Technology, Cupertino, USA