XML encoding: pom.xml, site.xml, ... 

Until now (Maven 2.0.7), XML encoding support is buggy:

  • XML streams are read with platform encoding, which leads to problems with non-ascii characters on ascii based platforms, and every characters on non-ascii platforms (Z/OS with EBCDIC),
  • XML streams are transformed to String (with platform encoding), and the resulting String is reworked a lot before being parsed by an XML parser (interpolation...)
  • even if XML streams were directly passed to the XML parser, MXParser used by Maven does not support encoding itself...

Changing the parser, then the interpolation code is a big task. 

Solution: use XmlReader class from Rome to detect XML streams encoding as defined in XML specification

It won't change much things in the code: only the Reader instanciation. Every other code (particularly interpolation) can remain the same.

Note: corresponding XmlStreamWriter and WriterFactory have been added in plexus-utils 1.4.4, and XmlReader renamed to XmlStreamReader to be consistent

Integration Level 1: detect XML encoding for user-written XML files

These files really need good XML encoding support, since user need accents and other local characters (Japanese, greek, cyrillic, ...)

  • [MODELLO-92]: use XmlStreamReader to read Modello .mdo files and update the misc. generators, DONE in modello 1.0-alpha-17
  • [MNG-2254]: use XmlStreamReader to read pom.xml, settings.xml and profiles.xml, DONE in Maven 2.0.8
  • [MANTTASKS-79]: add XML encoding detection support for pom.xml and settings.xml in Maven Ant Tasks, DONE in 2.0.8
  • [MINSTALL-44]: add XML encoding support when reading/writing POM files in install plugin, DONE in 2.3
  • [MDEPLOY-66]: add XML encoding support when reading/writing POM files in deploy plugin, DONE in 2.4
  • [MRELEASE-87]: Poms are written with wrong encodings, DONE in 2.0-beta-8
  • [MSITE-239]: use XmlStreamReader to read site.xml, DONE in maven-site-plugin 2.0-beta-6
  • [DOXIA-133]: XML encoding detection for xdoc, docbook, fml and xhtml files, DONE in doxia 1.0-alpha-9 and doxia-site 1.0-alpha-9
  • [MECLIPSE-56]: problem with non-ascii characters in generated .project-file, DONE in eclipse-plugin 2.5
  • [MREPOSITORY-10]: done in 2.1

Integration Level 2: detect XML encoding for internal XML files

These files shouldn't really need special characters, since they are technical descriptors (plexus.xml and so on). But this change is useful for non-ascii platforms (Z/OS with EBCDIC), where even simple ascii characters can't be read with platform encoding.

  • [PLX-343]: use XmlStreamReader in plexus-container-default to load internal XML configuration files, DONE in 1.0-alpha-30, integrated in Maven 2.1-SNAPSHOT 31/7/2007
  • [MANTTASKS-14]: make Maven Ant Tasks work on Z/OS
  • TODO: use XmlStreamReader class wherever an XML stream has to be changed into a String/Reader
  • TODO: check correct encoding when XML data are written to a stream through a Writer, using XmlStreamWriter if necessary

Technical Notes 

new FileReader/Writer(File) vs Reader/WriterFactory.newPlatformReader/Writer(File)

When using new FileReader(File) or new FileWriter(File) API, platform encoding is used for conversion between bytes and characters.

The Java API documentation is explicit about this fact (if you read it carefully: yes, look at the class description, not the constructor comments), but this is not obvious when using the API: developers tend to forget that they chose an encoding when using this API.

ReaderFactory.newPlatformReader(File) and WriterFactory.newPlatformWriter(File) API simply calls previous API, but when using it, the encoding choice is explicit.

After you have replaced your FileReader/Writer constructor with this API which is explicit about encoding choice, you understand that if the file read/written is XML, platform encoding is a wrong choice: you need XML encoding detection, which is the purpose ofReaderFactory.newXmlReader(File) and WriterFactory.newXmlWriter(File)...

Integrating XML encoding detection in Maven plugins

A lot of Maven plugins read and write XML files, and they're actually doing it with platform encoding (ie FileReader/Writer): the change to Reader/WriterFactory.newPlatformReader/Writer should be done.

But there is a problem with Maven versions earlier than 2.0.6: in Maven 2.0.5 and earlier, plexus-utils version is forced by Maven Core and cannot be overriden by a plugin. MNG-2892 (released in Maven 2.0.6) fixed this limitation. Then Maven 2.0.6 is a prerequisite to fix plugins...

What can be done?

  1. In maven-site-plugin, XML encoding classes from plexus-utils were copied to plugin's sources (MSITE-242 to remove them): there is a lot of XML files read by this plugin, with strong encoding support need, then this bad solution was really the best one. But this wouldn't be good to do such a copy in every plugin.
  2. A light solution is to replace new FileReader( File ) with new InputStreamReader( new FileInputStream( File ), "utf-8" ): if XML encoding detection is not supported, at least reading the file with default XML encoding, UTF-8, is both more powerful and more coherent (not a bug but a missing feature).
  3. Another solution would be to have XML encoding classes in another library than plexus-utils...

In plugins reading and writing POM files (install, deploy and release), there is no choice: XML encoding support must be the same as in Maven Core, then the classes will be copied in the plugins. But in assembly, for example, assembly.xml file is now simply read as UTF-8.

Here are Jira issues to track where they have been copied, to schedule their removal when upgrading prerequisite to Maven 2.0.6+:

  • MSITE-242 for site plugin: done in maven-site-plugin 2.0
  • MINSTALL-46 for install plugin: done in maven-install-plugin 2.3
  • MDEPLOY-70 for deploy plugin: done in maven-deploy-plugin 2.5
  • MRELEASE-316 for release plugin: done in maven-release-plugin 2.0-beta-8
  • MODELLO-110 for Modello: done in modello 1.0-alpha-19

Subversion properties

XML files should ideally be marked as "text/xml" to let svn and other tools know that XML encoding detection should be used:

 svn propset svn:mime-type text/xml *.xml *.mdo *.fml *.xhtml

Quick tests with viewvc 1.0.3 showed that such a mark did not change anything: an UTF-16 XML file was considered as binary, and no diff provided.

  • No labels