Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: converted to 1.6 markup

Character Encoding Issues

Questions

  1. Wiki Markup\[#Q1 What is the default character encoding of the request or response body?\]
  2. Wiki Markup\[#Q2 How do I change how GET parameters are interpreted?\]unmigrated-wiki-markup\[#Q3
  3. How do I change how POST parameters are interpreted?\]unmigrated-wiki-markup
  4. \[#Q4 How can I test if my configuration will work correctly?\]
  5. Wiki Markup\[#Q6 How can I send higher characters in HTTP headers?\]unmigrated-wiki-markup
  6. \[#Q8 What can you recommend to just make everything work?\]unmigrated-wiki-markup
  7. \[#Q9 Why does everything have to be this way?\] \\
  8. Wiki Markup\[#Q5 I'm having a problem with character encoding in Tomcat 5\] \\

Answers

Anchor
Q1
Q1
What is the default character encoding of the request or response body?

If a character encoding is not specified, the Servlet specification requires that an encoding of ISO-8859-1 is used. The character encoding for the body of an HTTP message (request or response) is specified in the Content-Type header field. An example of such a header is Content-Type: text/html; charset=ISO-8859-1 which explicitly states that the default (ISO-8859-1) is being used.

Anchor
Q2
Q2
How do I change how GET parameters are interpreted?

...

  1. Set the URIEncoding attribute on the <Connector> element in server.xml to something specific (e.g. URIEncoding="UTF-8").
  2. Set the useBodyEncodingForURI attribute on the <Connector> element in server.xml to true. This will cause the Connector to use the request body's encoding for GET parameters.

Anchor
Q3
Q3
How do I change how POST parameters are interpreted?

...

No Format
webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java

Anchor
Q4
Q4
How can I test if my configuration will work correctly?

...

No Format
<%@ page contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
     <title>Character encoding test page</title>
   </head>
   <body>
     <p>Data posted to this form was:
     <%
       request.setCharacterEncoding("UTF-8");
       out.print(request.getParameter("mydata"));
     %>

     </p>
     <form method="POST" action="index.jsp">
       <input type="text" name="mydata">
       <input type="submit" value="Submit" />
       <input type="reset" value="Reset" />
     </form>
   </body>
</html>

Anchor
Q8
Q8
How can I send higher characters in my HTTP headers?

You have to encode them in some way before you insert them into a header. Using url-encoding (% + high byte number + low byte number) would be a good idea.

Anchor
Q8
Q8
What can you recommend to just make everything work?

...

  1. Set URIEncoding="UTF-8" on your <Connector> in server.xmlunmigrated-wiki-markup
  2. Use a \[#Q3 character encoding filter\] with the default encoding set to UTF-8
  3. Change all your JSPs to set the correct Content-Type (use <%@page cotnentType="mime/type; charset=UTF-8" %>)
  4. Change all your servlets to set the content type for responses to UTF-8
  5. Change any content-generation libraries you use (Velocity, Freemarker, etc.) to use UTF-8 as the content type
  6. Disable any valves or filters that may read request parameters before your character encoding filter or jsp page has a chance to set the encoding to UTF-8. For more information see http://www.mail-archive.com/users@tomcat.apache.org/msg21117.html.

Anchor
Q9
Q9
Why does everything have to be this way?

Everything covered in this page comes down to practical interpretation of a number of specifications. When working with Java servlets, the Java Servlet Specification is the primary reference, but the servlet spec itself relies on older specifications such as HTTP for its foundation. Here are a couple of references before we cover exactly where these items are located in them.

  1. Wiki Markup\[http://jcp.org/aboutJava/communityprocess/mrel/jsr154/index2.html Java Servlet Specification 2.5\] Wiki Markup\[http://jcp.org/aboutJava/communityprocess/final/jsr154/index.html
  2. Java Servlet Specification 2.4\]
  3. Wiki Markup\[http://www.w3.org/Protocols/rfc2616/rfc2616.txt HTTP 1.1 Protocol] ] (\[http://www.w3.org/Protocols/rfc2616/rfc2616.html hyperlinked version\]) Wiki Markup
  4. \[http://www.ietf.org/rfc/rfc2396.txt URI Syntax\] Wiki Markup\
  5. [http://www.w3.org/Protocols/rfc822/ ARPA Internet Text Messages\]unmigrated-wiki-markup\[http://www
  6. .w3.org/TR/html4 HTML 4\] \\

Default encoding for GETunmigrated-wiki-markup

The character set for HTTP query strings (that's the technical term for 'GET parameters') can be found in sections 2 and 2.1 the "URI Syntax" specification. The character set is defined to be \[http://en.wikipedia.org/wiki/ASCII US-ASCII\]. Any character that does not map to US-ASCII must be encoded in some way. Section 2.1 of the URI Syntax specification says that characters outside of US-ASCII must be encoded using {{%}} escape sequences: each character is encoded as a literal {{%}} followed by the two hexadecimal codes which indicate its character code. Thus, {{a}} (US-ASCII character code 0x97) is equivalent to {{%97}}.

Some notes about the character encoding of URIs:

  1. ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so they are often used interchangeably. Most of the web uses ISO-8859-1 as the default for query strings.
  2. Many browsers are starting to offer (default) options of encoding URIs using UTF-8 instead of ISO-8859-1. Some browsers appear to use the encoding of the current page to encode URIs for links (see the note above regarding browser behavior for POST encoding).unmigrated-wiki-markup
  3. \[http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars HTML 4.0\] recommends the use of UTF-8 to encode the query string.
  4. When in doubt, use POST for any data you think might have problems surviving a trip through the query string.

Default Encoding for POSTunmigrated-wiki-markup

\[http://en.wikipedia.org/wiki/Iso-8859-1 ISO-8859-1\] is defined as the default character set for HTTP request and response bodies in the servlet specification (request encoding: section 4.9 for spec version 2.4, section 3.9 for spec version 2.5; response encoding: section 5.4 for both spec versions 2.4 and 2.5). This default is historical: it comes from sections 3.4.1 and 3.7.1 of the HTTP/1.1 specification.

Some notes about the character encoding of a POST request:

...

Section 3.1 of the ARPA Internet Text Messages spec states that headers are always in US-ASCII encoding. Anything outside of that needs to be encoded. See the section above regarding query strings in URIs.

Anchor
Q5
Q5
I'm having a problem with character encoding in Tomcat 5

In Tomcat 5 - there have been issues reported with respect to character encoding (usually of the the form "request.setCharacterEncoding(String) doesn't work"). Odds are, its not a bug. Before filing a bug report, see these bug reports as well as any bug reports linked to these bug reports:

...

  • \[http://issues.apache.org/bugzilla/show_bug.cgi?id=23929 23929\]unmigrated-wiki-markup\[http://issues.apache.org/bugzilla/show_bug.cgi?id=25360
  • 25360\]
  • Wiki Markup\[http://issues.apache.org/bugzilla/show_bug.cgi?id=25231 25231\] Wiki Markup\[http://issues.apache.org/bugzilla/show_bug.cgi?id=25235
  • 25235\]
  • Wiki Markup\[http://issues.apache.org/bugzilla/show_bug.cgi?id=22666 22666\]unmigrated-wiki-markup\[http://issues.apache.org/bugzilla/show_bug.cgi?id=
  • 24557 24557\]unmigrated-wiki-markup
  • \[http://issues.apache.org/bugzilla/show_bug.cgi?id=24345 24345\]unmigrated-wiki-markup\[http://issues.apache.org/bugzilla/show_bug.cgi?id=
  • 25848 25848\]