Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Character Encoding Issues

Questions

  1. Why
    1. What is the default character encoding of the request or response body?
    2. Why does everything have to be this way?
  2. How
    1. How do I change how GET parameters are interpreted?
    2. How do I change how POST parameters are interpreted?
    3. What can you recommend to just make everything work? (How to use UTF-8 everywhere).
    4. How can I test if my configuration will work correctly?
    5. How can I send higher characters in HTTP headers?
  3. Troubleshooting
    1. I'm having a problem with character encoding in Tomcat 5

Answers

Why

Anchor
Q1
Q1
What is the default character encoding of the request or response body?

...

  1. Java Servlet Specification 4.0
  2. HTTP 1.1 Protocol: Message Syntax and Routing, HTTP 1.1 Protocol: Semantics and Content
  3. URI Syntax
  4. ARPA Internet Text Messages
  5. HTML 4, HTML 5

Default encoding for request and response bodies

...

  1. ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so they are often used interchangeably.
  2. Modern browsers encoding URIs using UTF-8. Some browsers appear to use the encoding of the current page to encode URIs for links.
  3. HTML 4.0 recommends the use of UTF-8 to encode the query string.
  4. When in doubt, use POST for any data you think might have problems surviving a trip through the query string.

Default Encoding for POST

...

  1. RFC 2616 Section 3.4.1 stated that recipients of an HTTP message must respect the character encoding specified by the sender in the Content-Type header if the encoding is supported. A missing character allows the recipient to "guess" what encoding is appropriate.
  2. Most web browsers today do not specify the character set of a request, even when it is something other than ISO-8859-1. This seems to be in violation of the HTTP specification. Most web browsers appear to send a request body using the encoding of the page used to generate the POST (for instance, the <form> element came from a page with a specific encoding... it is that encoding which is used to submit the POST data for that form).

Percent Encoding for application/x-www-form-urlencoded

...

Section 3.1 of the ARPA Internet Text Messages spec states that headers are always in US-ASCII encoding. Anything outside of that needs to be encoded. See the section above regarding query strings in URIs.

How

Anchor
Q2
Q2
How do I change how GET parameters are interpreted?

...

  1. Set the URIEncoding attribute on the <Connector> element in server.xml to something specific (e.g. URIEncoding="UTF-8").
  2. Set the useBodyEncodingForURI attribute on the <Connector> element in server.xml to true. This will cause the Connector to use the request body's encoding for GET parameters.

In Tomcat 8 starting with 8.0.0 (8.0.0-RC3, to be specific), the default value of URIEncoding attribute on the <Connector> element depends on "strict servlet compliance" setting. The default value (strict compliance is off) of URIEncoding is now UTF-8. If "strict servlet compliance" is enabled, the default value is ISO-8859-1.

...

  1. Set URIEncoding="UTF-8" on your <Connector> in server.xml. References: HTTP Connector, AJP Connector.
  2. Set the default request character encoding either in the Tomcat conf/web.xml file or in the web app web.xml file; either by setting <request-character-encoding> or by using a character encoding filter.
  3. Change all your JSPs to include charset name in their contentType. For example, use <%@page contentType="text/html; charset=UTF-8" %> for the usual JSP pages and <jsp:directive.page contentType="text/html; charset=UTF-8" /> for the pages in XML syntax (aka JSP Documents).
  4. Change all your servlets to set the content type for responses and to include charset name in the content type to be UTF-8. Use response.setContentType("text/html; charset=UTF-8") or response.setCharacterEncoding("UTF-8").
  5. Change any content-generation libraries you use (Velocity, Freemarker, etc.) to use UTF-8 and to specify UTF-8 in the content type of the responses that they generate.
  6. Disable any valves or filters that may read request parameters before your character encoding filter or jsp page has a chance to set the encoding to UTF-8. For more information see http://www.mail-archive.com/users@tomcat.apache.org/msg21117.html.

...

You have to encode them in some way before you insert them into a header. Using url-encoding (% + high byte number + low byte number) would be a good idea.

Troubleshooting

Anchor
Q5
Q5
I'm having a problem with character encoding in Tomcat 5

...

CategoryFAQ