Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add note about working on a clean install

Character Encoding Issues

Questions

...

What is the default character encoding

...

?

...

  1. How do I change how GET parameters are interpreted?
  2. How do I change how POST parameters are interpreted?
  3. What can you recommend to just make everything work? (How to use UTF-8 everywhere).
  4. How can I test if my configuration will work correctly?
  5. How can I send higher characters in HTTP headers?

...

  1. I'm having a problem with character encoding in Tomcat 5

Answers

Why

AnchorQ1Q1What is the default character encoding of the request or response body?If a character encoding is not specified, the Servlet specification requires that an encoding of ISO-8859-1 is used. The character encoding for the body of an HTTP message (request or response) is specified in the Content-Type header field. An example of such a header is Content-Type: text/html; charset=ISO-8859-1 which explicitly states that the default (ISO-8859-1) is being used.

References: HTTP 1.1 Specification, Section 3.7.1

The above general rules apply to Servlets. The behaviour of JSP pages is further specified by the JSP specification. The request character encoding handling is the same, but response character encoding behaves a bit differently. See chapter "JSP.4.2 Response Character Encoding". For JSP pages in standard syntax the default response charset is the usual ISO-8859-1, but for the ones in XML syntax it is UTF-8.

...

Everything covered in this page comes down to practical interpretation of a number of specifications. When working with Java servlets, the Java Servlet Specification is the primary reference, but the servlet spec itself relies on older specifications such as HTTP for its foundation. Here are a couple of references before we cover exactly where these items are located in them.

  1. Java Servlet Specification 2.5
  2. Java Servlet Specification 2.4
  3. HTTP 1.1 Protocol (hyperlinked version)
  4. URI Syntax
  5. ARPA Internet Text Messages
  6. HTML 4

Default encoding for request and response bodies

See 'Default Encoding for POST' below.

Default encoding for GET

The character set for HTTP query strings (that's the technical term for 'GET parameters') can be found in sections 2 and 2.1 the "URI Syntax" specification. The character set is defined to be US-ASCII. Any character that does not map to US-ASCII must be encoded in some way. Section 2.1 of the URI Syntax specification says that characters outside of US-ASCII must be encoded using % escape sequences: each character is encoded as a literal % followed by the two hexadecimal codes which indicate its character code. Thus, a (US-ASCII character code 97 = 0x61) is equivalent to %61. There is no default encoding for URIs specified anywhere, which is why there is a lot of confusion when it comes to decoding these values.

Some notes about the character encoding of URIs:

  1. ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so they are often used interchangeably. Most of the web uses ISO-8859-1 as the default for query strings.
  2. Many browsers are starting to offer (default) options of encoding URIs using UTF-8 instead of ISO-8859-1. Some browsers appear to use the encoding of the current page to encode URIs for links (see the note above regarding browser behavior for POST encoding).
  3. HTML 4.0 recommends the use of UTF-8 to encode the query string.
  4. When in doubt, use POST for any data you think might have problems surviving a trip through the query string.

Default Encoding for POST

ISO-8859-1 is defined as the default character set for HTTP request and response bodies in the servlet specification (request encoding: section 4.9 for spec version 2.4, section 3.9 for spec version 2.5; response encoding: section 5.4 for both spec versions 2.4 and 2.5). This default is historical: it comes from sections 3.4.1 and 3.7.1 of the HTTP/1.1 specification.

Some notes about the character encoding of a POST request:

...

that

...

an

...

encoding

...

of

...

ISO

...

8859-1

...

is used

...

HTTP Headers

Section 3.1 of the ARPA Internet Text Messages spec states that headers are always in US-ASCII encoding. Anything outside of that needs to be encoded. See the section above regarding query strings in URIs.

How

...

AnchorQ2Q2How do I change how GET parameters are interpreted?

Tomcat will use ISO-8859-1 as the default character encoding of the entire URL, including the query string ("GET parameters") (though see Tomcat 8 notice below).

There are two ways to specify how GET parameters are interpreted:

Set the URIEncoding

...

parameter on the

...

Connector element in server.xml

...

In Tomcat 8 starting with 8.0.0 (8.0.0-RC3, to be specific), the default value of URIEncoding attribute on the <Connector> element depends on "strict servlet compliance" setting. The default value (strict compliance is off) of URIEncoding is now UTF-8. If "strict servlet compliance" is enabled, the default value is ISO-8859-1.

References: Tomcat 7 HTTP Connector, Tomcat 7 AJP Connector, Tomcat 8 HTTP Connector, Tomcat 8 AJP Connector

AnchorQ3Q3How do I change how POST parameters are interpreted?

POST requests should specify the encoding of the parameters and values they send. Since many clients fail to set an explicit encoding, the default is used (ISO - 8859-1). In many cases this is not the preferred interpretation so one can employ a javax.servlet.Filter to set request encodings. Writing such a filter is trivial.
6.x, 7.x::
Furthermore Tomcat already comes with such an example filter. Please take a look at:
4.x::

No Format
webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java

5. 5.36+, 6.0.36+, 7.0.20+, 8.x:: Since Tomcat 7.0.20, 6.0.36 and 5.5.36 the filter became first-class citizen and was moved from the examples into core Tomcat and is available to any web application without the need to compile and bundle it separately. See documentation for the list of filters provided by Tomcat. The class name is:

No Format

org.apache.catalina.filters.SetCharacterEncodingFilter

Note: The request encoding setting is effective only if it is done earlier than parameters are parsed. Once parsing happens, there is no way back. Parameters parsing is triggered by the first method that asks for parameter name or value. Make sure that the filter is positioned before any other filters that ask for request parameters. The positioning depends on the order of filter-mapping declarations in the WEB-INF/web.xml file, though since Servlet 3.0 specification there are additional options to control the order. To check the actual order you can throw an Exception from your page and check its stack trace for filter names.

...

Using UTF-8 as your character encoding for everything is a safe bet. This should work for pretty much every situation.

In order to completely switch to using UTF-8, you need to make the following changes:

...

No Format

webapps/servlets-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
webapps/jsp-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java

6.x::

No Format

webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java

...

AnchorQ4Q4How can I test if my configuration will work correctly?

The following sample JSP should work on a ona clean Tomcat install for any input. If you set the URIEncoding="UTF-8" on the connector, it will also work with method="GET".

No Format
<%@ page contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
     <title>Character encoding test page</title>
   </head>
   <body>
     <p>Data posted to this form was:
     <%
       request.setCharacterEncoding("UTF-8");
       out.print(request.getParameter("mydata"));
     %>

     </p>
     <form method="POST" action="index.jsp">
       <input type="text" name="mydata">
       <input type="submit" value="Submit" />
       <input type="reset" value="Reset" />
     </form>
   </body>
</html>

...

Q6Q6How can I send higher characters in my HTTP headers?

You have to encode them in some way before you insert them into a header. Using url-encoding (% + high byte number + low byte number) would be a good idea.

Troubleshooting

AnchorQ5Q5I'm having a problem with character encoding in Tomcat tomcat 5

In Tomcat 5 - there have been issues reported with respect to character encoding (usually of the the form "request.setCharacterEncoding(String) doesn't work"). Odds are, its not a bug. Before filing a bug report, see these bug reports as well as any bug reports linked to these bug reports:

  • Wiki Markup
    \[http://issues.apache.org/bugzilla/show_bug.cgi?id=23929 23929\]
  • Wiki Markup
    \[http://issues.apache.org/bugzilla/show_bug.cgi?id=25360 25360\]
  • Wiki Markup
    \[http://issues.apache.org/bugzilla/show_bug.cgi?id=25231 25231\]
  • Wiki Markup
    \[http://issues.apache.org/bugzilla/show_bug.cgi?id=25235 25235\]
  • Wiki Markup
    \[http://issues.apache.org/bugzilla/show_bug.cgi?id=22666 22666\]
  • Wiki Markup
    \[http://issues.apache.org/bugzilla/show_bug.cgi?id=24557 24557\]
  • Wiki Markup
    \[http://issues.apache.org/bugzilla/show_bug.cgi?id=24345 24345\]
  • 25848

...

  • Wiki Markup
    \[http://issues.apache.org/bugzilla/show_bug.cgi?id=25848 25848\]