Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Rearranged

Character Encoding Issues

Questions

  1. Why
    1. What is the default character encoding of the request or response body?
    2. Why does everything have to be this way?
  2. How
    1. How do I change how GET parameters are interpreted?
    2. How do I change how POST parameters are interpreted?
    3. What can you recommend to just make everything work? (How to use UTF-8 everywhere).
    4. How can I test if my configuration will work correctly?
    5. How can I send higher characters in HTTP headers?
  3. What can you recommend to just make everything work? – How to use UTF-8 everywhere.
  4. Why does everything have to be this way?
  5. Troubleshooting
    1. I'm having a problem with character encoding in Tomcat 5

Answers

Why

Anchor
Q1
Q1
What is the default character encoding of the request or response body?

...

References: HTTP 1.1 Specification, Section 3.7.1 AnchorQ2Q2How do I change how GET parameters are interpreted?

Tomcat will use ISO-8859-1 as the default character encoding of the entire URL, including the query string ("GET parameters").

There are two ways to specify how GET parameters are interpreted:

...

.

...

References: Tomcat 6 HTTP Connector, Tomcat 6 AJP Connector

AnchorQ3Q3How do I change how POST parameters are interpreted?POST requests should specify the encoding of the parameters and values they send. Since many clients fail to set an explicit encoding, the default is used (ISO-8859-1). In many cases this is not the preferred interpretation so one can employ a javax.servlet.Filter to set request encodings. Writing such a filter is trivial. Furthermore Tomcat already comes with such an example filter. Please take a look at:
5.x::

No Format

webapps/servlets-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
webapps/jsp-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java

6.x::

No Format

webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java

...

The following sample JSP should work on a clean Tomcat install for any input. If you set the URIEncoding="UTF-8" on the connector, it will also work with method="GET".

No Format

<%@ page contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
     <title>Character encoding test page</title>
   </head>
   <body>
     <p>Data posted to this form was:
     <%
       request.setCharacterEncoding("UTF-8");
       out.print(request.getParameter("mydata"));
     %>

     </p>
     <form method="POST" action="index.jsp">
       <input type="text" name="mydata">
       <input type="submit" value="Submit" />
       <input type="reset" value="Reset" />
     </form>
   </body>
</html>

...

You have to encode them in some way before you insert them into a header. Using url-encoding (% + high byte number + low byte number) would be a good idea.

...

Using UTF-8 as your character encoding for everything is a safe bet. This should work for pretty much every situation.

In order to completely switch to using UTF-8, you need to make the following changes:

...

...

Anchor
Q9
Q9
Why does everything have to be this way?

...

Section 3.1 of the ARPA Internet Text Messages spec states that headers are always in US-ASCII encoding. Anything outside of that needs to be encoded. See the section above regarding query strings in URIs.

...

How

Anchor
Q2
Q2
How do I change how GET parameters are interpreted?

Tomcat will use ISO-8859-1 as the default character encoding of the entire URL, including the query string ("GET parameters").

There are two ways to specify how GET parameters are interpreted:

  1. Set the URIEncoding attribute on the <Connector> element in server.xml to something specific (e.g. URIEncoding="UTF-8").
  2. Set the useBodyEncodingForURI attribute on the <Connector> element in server.xml to true. This will cause the Connector to use the request body's encoding for GET parameters.

References: Tomcat 6 HTTP Connector, Tomcat 6 AJP Connector

...

Anchor
Q3
Q3
How do I change how POST parameters are interpreted?

POST requests should specify the encoding of the parameters and values they send. Since many clients fail to set an explicit encoding, the default is used (ISO-8859-1). In many cases this is not the preferred interpretation so one can employ a javax.servlet.Filter to set request encodings. Writing such a filter is trivial. Furthermore Tomcat already comes with such an example filter. Please take a look at:
5.x::

No Format

webapps/servlets-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
webapps/jsp-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java

6.x::

No Format

webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java

...

Anchor
Q8
Q8
What can you recommend to just make everything work? (How to use UTF-8 everywhere).

Using UTF-8 as your character encoding for everything is a safe bet. This should work for pretty much every situation.

In order to completely switch to using UTF-8, you need to make the following changes:

  1. Set URIEncoding="UTF-8" on your <Connector> in server.xml. References: HTTP Connector, AJP Connector.
  2. Use a character encoding filter with the default encoding set to UTF-8
  3. Change all your JSPs to include charset name in their contentType. For example, use <%@page contentType="text/html; charset=UTF-8" %> for the usual JSP pages and <jsp:directive.page contentType="text/html; charset=UTF-8" /> for the pages in XML syntax (aka JSP Documents).
  4. Change all your servlets to set the content type for responses and to include charset name in the content type to be UTF-8. Use response.setContentType("text/html; charset=UTF-8") or response.setCharacterEncoding("UTF-8").
  5. Change any content-generation libraries you use (Velocity, Freemarker, etc.) to use UTF-8 and to specify UTF-8 in the content type of the responses that they generate.
  6. Disable any valves or filters that may read request parameters before your character encoding filter or jsp page has a chance to set the encoding to UTF-8. For more information see http://www.mail-archive.com/users@tomcat.apache.org/msg21117.html.

...

Anchor
Q4
Q4
How can I test if my configuration will work correctly?

The following sample JSP should work on a clean Tomcat install for any input. If you set the URIEncoding="UTF-8" on the connector, it will also work with method="GET".

No Format

<%@ page contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
   <head>
     <title>Character encoding test page</title>
   </head>
   <body>
     <p>Data posted to this form was:
     <%
       request.setCharacterEncoding("UTF-8");
       out.print(request.getParameter("mydata"));
     %>

     </p>
     <form method="POST" action="index.jsp">
       <input type="text" name="mydata">
       <input type="submit" value="Submit" />
       <input type="reset" value="Reset" />
     </form>
   </body>
</html>

...

Anchor
Q6
Q6
How can I send higher characters in my HTTP headers?

You have to encode them in some way before you insert them into a header. Using url-encoding (% + high byte number + low byte number) would be a good idea.

...

Troubleshooting

Anchor
Q5
Q5
I'm having a problem with character encoding in Tomcat 5

...