Character Encoding Issues
Questions
- Why
- How
- What can you recommend to just make everything work? – How to use UTF-8 everywhere.
- Why does everything have to be this way?
- Troubleshooting
Answers
Why
Anchor | ||||
---|---|---|---|---|
|
...
References: HTTP 1.1 Specification, Section 3.7.1 Anchor
Tomcat will use ISO-8859-1 as the default character encoding of the entire URL, including the query string ("GET parameters").
There are two ways to specify how GET parameters are interpreted:
...
...
References: Tomcat 6 HTTP Connector, Tomcat 6 AJP Connector
Anchor
5.x::
No Format |
---|
webapps/servlets-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
webapps/jsp-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
|
6.x::
No Format |
---|
webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
|
...
The following sample JSP should work on a clean Tomcat install for any input. If you set the URIEncoding="UTF-8" on the connector, it will also work with method="GET".
No Format |
---|
<%@ page contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Character encoding test page</title>
</head>
<body>
<p>Data posted to this form was:
<%
request.setCharacterEncoding("UTF-8");
out.print(request.getParameter("mydata"));
%>
</p>
<form method="POST" action="index.jsp">
<input type="text" name="mydata">
<input type="submit" value="Submit" />
<input type="reset" value="Reset" />
</form>
</body>
</html>
|
...
You have to encode them in some way before you insert them into a header. Using url-encoding (%
+ high byte number + low byte number) would be a good idea.
...
Using UTF-8
as your character encoding for everything is a safe bet. This should work for pretty much every situation.
In order to completely switch to using UTF-8, you need to make the following changes:
...
...
Anchor | ||||
---|---|---|---|---|
|
...
Section 3.1 of the ARPA Internet Text Messages spec states that headers are always in US-ASCII encoding. Anything outside of that needs to be encoded. See the section above regarding query strings in URIs.
...
How
Anchor | ||||
---|---|---|---|---|
|
Tomcat will use ISO-8859-1 as the default character encoding of the entire URL, including the query string ("GET parameters").
There are two ways to specify how GET parameters are interpreted:
- Set the
URIEncoding
attribute on the <Connector> element in server.xml to something specific (e.g.URIEncoding="UTF-8"
). - Set the
useBodyEncodingForURI
attribute on the <Connector> element in server.xml totrue
. This will cause the Connector to use the request body's encoding for GET parameters.
References: Tomcat 6 HTTP Connector, Tomcat 6 AJP Connector
...
Anchor | ||||
---|---|---|---|---|
|
POST requests should specify the encoding of the parameters and values they send. Since many clients fail to set an explicit encoding, the default is used (ISO-8859-1). In many cases this is not the preferred interpretation so one can employ a javax.servlet.Filter to set request encodings. Writing such a filter is trivial. Furthermore Tomcat already comes with such an example filter. Please take a look at:
5.x::
No Format |
---|
webapps/servlets-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
webapps/jsp-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
|
6.x::
No Format |
---|
webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
|
...
Anchor | ||||
---|---|---|---|---|
|
Using UTF-8
as your character encoding for everything is a safe bet. This should work for pretty much every situation.
In order to completely switch to using UTF-8, you need to make the following changes:
- Set
URIEncoding="UTF-8"
on your <Connector> inserver.xml
. References: HTTP Connector, AJP Connector. - Use a character encoding filter with the default encoding set to UTF-8
- Change all your JSPs to include charset name in their contentType. For example, use
<%@page contentType="text/html; charset=UTF-8" %>
for the usual JSP pages and<jsp:directive.page contentType="text/html; charset=UTF-8" />
for the pages in XML syntax (aka JSP Documents). - Change all your servlets to set the content type for responses and to include charset name in the content type to be UTF-8. Use
response.setContentType("text/html; charset=UTF-8")
orresponse.setCharacterEncoding("UTF-8")
. - Change any content-generation libraries you use (Velocity, Freemarker, etc.) to use UTF-8 and to specify UTF-8 in the content type of the responses that they generate.
- Disable any valves or filters that may read request parameters before your character encoding filter or jsp page has a chance to set the encoding to UTF-8. For more information see http://www.mail-archive.com/users@tomcat.apache.org/msg21117.html.
...
Anchor | ||||
---|---|---|---|---|
|
The following sample JSP should work on a clean Tomcat install for any input. If you set the URIEncoding="UTF-8" on the connector, it will also work with method="GET".
No Format |
---|
<%@ page contentType="text/html; charset=UTF-8" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Character encoding test page</title>
</head>
<body>
<p>Data posted to this form was:
<%
request.setCharacterEncoding("UTF-8");
out.print(request.getParameter("mydata"));
%>
</p>
<form method="POST" action="index.jsp">
<input type="text" name="mydata">
<input type="submit" value="Submit" />
<input type="reset" value="Reset" />
</form>
</body>
</html>
|
...
Anchor | ||||
---|---|---|---|---|
|
You have to encode them in some way before you insert them into a header. Using url-encoding (%
+ high byte number + low byte number) would be a good idea.
...
Troubleshooting
Anchor | ||||
---|---|---|---|---|
|
...