Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updated sections related to percent encoding charset of HTML form posts.

...

Everything covered in this page comes down to practical interpretation of a number of specifications. When working with Java servlets, the Java Servlet Specification is the primary reference, but the servlet spec itself relies on older specifications such as HTTP for its foundation. Here are a couple of references before we cover exactly where these items are located in them. A more detailed list can be found on the Specifications page.

  1. Java Servlet Specification 24.5
  2. Java Servlet Specification 2.4
  3. 0
  4. HTTP 1.1 Protocol: Message Syntax and Routing, HTTP 1.1 Protocol (hyperlinked version): Semantics and Content
  5. URI Syntax
  6. ARPA Internet Text Messages
  7. HTML 4, HTML 5

Default encoding for request and response bodies

...

The character set for HTTP query strings (that's the technical term for 'GET parameters') can be found in sections 2 and 2.1 the "URI Syntax" specification. The character set is defined to be US-ASCII. Any character that does not map to US-ASCII must be encoded in some way. Section 2.1 of the URI Syntax specification says that characters outside of US-ASCII must be encoded using % escape sequences: each character is encoded as a literal % followed by the two hexadecimal codes which indicate its character code. Thus, a (US-ASCII character code 97 = 0x61) is equivalent to %61. There is no Although the URI specification does not mandate a default encoding for URIs specified anywhere, which is why there is a lot of confusion when it comes to decoding these valuespercent-encoded octets, it recommends UTF-8 especially for new URI schemes, and most modern user agents have settled on UTF-8 for percent-encoding URI characters.

Some notes about the character encoding of URIs:

  1. ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so they are often used interchangeably. Most of the web uses ISO-8859-1 as the default for query strings.
  2. Many browsers are starting to offer (default) options of Modern browsers encoding URIs using UTF-8 instead of ISO-8859-1. Some browsers appear to use the encoding of the current page to encode URIs for links (see the note above regarding browser behavior for POST encoding).
  3. HTML 4.0 recommends the use of UTF-8 to encode the query string.
  4. When in doubt, use POST for any data you think might have problems surviving a trip through the query string.

Default Encoding for POST

Older versions of the HTTP/1.1 specification (e.g. RFC 2616) indicated that ISO-8859-1 is defined as the default character set charset for text-based HTTP request and response bodies in if no charset is indicated. Although RFC 7231 removed this default, the servlet specification continues to follow suit. Thus the servlet specification (request encoding: section 4.9 for spec version 2.4, section 3.9 for spec version 2.5; response encoding: section 5.4 for both spec versions 2.4 and 2.5). This default is historical: it comes from sections 3.4.1 and 3.7.1 of the HTTP/1.1 specification.indicates that if a POST request does not indicate an encoding, it must be processed as ISO-8859-1, except for application/x-www-form-urlencoded, which by default should be interpreted as {{`}}US-ASCII` (as it by definition should contain only characters within the ASCII range to begin with).

Some notes about the character encoding of a POST request:

  1. RFC 2616 Section 3.4.1 of HTTP/1.1 states stated that recipients of an HTTP message must respect the character encoding specified by the sender in the Content-Type header if the encoding is supported. A missing character allows the recipient to "guess" what encoding is appropriate.
  2. Most web browsers today do not specify the character set of a request, even when it is something other than ISO-8859-1. This seems to be in violation of the HTTP specification. Most web browsers appear to send a request body using the encoding of the page used to generate the POST (for instance, the <form> element came from a page with a specific encoding... it is that encoding which is used to submit the POST data for that form).

Percent Encoding for application/x-www-form-urlencoded

The HTML 4.0.1 specification indicated that percent-encoding of non-ASCII characters of application/x-www-form-urlencoded (the default content type for HTML form submissions) should be performed using US-ASCII byte sequences. However HTML 5 changed this to use UTF-8 byte sequences, matching the modern percent encoding for URLs. Modern browsers therefore percent-encode UTF-8 sequences when submitting forms using application/x-www-form-urlencoded.

The servlet specification, however, requires servlet containers to interpret percent-encoded sequences in application/x-www-form-urlencoded as ISO-8859-1, which in a default configuration will result in corrupted content because of the charset mismatch. See below for how this can be reconfigured in Tomcat.

HTTP Headers

Section 3.1 of the ARPA Internet Text Messages spec states that headers are always in US-ASCII encoding. Anything outside of that needs to be encoded. See the section above regarding query strings in URIs.

...

Anchor
Q3
Q3
How do I change how POST parameters are interpreted?

POST requests should specify the encoding of the parameters and values they send. Since many clients fail to set an explicit encoding, the default is used (is US-ASCII for application/x-www-form-urlencoded and ISO-8859-1) for all other content types. In many cases this is not the preferred interpretation so

In addition, the servlet specification requires that percent-encoded sequences of application/x-www-form-urlencoded be interpreted as ISO-8859-1 by default which, as explained above, does not match the HTML 5 specification and modern user agent practice of using UTF-8 to percent encode characters. Nevertheless the servlet specification requires the servlet container's interpretation of percent-encoded sequences of application/x-www-form-urlencoded to follow any configured character encoding. Thus appropriate intepretation of application/x-www-form-urlencoded byte sequences can be achieved by setting the request character encoding to UTF-8.

The container-agnostic approach for specifying the request character encoding is to set the <request-character-encoding> element in the web application web.xml file:

<request-character-encoding>UTF-8</request-character-encoding>

Note: If you are using the Eclipse integrated development environment, as of Eclipse Enterprise Java Developers 2019-03 M1 (4.11.0 M1) the IDE does not recognize the <request-character-encoding> setting and will temporarily freeze the IDE and generate errors with any edit of web application files. You can track the latest status of this problem at Eclipse Bug 543377.

Otherwise one can employ a javax.servlet.Filter to set request encodings. Writing such a filter is trivial.
6.x, 7.x::
Tomcat already comes with such an example filter. Please take a look at :

...

webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java

...

.
5.5.36+, 6.0.36+, 7.0.20+, 8.x::
Since Tomcat 7.0.20, 6.0.36 and 5.5.36 the filter became first-class citizen and was moved from the examples into core Tomcat and is available to any web application without the need to compile and bundle it separately, although this will not allow the web application to be deployed in non-Tomcat servlet containers that do not have this filter available, if the servlet is defined in the web application's own web-xml file. See documentation for the list of filters provided by Tomcat. The class name is :

...

org.apache.catalina.filters.SetCharacterEncodingFilter

...

.

It is also possible to define such a filter in the Tomcat installation configuration file conf/web.xml, which would set the request character encoding across all web applications without the need for any web.xml modifications. In fact the latest Tomcat versions come with sections in web.xml that already configure a filter to set the request character encoding to UTF-8. Simply edit conf/web.xml and uncomment both the definition and the mapping of the filter named setCharacterEncodingFilter.

Note: The request encoding setting is effective only if it is done earlier than parameters are parsed. Once parsing happens, there is no way back. Parameters parsing is triggered by the first method that asks for parameter name or value. Make sure that the filter is positioned before any other filters that ask for request parameters. The positioning depends on the order of filter-mapping declarations in the WEB-INF/web.xml file, though since Servlet 3.0 specification there are additional options to control the order. To check the actual order you can throw an Exception from your page and check its stack trace for filter names.

...