Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

How do I change how GET parameters are interpreted?

Tomcat will use ISO-8859-1 as the default character encoding of the entire URL, including the query string ("GET parameters").

There are two ways to specify how GET parameters are interpreted:

...

  1. Wiki Markup
    \[http://jcp.org/aboutJava/communityprocess/mrel/jsr154/index2.html Java Servlet Specification 2.5\]
  2. Wiki Markup
    \[http://jcp.org/aboutJava/communityprocess/final/jsr154/index.html Java Servlet Specification 2.4\]
  3. Wiki Markup
    \[http://www.w3.org/Protocols/rfc2616/rfc2616.txt HTTP 1.1 Protocol]] (\[http://www.w3.org/Protocols/rfc2616/rfc2616.html hyperlinked version\])
  4. Wiki Markup
    \[http://www.ietf.org/rfc/rfc2396.txt URI Syntax\]
  5. Wiki Markup
    \[http://www.w3.org/Protocols/rfc822/ ARPA Internet Text Messages\]
    \\

...

  1. Wiki Markup
    \[http://

...

  1. www.

...

  1. w3.org/

...

Some notes about the character encoding of a POST request:

  1. Section 3.4.1 of HTTP/1.1 states that recipients of an HTTP message must respect the character encoding specified by the sender in the Content-Type header if the encoding is supported. A missing character allows the recipient to "guess" what encoding is appropriate.
  2. Most web browsers today do not specify the character set of a request, even when it is something other than ISO-8859-1. This seems to be in violation of the HTTP specification. Most web browsers appear to send a request body using the encoding of the page used to generate the POST (for instance, the <form> element came from a page with a specific encoding... it is that encoding which is used to submit the POST data for that form).
    TR/html4 HTML 4\]
    \\

Default encoding for GET

Wiki Markup
The character set for HTTP query strings (that's the technical term for 'GET parameters') can be found in sections 2 and 2.1 the "URI Syntax" specification. The character set is defined to be \[http://en.wikipedia.org/wiki/ASCII US-ASCII\]. Any character that does not map to US-ASCII must be encoded in some way. Section 2.1 of the URI Syntax specification says that characters outside of US-ASCII must be encoded using {{%}} escape sequences: each character is encoded as a literal {{%}} followed by the two hexadecimal codes which indicate its character code. Thus, {{a}} (US-ASCII character code 0x97) is equivalent to {{%97}}.

...

  1. ISO-8859-1 and ASCII are compatible for character codes 0x20 to 0x7E, so they are often used interchangeably. Most of the web uses ISO-8859-1 as the default for query strings.
  2. Many browsers are starting to offer (default) options of encoding URIs using UTF-8 instead of ISO-8859-1. Some browsers appear to use the encoding of the current page to encode URIs for links (see the note above regarding browser behavior for POST encoding).
  3. Wiki Markup
    \[http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars HTML 4.0\] recommends the use of UTF-8 to encode the query string.
  4. When in doubt, use POST for any data you think might have problems surviving a trip through the query string.

Default Encoding for POST

Wiki Markup
\[http://en.wikipedia.org/wiki/Iso-8859-1 ISO-8859-1\] is defined as the default character set for HTTP request and response bodies in the servlet specification (request encoding: section 4.9 for spec version 2.4, section 3.9 for spec version 2.5; response encoding: section 5.4 for both spec versions 2.4 and 2.5). This default is historical: it comes from sections 3.4.1 and 3.7.1 of the HTTP/1.1 specification.

Some notes about the character encoding of a POST request:

  1. Section 3.4.1 of HTTP/1.1 states that recipients of an HTTP message must respect the character encoding specified by the sender in the Content-Type header if the encoding is supported. A missing character allows the recipient to "guess" what encoding is appropriate.
  2. Most web browsers today do not specify the character set of a request, even when it is something other than ISO-8859-1. This seems to be in violation of the HTTP specification. Most web browsers appear to send a request body using the encoding of the page used to generate the POST (for instance, the <form> element came from a page with a specific encoding... it is that encoding which is used to submit the POST data for that form).

HTTP Headers

Section 3.1 of the ARPA Internet Text Messages spec states that headers are always in US-ASCII encoding. Anything outside of that needs to be encoded. See the section above regarding query strings in URIs.

...