158 lines
6.5 KiB
ReStructuredText
158 lines
6.5 KiB
ReStructuredText
.. _unicode:
|
|
|
|
=======
|
|
Unicode
|
|
=======
|
|
|
|
.. module:: werkzeug
|
|
|
|
Since early Python 2 days unicode was part of all default Python builds. It
|
|
allows developers to write applications that deal with non-ASCII characters
|
|
in a straightforward way. But working with unicode requires a basic knowledge
|
|
about that matter, especially when working with libraries that do not support
|
|
it.
|
|
|
|
Werkzeug uses unicode internally everywhere text data is assumed, even if the
|
|
HTTP standard is not unicode aware as it. Basically all incoming data is
|
|
decoded from the charset specified (per default `utf-8`) so that you don't
|
|
operate on bytestrings any more. Outgoing unicode data is then encoded into
|
|
the target charset again.
|
|
|
|
Unicode in Python
|
|
=================
|
|
|
|
In Python 2 there are two basic string types: `str` and `unicode`. `str` may
|
|
carry encoded unicode data but it's always represented in bytes whereas the
|
|
`unicode` type does not contain bytes but charpoints. What does this mean?
|
|
Imagine you have the German Umlaut `ö`. In ASCII you cannot represent that
|
|
character, but in the `latin-1` and `utf-8` character sets you can represent
|
|
it, but they look differently when encoded:
|
|
|
|
>>> u'ö'.encode('latin1')
|
|
'\xf6'
|
|
>>> u'ö'.encode('utf-8')
|
|
'\xc3\xb6'
|
|
|
|
So an `ö` might look totally different depending on the encoding which makes
|
|
it hard to work with it. The solution is using the `unicode` type (as we did
|
|
above, note the `u` prefix before the string). The unicode type does not
|
|
store the bytes for `ö` but the information, that this is a
|
|
``LATIN SMALL LETTER O WITH DIAERESIS``.
|
|
|
|
Doing ``len(u'ö')`` will always give us the expected "1" but ``len('ö')``
|
|
might give different results depending on the encoding of ``'ö'``.
|
|
|
|
Unicode in HTTP
|
|
===============
|
|
|
|
The problem with unicode is that HTTP does not know what unicode is. HTTP
|
|
is limited to bytes but this is not a big problem as Werkzeug decodes and
|
|
encodes for us automatically all incoming and outgoing data. Basically what
|
|
this means is that data sent from the browser to the web application is per
|
|
default decoded from an utf-8 bytestring into a `unicode` string. Data sent
|
|
from the application back to the browser that is not yet a bytestring is then
|
|
encoded back to utf-8.
|
|
|
|
Usually this "just works" and we don't have to worry about it, but there are
|
|
situations where this behavior is problematic. For example the Python 2 IO
|
|
layer is not unicode aware. This means that whenever you work with data from
|
|
the file system you have to properly decode it. The correct way to load
|
|
a text file from the file system looks like this::
|
|
|
|
f = file('/path/to/the_file.txt', 'r')
|
|
try:
|
|
text = f.decode('utf-8') # assuming the file is utf-8 encoded
|
|
finally:
|
|
f.close()
|
|
|
|
There is also the codecs module which provides an open function that decodes
|
|
automatically from the given encoding.
|
|
|
|
Error Handling
|
|
==============
|
|
|
|
With Werkzeug 0.3 onwards you can further control the way Werkzeug works with
|
|
unicode. In the past Werkzeug ignored encoding errors silently on incoming
|
|
data. This decision was made to avoid internal server errors if the user
|
|
tampered with the submitted data. However there are situations where you
|
|
want to abort with a `400 BAD REQUEST` instead of silently ignoring the error.
|
|
|
|
All the functions that do internal decoding now accept an `errors` keyword
|
|
argument that behaves like the `errors` parameter of the builtin string method
|
|
`decode`. The following values are possible:
|
|
|
|
`ignore`
|
|
This is the default behavior and tells the codec to ignore characters that
|
|
it doesn't understand silently.
|
|
|
|
`replace`
|
|
The codec will replace unknown characters with a replacement character
|
|
(`U+FFFD` ``REPLACEMENT CHARACTER``)
|
|
|
|
`strict`
|
|
Raise an exception if decoding fails.
|
|
|
|
Unlike the regular python decoding Werkzeug does not raise an
|
|
:exc:`UnicodeDecodeError` if the decoding failed but an
|
|
:exc:`~exceptions.HTTPUnicodeError` which
|
|
is a direct subclass of `UnicodeError` and the `BadRequest` HTTP exception.
|
|
The reason is that if this exception is not caught by the application but
|
|
a catch-all for HTTP exceptions exists a default `400 BAD REQUEST` error
|
|
page is displayed.
|
|
|
|
There is additional error handling available which is a Werkzeug extension
|
|
to the regular codec error handling which is called `fallback`. Often you
|
|
want to use utf-8 but support latin1 as legacy encoding too if decoding
|
|
failed. For this case you can use the `fallback` error handling. For
|
|
example you can specify ``'fallback:iso-8859-15'`` to tell Werkzeug it should
|
|
try with `iso-8859-15` if `utf-8` failed. If this decoding fails too (which
|
|
should not happen for most legacy charsets such as `iso-8859-15`) the error
|
|
is silently ignored as if the error handling was `ignore`.
|
|
|
|
Further details are available as part of the API documentation of the concrete
|
|
implementations of the functions or classes working with unicode.
|
|
|
|
Request and Response Objects
|
|
============================
|
|
|
|
As request and response objects usually are the central entities of Werkzeug
|
|
powered applications you can change the default encoding Werkzeug operates on
|
|
by subclassing these two classes. For example you can easily set the
|
|
application to utf-7 and strict error handling::
|
|
|
|
from werkzeug.wrappers import BaseRequest, BaseResponse
|
|
|
|
class Request(BaseRequest):
|
|
charset = 'utf-7'
|
|
encoding_errors = 'strict'
|
|
|
|
class Response(BaseResponse):
|
|
charset = 'utf-7'
|
|
|
|
Keep in mind that the error handling is only customizable for all decoding
|
|
but not encoding. If Werkzeug encounters an encoding error it will raise a
|
|
:exc:`UnicodeEncodeError`. It's your responsibility to not create data that is
|
|
not present in the target charset (a non issue with all unicode encodings
|
|
such as utf-8).
|
|
|
|
.. _filesystem-encoding:
|
|
|
|
The Filesystem
|
|
==============
|
|
|
|
.. versionchanged:: 0.11
|
|
|
|
Up until version 0.11, Werkzeug used Python's stdlib functionality to detect
|
|
the filesystem encoding. However, several bug reports against Werkzeug have
|
|
shown that the value of :py:func:`sys.getfilesystemencoding` cannot be
|
|
trusted under traditional UNIX systems. The usual problems come from
|
|
misconfigured systems, where ``LANG`` and similar environment variables are not
|
|
set. In such cases, Python would default to ASCII as filesystem encoding, a
|
|
very conservative default that is usually wrong and causes more problems than
|
|
it avoids.
|
|
|
|
Therefore Werkzeug will force the filesystem encoding to ``UTF-8`` and issue a
|
|
warning whenever it detects that it is running under BSD or Linux, and
|
|
:py:func:`sys.getfilesystemencoding` is returning an ASCII encoding.
|
|
|
|
See also :py:mod:`werkzeug.filesystem`.
|