Reading a multibyte text file in Windows - how does it detect newlines? (Python 2) -
i thought caveat of unicode world -> cannot correctly process byte stream writing without knowing encoding is. if assume encoding, might valid - incorrect - characters showing up.
here's test - file writing:
hi1 hi2
stored on disk 2-byte unicode encoding:
windows newline characters \r\n
stored 4 byte sequence 0d 00 0a 00
. open in python 2 default encodings, think it's expecting ascii 1-byte-per-character (or stream of bytes), , reads:
>>> open('d:/t/hi2.txt').readlines() ['\xff\xfeh\x00i\x001\x00\r\x00\n', '\x00h\x00i\x002\x00']
it's not decoding 2 bytes 1 character, yet 4 byte line ending sequence has been detected 2 characters, and file has been correctly split 2 lines.
presumably, then, windows opened file in 'text mode', described here: difference between files writen in binary , text mode
and fed lines python. how did windows know file multibyte encoded, , four-bytes of newlines, without being told, per caveat @ top of question?
- does windows guess, heuristic - , therefore can wrong?
- is there more cleverness in design of unicode, makes windows newline patterns unambiguous across encodings?
- is understanding wrong, , there correct way process text file without being told encoding beforehand?
the result in case has nothing windows or standard i/o implementation of microsoft's c runtime. you'll see same result if test in python 2 on linux system. it's how file.readlines
(2.7.12 source link) works in python 2. see line 1717, p = (char *)memchr(buffer+nfilled, '\n', nread)
, line 1749, line = pystring_fromstringandsize(q, p-q)
. naively consumes \n
character, why actual utf-16le \n\x00
sequence gets split up.
if had opened file using python 2's universal newlines mode, e.g. open('d:/t/hi2.txt', 'u')
, \r\x00
sequences naively translated \n\x00
. result of readlines
instead ['\xff\xfeh\x00i\x001\x00\n, \x00\n', '\x00h\x00i\x002\x00']
.
thus initial supposition correct. need know encoding, or @ least know unicode bom (byte order mark) @ start of file, such \xff\xfe
, indicates utf-16le (little endian). end recommend using io
module in python 2.7, since handles newline translation. codecs.open
, on other hand, requires binary mode on wrapped file , ignores universal newline mode:
>>> codecs.open('test.txt', 'u', encoding='utf-16').readlines() [u'hi1\r\n', u'hi2']
io.open
returns textiowrapper
has built-in support universal newlines:
>>> io.open('test.txt', encoding='utf-16').readlines() [u'hi1\n', u'hi2']
regarding microsoft's crt, defaults ansi text mode. microsoft's ansi codepages supersets of ascii, crt's newline translation work files encoded ascii compatible encoding such utf-8. on other hand, ansi text mode doesn't work utf-16 encoded file, i.e. doesn't remove utf-16le bom (\xff\xfe
) , doesn't translate newlines:
>>> open('test.txt').read() '\xff\xfeh\x00i\x001\x00\r\x00\n\x00h\x00i\x002\x00'
thus using standard i/o text mode utf-16 encoded file requires non-standard ccs
flag, e.g. fopen("d:/t/hi2.txt", "rt, ccs=unicode")
. python doesn't support microsoft extension open mode
, make crt's low i/o (posix) _open
, _read
functions available in os
module. while might surprise posix programmers, microsoft's low i/o api supports text mode, including unicode. example:
>>> o_wtext = 0x10000 >>> fd = os.open('test.txt', os.o_rdonly | o_wtext) >>> os.read(fd, 100) 'h\x00i\x001\x00\n\x00h\x00i\x002\x00' >>> os.close(fd)
the o_wtext
constant isn't made directly available in windows python because it's not safe open file descriptor mode python file
using os.fdopen
. crt expects wide-character buffers multiple of size of wchar_t
, i.e. multiple of 2. otherwise invokes invalid parameter handler kills process. example (using cdb debugger):
>>> fd = os.open('test.txt', os.o_rdonly | o_wtext) >>> os.read(fd, 7) ntdll!ntterminateprocess+0x14: 00007ff8`d9cd5664 c3 ret 0:000> k8 child-sp retaddr call site 00000000`005ef338 00007ff8`d646e219 ntdll!ntterminateprocess+0x14 00000000`005ef340 00000000`62db5200 kernelbase!terminateprocess+0x29 00000000`005ef370 00000000`62db52d4 msvcr90!_invoke_watson+0x11c 00000000`005ef960 00000000`62db0cff msvcr90!_invalid_parameter+0x70 00000000`005ef9a0 00000000`62db0e29 msvcr90!_read_nolock+0x76b 00000000`005efa40 00000000`1e056e8a msvcr90!_read+0x10d 00000000`005efaa0 00000000`1e0c3d49 python27!py_main+0x12a8a 00000000`005efae0 00000000`1e1146d4 python27!pycfunction_call+0x69
the same applies _o_utf8
, _o_utf16
.
Comments
Post a Comment