Reading a multibyte text file in Windows - how does it detect newlines? (Python 2) -


i thought caveat of unicode world -> cannot correctly process byte stream writing without knowing encoding is. if assume encoding, might valid - incorrect - characters showing up.

here's test - file writing:

hi1 hi2 

stored on disk 2-byte unicode encoding:

hex editor view of file

windows newline characters \r\n stored 4 byte sequence 0d 00 0a 00. open in python 2 default encodings, think it's expecting ascii 1-byte-per-character (or stream of bytes), , reads:

>>> open('d:/t/hi2.txt').readlines() ['\xff\xfeh\x00i\x001\x00\r\x00\n',   '\x00h\x00i\x002\x00'] 

it's not decoding 2 bytes 1 character, yet 4 byte line ending sequence has been detected 2 characters, and file has been correctly split 2 lines.

presumably, then, windows opened file in 'text mode', described here: difference between files writen in binary , text mode

and fed lines python. how did windows know file multibyte encoded, , four-bytes of newlines, without being told, per caveat @ top of question?

  • does windows guess, heuristic - , therefore can wrong?
  • is there more cleverness in design of unicode, makes windows newline patterns unambiguous across encodings?
  • is understanding wrong, , there correct way process text file without being told encoding beforehand?

the result in case has nothing windows or standard i/o implementation of microsoft's c runtime. you'll see same result if test in python 2 on linux system. it's how file.readlines (2.7.12 source link) works in python 2. see line 1717, p = (char *)memchr(buffer+nfilled, '\n', nread) , line 1749, line = pystring_fromstringandsize(q, p-q). naively consumes \n character, why actual utf-16le \n\x00 sequence gets split up.

if had opened file using python 2's universal newlines mode, e.g. open('d:/t/hi2.txt', 'u'), \r\x00 sequences naively translated \n\x00. result of readlines instead ['\xff\xfeh\x00i\x001\x00\n, \x00\n', '\x00h\x00i\x002\x00'].

thus initial supposition correct. need know encoding, or @ least know unicode bom (byte order mark) @ start of file, such \xff\xfe, indicates utf-16le (little endian). end recommend using io module in python 2.7, since handles newline translation. codecs.open, on other hand, requires binary mode on wrapped file , ignores universal newline mode:

>>> codecs.open('test.txt', 'u', encoding='utf-16').readlines() [u'hi1\r\n', u'hi2'] 

io.open returns textiowrapper has built-in support universal newlines:

>>> io.open('test.txt', encoding='utf-16').readlines() [u'hi1\n', u'hi2'] 

regarding microsoft's crt, defaults ansi text mode. microsoft's ansi codepages supersets of ascii, crt's newline translation work files encoded ascii compatible encoding such utf-8. on other hand, ansi text mode doesn't work utf-16 encoded file, i.e. doesn't remove utf-16le bom (\xff\xfe) , doesn't translate newlines:

>>> open('test.txt').read() '\xff\xfeh\x00i\x001\x00\r\x00\n\x00h\x00i\x002\x00'  

thus using standard i/o text mode utf-16 encoded file requires non-standard ccs flag, e.g. fopen("d:/t/hi2.txt", "rt, ccs=unicode"). python doesn't support microsoft extension open mode, make crt's low i/o (posix) _open , _read functions available in os module. while might surprise posix programmers, microsoft's low i/o api supports text mode, including unicode. example:

>>> o_wtext = 0x10000 >>> fd = os.open('test.txt', os.o_rdonly | o_wtext) >>> os.read(fd, 100) 'h\x00i\x001\x00\n\x00h\x00i\x002\x00' >>> os.close(fd) 

the o_wtext constant isn't made directly available in windows python because it's not safe open file descriptor mode python file using os.fdopen. crt expects wide-character buffers multiple of size of wchar_t, i.e. multiple of 2. otherwise invokes invalid parameter handler kills process. example (using cdb debugger):

>>> fd = os.open('test.txt', os.o_rdonly | o_wtext) >>> os.read(fd, 7) ntdll!ntterminateprocess+0x14: 00007ff8`d9cd5664 c3              ret 0:000> k8 child-sp          retaddr           call site 00000000`005ef338 00007ff8`d646e219 ntdll!ntterminateprocess+0x14 00000000`005ef340 00000000`62db5200 kernelbase!terminateprocess+0x29 00000000`005ef370 00000000`62db52d4 msvcr90!_invoke_watson+0x11c 00000000`005ef960 00000000`62db0cff msvcr90!_invalid_parameter+0x70 00000000`005ef9a0 00000000`62db0e29 msvcr90!_read_nolock+0x76b 00000000`005efa40 00000000`1e056e8a msvcr90!_read+0x10d 00000000`005efaa0 00000000`1e0c3d49 python27!py_main+0x12a8a 00000000`005efae0 00000000`1e1146d4 python27!pycfunction_call+0x69 

the same applies _o_utf8 , _o_utf16.


Comments

Popular posts from this blog

javascript - Slick Slider width recalculation -

jsf - PrimeFaces Datatable - What is f:facet actually doing? -

angular2 services - Angular 2 RC 4 Http post not firing -