Regex unicode in python 2.x vs 3.x -
i have simple function tokenizing words.
import re def tokenize(string): return re.split("(\w+)(?<!')",string,re.unicode)
in python 2.7 behaves this:
in [170]: tokenize('perché.') out[170]: ['perch', '\xc3\xa9.', '']
in python 3.5.0 this:
in [6]: tokenize('perché.') out[6]: ['perché', '.', '']
the problem 'é' should not treated character tokenize. thoght re.unicode
enough make \w
work in way mean?
how same behaviour of python 3.x in python 2.x ?
you'll want use unicode strings, third parameter of split
not flags
, maxsplit
:
>>> help(re.split) on function split in module re: split(pattern, string, maxsplit=0, flags=0) split source string occurrences of pattern, returning list containing resulting substrings. if capturing parentheses used in pattern, text of groups in pattern returned part of resulting list. if maxsplit nonzero, @ maxsplit splits occur, , remainder of string returned final element of list.
example:
#!coding:utf8 __future__ import print_function import re def tokenize(string): return re.split(r"(\w+)(?<!')",string,flags=re.unicode) print(tokenize(u'perché.'))
output:
c:\>py -2 test.py [u'perch\xe9', u'.', u''] c:\>py -3 test.py ['perché', '.', '']
Comments
Post a Comment