Regex unicode in python 2.x vs 3.x -


i have simple function tokenizing words.

import re def tokenize(string):     return re.split("(\w+)(?<!')",string,re.unicode) 

in python 2.7 behaves this:

in [170]: tokenize('perché.') out[170]: ['perch', '\xc3\xa9.', ''] 

in python 3.5.0 this:

in [6]: tokenize('perché.') out[6]: ['perché', '.', ''] 

the problem 'é' should not treated character tokenize. thoght re.unicode enough make \w work in way mean?

how same behaviour of python 3.x in python 2.x ?

you'll want use unicode strings, third parameter of split not flags, maxsplit:

>>> help(re.split) on function split in module re:  split(pattern, string, maxsplit=0, flags=0)     split source string occurrences of pattern,     returning list containing resulting substrings.  if     capturing parentheses used in pattern, text of     groups in pattern returned part of resulting     list.  if maxsplit nonzero, @ maxsplit splits occur,     , remainder of string returned final element     of list. 

example:

#!coding:utf8 __future__ import print_function import re def tokenize(string):     return re.split(r"(\w+)(?<!')",string,flags=re.unicode)  print(tokenize(u'perché.')) 

output:

c:\>py -2 test.py [u'perch\xe9', u'.', u'']  c:\>py -3 test.py ['perché', '.', ''] 

Comments

Popular posts from this blog

javascript - Slick Slider width recalculation -

jsf - PrimeFaces Datatable - What is f:facet actually doing? -

angular2 services - Angular 2 RC 4 Http post not firing -