Process non-ascii characters such as pound in python -
this question has answer here:
- decoding utf-8 strings in python 2 answers
i wanna process sentence such as: "the gift costs £100"
the sentence in text file. read in python , when print get:
print "text",text text gift costs £100.
i tried replace code (and when finish processing use function unmapstrangechars original data):
def mapstrangechars(text): text = text.replace("£","1pound1 ") return text def unmapstrangechars(text): text = text.replace("1pound1 ","£") return text
but error saying £ not acii character. how can fix it?
it helpfull learn @ least how cound replace non acii char specific char, recover letter. example: original:the gift costs £100. copy1: gift costs 11pound11 100. output: gift costs $100.
output actually:
print text
whole code(in txt file says "the gift costs £100."):
if 1==1: import os script_dir = os.path.dirname(os.path.realpath(__file__)) rel_path = "results/article.txt" abs_file_path = os.path.join(script_dir, rel_path) thefile = open(abs_file_path) text = thefile.read() print "text",text def mapstrangechars(text): #text = text.replace("fdfdsfds","1pound1 ") return text def unmapstrangechars(text): #text = text.replace("1pound1 ","fdfdsfds") return text text = mapstrangechars(text) #process text text = unmapstrangechars(text) print "text",text #this output
it's because encoding of text file 'utf-8', terminal/ide in windows-1252 encoding.
in utf-8, pound sign encoded 2 bytes: 0xc2 0xa3
if opened file in hex editor, you'd see.
when printed it, terminal/ide interpreting 0xc2 0xa3
windows-1252
. other 8bit codepages, windows-1252
expects each byte maps character. therefore, when 0xc2 0xa3
interpreted windows-1252
, each byte mapped character instead, following happens:
0xc2
displays Â
0xa3
displays £
the solution decode text file special python string type called "unicode string". once have python unicode string, python able re-encode terminal type. i.e, python decode utf-8, encode windows-1252
.
to achieve this, use io
module open()
method , pass in encoding
attribute:
import io thefile = io.open(abs_file_path, encoding="utf-8")
when read()
thefile
, <type 'unicode'>
. function regular string. when pass print
, python automatically encode displays on terminal.
you no longer need mapstrangechars()
, unmapstrangechars()
note: particular python 2.x, open()
defaults opening in binary mode. python 3 opens in text mode default , use locale/language settings determine correct encoding if not given.
Comments
Post a Comment