Process non-ascii characters such as pound in python -

- June 15, 2013

this question has answer here:

decoding utf-8 strings in python 2 answers

i wanna process sentence such as: "the gift costs £100"

the sentence in text file. read in python , when print get:

print "text",text text gift costs Â£100.

i tried replace code (and when finish processing use function unmapstrangechars original data):

def mapstrangechars(text):     text = text.replace("Â£","1pound1 ")     return text   def unmapstrangechars(text):     text = text.replace("1pound1 ","Â£")         return text

but error saying Â£ not acii character. how can fix it?

it helpfull learn @ least how cound replace non acii char specific char, recover letter. example: original:the gift costs Â£100. copy1: gift costs 11pound11 100. output: gift costs $100.

output actually:

print text

whole code(in txt file says "the gift costs £100."):

if 1==1:          import os     script_dir = os.path.dirname(os.path.realpath(__file__))     rel_path = "results/article.txt"     abs_file_path = os.path.join(script_dir, rel_path)            thefile = open(abs_file_path)     text = thefile.read()       print "text",text       def mapstrangechars(text):         #text = text.replace("fdfdsfds","1pound1 ")         return text      def unmapstrangechars(text):         #text = text.replace("1pound1 ","fdfdsfds")             return text        text = mapstrangechars(text)      #process text      text = unmapstrangechars(text)         print "text",text  #this output

it's because encoding of text file 'utf-8', terminal/ide in windows-1252 encoding.

in utf-8, pound sign encoded 2 bytes: 0xc2 0xa3 if opened file in hex editor, you'd see.

when printed it, terminal/ide interpreting 0xc2 0xa3 windows-1252. other 8bit codepages, windows-1252 expects each byte maps character. therefore, when 0xc2 0xa3 interpreted windows-1252 , each byte mapped character instead, following happens:

0xc2 displays Â
0xa3 displays £

the solution decode text file special python string type called "unicode string". once have python unicode string, python able re-encode terminal type. i.e, python decode utf-8, encode windows-1252.

to achieve this, use io module open() method , pass in encoding attribute:

import io thefile = io.open(abs_file_path, encoding="utf-8")

when read() thefile, <type 'unicode'>. function regular string. when pass print, python automatically encode displays on terminal.

you no longer need mapstrangechars() , unmapstrangechars()

note: particular python 2.x, open() defaults opening in binary mode. python 3 opens in text mode default , use locale/language settings determine correct encoding if not given.

Search This Blog

Jal

Process non-ascii characters such as pound in python -

Comments

Post a Comment

Popular posts from this blog

javascript - Slick Slider width recalculation -

jsf - PrimeFaces Datatable - What is f:facet actually doing? -

angular2 services - Angular 2 RC 4 Http post not firing -