Forum Archive

cvp

May 02, 2016 - 17:50

I have Dropbox files names containing accented characters, like é or è.
When I ask Dropbox to generate an url, it copies a link as ...../%C3%A9, what is normally é in UTF-8.
If I want to get the file name in my 'script, using

url = urllib.unquote(url.decode('utf-8'))

gives Ã©.
To get "my" é, I need to encode('latin1')
Is that normal?
We say in French 'perdre son latin', for 'lost in translation' 🤕

dgelessus

May 02, 2016 - 17:57

I think what you need to do is urllib.unquote(url).decode("utf-8"). The %C3%A9 is é encoded in UTF-8, so you first need to convert the escaped UTF-8 bytes to normal bytes, then decode that to a Unicode string.

Under Python 3 urllib.unquote does the decoding automatically, so there you can just write urllib.unquote(url) and you get a proper Unicode é.

omz

May 02, 2016 - 18:00

I think this would be the correct way to decode the URL:

url = unicode(urllib.unquote(url), 'utf-8')

or alternatively (but more confusing):

url = urllib.unquote(url).decode('utf-8')

Edit: Looks like @dgelessus was faster than me...

cvp

May 02, 2016 - 18:06

Thanks champions! My code had a misplaced right parenthesis which thus gave a bad answer.
One more time, shame on me.

cvp

May 02, 2016 - 18:18

Sorry, but I still have problems with that.

Try this short code,
if the URL is passed "by appex", it is NOT OK
if the URL is set as text for testing, it's OK

# coding: utf-8
import urllib
import appex

#url = 'https://www.dropbox.com/s/5mmxh7h7vu2lwnp/La%20vie%20tr%C3%A8s%20priv%C3%A9e%20de%20Monsieur%20Sim.png?dl=0'
url = appex.get_url()
print url
print urllib.unquote(url).decode('utf-8')

omz

May 02, 2016 - 18:22

appex.get_url() returns a unicode string, so you need an extra encode there...

import urllib

# This is a unicode string literal (note the 'u' before the quotes), to simulate the behavior of appex.get_url():
url = u'https://www.dropbox.com/s/5mmxh7h7vu2lwnp/La%20vie%20tr%C3%A8s%20priv%C3%A9e%20de%20Monsieur%20Sim.png'
print urllib.unquote(url.encode('utf-8')).decode('utf-8')

And no, you're not the only one who finds this very confusing. ;)

cvp

May 02, 2016 - 18:25

My god! (Not you, but almost)

omz

May 02, 2016 - 18:28

The good news is, this kind of stuff is generally a bit easier in Python 3 because pretty much every string is unicode there, and urllib.parse.unquote (the Python 3 equivalent of urllib.unquote) can handle unicode, so it would be just urllib.parse.unquote(url) in Python 3, regardless of whether url was defined as a normal string literal, or returned by appex.get_url.

dgelessus

May 02, 2016 - 18:32

Well this is confusing. Though here the issue looks like it's with urllib.unquote - it seems to be designed for str strings and gets confused with unicode strings. In Python 3 it's a lot better (as always) - there the string is decoded as UTF-8 by default, and you can set a different encoding if necessary.

cvp

May 02, 2016 - 18:32

I'm really still a beginner in Python and, of course, I'll buy the next version, but I hope you'll give some explanation how to convert my scripts for this version, when it would be available.

dgelessus

May 02, 2016 - 18:35

@cvp There is the 2to3 tool which can do most of the dumb work for you (e. g. putting parentheses around your print calls). I'm not sure how well it corrects the bytes/str/unicode mess that you need in Python 2. Probably not very much, as it's hard to guess whether a encode or decode is actually necessary or just a compatibility hack.

cvp

May 02, 2016 - 18:38

Ok, I'll try to remember when I'll use Python 3, thanks