TomD

Mar 25, 2019 - 09:27

I downloaded a webpage successfuly which looks correct in content. When printing slices the characters are different.

Specifically, when I print the first character it shows the first plus the next 3 characters and an apostrophe on the end. So one character becomes 5 characters.

On printing longer slices of the webpage the number of characters is also greater and the apostrophe is always added on the end.

What is happening?

Tom

cvp

Mar 25, 2019 - 10:39

@TomD could you post your code here?

cvp

Mar 25, 2019 - 13:09

@TomD If you download with something like

data = requests.get(url).content

data is bytes and when you print it, you convert it to string as b'xxx'

TomD

Mar 25, 2019 - 13:10

The "with" statement overflows into the next line due to this narrow comment box

import urllib.request
tda=str
with urllib.request.urlopen("https://www.asx.com.au/asx/statistics/todayAnns.do") as response
tda=response.read()

Print the entire html string so I know what is in it

The output of this print statement starts:

b'\r\n\r\n\r\n\<!DOCTYPE

print (tda)

Separately print the first 5 characters in the html string

The output of this is, including spaces between items:

b'\r' b'\n' b'\r' b'\n' b'\r'

print (tda[0:1],tda[1:2],tda[2:3],tda[3:4],tda[4:5])

Print the first 5 characters in the string

The output of this is:

b'\r\n\r\n\r'

print(tda[0:5])

TomD

Mar 25, 2019 - 13:14

CVP, so no printed string slice takes the html one character at a time. It combines them into groups and adds apostrophes.

cvp

Mar 25, 2019 - 13:18

@TomD You can see the string is between b' '
And characters with \ are not printable: ex: \n = next line
Thus b'\n' is only one character "next line "

TomD

Mar 25, 2019 - 13:23

Thanks CVP. That has me onto something.
I am data scraping. Maybe better off using a package like beautifulsoup?

cvp

Mar 25, 2019 - 13:24

@TomD try this

st = tda.decode('utf8')
print(st)

And you will see that there are empty lines at begin, which are \n

TomD

Mar 25, 2019 - 13:31

It doesn't like
print (st)

cvp

Mar 25, 2019 - 13:32

@TomD try this script

import urllib.request
with urllib.request.urlopen("https://www.asx.com.au/asx/statistics/todayAnns.do") as response:
    tda=response.read()
st = tda.decode('utf8')
print(st)

TomD

Mar 25, 2019 - 13:41

I see so I could work on that utf8 more easily

cvp

Mar 25, 2019 - 13:42

@TomD st contains a string, thus yes, good luck

TomD

Mar 25, 2019 - 13:44

I much appreciate. You have helped me around an obstacle

mikael

Mar 25, 2019 - 18:16

@TomD, definitely recommend using BeautifulSoup or webview with Javascript. Latter especially if you are trying to scrape pages with dynamic content.

Forum Archive

Webpage Slices are Different from what is There

The "with" statement overflows into the next line due to this narrow comment box

Print the entire html string so I know what is in it

The output of this print statement starts:

b'\r\n\r\n\r\n\<!DOCTYPE

Separately print the first 5 characters in the html string

The output of this is, including spaces between items:

b'\r' b'\n' b'\r' b'\n' b'\r'

Print the first 5 characters in the string

The output of this is:

b'\r\n\r\n\r'