Forum Archive

Webpage Slices are Different from what is There

TomD

I downloaded a webpage successfuly which looks correct in content. When printing slices the characters are different.

Specifically, when I print the first character it shows the first plus the next 3 characters and an apostrophe on the end. So one character becomes 5 characters.

On printing longer slices of the webpage the number of characters is also greater and the apostrophe is always added on the end.

What is happening?

Tom

cvp

@TomD could you post your code here?

cvp

@TomD If you download with something like

data = requests.get(url).content

data is bytes and when you print it, you convert it to string as b'xxx'

TomD

The "with" statement overflows into the next line due to this narrow comment box

import urllib.request
tda=str
with urllib.request.urlopen("https://www.asx.com.au/asx/statistics/todayAnns.do") as response
tda=response.read()

Print the entire html string so I know what is in it

The output of this print statement starts:

b'\r\n\r\n\r\n\<!DOCTYPE

print (tda)

Separately print the first 5 characters in the html string

The output of this is, including spaces between items:

b'\r' b'\n' b'\r' b'\n' b'\r'

print (tda[0:1],tda[1:2],tda[2:3],tda[3:4],tda[4:5])

Print the first 5 characters in the string

The output of this is:

b'\r\n\r\n\r'

print(tda[0:5])

TomD

CVP, so no printed string slice takes the html one character at a time. It combines them into groups and adds apostrophes.

cvp

@TomD You can see the string is between b' '
And characters with \ are not printable: ex: \n = next line
Thus b'\n' is only one character "next line "

TomD

Thanks CVP. That has me onto something.
I am data scraping. Maybe better off using a package like beautifulsoup?

cvp

@TomD try this

st = tda.decode('utf8')
print(st)

And you will see that there are empty lines at begin, which are \n

TomD

It doesn't like
print (st)

cvp

@TomD try this script

import urllib.request
with urllib.request.urlopen("https://www.asx.com.au/asx/statistics/todayAnns.do") as response:
    tda=response.read()
st = tda.decode('utf8')
print(st)
TomD

I see so I could work on that utf8 more easily

cvp

@TomD st contains a string, thus yes, good luck

TomD

I much appreciate. You have helped me around an obstacle

mikael

@TomD, definitely recommend using BeautifulSoup or webview with Javascript. Latter especially if you are trying to scrape pages with dynamic content.