Forum Archive

Is advanced web scraping possible?

ihf

I would like to scrape some data from a website that requires that I login and post data. However it is done in several steps on different webpages. Specifically,

  1. The login page for the username has this input:
  2. This must be followed by a button press on that page which looks like this:
  3. Now logged in, I must post a value for a query:
  4. This is followed by a button press:
  5. Finally I need to get back the html result so that I can parse it.

Any help would be greatly appreciated.

JonB

@ihf you need to look at the

action property. It will either be a JavaScript call, which you will have to look at to replicate be it's action, or a URL in which case you can use requests to get or post the form data. (Use a requests.session to carryover authentication, or cookies.) You may need to use beautiful soup to find any hidden form inputs, which is sometimes how some websites handle authentication tokens or other page differentiation, etc.

I'd recommend doing this on a PC first, using your browsers developer tools (f12 on edge or chrome), then transferring to pythonista. Much easier to debug.

I use this approach to automatically check my library account for books ready to renew, and then send an alert when I got the renew limit. That involves basically one login page, two or three "clicks", then programmatically "checking" some check boxes to select books, and submitting another form, then parsing the response to check for failures (like books on hold).

mikael

@ihf, I just posted an example on this thread. Funnily enough, the code was developed for the purposes of library books renewal, just like @JonB did. I suspect they are using the same library softwate eveywhere.

mikael

Those JS functions have turned out to be pretty convenient, although in need of more documentation, but there is lot to be said about the ease of development on a PC.

As there seems to be some interest in scraping, I think I will add a feature where you can click on elements and get the XPath for it.

ihf

As to the property, I am assuming that, at least for the first login page that accepts the username, it is as shown below. But I've only used requests for really simple things, so this may be beyond my reach at this point. In any case, thanks for your replies.

```

...

P.S. I am trying to scrape name and address from a callsign query on qrz.com.
There is a python module that will use an api for basic data but more data is available by logging in.

mikael

@ihf, this looks potentially undoable with requests due to the Javascript involved – but trivial with the WebView-based code I shared in the other thread.

ihf

@mikael where do I find the module inheritable? Actually, it is imported but I don't see that it is used.

mikael

@ihf, just remove the import. I removed that dependency for the express purpose of making this dependency-free - and then forgot to remove import. (Fix committed to github as well.)

ihf

@mikael Thanks. Jswrapper looks as if it could be quite useful though I remain unsure how to use it for qrz.com or whether requests could submit the username, password, and search term without it.

mikael

@ihf, yes, I have in many cases been able to submit login credentials with requests. Often you need to find some special magic string included on the form, and send it along with your login post request. Again, with the WebView approach you do not have to do that, or worry about cookies etc. Also, many sites these days have lots of dynamic JS content, which requests cannot help you with at all.

mikael

@ihf, after thinking long and hard about it, I have decided to accept your challenge. :-) What is it that you want to get out of that site?

ihf

@mikael I merely want to do a search by callsign which is on the upper left side of the main web page. However unless you can login the result is very limited. If you click on login you will be taken to qrz.com/login and you will then need to enter a username followed by the next button which takes you to a similar page to enter your password and a button to click to login. It is then that the search can be done which will result in a more detailed response. If you could help get me through login (even without proper credentials), I could probably take it from there

ihf

If it would help, I can send logs from the browser inspector tool (not sure how but surely there is a way).

ihf

Here are the inputs and buttons that will need to be programmed:

Next

Sign In

A page comes up briefly which says wait … and then it goes back to the main page (not sure if this complicates matters)

The search is then done:

it is the resulting page that I need to capture.

JonB

So, I used Microsoft Edge dev tools (F12), to look at the page source and request headers for each step.
This actually looks fairly simple to use standard techniques..

In the initial qrz.com/login response, there is some dynamic code. If you parse that, you will find "loginTicket", which you'll need to store:

   if (step == 1) {
            jQuery.ajax({
                dataType: "json",
                url: '/login-handshake',
                data: {'loginTicket': TOKENSTRING, 'username': jQuery('.login-container #username').val(), 'step' : 1},
                method: 'post',

(here I've replaced the actual string).
Typing user name and pressing submit, results in a request to /login-handshake, which contain these fields:
loginTicket: TOKENSTRING
step: 1
username: YOURUSERNAME

Next, typing password, results in another post to /login-handshake, containing:
loginTicket: TOKENSTRING
password: YOURPASSWORD
step: 2
username: YOURUSERNAME

Finally, if the handshake works, there is a final post to /login:

    '2fcode': ''  (empty string)
    'flush': 1
    'login_ref': 'https%3A%2F%2Fwww.qrz.com%2F'
    'password': YOURPASSWORD
    'target': '%2F'
    'username': YOURUSERNAME

After that, you get a cookie, which you keep using.

Turns out, the whole handshake seems unnecessary:

import requests

data = {'2fcode':'', 'flush': 1, 'login_ref':'https%3A%2F%2Fwww.qrz.com%2F', 'password':YOURPASSWORD, 'target':'%2F','username':YOURUSERNAME}
sess=requests.Session()
sess.get('http://qrz.com/login')
sess.post('http://qrz.com/login', data=data)

querydata={'tquery':'VA6BH', 'mode':'callsign'}
r=sess.post('https://qrz.com/lookup',data=querydata)

You can then pass r.content into bs4:

soup = bs4(r.content)
csdata=soup.find_all('td',id='csdata')

You can then parse the resulting table

ihf

@jonb You’ve done it again! Thank you. I ran the script and viewed the result using a webview (just to see it) and I am indeed logged in; however, the search does not appear to be executing. No error message, it is just sitting at the advanced search screen after login. If I repeat the search post, it does the search and I get the desired results. Thank you again..you always make it look easy but I know it would have taken me a long time to get it right.

ihf

After searching for the callsign w2aee, the result set from:

csdata=soup.find_all('td',id='csdata')

gives

[
W2AEE
USA flag USA

COLUMBIA UNIVERSITY AMAT RAD CLUB, W2AEE
144 Washburn Rd
Briarcliff Manor, NY 10510
USA

Email: Use mouse to view..

Page managed by KD2DDT Lookups: 5677

</td>]

What is the best way to parse this for the name and address? HTMLParser, BeautifulSoup, or just string functions? (I tried feeding this as a string to an HTMLParser instance and got no output).