Forum Archive

Web scraping

sulcud

I made this python script for web scraping, it can map a web site in a dictionary for example:

{url:[url_type,{url1_in_url:[url1_in_url_type,{...}],url2_inurl:[...],...}]}

if someone also want to download all files in that web_site only need specify it:
with the variable 'descargar' and set it to True

 url = 'https://en.wikipedia.org/wiki/Roy_Clark'
 descargar = True
 profundidad = 2
 archivo = 'clark.json'
 s = Scraper()
 s.lineal(url,profundidad,descargar,archivo)

GitHub

I want to implement some threading algorithm to speed up the analysis.
Note:

it sometimes have an error when downloading files.

mikael

@sulcud, would be interesting to know if you have considered the pros and cons of threading vs. asyncio, and why you would pick one or the other.

sulcud

@mikael I try to implement the async function, really I don’t know if i do it well, now it download all more faster than before and in other hand i also correct the link extraction function because some times (most of the time ☹️) the function only outputs 20-50 urls, now it can extract all or some number near to all of the links in the page, I also make a setup.py file but I truly don’t know if it works.

Now the way to use it is:

from scrapthor import scrap
url=“some url”
scrap(url)

Please. can you check it?

mikael

@sulcud, your code still looks serial to me. I think you need aiohttp for this - check e.g. this tutorial.

sulcud

@mikael WOW with that package the speed increase a lot, thanks, now I know the real power of async programming