Crawling websites using curl and bash.
Every now and then I need to crawl some website or automate a web form, the easiest way to accomplish this is with a simple shells cript and parsing the data manually. I did this recently with www.samair.ru and decided to do a quick writeup.
Finding open proxy servers that work is often a little annoying, very view of the sites that give you the ability to download the complete database allow it for long and most try to obfscuscate the data to prevent crawling. Take a look at http://www.samair.ru/proxy/proxy-01.htm, it seems easy enough to grab the data from this site but since it updates occasionally a script would be better. When we look at the page source however we get the following:
Showing that they tried to complicate this a bit, looking at the top of the page we can see the definitions they use for outputting port numbers, this changes every couple of minutes so we can’t simply hard code it either”:
There’s a lot of room for improvement, detecting redirecting proxies for example. Interestingly the pages are only numbered to 25 but we can crawl much farther than that, the scripped maxed out at 72 the last time I used it. Detecting when we get our last proxy doesn’t always work though, you can get an empty page and the following pages would have more loot, we just crawl to page 99 then.