Crawling websites using curl and bash.

Every now and then I need to crawl some website or automate a web form, the easiest way to accomplish this is with a simple shells cript and parsing the data manually. I did this recently with www.samair.ru and decided to do a quick writeup.

Finding open proxy servers that work is often a little annoying, very view of the sites that give you the ability to download the complete database allow it for  long and most try to obfscuscate the data to prevent crawling. Take a look at http://www.samair.ru/proxy/proxy-01.htm, it seems easy enough to grab the data from this site but since it updates occasionally a script would be better. When we look at the page source however we get the following:

<tr><td>62.148.136.79<script type="text/javascript">document.write(":"+o+s)</script>

Showing that they tried to complicate this a bit, looking at the top of the page we can see the definitions they use for outputting port numbers, this changes every couple of minutes so we can’t simply hard code it either”:

<script type="text/javascript">
q=7;s=0;u=4;z=3;g=6;i=9;p=1;o=8;h=2;m=5;</script></head>

Since curl doesn’t support javascript we have to parse it ourselves, this is fairly simple using sed and grep hoewever. Since not all these proxies work we have to test them as well, they either block certain regions of the net or maybe never even have worked. Either way, using curl we connect to a site through the proxy and search for some known piece of text, if we don’t find it the proxy doesn’t work, easy. If anyone is interested here’s the script for crawling the site, on a complete pass it gives about 150 working http proxies. You can use these for faking online polls, sql injection, you name it.

http://pastebin.com/90Vh0AVY

There’s a lot of room for improvement, detecting redirecting proxies for example. Interestingly the pages are only numbered to 25 but we can crawl much farther than that, the scripped maxed out at 72 the last time I used it. Detecting when we get our last proxy doesn’t always work though, you can get an empty page and the following pages would have more loot, we just crawl to page 99 then.

Advertisements

~ by s3c on 2011/05/19.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: