Fun with data mining

Chris Monteiro
pirate dot london
Published in
3 min readJan 2, 2011

--

So a friend of mine has started a new job and been given the rather boring task of taking 2k company names and finding the associated address phone and email details for the purpose of calling them up and and selling them his company’s services.

He mentioned this to me, knowing I always appreciate a challenge with t’internet. There were probably alternative approaches I could have taken to this, but once I had my heart set on it, I it was form posting automation I chose.

Scouting around, I found an online phone book site, quite a mainstream one, from BT, http://www.thephonebook.bt.com

Crucial to my crude attempt, it had to accept the search parameters into the query string, which this one did. I then converted my list of company names to csv format, encoded characters like & to %26 and space to ‘+’ as per how the site was working.

I then created a complete list of 2k valid search string URLs from this. Next, how to automate the posting and capturing the results? Well this was the tricky, but new bit for me. If I was a programmer, I could have likely written this entire process end to end using one language and a couple of libraries, but I’m not so I improvise when I get stuck.

I downloaded the trial of WinAutomation which is a highly versatile tool which I’d recommend to anyone who does repetitive tasks at their job using a computer.

Using it’s scripting / GUI (quite unique!) I created a ‘programme’ (commas are in case any actual programmers are reading ;) ) which would read the contents from my CSV and slap it into a variable. It would then iterate through the list and perform an HTTP get against the URL and append the output to a file.

This worked well at first, then I found I was downloading the same contents again and again. It turned out you can post only 50 searches before having to enter a captcha. I modified the script to parse the output and look for the captcha prompt text, and pause the script so I could manually enter the captcha each time.

Finally, I manually cleaned up the raw HTML output by using a mixture of notepad++ replacements and use of delimiter points to split the output.

Final result, hundreds of emails, addresses and phone numbers!

What did I learn through this exercise? One, BT could quite easily have stopped me data mining their site through throttling my requests, or increasing the frequency of the captcha prompts, but they failed to do so? I wonder why? It also makes me realise just how easy it is for a technical person (not even a programmer) to data mine any search site once the initial configuration is in place.

It has opened my eyes further to why the problem of leaving your email on the internet leads to simple bots picking up them up then spamming you, and that information in a website database, even if you need to search for specific fields to return it, given a large enough and relevant dataset, effectively makes the whole database open to anyone with the right tools.

ppuk link

--

--

Pirate, sysadmin, transhumanist, internet hipster. Researches cybercrime and Tor scams. Not getting paid enough for this shit