Saturday, August 4, 2007

Fun with Python: Domainspotter

In an effort to learn Python better, I will be writing a number of relatively small scripts to perform set tasks. I find that giving myself some goal and having to code towards it is more effective in learning syntax than going through coding exercises in a tutorial.

For the first such project, I wanted to determine how many words in a current English dictionary are not taken as domain names. This could have become a rather complicated exercise, but I limited it to the following:
There are many domains made with acronyms, combinations of names, made up words, and many other forms. But I was curious just how many of the more common terms in the language were still available.

The code as it stands is at the end of this post.

Before you look at how I did it, however (which is far from optimal anyway), I would encourage you to try out the exercise on your own, given the basic requirements stated above. I found that this program was perfect for learning more Python, since it required learning how to perform a number of disparate activities, such as reading and parsing files, using external modules, performing web lookups, and displaying desired data. These are, of course, some of the most general tasks most programs of use must perform.

A few notes. The whois lookups was the only section of the program that I assumed would require something outside of core Python. I came across, a module for performing whois lookups. I still have the code to implement this listed in the program as it stands for illustration, but found it did not meet my needs. It could not locate many domains I confirmed were taken using the simple command whois. So, I decided to fall back on something I knew: the command whois! This seemed to work fine for my needs.

Additionally, while I know the code could stand for optimization all over the place, I was curious how long the web lookups would take. To do this, I modified the lookup function to the following:

import time
self.totalchecked = 0
self.availdomains = []
times = []
for potdomain in self.potdomains:
start = time.time()
runthis = "whois %s" % potdomain
check = commands.getoutput(runthis).split('\n')
if check[8] == 'No match for "%s".' % potdomain.upper():
end = time.time()
timeforquery = end - start
import pdb
I ran this against a much shortened dictionary file (the top 10 lines of the real one). Once the debugger launched, I did the following:
(Pdb) print sum(times) / len(times)
This gave me the average time per web query. Now, note that I found this out after I started the final run for the first time... And since the actual dictionary file has 112,505 entries, I estimated the entire series should be done in around 27.6 hours. So I started my final run at around 7:32pm Saturday evening, and expected it to run until around 11pm Sunday evening. But instead I got an error during the night:

  File "", line 125, in lookup
if check[8] == 'No match for "%s".' % potdomain.upper():
IndexError: list index out of range

This, I am pretty certain, was due to my internet connection dropping for a moment. This not being an unlikely event, I should add in handling for such situations, and perhaps write to the results output in chunks, instead of all at once! So the complete list will have to wait. But, running against the beginning of the list showed there are some exciting domains still out there:
Domains that seem to be available:

One last note you might already be asking yourself about: The AGID-4 dictionary contains proper and common nouns. I left the proper ones capitalized. It could be argued I should lower() them first, but I don't think one is able to buy a domain name with merely differing capitalization.

Current code of

[EDIT] The code base of domainspotter has changed around a lot since I posted this. Instead of having a big block of code here that is static, check out the latest form on my Trac site.


Post a Comment

Subscribe to Post Comments [Atom]

<< Home