Fun with Python: Domainspotter
In an effort to learn Python better, I will be writing a number of relatively small scripts to perform set tasks. I find that giving myself some goal and having to code towards it is more effective in learning syntax than going through coding exercises in a tutorial.
For the first such project, I wanted to determine how many words in a current English dictionary are not taken as domain names. This could have become a rather complicated exercise, but I limited it to the following:
The code as it stands is at the end of this post.
Before you look at how I did it, however (which is far from optimal anyway), I would encourage you to try out the exercise on your own, given the basic requirements stated above. I found that this program was perfect for learning more Python, since it required learning how to perform a number of disparate activities, such as reading and parsing files, using external modules, performing web lookups, and displaying desired data. These are, of course, some of the most general tasks most programs of use must perform.
A few notes. The whois lookups was the only section of the program that I assumed would require something outside of core Python. I came across rwhois.py, a module for performing whois lookups. I still have the code to implement this listed in the program as it stands for illustration, but found it did not meet my needs. It could not locate many domains I confirmed were taken using the simple command
Additionally, while I know the code could stand for optimization all over the place, I was curious how long the web lookups would take. To do this, I modified the lookup function to the following:
Current code of domainspotter.py:
[EDIT] The code base of domainspotter has changed around a lot since I posted this. Instead of having a big block of code here that is static, check out the latest form on my Trac site.
For the first such project, I wanted to determine how many words in a current English dictionary are not taken as domain names. This could have become a rather complicated exercise, but I limited it to the following:
- single English words, using the AGID-4 dictionary
- .com and .net TLDs only
The code as it stands is at the end of this post.
Before you look at how I did it, however (which is far from optimal anyway), I would encourage you to try out the exercise on your own, given the basic requirements stated above. I found that this program was perfect for learning more Python, since it required learning how to perform a number of disparate activities, such as reading and parsing files, using external modules, performing web lookups, and displaying desired data. These are, of course, some of the most general tasks most programs of use must perform.
A few notes. The whois lookups was the only section of the program that I assumed would require something outside of core Python. I came across rwhois.py, a module for performing whois lookups. I still have the code to implement this listed in the program as it stands for illustration, but found it did not meet my needs. It could not locate many domains I confirmed were taken using the simple command
whois
. So, I decided to fall back on something I knew: the command whois
! This seemed to work fine for my needs.Additionally, while I know the code could stand for optimization all over the place, I was curious how long the web lookups would take. To do this, I modified the lookup function to the following:
I ran this against a much shortened dictionary file (the top 10 lines of the real one). Once the debugger launched, I did the following:
import time
self.totalchecked = 0
self.availdomains = []
times = []
for potdomain in self.potdomains:
start = time.time()
runthis = "whois %s" % potdomain
check = commands.getoutput(runthis).split('\n')
if check[8] == 'No match for "%s".' % potdomain.upper():
self.availdomains.append(potdomain)
else:
pass
self.totalchecked+=1
end = time.time()
timeforquery = end - start
times.append(timeforquery)
import pdb
pdb.set_trace()
(Pdb) print sum(times) / len(times)This gave me the average time per web query. Now, note that I found this out after I started the final run for the first time... And since the actual dictionary file has 112,505 entries, I estimated the entire series should be done in around 27.6 hours. So I started my final run at around 7:32pm Saturday evening, and expected it to run until around 11pm Sunday evening. But instead I got an error during the night:
0.884103870392
File "domainspotter.py", line 125, in lookupThis, I am pretty certain, was due to my internet connection dropping for a moment. This not being an unlikely event, I should add in handling for such situations, and perhaps write to the results output in chunks, instead of all at once! So the complete list will have to wait. But, running against the beginning of the list showed there are some exciting domains still out there:
if check[8] == 'No match for "%s".' % potdomain.upper():
IndexError: list index out of range
Domains that seem to be available:One last note you might already be asking yourself about: The AGID-4 dictionary contains proper and common nouns. I left the proper ones capitalized. It could be argued I should
AMusD.net
Abelmoschus.net
Aberdonian.net
Abkhas.com
Abkhas.net
Abkhasian.com
Abkhasian.net
Abkhazian.net
Abnaki.net
Abramis.net
lower()
them first, but I don't think one is able to buy a domain name with merely differing capitalization.Current code of domainspotter.py:
[EDIT] The code base of domainspotter has changed around a lot since I posted this. Instead of having a big block of code here that is static, check out the latest form on my Trac site.
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home