A Lesson on Python, DNS, and Threads

  2007-09-18


While working on a firewall project (that I’ll hopefully be posting about soon), I ran across the need to resolve thousands of domain names to IP addresses as quickly as possible (without duplicates). It was a simple, straightforward problem which proved to be a great learning experience!

To follow along at home you’ll need to download proxies.txt and put it in the same directory as the Python scripts below. Update: I removed the proxiest.txt file… too many hits for all the wrong reasons I’m sure.

Spoiler Obvious Warning: If you need to do lots of DNS queries, don’t reinvent the wheel: use an asynchronous DNS library. These examples are useful for people interested in learning about threading and optimizing simple tasks.

First Try: Synchronous DNS Lookups

Always try the simplest solution first: loop over the list of domain names and add each resolved IP to a set. A set is by definition a unique list, so it won’t contain duplicate IP entires.

The Python code is short enough to post here, but WordPress really likes messing with my whitespace, so I uploaded it to pastebin.

Unfortunately doing synchronous DNS lookups on 3,000 domain names took about 2 hours on my machine!

Bottom line: 2 hours is not acceptable.

Second Try: Threads and Semaphores

I know threads are overkill, but it just wouldn’t be fun if I went straight to using asynchronous DNS. 😉 I also wanted to try using threads on Python which in general have a bad reputation due to the GIL.

[

Download try2.py]4 and execute it with “time python try2.py” to see how long it takes to do the lookups. It took 40 seconds on my computer.

However, if you set “easyadns_debug = False” in the code, suddenly I get a “thread.error: can't start new thread” exception. I believe this is because the code is starting about 3,000 new threads in a tight loop which isn’t a very nice thing to do.

I use a counting semaphore with a value of 50 to ensure only 50 lookups are performed at a time, but this doesn’t stop the script from spawning 3,000 threads immediately! I’m guessing all of the print statements that occur when easyadns_debug is true slow down the process just enough to allow Python to sanely handle all of the new threads.

Bottom line: 40 seconds is good, but causing difficult to debug exceptions is unacceptable.

Third Try: Thread Pool

At this point I should have bitten the bullet and sought out a real asynchronous DNS library, but I had to at least try threads properly before giving up on them.

Download try3.py and execute it with “time python try3.py“.

Just like with try2.py you can set easyadns_debug to False to disable all of the output, but unlike try2.py, this version won’t crash if you do!

My third try uses a thread pool to handle the lookups. 50 threads are created and pop domain names off of the master list. This method requires a lock on both the input and output data variables which adds overhead.

With debugging on the script takes 3 minutes 20 seconds to do the lookups, and disabling debugging output seems to save about 20 seconds.

Bottom Line: 3 minutes isn’t bad, but from the 2nd try we know there’s a more efficient way.

Solution: adns-python

adns-python’s nearly complete lack of documentation scared me away at first, but with a little toying around in ipython its pretty straightforward. Its just a “sudo apt-get install python-adns” away in Debian Sid.

Download try4.py and execute it with “time python try4.py“.

As always you can turn off debugging messages easily. However, even with debugging on this script completes in a little over 20 seconds.

Using a callback instead of the while loop starting on line 15 would be preferable, but I didn’t see how to use callbacks in the version of adns-python I was using. Callbacks would also introduce the need to lock data before adding IPs to it.

Bottom line: No threads. No locks. Less code. Good performance. We have a winner!

Some random things I learned while doing all of this:

  • The bottleneck for DNS lookups is the timeout. Failed lookups will wait multiple seconds before timing out while successful DNS queries will return results in a fraction of a second.
  • Python threading is not worthless because of the GIL. It will be interesting to see where experiments in Python concurrency lead, but I don’t see the GIL as a limiting factor for Python.
  • Because of the GIL I may not have need to lock around my variables. This could have improved performance, but I wanted to write the code using the correct principles. I would hope you could port these examples to another language and get similar results.
  • It’s hell to post code in WordPress. I thought installing a code formatting plugin would help, but it doesn’t seem to stop WordPress from mangling my whitespace randomly.

Aug 1st, 2010 – Update: Disabled comments on this post as it seems to attract a disproportionate amount of spam.