TL;DR: There are over 500’000 legitimate mailservers on the Internet. If you are not managing the IPs of your own mailservers at dnswl.org yet, you should do it!
At dnswl.org, we attempt to “enumerate the goodness”: to identify which mailservers are “good”, as opposed to blacklists, which attempt to “enumerate the badness”.
But just how many “good” mailservers are out there?
We collect data to measure the relative size of individual mailservers. We do not directly observe SMTP traffic, but we do log DNS traffic. We then use the DNS logs as a proxy to estimate the volume of email being sent from a particular IP.
While this will never provide us with correct absolute numbers (due to DNS caching), it does provide us relative numbers. More DNS queries = bigger SMTP sender. * The relative size is then assigned a value on a logarithmic scale, the “magnitudes” of each IP:
|Magnitude||Percent of overall mail traffic|
If we use daily data for such statistics, it may be subject to day-of-week or other short term influences. Therefor we take stats on a monthly aggregate. We then group mailservers by their magnitude:
|Magnitude||Number of mailservers|
Those with “< 1.0” are in fact “0.0”, ie they have not been seen in the current month and may drop off our database soon.
This will leave us with around 500’000 legitimate mailservers on the Internet today.
The number of IPs dnswl.org currently publishes is slightly different from the above because we deactivated some entries for various reasons, and we still publish some IPs which have not been seen in quite some time because it is likely that these IPs will be used again in the future.
Notes on the methodology and data handling
We clean the data by excluding our internal blacklist (a few known to be very bad networks), and those IPs which are younger than 5 days (from experience a lot of those IPs will soon be found to be bad and removed again from our data set).
We exclude those IPs which have not been seen for 30 days. If an IP does not send for 30 days, it may still be a valid mailserver, but it’s unlikely to be significant in terms of volumes in the near future.
Very low-volume IPs(less than 50 queries per 24 hours) are excluded, because we assume they are not significant mailservers.
We further clean some other noise in the data (internal IPs, hostnames, random gibberish, people trying to use our nameservers like public resolvers etc).
We are using July 2017 data for this post. As of the time of writing this on July 22nd, we have data available from July 1st to July 21st 2017.
* Due to DNS caching being more effective for higher volume senders, lower volume senders are likely overrepresented in our data set (fewer DNS queries for higher volume senders reach our logfiles). We assume that the statistical error follows a continuous function and we can ignore this inconsistency for most part of our data set.