Sometime today, Nov 27 2010, amidst the hardware problems with one of our servers, we silently passed the milestone of 100’000 active entries in the dnswl.org database (it’s slightly more IP addresses, because there are also some ranges of IP addresses in our database). That data is used by about 50’000 organisations world-wide.
Based on our statistics, we cover about 90% of the volume of e-mail, and about two thirds of the number of IPs who send e-mail. We are still missing a “long tail” of about one third smaller mailservers in our database.
You may be curious to know how we arrive at these numbers. We obviously can not look into everyone’s e-mail logfiles, but we can look into the DNS traffic on our nameservers. There, we do not only look into who is querying our data, but we also look into what they are querying.
We aggregate this data on a daily and monthly basis to reduce the volume. We then filter out the “noise”: those with extremely low query volumes, those which are clearly dynamic/end-user space (“dynamic” in the hostname etc), and some other tests. The remaining IPs are added into a queue of IPs to be reviewed and assigned to appropriate DNSWL records.
From all IPs (including the filtered and those already assigned), we compute “magnitudes”. These magnitudes indicate basically the percentage of an IP from total world-wide e-mail traffic. Now, we do not directly measure e-mail traffic, but DNS lookups, from which we infer e-mail traffic based on the assumption that those with many DNS lookups are those with a lot of e-mail.
Given the caching mechanisms in DNS, our setup has a tendency to under-estimate the volume of the big senders. It’s a flaw we are willing to accept, especially as this effect is distributed over a very large number of IPs and so does not heavily distort the analysis.
Since an individual IP generally has a very low percentage of overall e-mail traffic, we would have very small numbers. We therefore use logarithmic magnitudes (see table at the bottom of this posting). All the “unassigned” IPs together are usually in the area of magnitude 9.0, ie about 10%. These 10% are (with some daily/weekly fluctuation) about 100’000 IPs. A considerable number of these IPs is later found to be snow-shoe spam or otherwise spammish and are thrown away, so we estimate that there are still between 50’000 and 75’000 IPs which we have not covered (ie total of 150’000 to 175’000 e-mail sending IPs).
Notes about our data
- We collect DNS usage data from six of our 14 mirrors. While it can be rather safely assumed that the large senders will be covered in any case, there may be regional differences which we do not account for (there is some regional bias in the geographical distribution of our nameservers).
- There is a considerable number of outright broken DNS queries to our nameservers. RFC1918 IPs, hostnames instead of IPs, Multicast IPs, IPs from 18.104.22.168/8 before it was being assigned – and most likely there are much more erroneous setups which we can not distinguish from regular traffic. We assume that this will not unduly distort our data in aggregated form.
- Some IPs are observed only for a short period of time. This may be typical snow-shoe behavior or other similar usage patterns. Strictly speaking, we should remove those IPs. We are however deliberately slow in removing them, because they may only be used sporadically (newsletters once a month, …).
- We only started to collect full magnitude data a short while ago when we move the database to a dedicated, big, fat machine. The monthly magnitudes seem to be pretty stable, but the time before christmas may have unusual traffic patterns.
- We do not have such usage data for traffic between the big players in the e-mail space (eg between Yahoo and Hotmail) or internal to such and other organisations. Our data is based on the observation of DNSWL queries from the mostly small- to mid-size organisations that use our whitelisting information.
|Magnitude||Percent of overall lookups|