Merge/Purge of E-Mail Addresses

February 7, 2001

Companies that want to send marketing messages to rented e-mail addresses place rental orders with managers and owners of many different lists. Recipients of marketing messages who get more than one copy of the same message tend to be irritated at the duplicate e-mails and are generally less responsive to the marketing messages themselves.

Because the e-mail marketing industry is unsophisticated compared with the traditional postal list world, e-mail list owners are hesitant to provide the actual lists to mailers or their service bureaus. As a result, there is no simple method of ensuring that lists rented from multiple sources do not contain duplicates. There is a methodology that incorporates patented cryptographic technology as well as publicly available algorithms and allows list owners to retain possession of their lists while still enabling the elimination of duplicates.

Over the past 40 years, the traditional direct mail and direct marketing world has developed techniques that allow list owners to ship the actual names and addresses of their customers to mailers for various purposes without fear of the names and addresses being misused. The purposes include:

Application of statistical models to identify the best prospects.
Deduplication among the multiple sources of lists.
Suppression of prospects on rental files who are already customers of the mailer.
Overlaying demographic and psychographic data to increase response rates.
Segmentation for testing purposes.
Control over the logistics of the overall mailings.
Achieving media and postage cost efficiencies.
Various database functions.

The list owners’ concerns include:

Use of lists for offers not approved by owners.
Mailing of lists on unapproved dates.
Use of lists in excess of the usage contracted (multiple mailings).
Passing lists on to other parties who use the lists without permission.

The universal technique in traditional direct marketing is the use of decoy, or seed, names. In this process, the list owner, or its computer service bureau, inserts a unique set of records into each list order that is shipped out. While these records look like and have the same format as the legitimate records on the file, an identifier unique to each list rental order is inserted somewhere within each decoy record.

This allows the seed names to pass the normal processing steps of the mailer without giving any indication that they are anything other than real customer names and addresses. The list owner keeps the nature of the seed names secret.

Because the decoy record is created by the list owner or its service bureau, the detail in that record exists nowhere outside the control of the list owner. Therefore, all mail received by these decoys has to originate from specifically approved list rental transactions. When these decoy records receive mail, the coded information allows the list owner to identify the original list rental file that was sent out and to monitor the usage.

The technique has proved sufficiently secure so that over the years, after a few attempts by dishonest mailers were thwarted and prosecuted successfully, misappropriation of a list has become rare. Despite the hundreds of thousands of list orders that are fulfilled each year, fewer than five cases of misappropriation have been reported publicly over the past two years. In most of the reported cases, the misappropriation has proved to be the result of human error, not malice.

In the traditional postal world, identifying duplicates is no longer a problem.

The E-Mail World

The use of e-mail addresses for marketing is a relatively new phenomenon. True direct marketing to e-mail addresses was first recorded in 1994, and in most cases, the marketers were Internet companies, not traditional merchandisers. They had no understanding or experience in the traditional postal world and created a new set of experiences to draw on. In addition, many of the e-mail marketers began their careers in the shadier segment of marketing, where vendors existed in cyberspace, and there was no physical framework to allow for normal validation of a vendor’s genuineness. Many of the early e-mail marketing campaigns were dishonest offers mailed by unscrupulous marketers.

As a result, the e-mail industry adopted a standard practice whereby the list owners themselves retained possession of their customer e-mail addresses, and marketers who rented the lists had to rely on their message being sent out by the list owners themselves. This led to three major problems:

If a mailer rented names from more than one list owner, there was a good chance that some people would receive duplicate e-mails from each list owner.
A mailer could not ensure that his existing customers would not receive solicitations, with obvious negative feedback from his existing customers.
There was no consistency in the formatting of the messages and, as a result, no easy way to compare the results of mailing to two different lists accurately.

In the halcyon days of e-mail marketing, mailers lived with these limitations because results were generally good and recipients were unsophisticated. But as Internet use has become more widespread, this has changed. Response rates are being carefully measured for the first time, and mailers are realizing that duplicate messages are expensive.

The e-mail world is not yet sufficiently populated with sophisticated traditional mailers, and so list owners still refuse to release e-mail lists to mailers. It is likely that the first step in the evolution of the e-mail world will be to ship files to trusted third-party service bureaus. In the absence of that, a method described as "merge/purge by proxy" meets the requirements and overcomes the objections of both mailers and list owners.

Merge/Purge by Proxy

In the computing world, a technique known as "hashing" has been developed. Hashing can be used to, among other things, obtain a single integer value that can represent, usually uniquely, a string of data. Simply put, an arithmetic calculation is applied to a sentence, word or set of characters, and the result is a single number. Given the integer values for two strings, one can quickly determine whether the strings are the same. In other words, we can convert English sentences into numeric values, and by comparing the values, we can easily tell whether the sentences are different.

As a simple example, let’s look at these two sentences:

The elephant is blue.

The monkey laughs.

If we were to apply a simple formula to these two sentences, wherein we apply a numeric value to each letter, starting with a=1, b=2, c=3... z=26, and add the numbers, we would end up with two integers.

The elephant is blue = 20+8+5+5+12+5+16+8+1+14+20+9+19+2+12+21+5= 182.

The monkey laughs = 20+8+5+13+15+14+11+5+25+12+1+21+7+8+19= 184

To see whether the two pieces of data are the same, instead of having to compare every character, we look at the resultant integers. Obviously, 182 does not equal 184, so the data must be different. Let’s transfer this knowledge to the e-mail world.

If a list owner applied the simple, public formula we have above (a=1, b=2, etc...) to each e-mail address on the list segment it supplied to a mailer and the mailer applied the same formula to his list of e-mail addresses, any e-mail addresses in the list owner’s file that did not match the mailer’s customer file would, by definition, be unique, and safe to mail. Of course, there is a practical flaw in this: A simple formula like this applied to a big list would create many records on the mailer’s own list with the same numbers. For example, consider the following two sentences:

The elephant is neat = 20+8+5+5+12+5+16+8+1+14+20+9+19+14+5+1+20 = 182.

The elephant is blue = 20+8+5+5+12+5+16+8+1+14+20+9+19+2+12+21+5= 182.

In this example, if someone gave you the number 182, you would not be able to tell which of the two sentences they meant. There are thousands of sentences that could generate the same number: 182. We do know that if the numbers were different, the sentences would have to be different. But we have no idea what the original sentence was by being in possession of the number. While we would prefer unique numbers to provide the most accurate comparison of e-mail lists, the e-mail application also requires that one cannot recover an e-mail address given its corresponding integer. If the translation produces unique integers, this raises the concern of whether a reverse translation can be constructed (i.e. whether one can reverse-engineer the e-mail address from the integer). In other words, we want to make sure that if someone knows the number, they can’t translate it back to the original e-mail address.

As we note above, the second two sentences have the same number but are different. So we have to look for an algorithm, or formula, that will ensure that each unique e-mail address will produce a unique resultant. In mathematics and in cryptography, several algorithms will allow this to happen. The difficult part is to create a unique answer that cannot be reversed. In other words, we ensure that it is impossible to take the resultant numbers and re-create the original address. One way to do this is to ensure that when the hash is calculated, an insignificant part of the intermediate data is discarded.

A solution to this challenge can be found easily in the mathematical world. Probably the most appropriate algorithm is known as MD5, or Message Digest Version 5. An alternative appropriate algorithm is SHA-1 or Secure Hash Algorithm.

By using one of these algorithms, "merge/purge by proxy" (M/PBP) will allow list owners to protect the confidentiality of their names while allowing mailers to deduplicate their mailings.

To provide a method of deduplicating e-mail lists using merge/purge by proxy, these steps are taken in a hypothetical situation:

Mailer A orders 10,000 e-mail addresses from each of 10 list owners, Owner 1 through Owner 10 inclusively. Together with the list orders, Mailer A provides the specifics of the algorithm employed. Each list owner applies the algorithm to his e-mail list and creates a list of 10,000 hash values. These hashed files are sent to a neutral third-party service bureau. The mailer also applies the algorithm to the e-mail addresses of his house file. He then forwards this file of hashes to the independent third-party service bureau.

The service bureau then goes through these steps:

The bureau examines each record from the house file and deletes any matching hashes it identifies on each of the 10 rental files. At this point, the service bureau has ensured that the mailer will not mail to an existing customer.
The service bureau then merges the 10 files and identifies all records that occur on more than one of the 10 lists. Using a fair allocation system, the service bureau suppresses all but one occurrence of any given hash. The bureau now can ensure that no duplicates remain in any of the rental lists, and no more than one e-mail message will be sent to any one address.
The bureau splits the unique file back into its 10 component lists and returns those files to their list owners.

The list owners now have a file of hashes known to represent e-mail addresses that do not appear on either the mailer’s customer file or on any of the other nine post-deduplication files. The list owners then match the hashes to their master files and extract the corresponding e-mail addresses. These are then the addresses that are to be mailed for the mailer.

Practical Issues

To ensure that each list owner is fairly compensated for records that also appear on other lists, the first duplicate between list 1 and list 2 is allocated to list 1, and the second is allocated to list 2, etc. This also applies to triplicates and so on. It is conceivable that an address may occur on all 10 lists, and so tables are kept to ensure that allocations are fair.

Care should be taken to ensure that addresses are in a canonical or basic form so that various permitted derivations of a unique address are not identified as unique addresses themselves. This includes ensuring that addresses in the form name+[some random value]@domain name are edited back to name@domain name, and dealing with upper- and lower-case characters.

Until the e-mail list world becomes comfortable with the systems used by the traditional list world to maintain the integrity of lists in the face of theft and misuse, "merge/purge by proxy" provides the only acceptable solution.

Rodney Joffe is founder of Whitehat.com, Tempe, AZ