Merge/Purge of E-Mail Addresses
February 7, 2001
Companies that want to send marketing messages to rented e-mail addresses place rental orders with managers and owners of many different lists. Recipients of marketing messages who get more than one copy of the same message tend to be irritated at the duplicate e-mails and are generally less responsive to the marketing messages themselves.
Because the e-mail marketing industry is unsophisticated compared with the traditional postal list world, e-mail list owners are hesitant to provide the actual lists to mailers or their service bureaus. As a result, there is no simple method of ensuring that lists rented from multiple sources do not contain duplicates. There is a methodology that incorporates patented cryptographic technology as well as publicly available algorithms and allows list owners to retain possession of their lists while still enabling the elimination of duplicates.
Over the past 40 years, the traditional direct mail and direct marketing world has developed techniques that allow list owners to ship the actual names and addresses of their customers to mailers for various purposes without fear of the names and addresses being misused. The purposes include:
- Application of statistical models to identify the best prospects.
- Deduplication among the multiple sources of lists.
- Suppression of prospects on rental files who are already customers of the mailer.
- Overlaying demographic and psychographic data to increase response rates.
- Segmentation for testing purposes.
- Control over the logistics of the overall mailings.
- Achieving media and postage cost efficiencies.
- Various database functions.
The list owners’ concerns include:
- Use of lists for offers not approved by owners.
- Mailing of lists on unapproved dates.
- Use of lists in excess of the usage contracted (multiple mailings).
- Passing lists on to other parties who use the lists without permission.
The universal technique in traditional direct marketing is the use of decoy, or seed, names. In this process, the list owner, or its computer service bureau, inserts a unique set of records into each list order that is shipped out. While these records look like and have the same format as the legitimate records on the file, an identifier unique to each list rental order is inserted somewhere within each decoy record.
- If a mailer rented names from more than one list owner, there was a good chance that some people would receive duplicate e-mails from each list owner.
- A mailer could not ensure that his existing customers would not receive solicitations, with obvious negative feedback from his existing customers.
- There was no consistency in the formatting of the messages and, as a result, no easy way to compare the results of mailing to two different lists accurately.
In the halcyon days of e-mail marketing, mailers lived with these limitations because results were generally good and recipients were unsophisticated. But as Internet use has become more widespread, this has changed. Response rates are being carefully measured for the first time, and mailers are realizing that duplicate messages are expensive.
The e-mail world is not yet sufficiently populated with sophisticated traditional mailers, and so list owners still refuse to release e-mail lists to mailers. It is likely that the first step in the evolution of the e-mail world will be to ship files to trusted third-party service bureaus. In the absence of that, a method described as "merge/purge by proxy" meets the requirements and overcomes the objections of both mailers and list owners.
Merge/Purge by Proxy
In the computing world, a technique known as "hashing" has been developed. Hashing can be used to, among other things, obtain a single integer value that can represent, usually uniquely, a string of data. Simply put, an arithmetic calculation is applied to a sentence, word or set of characters, and the result is a single number. Given the integer values for two strings, one can quickly determine whether the strings are the same. In other words, we can convert English sentences into numeric values, and by comparing the values, we can easily tell whether the sentences are different.
As a simple example, let’s look at these two sentences:
The elephant is blue.
The monkey laughs.
If we were to apply a simple formula to these two sentences, wherein we apply a numeric value to each letter, starting with a=1, b=2, c=3... z=26, and add the numbers, we would end up with two integers.
The elephant is blue = 20+8+5+5+12+5+16+8+1+14+20+9+19+2+12+21+5= 182.
The monkey laughs = 20+8+5+13+15+14+11+5+25+12+1+21+7+8+19= 184
To see whether the two pieces of data are the same, instead of having to compare every character, we look at the resultant integers. Obviously, 182 does not equal 184, so the data must be different. Let’s transfer this knowledge to the e-mail world.
If a list owner applied the simple, public formula we have above (a=1, b=2, etc...) to each e-mail address on the list segment it supplied to a mailer and the mailer applied the same formula to his list of e-mail addresses, any e-mail addresses in the list owner’s file that did not match the mailer’s customer file would, by definition, be unique, and safe to mail. Of course, there is a practical flaw in this: A simple formula like this applied to a big list would create many records on the mailer’s own list with the same numbers. For example, consider the following two sentences:
The elephant is neat = 20+8+5+5+12+5+16+8+1+14+20+9+19+14+5+1+20 = 182.
The elephant is blue = 20+8+5+5+12+5+16+8+1+14+20+9+19+2+12+21+5= 182.
In this example, if someone gave you the number 182, you would not be able to tell which of the two sentences they meant. There are thousands of sentences that could generate the same number: 182. We do know that if the numbers were different, the sentences would have to be different. But we have no idea what the original sentence was by being in possession of the number. While we would prefer unique numbers to provide the most accurate comparison of e-mail lists, the e-mail application also requires that one cannot recover an e-mail address given its corresponding integer. If the translation produces unique integers, this raises the concern of whether a reverse translation can be constructed (i.e. whether one can reverse-engineer the e-mail address from the integer). In other words, we want to make sure that if someone knows the number, they can’t translate it back to the original e-mail address.
As we note above, the second two sentences have the same number but are different. So we have to look for an algorithm, or formula, that will ensure that each unique e-mail address will produce a unique resultant. In mathematics and in cryptography, several algorithms will allow this to happen. The difficult part is to create a unique answer that cannot be reversed. In other words, we ensure that it is impossible to take the resultant numbers and re-create the original address. One way to do this is to ensure that when the hash is calculated, an insignificant part of the intermediate data is discarded.
A solution to this challenge can be found easily in the mathematical world. Probably the most appropriate algorithm is known as MD5, or Message Digest Version 5. An alternative appropriate algorithm is SHA-1 or Secure Hash Algorithm.
By using one of these algorithms, "merge/purge by proxy" (M/PBP) will allow list owners to protect the confidentiality of their names while allowing mailers to deduplicate their mailings.
To provide a method of deduplicating e-mail lists using merge/purge by proxy, these steps are taken in a hypothetical situation:
Mailer A orders 10,000 e-mail addresses from each of 10 list owners, Owner 1 through Owner 10 inclusively. Together with the list orders, Mailer A provides the specifics of the algorithm employed. Each list owner applies the algorithm to his e-mail list and creates a list of 10,000 hash values. These hashed files are sent to a neutral third-party service bureau. The mailer also applies the algorithm to the e-mail addresses of his house file. He then forwards this file of hashes to the independent third-party service bureau.
The service bureau then goes through these steps:
- The bureau examines each record from the house file and deletes any matching hashes it identifies on each of the 10 rental files. At this point, the service bureau has ensured that the mailer will not mail to an existing customer.
- The service bureau then merges the 10 files and identifies all records that occur on more than one of the 10 lists. Using a fair allocation system, the service bureau suppresses all but one occurrence of any given hash. The bureau now can ensure that no duplicates remain in any of the rental lists, and no more than one e-mail message will be sent to any one address.
- The bureau splits the unique file back into its 10 component lists and returns those files to their list owners.
The list owners now have a file of hashes known to represent e-mail addresses that do not appear on either the mailer’s customer file or on any of the other nine post-deduplication files. The list owners then match the hashes to their master files and extract the corresponding e-mail addresses. These are then the addresses that are to be mailed for the mailer.
Rodney Joffe is founder of Whitehat.com, Tempe, AZ