| Welcome to Crypto. We hope you enjoy your visit. You're currently viewing our forum as a guest. This means you are limited to certain areas of the board and there are some features you can't use. If you join our community, you'll be able to access member-only sections, and use many member-only features such as customizing your profile, sending personal messages, and voting in polls. Registration is simple, fast, and completely free. Join our community! If you're already a member please log in to your account to access all of our features: |
| Contact Analysis | |
|---|---|
| Topic Started: May 18 2008, 06:40 PM (235 Views) | |
| jdege | May 18 2008, 06:40 PM Post #1 |
|
Elite member
![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Contact Analysis In cryptanalysis, contact analysis is the study of the frequency with which certain symbols precede or follow other symbols. The method is used as an aid to breaking classical ciphers. Contact analysis is based on the fact that, in any sample of any written language, certain symbols appear adjacent to other symbols with varying frequencies. Moreover, these frequencies are roughly the same for almost all samples of that language, even when the distribution of the symbols themselves differs significantly from normal. This is true regardless of whether the symbols being used are words or letters. In some ciphers, these properties of the natural language plaintext are preserved in the ciphertext, and have the potential to be exploited in a ciphertext-only attack. Although in a sense contact analysis can be considered a type of frequency analysis, most discussions of frequency analysis concern themselves with the simple probabilities of the symbols in the text: P(X(i) = a) or P(X(i) = a ∩ X(i + 1)= b) Contact analysis is based on the conditional probability that certain letters will precede or succeed other letters: P(X(i) = b | X(i − 1) = a), or P(X(i) = c | X(i − 2) = a ∩ X(i − 1) = b), or even P(X(i) in S | X(i-1) in T ∩ X(i+1) in T), where S and T are subsets of the alphabet being used. Where frequency analysis is based on first-order statistics, contact analysis is based on second or third-order statistics. The process of contact analysis often begins with an attempt to categorize the symbols into classes that have different contact behavior. When the symbols represent letters, the first attempted categorization is usually between vowels and consonants, which have very different contact behaviors in all alphabetic languages. Contact analysis for simple substitution ciphers A common technique for applying the principles of contact analysis to simple substitution ciphers is the consonant line method. In this, the cryptanalyst first creates a chart indicating which ciphertext letters appear next to which other ciphertext letters. From this, a determination is made as to which letters are vowels and which are consonants. English, as with other alphabetic languages, uses letters to spell out phonetically pronounceable words. This places limitations on the sequence of letters. Vowels and consonants tend to alternate, and when we see consecutive vowels or consonants, they are limited to those which can be pronounced. Vowels tend to occur adjacent to consonants, and consonants tend to occur next to vowels. Each will be found in contact with the letters of other class rather than in contact with its own. Vowels tend to contact more distinct letters than do consonants. Vowels commonly contact the many different consonants, consonants most commonly contact the much fewer number of vowels. When vowels contact other vowels they show typical patterns: e rarely contacts o, ea is common, ae is rare. ei and ie are both common. Using these relationships, many letters within the ciphertext can be classified as vowels (AEIOUY), high-frequency consonants (TNSHRDL), medium-frequency consonants (CMWFGPB), and low-frequency consonants (VKJXQZ), and certain letters within each group can be given provisional identifications. An example (For this example, uppercase letters are used to denote ciphertext, lowercase letters are used to denote plaintext or guesses at plaintext, and X~t is used to express a guess that ciphertext letter X represents the plaintext letter t.) Suppose Eve has intercepted the cryptogram below, and suspects that it is an English text encrypted using a simple substitution cipher: She will have prepared herself with charts describing the conditional probabilities for the plaintext language. Her first task is to create a contact chart. Contact Chart A contact chart is a table with a column for every letter in the ciphertext, under which is written a pair of letters for each instance of that letter in the plaintext, the first being the letter prior to that instance of the letter, the second being the letter after. At the top of each column two numbers are written, the number of times that letter appears in the ciphertext (the letter frequency, which is the same as the number of letter pairs in that column), and the number of distinct letters that appear in that column, called the Variety of Contact or VOC. Notice how in the column for H, there are 26 letter pairs, which indicates that appears 26 times, but there are only 16 distinct letters in that column, giving H a VOC of 16. While preparing this chart would have taken Eve a bit longer than generating a simple frequency count, it provides her with far more information about the structure of the text. It includes the single character frequency count as the top line of chart. She can easily calculate the digram frequencies by counting the number of times each distinct letter appears in the left position in each column. HK appears six times. She can calculate the trigram frequency by counting the number of times each distinct letter pair appears in each column. HKR appears three times. In practice, she would not list all of the trigram and digram frequencies, but would simply identify the highest, by visual inspection of the chart, and perhaps highlighting those of interest. In this ciphertext, the most frequent letters are R, H, and X, the most frequent digrams are CR, HK, and RO, and the most frequent trigrams are GYN, HKR and YNV. Just as a ciphertext can sometimes be broken only simple examination, or using only simple frequency distributions, sometimes the contact chart alone will provide sufficient insight into the text as to provide the crack necessary to break it. But for the sake of continuing the example, Eve failed to accomplish this. Identifying vowels and consonants Eve's next step would be to make a guess as to whether one or more letters were a consonant or a vowel. In the ciphertext, Eve might assume that R is a vowel. It has a very high contact number - 20 - indicating that it contacts every letter in the ciphertext save four. But Eve knows that the most reliable assumption is usually that low-frequency, low-contact letters are consonants. The low-frequency consonants make up around 20% of the a typical text. There are 314 characters in the ciphertext, 20% of which is 62. Using the chart, Eve identifies the lowest-contact letters that have a combined frequency of as close to 62 as possible, without going over. From these, Eve would begin to build her vowel and consonant lines. Vowel and Consonant Lines A Consonant Line takes the form of a tee, with the letters thought to be consonants above the top line, and with the letters that precede or follow the letters at the top written below on each side of the vertical line. A Vowel Line is analogous. Eve has identified a number of probable consonants. She would have created a consonant line by taking each probable consonant in turn, writing it at the top of the line, and then going down the column for that letter in the contact chart, taking each letter pair and writing the left letter on the left side of the consonant line and the right on the right. This shows which letters contact the low-frequency consonants. Those that do so the most are likely to be vowels. Any letter that does not contact the low-frequency consonants at all is likely to be a consonant itself, and can be added to the line. In this case, there are none. This first consonant line gives Eve two probable vowels, R and X. R had been strongly suggested simply by it's high VOC. Eve decides to classify R and X as a vowels in this pass. So she prepares a vowel line. This convinces Eve that O and H are consonants. C also shows up as frequently contacting vowels, but the consonant line shows it contacting consonants about as often, so she'll make no determination for it, yet. Adding O and H produces a second consonant line. Examining this second consonant line convinces Eve that B and N are the other two high-frequency vowels. Vowel spacing As Eve was identifying vowels, she was marking up the ciphertext, putting a bar over ever letter that represents a vowel. Long consonant clusters are rare in English. Five sequential consonants happen rarely (e.g., "running through"), more than five happen almost not at all. Similarly, long sequences of vowels are rare. She used these sequences, along with the number of contacts shown in the consonant line and the vowel line, in determining which letters were vowels and consonants. After having identified the four high-frequency vowels, Eve sees that there are no instances of more than two vowels in sequence. Had there been any, it would have been an indication that she had decided incorrectly. She finds only four five-long non-vowel sequences, EZGHQK. Of these, she has already identified E, Z, G, H, and Q as consonants. Either K is a vowel, or one of the others was mistakenly identified. (it's not uncommon for u and y to appear in the low 20%, and to be picked as consonants in the initial division.) Vowel-vowel contact Eve's next step is to try to determine which of the four high frequency vowels are which. She does this by looking at the frequencies of the digrams containing only these four letters. Eve knows that e is the vowel that is by far most likely to contact other vowels. She provisionally assigns R~e. The vowel that is most likely to precede e is i, which might be N, though the counts are so small as to make it uncertain. She leaves the others, for now, and moves on to the trigrams to see if she can identify the high-frequency consonants. High-Frequency Consonants The high-frequency letters in the ciphertext are R, H, X, B, O, C, Y, N, J, and I. Eve has identified R as e, and B, N, and X as high-frequency vowels. Remaining are H, O, C, Y, J, and I, and the high-frequency consonants, t, n, s, r and d are likely among them. Eve examines the high-frequency trigrams, to see if they can help her identify the remaining high-frequency consonants,. The one she immediately notices is HKR. She knows R~e, that H is a consonant, and that K is not a, e, i or o,. Of the trigrams consisting of two consonants followed by e, the is overwhelmingly the most common. She also sees that HH appears as a double letter, which doesn't contradict the assumption. So she provisionally assigns HKR~the. Now she looks at the digrams; She looks at BH. She has assumed that H is t, and is pretty sure that B is a vowel other than e. Of the four possibilities, at, it, ot, and ut, at is the most common. She decides to give B~a a try. At this point, she has two high-frequency vowels left, i and o, which have to be either XN or NX. Purely arbitrarily, she tries NX~oi. Examining the partial plaintext, she sees that the message ends: BHNXY~atoi_, which doesn't seem right. Both oi and io are reasonably common, but as the ending of a word, ation is common, and atoi? is not. So she changes her mind and tries NXY~ion. She now has the four high-frequency vowels BENX~aeio, and three consonants, KYH~hnt. The remaining high-frequency letters in the ciphertext are: O, C, J, and I. The unidentified high-frequency consonants are s, r and d. She looks through the partial plaintext, looking to see where these might fit. She notices: Most particularly, she notices that the unknown letters in this sequence include repeats of the high-frequency ciphertext letters, O and I. O is the higher-frequency of the two, so she mentally replaces it with the unidentified high-frequency consonants. The assignment O~r immediately leads to I~m, and the plaintexttext Which is, of course:
At which point her plaintext is:
And from there, breaking the rest is trivial. |
| When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl. | |
![]() |
|
| Donald | May 19 2008, 12:25 PM Post #2 |
|
Elite member
![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Fantastic! Thank you! I added it to the tutorial list. Funny story: I was researching Variety of Contact just last week. I'm working on a program that was generating contact info, and I needed to know more about how it is calculated and how to use it. I've read the section on it in Gaines, but while she has WONDERFUL information, her explanations aren't the most user friendly in the world. So, I was Googling for more info, and what should come up, but a scratch wikipedia page by jdege himself with an incredibly detailed description of how to use contact info and work on the consonant line. And then here, just a few days later, we get all the details! Thanks Jdege, it was EXACTLY what I was looking for! |
![]() |
|
| jdege | May 19 2008, 01:35 PM Post #3 |
|
Elite member
![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'd noticed that the Wikipedia page on cryptanalysis was very weak on attacks on classical ciphers. It mentioned, IIRC, frequency analysis, Kasiski, and IC. Nothing on pattern words, anagramming, or contact tables. I tried to write something, patterned on the existing page on frequency analysis, and it quickly became too long. That's what was on the scratch page you Googled. I took the first part, and posted it as contact analysis, figuring to put a second page up on cluster analysis later. That's when I hit Google's citation rules. Writing up something like this is a lot more fun than tracking down citations, trying to figure out which book I had learned these various ideas from. So I pretty much gave up on it. With the recent surge in activity here, I decided to grab what I had intended for wikipedia, and to rewrite it for here. When I said I was trying to figure out how to deal with Pats - simple substitution without word separation, this is what I meant. If a Pat is long enough for the statistics to be clear, they're pretty easy to deal with. If you have a good crib - as you do for the first half-dozen Pats in every issue of The Cryptogram - they're pretty easy to deal with. But short pats can be a problem. For them, these techniques sometimes work. The basic approach, trying to distinguish vowels from consonants, is pretty much the essential first step. The consonant line is one method for trying to do so. I've seen others, and I often try the others. I've not found any that always work. Another form of contact chart: What you look for here are high-frequency letters that make frequent contact with low-frequency letters. These are usually vowels. Another I've seen was to look at the 8 highest frequency letters and the 18 highest frequency digrams. In this case: What you're looking for is which of the high-frequency letters appears in the greatest number of distinct digrams. This is usually e. Of the remaining 7 high-frequency letters, the three that are found least often in contact with e are most often a, i, and o. But again, these don't always work. What works best for me is to lay out all of the info, frequency counts, digrams, trigrams, repeated letters, ABA patterns, contact tables, consonant and vowel lines, etc., and to try to make guesses. Keeping my mind focused on what I'm looking for next is essential. I can read over something like "_anne_e__e_e__e__hatthe" dozens of times, without picking up that it means "canneverrememberwhat". It was only when I was looking at that with an eye to seeing where an r might fit, that I recognized the solution. Ditto for keeping track of vowel separation. If you have your vowels right, your consonant clusters will be small. When you have some, but not all, of your vowels, you can be assured that the remaining vowels will be among the set of letters that appear in all of your consonant clusters. When you have the vowels right, you will have few long groups of consonants.
The idea of using contact behavior to partition the ciphertext into distinct sets isn't limited to simple substitution. I was in a conversation in a SF forum, not long ago, over how strong his cipher system was. What he'd proposed was, in effect, a nomenclature. A small list of code words plus a cipher for spelling out words that weren't in the list. He was adamant that no one would be able to tell which of his code groups represented words and which would represent letters. That is, of course, nonsense. The groups representing letters would be used to spell out words - that means they would appear in clusters. Ditto for the groups representing words. Examining the frequency with which each code group appeared in contact with each other group should quickly allow you to partition the code groups into the letters and the words. The letters can then be attacked as simple substitution. The words can then be guessed from context. |
| When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl. | |
![]() |
|
| Revelation | May 21 2008, 08:30 AM Post #4 |
|
Administrator
![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks for this great tutorial
We may be able to put some contact analysis in the ADFGVX cracker. |
|
RRRREJMEEEEEPVKLWENFNVJKEEEEEAOLKAFKLXCFZAASDJXZTTTTTTTLSIOWJXMOKLAFJNNKFNXN RAGRBAQEMHIGDJVDSEOXVIYCELFHWLELJFIENXLRATALSJFSLCYTKLASJDKMHGOVOKAJDNMNUITN RRRRLJVEEEEECLYVYHNVPFTAEEEEEMWLMEIRNGLARWJAKJDFLWNTIERJMIPQWOTZEOCXKNUBNXCN RJIRPOWEANFUSNCZVDVZNMSFEKLOEPZLDKDJWSAAAAAAAOERHJCTNCKFRIMVKSOFOMKMANREWNBN RZUDRGXEEEEENFQIDVLQNCKNEEEEEDGLLLLLLAWIOSNCDARLODMTOEJXMILDFJROTKJSDNLVCZNN | |
![]() |
|
| jdege | May 21 2008, 06:00 PM Post #5 |
|
Elite member
![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, once you figure out how to reverse the transposition, you'll have a simple substitution cipher without word divisions, like the one in my example. How you might use contact information in reversing the transposition? I could easily see using digram or trigram statistics in your scoring function. Whether that constitutes contact analysis depends upon how you're looking at it. |
| When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl. | |
![]() |
|
| 1 user reading this topic (1 Guest and 0 Anonymous) | |
| « Previous Topic · General · Next Topic » |





![]](http://209.85.122.85/static/1/pip_r.png)



12:06 PM Nov 8