Welcome Guest [Log In] [Register]
Viewing Single Post From: Contact Analysis
jdege
Member Avatar
Elite member
[ *  *  *  *  * ]
Contact Analysis

In cryptanalysis, contact analysis is the study of the frequency with which certain symbols precede or follow other symbols. The method is used as an aid to breaking classical ciphers.

Contact analysis is based on the fact that, in any sample of any written language, certain symbols appear adjacent to other symbols with varying frequencies. Moreover, these frequencies are roughly the same for almost all samples of that language, even when the distribution of the symbols themselves differs significantly from normal. This is true regardless of whether the symbols being used are words or letters.

In some ciphers, these properties of the natural language plaintext are preserved in the ciphertext, and have the potential to be exploited in a ciphertext-only attack.

Although in a sense contact analysis can be considered a type of frequency analysis, most discussions of frequency analysis concern themselves with the simple probabilities of the symbols in the text: P(X(i) = a) or P(X(i) = a ∩ X(i + 1)= b)

Contact analysis is based on the conditional probability that certain letters will precede or succeed other letters: P(X(i) = b | X(i − 1) = a), or P(X(i) = c | X(i − 2) = a ∩ X(i − 1) = b), or even P(X(i) in S | X(i-1) in T ∩ X(i+1) in T), where S and T are subsets of the alphabet being used.

Where frequency analysis is based on first-order statistics, contact analysis is based on second or third-order statistics.

The process of contact analysis often begins with an attempt to categorize the symbols into classes that have different contact behavior. When the symbols represent letters, the first attempted categorization is usually between vowels and consonants, which have very different contact behaviors in all alphabetic languages.

Contact analysis for simple substitution ciphers

A common technique for applying the principles of contact analysis to simple substitution ciphers is the consonant line method. In this, the cryptanalyst first creates a chart indicating which ciphertext letters appear next to which other ciphertext letters. From this, a determination is made as to which letters are vowels and which are consonants.

English, as with other alphabetic languages, uses letters to spell out phonetically pronounceable words. This places limitations on the sequence of letters. Vowels and consonants tend to alternate, and when we see consecutive vowels or consonants, they are limited to those which can be pronounced. Vowels tend to occur adjacent to consonants, and consonants tend to occur next to vowels. Each will be found in contact with the letters of other class rather than in contact with its own.

Vowels tend to contact more distinct letters than do consonants. Vowels commonly contact the many different consonants, consonants most commonly contact the much fewer number of vowels. When vowels contact other vowels they show typical patterns: e rarely contacts o, ea is common, ae is rare. ei and ie are both common.

Using these relationships, many letters within the ciphertext can be classified as vowels (AEIOUY), high-frequency consonants (TNSHRDL), medium-frequency consonants (CMWFGPB), and low-frequency consonants (VKJXQZ), and certain letters within each group can be given provisional identifications.

An example

(For this example, uppercase letters are used to denote ciphertext, lowercase letters are used to denote plaintext or guesses at plaintext, and X~t is used to express a guess that ciphertext letter X represents the plaintext letter t.)

Suppose Eve has intercepted the cryptogram below, and suspects that it is an English text encrypted using a simple substitution cipher:
Code:
 
GYNVN CBFXH IXORD XIMFN DBHRJ HKBYD MIXTD XGOCR HKRHS MNDBF
GYNVK BDERO DBYYR UROOR IRIZR OQKBH HKRMO NYHDX IIBYJ NCDBF
FRJHK NCQRR EZGHQ KRYNH PRHCO NPKHJ XQYHX NHGYN VNCBP FXONT
NRJUN JRXPB IRMRX MFRJX YHJXC RONXG CQXOE XYGYN VCSCH RICHK
RSCRY JAXER CBOXG YJHKR QXOFJ XYGCR YRHXO QONHR BJURY HGORP
BIRCB YJORC RBODK MBMRO CRMXC HORBF MOXPO BIIRO CJXYH GCRMB
CDBFJ BHBIB HNXY
She will have prepared herself with charts describing the conditional probabilities for the plaintext language.

Her first task is to create a contact chart.

Contact Chart

A contact chart is a table with a column for every letter in the ciphertext, under which is written a pair of letters for each instance of that letter in the plaintext, the first being the letter prior to that instance of the letter, the second being the letter after. At the top of each column two numbers are written, the number of times that letter appears in the ciphertext (the letter frequency, which is the same as the number of letter pairs in that column), and the number of distinct letters that appear in that column, called the Variety of Contact or VOC.
Code:
 
 1 24 21 11  4 10 11 26 13 15 11  0 11 20 23  6  7 40  3  2  3  4  0 25 21  2
 2 12 13 12  5 10  7 16  8 11  9  0 10 15 15  8  7 20  4  3  3  3  0 17 10  4
 A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
JX CF NB RX DR BX -Y XI HX RH HB    IF YV XR HR OK OD HM XD RR NN    FH GN IR
   DH OR NB RZ MN XO BR XM YN HR    DI VC GC NK CR HJ CC NN JN NK    IO BD EG
   KY ND YM OX BG FY JK MX RH VB    SN FD RD BF HK CH RC    JR NN    DI GN  
   DF NQ TX XR BF ZH RK RR HX QB    RO MD RO XB XY KH          NC    IT BY  
   KD HO NB    FR HY RS RZ RU HR    RR YV OR RB CX EO                DG YR  
   DY NB BE    PX XC BH XI NR HN    XF OY RQ XO RX YU                DI NH  
   KH XR OB    MR YY HK IB RX QR    KB JC MN    OO UO                JQ BJ  
   IY GQ HX    OJ XY YD BR HX PH    BR KC CN       OI                HN RN  
   DF VS CB    BM YC JK RC YA HR    RX YH XN       II                FO QH  
   CP SH OK    BJ HO GQ BR YH HR    FO OP RN       ZO                RP GN  
   PI IH CB       HC NP BI FX DM    RB XH XE       KM                RM XH  
   CO SR             RC IR BU          YV BX       FJ                JY XG  
   RJ RB             KJ BB YO          VC XF       QR                JC GN  
   PI GR             YX    CX          OT XQ       RE                NG RJ  
   CY RB             NG    FB          TR QN       KY                QO GJ  
   RO RR             YJ                UJ GR       PH                EY XG  
   MM OR             CR                OX JR       NJ                AE RR  
   RF XH             CK                YV BD       JX                OG RH  
   OI OJ             JK                OH RC       IM                QO BJ  
   MC GR             RX                HX HR       MX                JY XH  
   DF BD             NR                   MX       FJ                HO X-  
   JH                YG                   PB       CO                MC      
   HI                CO                   RC       HI                OP      
   IH                YG                            KS                JY      
                     BB                            CY                NY      
                     BN                            EC                        
                                                   KQ                        
                                                   CY                        
                                                   YH                        
                                                   HB                        
                                                   UY                        
                                                   OP                        
                                                   IC                        
                                                   OC                        
                                                   CB                        
                                                   MO                        
                                                   CM                        
                                                   OB                        
                                                   IO                        
                                                   CM                        
Notice how in the column for H, there are 26 letter pairs, which indicates that appears 26 times, but there are only 16 distinct letters in that column, giving H a VOC of 16.

While preparing this chart would have taken Eve a bit longer than generating a simple frequency count, it provides her with far more information about the structure of the text. It includes the single character frequency count as the top line of chart.

She can easily calculate the digram frequencies by counting the number of times each distinct letter appears in the left position in each column. HK appears six times.

She can calculate the trigram frequency by counting the number of times each distinct letter pair appears in each column. HKR appears three times.

In practice, she would not list all of the trigram and digram frequencies, but would simply identify the highest, by visual inspection of the chart, and perhaps highlighting those of interest. In this ciphertext, the most frequent letters are R, H, and X, the most frequent digrams are CR, HK, and RO, and the most frequent trigrams are GYN, HKR and YNV.

Just as a ciphertext can sometimes be broken only simple examination, or using only simple frequency distributions, sometimes the contact chart alone will provide sufficient insight into the text as to provide the crack necessary to break it. But for the sake of continuing the example, Eve failed to accomplish this.

Identifying vowels and consonants

Eve's next step would be to make a guess as to whether one or more letters were a consonant or a vowel. In the ciphertext, Eve might assume that R is a vowel. It has a very high contact number - 20 - indicating that it contacts every letter in the ciphertext save four.

But Eve knows that the most reliable assumption is usually that low-frequency, low-contact letters are consonants. The low-frequency consonants make up around 20% of the a typical text. There are 314 characters in the ciphertext, 20% of which is 62. Using the chart, Eve identifies the lowest-contact letters that have a combined frequency of as close to 62 as possible, without going over.
Code:
 
Letter Freq VoC
P         6   8
I        13   8
Q         7   7
G        10   7
E         4   5
Z         2   4
S         3   4
V         4   3
U         3   3
T         2   3
A         1   2
Total: 55
From these, Eve would begin to build her vowel and consonant lines.

Vowel and Consonant Lines

A Consonant Line takes the form of a tee, with the letters thought to be consonants above the top line, and with the letters that precede or follow the letters at the top written below on each side of the vertical line. A Vowel Line is analogous.

Eve has identified a number of probable consonants. She would have created a consonant line by taking each probable consonant in turn, writing it at the top of the line, and then going down the column for that letter in the contact chart, taking each letter pair and writing the left letter on the left side of the consonant line and the right on the right.
Code:
 
              ATUVSZEGQIP
              ---------------------
               BBBB | BBBB                
                III | II                  
           RRRRRRRR | RRRRRRRRRR          
                  F | F                  
                CCC | CCCCCCC            
         XXXXXXXXXX | XXXXXX              
                  D | D                  
                  E |                    
                    | G                  
                 YY | YYYYY              
            HHHHHHH | H                  
                OOO | OOOO                
                    | KKKK                
                  M | MM                  
                JJJ |                    
              NNNNN | NNN                
                  Z | ZZ                  
This shows which letters contact the low-frequency consonants. Those that do so the most are likely to be vowels. Any letter that does not contact the low-frequency consonants at all is likely to be a consonant itself, and can be added to the line. In this case, there are none.

This first consonant line gives Eve two probable vowels, R and X. R had been strongly suggested simply by it's high VOC.

Eve decides to classify R and X as a vowels in this pass. So she prepares a vowel line.
Code:
 
              RX
              ---------------------
                BBB | BBBB                
                    | FFF                
                JJJ | JJ                  
         OOOOOOOOOO | OOOOO              
               CCCC | CCCCC              
                  R | RRRRRRRR            
                  D | DD                  
                 XX | XXXXX              
                 EE | EE                  
                    | ZZ                  
              HHHHH | HHHHH              
                    | K                  
                  S | S                  
              IIIII | II                  
                  U | U                  
              MMMMM | MM                  
                  N | NNN                
                 QQ | QQ                  
                PPP | P                  
             YYYYYY | YY                  
                GGG | G                  
                  T |                    
This convinces Eve that O and H are consonants. C also shows up as frequently contacting vowels, but the consonant line shows it contacting consonants about as often, so she'll make no determination for it, yet.

Adding O and H produces a second consonant line.
Code:
 
              ATUVSZEGQIPOH
              ---------------------
         BBBBBBBBBB | BBBBBB              
          HHHHHHHHH | HH                  
           NNNNNNNN | NNNNNNNNN          
  RRRRRRRRRRRRRRRRR | RRRRRRRRRRRRRRRRRR  
                III | III                
                  D | DDDD                
   XXXXXXXXXXXXXXXX | XXXXXXXXXX          
                  F | FF                  
            CCCCCCC | CCCCCCCCCC          
                  K | KKKKKKKK            
               OOOO | OOOOOO              
                  E | E                  
                GGG | GGG                
             YYYYYY | YYYYY              
                  Q | QQQ                
                MMM | MM                  
              JJJJJ | JJ                  
                  P | P                  
                  Z | ZZ                  
                    | S                  
Examining this second consonant line convinces Eve that B and N are the other two high-frequency vowels.

Vowel spacing

As Eve was identifying vowels, she was marking up the ciphertext, putting a bar over ever letter that represents a vowel. Long consonant clusters are rare in English. Five sequential consonants happen rarely (e.g., "running through"), more than five happen almost not at all. Similarly, long sequences of vowels are rare. She used these sequences, along with the number of contacts shown in the consonant line and the vowel line, in determining which letters were vowels and consonants.

After having identified the four high-frequency vowels, Eve sees that there are no instances of more than two vowels in sequence. Had there been any, it would have been an indication that she had decided incorrectly.

She finds only four five-long non-vowel sequences, EZGHQK. Of these, she has already identified E, Z, G, H, and Q as consonants. Either K is a vowel, or one of the others was mistakenly identified. (it's not uncommon for u and y to appear in the low 20%, and to be picked as consonants in the initial division.)

Vowel-vowel contact

Eve's next step is to try to determine which of the four high frequency vowels are which. She does this by looking at the frequencies of the digrams containing only these four letters.
Code:
 
    B   N   R   X
B   0   0   0   0
N   0   0   1   1
R   3   0   1   1
X   0   1   0   0
Eve knows that e is the vowel that is by far most likely to contact other vowels. She provisionally assigns R~e. The vowel that is most likely to precede e is i, which might be N, though the counts are so small as to make it uncertain.

She leaves the others, for now, and moves on to the trigrams to see if she can identify the high-frequency consonants.

High-Frequency Consonants

The high-frequency letters in the ciphertext are R, H, X, B, O, C, Y, N, J, and I.

Eve has identified R as e, and B, N, and X as high-frequency vowels.

Remaining are H, O, C, Y, J, and I, and the high-frequency consonants, t, n, s, r and d are likely among them.
Code:
 
     4 GYN
     4 HKR
     4 YNV
     3 DBF
     3 JHK
     3 JXY
     2 BIR
     2 BYJ
     2 CDB
     2 CRM
     2 CRY
     2 DXI
     2 FRJ
     2 GCR
     2 HJX
     2 NCB
     2 NDB
     2 NVN
     2 OCR
     2 PBI
     2 QXO
     2 RCB
     2 RJH
     2 ROC
     2 VNC
     2 XYG
     2 XYH
     2 YHG
Eve examines the high-frequency trigrams, to see if they can help her identify the remaining high-frequency consonants,. The one she immediately notices is HKR. She knows R~e, that H is a consonant, and that K is not a, e, i or o,. Of the trigrams consisting of two consonants followed by e, the is overwhelmingly the most common. She also sees that HH appears as a double letter, which doesn't contradict the assumption. So she provisionally assigns HKR~the.

Now she looks at the digrams;
Code:
 
     7 CR
     6 HK
     6 RO
     5 BF
     5 DB
     5 GY
     5 JX
     5 KR
     5 ON
     5 OR
     5 XO
     5 XY
     5 YH
     5 YN
     4 BH
     4 BI
     4 BY
     4 CB
     4 IR
     4 NC
     4 NV
     4 RH
     4 RJ
     4 RM
     4 RY
     4 YJ
She looks at BH. She has assumed that H is t, and is pretty sure that B is a vowel other than e. Of the four possibilities, at, it, ot, and ut, at is the most common. She decides to give B~a a try.

At this point, she has two high-frequency vowels left, i and o, which have to be either XN or NX. Purely arbitrarily, she tries NX~oi. Examining the partial plaintext, she sees that the message ends: BHNXY~atoi_, which doesn't seem right. Both oi and io are reasonably common, but as the ending of a word, ation is common, and atoi? is not.

So she changes her mind and tries NXY~ion. She now has the four high-frequency vowels BENX~aeio, and three consonants, KYH~hnt. The remaining high-frequency letters in the ciphertext are: O, C, J, and I. The unidentified high-frequency consonants are s, r and d.

She looks through the partial plaintext, looking to see where these might fit. She notices:
Code:
 
_ a n n e _ e _ _ e _ e _ _ e _ _ h a t t h e
D B Y Y R U R O O R I R I Z R O Q K B H H K R
Most particularly, she notices that the unknown letters in this sequence include repeats of the high-frequency ciphertext letters, O and I. O is the higher-frequency of the two, so she mentally replaces it with the unidentified high-frequency consonants. The assignment O~r immediately leads to I~m, and the plaintexttext
Code:
 
_ a n n e _ e r r e m e m _ e r _ h a t t h e
D B Y Y R U R O O R I R I Z R O Q K B H H K R
Which is, of course:
Code:
 
c a n n e v e r r e m e m b e r w h a t t h e
D B Y Y R U R O O R I R I Z R O Q K B H H K R


At which point her plaintext is:
Code:
 
_ n i _ i _ a _ o t m o r e c o m _ _ i c a t e _ t h a n c _ m o _ c
G Y N V N C B F X H I X O R D X I M F N D B H R J H K B Y D M I X T D

o _ r _ e t h e t _ _ i c a _ _ n i _ h a c _ e r c a n n e v e r r e
X G O C R H K R H S M N D B F G Y N V K B D E R O D B Y Y R U R O O R

m e m b e r w h a t t h e _ r i n t c o m m a n _ i _ c a _ _ e _ t h
I R I Z R O Q K B H H K R M O N Y H D X I I B Y J N C D B F F R J H K

i _ w e e _ b _ t w h e n i t _ e t _ r i _ h t _ o w n t o i t _ n i
N C Q R R E Z G H Q K R Y N H P R H C O N P K H J X Q Y H X N H G Y N

_ i _ a _ _ o r i _ i e _ v i _ e o _ a m e _ e o _ _ e _ o n t _ o _
V N C B P F X O N T N R J U N J R X P B I R M R X M F R J X Y H J X C

e r i o _ _ w o r _ o n _ n i _ _ _ _ t e m _ t h e _ _ e n _ _ o _ e
R O N X G C Q X O E X Y G Y N V C S C H R I C H K R S C R Y J A X E R

_ a r o _ n _ t h e w o r _ _ o n _ _ e n e t o r w r i t e a _ v e n
C B O X G Y J H K R Q X O F J X Y G C R Y R H X O Q O N H R B J U R Y

t _ r e _ a m e _ a n _ r e _ e a r c h _ a _ e r _ e _ o _ t r e a _
H G O R P B I R C B Y J O R C R B O D K M B M R O C R M X C H O R B F

_ r o _ r a m m e r _ _ o n t _ _ e _ a _ c a _ _ a t a m a t i o n
M O X P O B I I R O C J X Y H G C R M B C D B F J B H B I B H N X Y


And from there, breaking the rest is trivial.
When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.
Offline Profile Quote Post
Contact Analysis · General