Welcome to Crypto. We hope you enjoy your visit. You're currently viewing our forum as a guest. This means you are limited to certain areas of the board and there are some features you can't use. If you join our community, you'll be able to access memberonly sections, and use many memberonly features such as customizing your profile, sending personal messages, and voting in polls. Registration is simple, fast, and completely free. Join our community! If you're already a member please log in to your account to access all of our features: 
The Index Of Coincidence  The Chi Test, The Kappa  

Topic Started: May 5 2008, 01:18 AM (9,396 Views)  
jdege  May 5 2008, 01:18 AM Post #1 
NSA worthy

The Index of Coincidence  the Chi Test, the Kappa Test, and the mixedalphabet Vig. William Friedman's discovery of the Incidence of Coincidence was arguably the greatest breakthrough in cryptanalysis since Al Kindi's development of frequency analysis in the 9th century. The Index of Coincidence was first described by Friedman in a paper written by Friedman in 1920, and published as Riverbank Publication 22. "The Incidence Of Coincidence And Its Application In Cryptography". Friedman revised and improved his technique over the years, and was expanded on by his subordinates, after he began working for the War Department. Most significantly by Solomon Kullbeck, who created what became the phi and chi tests, derived from Friedman's Index of Coincidence, which became known as the kappa test. A reprint of Friedman's original paper is available from Aegean Park Press as "The Riverbank Publications, Volume III". A later revision is also available as "The Index Of Coincidence And Its Applications In Cryptanalysis", Soloman Kullback's text describing the phi and chi tests is available as "Statistical Methods In Cryptanalysis", both from Aegean Park Press. I'm going to try to explain how to calculate the Chi Test and the Kappa test, and how to use them in simple problems. We'll start with a known fact. Plain language text has a distinctive frequency distribution. We've all seen it: "etaoin shrdlu", or whatever. Some letters appear frequently, some less frequently. Plotted in a graph, these form a recognizable shape. For a large sample of text, these are quite consistent.
For a smaller sample of text, there can be variations, but the general shape is quite similar, even if the height of the peaks might be different:
Now suppose you were writing a computer program to automatically crack Caesar shift ciphers. Yes, I know that it's easy enough to work out all 26 possibilities, and then to pick out the right solution by eye. Or to write a computer program to generate all 26 possibilities, then to pick out the right solution by eye. But you're trying to write a program that would pick out the right solution automatically. What test would you use? You could try to replicate what you're doing by eye  search the text to see if parts of it match words in a dictionary, see if bits match common English digrams, trigrams, tetragrams, etc. All of these are easy to do by eye, but they can be complicated to implement on computer. But what a computer can do, quickly and easily, is to determine how closely the frequency distribution of the text matches that of ordinary English text. That's the Chi test. It gives you a measure of how similar the frequency distributions of two strings of text are. To calculate it you need the frequency counts of the letters in the two strings, and the total number of letters in each string. It's calculated by multiplying the frequency count of one letter in the first string by the frequency count of the same letter in the second string, and then doing the same for all the other letters, and summing the result. This is divided by the product of the total number of letters in each string. A poor attempt at displaying math in ASCII:
Where k is the number of distinct letters in the alphabet, f is the number of times the i'th letter appears in the first string and f' is the number of times the i'th letter appears in the second string. And n and n' are the total number of characters in the first and second strings. For our automated Caesar breaker, we'd use frequency counts generated from our test decryption for one set of frequencies, and standardized frequency counts generated from large samples of English text for the other. That's the chi test. It gives us a way to determine if two strings have the same frequency distribution. It doesn't, though, tell us if they have similar frequency distributions. Consider a simple substiturion cipher. It doesn't have anything like the same distribution as ordinary text when it comes to the distances between peaks, but it has the same number of peaks, of the same height. They simply appear in different places, It would be nice to have a simple test by which we could determine if a text was a simple substitution of another. And we do. That's the Index of Coincidence, or the kappa test. It's calculated by using the chi test to compare a text to itself, and then comparing the result to that of performing the kappa test on a sample of ordinary text, and to that of a random string of letters. So where for the chi test, we did this:
for the kappa test, we do:[/i]
(There's some sort of statistician's black magic that makes them reduce counts by one when comparing distributions to themselves. I'm not going to try to explain why, I'm just going to point out that it's what they do, and you don't need to worry about it.) What the kappa test measures is the probability that if you choose two letters from the text at random, they'll be the same letter. (OK, I am going to try to explain the 1. If you're selecting a letter from each of two distinct texts, you are selecting each randomly from a population of size n and a population of size n'. If you are selecting two letters from the same text, and you're not allowing the same letter to be chosen twice, the first letter is chosen from a population of size n, and the second from a population of size n1.) If you're working with an alphabet with 26 letters, the kappa of a string of random letters will be 1/26, or 0.0385, 3.85%. If you take the kappa of normal english text, your kappa will be around 0.0665. The Index of Coincidence is done by scaling the kappa test by the kappa of random text. So if you're kappa is 0.0665, your IC is 0.0665/0.0385, or 1.73. What happens when you calculate the IC of any of our attempts at cracking the Caesar shift? All of them, correct and incorrect, give you the same value. So will the result of applying any simple substitution cipher to the same text. The chi test can be of use in recognizing when a monoalphabetic cipher has been broken, the kappa and the IC are of no use at this. Where the IC comes into its own is in polyalphabetic ciphers  where a number of different alphabets are used in encrypting a text. Consider, now, the Vigenere cipher, where the text is encrypted by adding to each letter of the plaintext successive letters of the keyword, rotating back to the first letter of the plaintext after the last has been used.
The IC of vigenere ciphertext is very different than that of ordinary text  far closer to 1.0, or that of random text, than the 1.7 or thereabouts we'd expect from a plaintext or a monoalphabetic substitution, The more alphabets used, the closer the IC will be to 1.0. So, how do we break a Vig? We find the period. How do we find the period? We look for slices of the ciphertext that have ICs that look like normal text. We start by working through possible key lengths. We assume the key length is two, and then calculate the IC of the even characters, and the IC of the odd characters, then take the average of these two ICs, and call that the result for key length two. Then we assume the key length is three, split the text into three pieces, one containing the 1st, 4th, 7th, characters, one the 2nd, 5th, 8th, one the 3rd, 6th, 9th, and calculate ICs for each of them. We average these three ICs and call it the result for key length three. Continue until you've calculated results for key lengths twice what you think is the maximum likely key length. It's tedious work by hand, absurdly easy with a computer. Look at your results. Most of them will be close to 1.0. But your results for the actual key length will be close to 1.7, and so will be the results for multiples of the actual key length. If your actual key length is 6, you'll see peaks in your results at 6, 12, and 18. So now you know your key length. What next? Well, if the key length was six, then your text was encrypted with six different alphabets, and since we're dealing with a standard Vig, each of those six is a simple Caesar shift. And we know how to crack Caesar shifts, we shift each 26 times, using the chi test to compare their frequency distributions to that of ordinary text. Again, a bit tedious to do by hand, but blindingly simple for a computer. But, you say, what if it wasn't a standard Vig? What if it was a mixed alphabet Vig? What, if they didn't use the standard Vig tableau:
What if, instead, they used a mixedalphabet tableau?
This looks as if it would be a great deal more complicated, but its not. Yes, each of the six alphabets was encrypted using a mixed alphabet, but they're the same alphabet shifted by six different amounts. Again, we use the chi test. But now, instead shifting each piece of the text, and comparing its frequency distribution to that of standard english, we compare it to the other pieces. We shift piece one 26 times, and use the chi test to compare it to piece two. When the chi test peaks, we'll have aligned piece 1 and 2. Do the same with each piece of the text, until they're all aligned with each other. What do we have, when we're done? We've eliminated the shifts between the separate alphabets. So what we have now is a monoalphabetic substitution. In other words, a simple substitution cipher. Which can easily be broken using the techniques we're all familiar with. (This approach is described in Abraham Sinkov's "Elementary Cryptanalysis: A Mathematical Approach". He and Soloman Kullbeck were two of William Friedman's first three hires when he set up the Signals Intelligence Service for the War Department. The third being Frank Rowlett, who was later credited for being the key figure in the cracking of the Japanese PURPLE cipher.) 
When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.  
Donald  May 5 2008, 03:04 AM Post #2 
NSA worthy

Cool! This is a new technique to me! AND, a wonderfully detailed description of the IOC, CHI and Kappa. Thanks!!!! 
jdege  May 5 2008, 03:17 AM Post #3 
NSA worthy

I thought it was cool, when I finally figured out enough of the math in Sinkov's book to understand what it was he was talking about. What actually works best is to compare each of the alphabets against each of the others, at every shift, and to merge the two that provide the highest chi value over all of them. Then to repeat, using the frequencies of your combined alphabets. For short texts, the differences don't always stand out. By merging the texts with the highest chitest first, we will be working with longer texts when we get to the alphabets where the differences were indistinct. 
When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.  
Revelation  May 5 2008, 09:15 AM Post #4 
Administrator

Very nice tutorial! I've learned something new 
RRRREJMEEEEEPVKLWENFNVJKEEEEEAOLKAFKLXCFZAASDJXZTTTTTTTLSIOWJXMOKLAFJNNKFNXN RAGRBAQEMHIGDJVDSEOXVIYCELFHWLELJFIENXLRATALSJFSLCYTKLASJDKMHGOVOKAJDNMNUITN RRRRLJVEEEEECLYVYHNVPFTAEEEEEMWLMEIRNGLARWJAKJDFLWNTIERJMIPQWOTZEOCXKNUBNXCN RJIRPOWEANFUSNCZVDVZNMSFEKLOEPZLDKDJWSAAAAAAAOERHJCTNCKFRIMVKSOFOMKMANREWNBN RZUDRGXEEEEENFQIDVLQNCKNEEEEEDGLLLLLLAWIOSNCDARLODMTOEJXMILDFJROTKJSDNLVCZNN  
jdege  May 5 2008, 12:21 PM Post #5 
NSA worthy

I was looking at some of the tutorials. Rot13's new way of attacking vigs.  he's matching the pattern of the frequency distributions by calculating the distances between the peaks. It's doing what the chi test does, but in a way that is a lot easier to calculate by hand. 
When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.  
jdege  May 30 2008, 04:49 PM Post #6 
NSA worthy

It's not a very general technique, though. Using chi to match the different alphabets to each other depends on being able to shift the ciphertext component of the alphabet. Which means you know the sequence of the ciphertext components. Which you do in Vig, Beaufort, Variant, and Quag I, but which you don't in Gronsfeld or Quag IIIV. Something I noticed, last night, reading MILCRYPT I. I'd skimmed it before, skipping over the sections on standard and reverse standard alphabets as being too obvious to be worth wasting time on. This time, I actually read those sections. They discussed running down the alphabets  taking the first ten letters of the ciphertext, and the writing the successive letters below until the plaintext appeared:
Then they discussed doing the same for a reverse alphabet  by converting the ciphertext using any arbitrary reverse alphabet. The result will be encrypted with some standard alphabet, which can then be walked down in the normal manner. Which got me thinking  wouldn't this work for any known cipher component sequence? Porta uses a nonstandard alphabet, so it can't be shifted in the normal manner. But it always uses the same sequence. Would simply decrypting with any Porta sequence result in standard cipher component sequences that can be chitested? Would it be possible, with the Gronsfeld and the Quag II, to reconstruct enough of the cipher component sequence, to do a trial decryption, and to see if the result is close enough to standard for it to be chitested? 
When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.  
jdege  May 31 2008, 12:13 PM Post #7 
NSA worthy

One correction  I was under the impression that the Gronsfeld used a mixed alphabet. Under ACA rules, at least, it does not. The ACA Gronsfeld is a plain Vig, with a numeric key, so chitesting against the standard alphabet works. 
When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.  
dabombguyman  May 24 2009, 05:21 AM Post #8 
Member

It's almost like the 'birthday paradox' applied to crypography. Except instead of 365 days, there are 26 letters. 
Q3J5cHRvZ3JhcGh5IGlzIGFuIGFydCBmb3Jt  
1 user reading this topic (1 Guest and 0 Anonymous) 
« Previous Topic · Tutorials · Next Topic » 