Welcome Guest [Log In] [Register]
Welcome to Crypto. We hope you enjoy your visit.


You're currently viewing our forum as a guest. This means you are limited to certain areas of the board and there are some features you can't use. If you join our community, you'll be able to access member-only sections, and use many member-only features such as customizing your profile, sending personal messages, and voting in polls. Registration is simple, fast, and completely free.


Join our community!


If you're already a member please log in to your account to access all of our features:

Username:   Password:
Add Reply
Cryptostat; My work in progress
Topic Started: Sep 8 2005, 11:18 PM (411 Views)
PulsarSL
Super member
[ *  *  *  * ]
Since I'm not very good at making algorithims, I've decided to try something a bit different with crypto software.

I'm working on a program that will accept input (an encoded message) and will analyze the frequency of the letters in it to try and predict (obviously, it can't be perfect) which letters are which and also provide a best-guess of the message.

I realize this will only work with simple encryptions and none of the harder stuff will work with it, but it's both a cryptography and C++ learning experience for me.

I'm pretty busy, so work is pretty slow. :D

PulsarSL

P.S. - We need more members... this site isn't listed in Google is it??
Offline Profile Quote Post Goto Top
 
Revelation
Member Avatar
Administrator
[ *  *  *  *  * ]
Oh we do need more members! I don't know if it is listed.

This is my code for the frequency check:

Code:
 

procedure TForm1.Button1Click(Sender: TObject);
var
  i: integer;
  k: array [33..256] of Integer;
begin
 Memo2.Lines.Clear;

 for i := 33 to 256 do
 begin
   k[i] := 0;
 end;

 for i := 1 to Length(Memo1.Text) do
 begin
   if Ord(Memo1.Text[i]) >= 8 then
     k[ord(Memo1.Text[i])] := k[Ord(Memo1.Text[i])] + 1;
 end;

 for i := 33 to 256 do
 begin
   if k[i] <> 0 then
     Memo2.Lines.Add(Chr(i) + '  : ' + IntToStr(k[i]));
 end;
end;
RRRREJMEEEEEPVKLWENFNVJKEEEEEAOLKAFKLXCFZAASDJXZTTTTTTTLSIOWJXMOKLAFJNNKFNXN
RAGRBAQEMHIGDJVDSEOXVIYCELFHWLELJFIENXLRATALSJFSLCYTKLASJDKMHGOVOKAJDNMNUITN
RRRRLJVEEEEECLYVYHNVPFTAEEEEEMWLMEIRNGLARWJAKJDFLWNTIERJMIPQWOTZEOCXKNUBNXCN
RJIRPOWEANFUSNCZVDVZNMSFEKLOEPZLDKDJWSAAAAAAAOERHJCTNCKFRIMVKSOFOMKMANREWNBN
RZUDRGXEEEEENFQIDVLQNCKNEEEEEDGLLLLLLAWIOSNCDARLODMTOEJXMILDFJROTKJSDNLVCZNN
Offline Profile Quote Post Goto Top
 
codebreaker11235
Just registered
[ * ]
If you would like here is my Java code that performs basically the same task, it also calculates the observed phi, the expected monographic phi, and the expected random phi. I plan to add a few more statistical tests to it to help in analysis.

Code:

Code:
 

BufferedReader stdin = new BufferedReader (
               new InputStreamReader(System.in));
  String cipherText;
  char[] cipherV = {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'};
  int[] frequency = new int[27];
  char cipher;
 
  System.out.println("input ciphertext");
  cipherText = (String)(stdin.readLine());
 
  int i = 0, j = 0;
  System.out.println("\n" + cipherText.length());

  for (i = 0; i < cipherText.length(); i++)
  {
   cipher = cipherText.charAt(i);

   for (j = 0; j < 26; j++)
   {
    if (cipher == cipherV[j])
    {
     frequency[j] += 1;
    }
   }
  }
  System.out.println();
 
  for (i = 0; i < 26; i++)
  {
   System.out.println(cipherV[i] + ": " + frequency[i] + "\t: " + (((float)frequency[i])/((float)cipherText.length()))*100 + "%");
  }
 
  int phi = 0;
 
  for (i = 0; i < 26; i++)
  {
   phi = phi + (frequency[i] * (frequency[i] - 1));
  }
 
  System.out.println("Expected plaintext phi: " + 0.0661 * cipherText.length() * (cipherText.length() - 1));
 
  System.out.println("Expected random phi   : " + 0.038 * cipherText.length() * (cipherText.length() - 1));
  System.out.println("Observed ciphertext phi: " + phi);
}//main


Edit Revelation: Code put between code tags
Offline Profile Quote Post Goto Top
 
insecure
Elite member
[ *  *  *  *  * ]
Unless I am mistaken, all the programs shown to date deal only with alphabetical characters. Unfortunately, many encryption mechanisms do not restrict themselves to alphabetics.

The following program (which is written in standard C) will display frequency stats for a file named in its command line, irrespective of the kind of data stored in that file. I apologise that it's a bit long (around 100 lines), but the alternative is to write "clever" C, which doesn't really help people to read it easily - and it is a complete program, not just a fragment:

Code:
 

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <ctype.h>
#include <assert.h>

struct frequency_map_
{
 unsigned char data;
 unsigned long count;
};

typedef struct frequency_map_ frequency_map;

int FrequencyAnalyse(frequency_map *pfm,
                    size_t nelems,
                    FILE *input,
                    unsigned long *counter)
{
 int ch = 0;
 assert(nelems == UCHAR_MAX + 1);
 assert(counter != NULL);

 while((ch = getc(input)) != EOF)
 {
   ++*counter;         /* track total number of characters */
   ++pfm[ch].count;    /* store the count */
   pfm[ch].data = ch;  /* track what this count is for */
 }

 return ferror(input);
}

int FrequencyCompare(const void *vf1, const void *vf2)
{
 const frequency_map *f1 = vf1;
 const frequency_map *f2 = vf2;

 return (f1->count < f2->count) - (f1->count > f2->count);
}

int FrequencyDisplay(frequency_map *pfm,
                    size_t nelems,
                    unsigned long counter,
                    int suppress_zero)
{
 size_t i = 0;
 do
 {
   if(pfm[i].count > 0 || !suppress_zero)
   {
     printf("Code %3d (", pfm[i].data);
     if(isprint(pfm[i].data))
     {
       printf(" %c", pfm[i].data);
     }
     else
     {
       printf("!!");
     }
     printf(") %6lu (%6.2f%%)\n",
            pfm[i].count,
            pfm[i].count * 100.0 / counter);
   }
 } while(++i < nelems);

 return ferror(stdout);
}

int main(int argc, char **argv)
{
 int rc = 1;

 if(argc > 1)
 {
   FILE *fp = fopen(argv[1], "rb");
   if(fp != NULL)
   {
     frequency_map fm[UCHAR_MAX + 1] = {0};
     unsigned long counter = 0;
     rc = FrequencyAnalyse(fm,
                           sizeof fm / sizeof fm[0],
                           fp,
                           &counter);
     fclose(fp);
     if(0 == rc)
     {
       qsort(fm,
             sizeof fm / sizeof fm[0],
             sizeof fm[0],
             FrequencyCompare);
       rc = FrequencyDisplay(fm,
                             sizeof fm / sizeof fm[0],
                             counter,
                             1);
     }
   }
 }
 else
 {
   fputs("Specify a file to frequency-analyse\n", stderr);
 }
 return rc;
}
Offline Profile Quote Post Goto Top
 
Revelation
Member Avatar
Administrator
[ *  *  *  *  * ]
Mine takes every ascii character starting at 33. See here why. I like it that much people here are programmers (and in different languages!). :)
RRRREJMEEEEEPVKLWENFNVJKEEEEEAOLKAFKLXCFZAASDJXZTTTTTTTLSIOWJXMOKLAFJNNKFNXN
RAGRBAQEMHIGDJVDSEOXVIYCELFHWLELJFIENXLRATALSJFSLCYTKLASJDKMHGOVOKAJDNMNUITN
RRRRLJVEEEEECLYVYHNVPFTAEEEEEMWLMEIRNGLARWJAKJDFLWNTIERJMIPQWOTZEOCXKNUBNXCN
RJIRPOWEANFUSNCZVDVZNMSFEKLOEPZLDKDJWSAAAAAAAOERHJCTNCKFRIMVKSOFOMKMANREWNBN
RZUDRGXEEEEENFQIDVLQNCKNEEEEEDGLLLLLLAWIOSNCDARLODMTOEJXMILDFJROTKJSDNLVCZNN
Offline Profile Quote Post Goto Top
 
insecure
Elite member
[ *  *  *  *  * ]
Yes, I understand why you start at 33 (basically it's because printing characters with lower codes is a PITA). That's fine for some purposes, but some ciphers do map some plaintext characters to ciphertext character values below 33, so it's as well to have a way to analyse such ciphertexts.
Offline Profile Quote Post Goto Top
 
PulsarSL
Super member
[ *  *  *  * ]
insecure
Sep 10 2005, 10:20 AM
Yes, I understand why you start at 33 (basically it's because printing characters with lower codes is a PITA). That's fine for some purposes, but some ciphers do map some plaintext characters to ciphertext character values below 33, so it's as well to have a way to analyse such ciphertexts.

But how are those displayed?
Offline Profile Quote Post Goto Top
 
PulsarSL
Super member
[ *  *  *  * ]
Revelation
Sep 10 2005, 08:46 AM
Mine takes every ascii character starting at 33. See here why. I like it that much people here are programmers (and in different languages!). :)

That's how I'm planning on writing mine, but I have to get around to it :D
Offline Profile Quote Post Goto Top
 
PulsarSL
Super member
[ *  *  *  * ]
As for listing with google, you can do it here

http://www.google.com/addurl/?continue=/addurl
Offline Profile Quote Post Goto Top
 
insecure
Elite member
[ *  *  *  *  * ]
PulsarSL asks how to print non-printable characters. Of course, the answer is that you can't. But what you can do is print their code points. My program produces output (which I've wrapped in code tags for you) like this:

Code:
 

Code   0 (!!)   3895 ( 28.91%)
Code  44 ( ,)    573 (  4.25%)
Code  95 ( _)    540 (  4.01%)
Code  40 ( ()    396 (  2.94%)
Code  41 ( ))    391 (  2.90%)


Here, the code point is given in decimal. This is followed by two characters in parentheses. These are EITHER a space and the printing character corresponding to the code point, OR the sequence !!, which indicates that no printing character is available.

The next figures describe the frequency (first the absolute frequency, and then expressed as a percentage). The characters are displayed in descending order of frequency, so the most significant are at the top.

By far the most common way of displaying non-printable characters, however, is to replace them with a single dot. Look at any hex dump, and you'll see this immediately. For example, here's how I got a hex dump of the object file corresponding to my C program (it's piped through head to keep the output down to a few lines):

Code:
 

~/dev/crypto/ft> hexdump -C ft.o | head

00000000  7f 45 4c 46 01 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  01 00 03 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  ac 2e 00 00 00 00 00 00  34 00 00 00 00 00 28 00  |¬.......4.....(.|
00000030  0e 00 0b 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040  90 eb 0d 90 90 90 90 90  90 90 90 90 90 90 90 90  |.ë..............|
00000050  55 89 e5 83 ec 0c 57 56  53 ba 00 00 00 00 e8 fc  |U.å.ì.WVSº....èü|
00000060  ff ff ff 8b 5d 08 8b 7d  10 8b 75 14 81 7d 0c 00  |ÿÿÿ.]..}..u..}..|
00000070  01 00 00 74 1b 68 00 00  00 00 6a 12 68 11 00 00  |...t.h....j.h...|
00000080  00 68 16 00 00 00 e8 fc  ff ff ff 90 8d 74 26 00  |.h....èüÿÿÿ..t&.|
00000090  85 f6 75 25 68 00 00 00  00 6a 13 68 11 00 00 00  |.öu%h....j.h....|


(Ignore the first column, which is just a file offset, in hex.) As you can see, there are quite a few non-printable characters here, most of which are simply ASCII 0 (although there are a few others too). The hex dumper displays their hexadecimal value easily (always in two digits, since my system - like many desktop systems - uses 8-bit bytes). Then it gives the printable value if possible, but replaces with a dot if necessary.

Personally, I prefer ciphertexts to be expressed in hexadecimal, sixteen bytes per line, with no file offset column and no print translation, e.g. as follows:

Code:
 

7F 45 4C 46 01 01 01 00 00 00 00 00 00 00 00 00
01 00 03 00 01 00 00 00 00 00 00 00 00 00 00 00
AC 2E 00 00 00 00 00 00 34 00 00 00 00 00 28 00
0E 00 0B 00 00 00 00 00 00 00 00 00 00 00 00 00
90 EB 0D 90 90 90 90 90 90 90 90 90 90 90 90 90
55 89 E5 83 EC 0C 57 56 53 BA 00 00 00 00 E8 FC
FF FF FF 8B 5D 08 8B 7D 10 8B 75 14 81 7D 0C 00
01 00 00 74 1B 68 00 00 00 00 6A 12 68 11 00 00
00 68 16 00 00 00 E8 FC FF FF FF 90 8D 74 26 00
85 F6 75 25 68 00 00 00 00 6A 13 68 11 00 00 00


This provides a nice consistent way to express ciphertexts, irrespective of algorithm, and makes it easy to write parsers to read them. Also, you can see every single byte, irrespective of its value.

If everybody here produced ciphertexts in this form, each of us would only need to write (or steal!) one ciphertext-parsing routine, which we could use for any challenge or exercise posted in this forum. That would leave us free to concentrate on the challenge or exercise itself.
Offline Profile Quote Post Goto Top
 
Donald
Elite member
[ *  *  *  *  * ]
Quote:
 
If everybody here produced ciphertexts in this form


The problem as I see it, is that posting your challenge in hex will alienate
some of the pen and paper cryptographers who aren't into programming.

Ok not a lot, I mean it's just one more level of encryption, but it might
scare off a few who are new to the entire thing. Give them a string
of letters and they might be willing to try, give them a hex table and
they are going to assume right off that it's too complicated for them.

BUT, we can have the best of both worlds. It's pretty simple to write a
program that will convert 'A' to 'Z' (or any ascii text) into hex.

So unless your challenge has unusual characters in it, just post it with
the ordinary character stream and we programmers (and others using
such tools) can convert to hex as needed.

Donald
Offline Profile Quote Post Goto Top
 
insecure
Elite member
[ *  *  *  *  * ]
It's a fair point. It seems to me that we can get the best of both worlds by deciding, when posting a ciphertext, how it can best be presented to maximise the chance that someone will be motivated to have a crack at it (if you will pardon the pun).

If the domain of the algorithm is text-based, then it makes sense to present the ciphertext in letters. If it is not, then a hexadecimal presentation works better.

For example, let's say I decide that my monoalphabetic substitution cipher is uncrackable (yeah, right!), or perhaps I am simply providing an exercise in monoalphabetic cracking. I could reasonably suggest that we are concerned only with letters of the alphabet, and display a ciphertext like this:

FGV OL ZIT ZODT YGK QSS UGGR DTF ZG EGDT ZG ZIT QOR GY ZITOK HQKZN.

(Here, I have retained the original word spacing, to make it a bit easier to crack. Sorry about ZIT, by the way - just luck. You shouldn't have too many problems decoding this - it's just an ordinary cryptogram, using simple letter substitution - but it's supposed to be an illustration, not a challenge!)

Nobody is going to be put off from cracking this. But if we look at the challenges some people have posted here already, we will see that they contain bizarre characters such as that little y with the two dots over it (which I think is the Windows rendition of code point 255). Now, this might come as a shock to some people, but not all operating systems interpret and display non-ASCII characters in the same way. ASCII is a 7-bit code, so any code point > 127 is not, strictly speaking, an ASCII character, and so is not covered by the ASCII standard, and ASCII-based systems are free to render it in any way they like, or not at all (some systems just display a little box there instead, to mean "er, huh?").

Actually, not all computers even use ASCII! But I think it's fair to assume that people reading this forum will be on ASCII-based machines. When a character is not in the ASCII range, though, all bets are off. It is in these circumstances that a hexadecimal display makes sense.

When you are designing your algorithm, it's a good idea to decide in advance whether the domain of the ciphertext "alphabet" will be strictly printable-ASCII. If so, then by all means let's display it as such. But if there are some characters that look "weird", that's a hint that we may need to present it in a way that is utterly unambiguous.

To make this a bit more concrete, consider this code to encipher a plaintext using a one-time pad:

Code:
 

#include <ctype.h>
#include <assert.h>

/* one time pad - domain is A-Z only, p is null-terminated. */
/* This is untested code - just a scribble. */

void otp_ascii_encrypt(char *c, const char *p, const char *k)
{
 char t;
 while(*p != '\0')
 {
   assert(isupper(*p)); /* assertion is not a great way to */
   assert(isupper(*k)); /* validate, but this is just a sketch! */
   t = *p++;
   t += (*k++ - 'A');
   if(t > 'Z')
   {
     t -= 26;
   }
   *c++ = t;
 }
 *c = '\0';
}


The input domain (for both the plaintext and the key) is A-Z, and the output domain is also A-Z. So it makes lots of sense to display the ciphertext in - ha! - "English".

But the following version does not assume or require restrictions on either its input domain or its output domain:

Code:
 

/* one time pad - domain is binary.          */
/* p and k have at least n bytes, and c      */
/* points to at least that much storage too. */

/* This is untested code - just a scribble.  */

void otp(unsigned char *c,
        const unsigned char *p,
        const unsigned char *k,
        size_t n)
{
 while(n--)
 {
   *c++ = *p++ ^ *k++;
 }
}


This code could produce any old junk as its output, and to display it as ASCII would scare most pencil-and-paper junkies away very quickly, but a hex output would make a pencil-and-paper analysis much more feasible (provided the cryptanalyst had a firm grasp of the principle of XOR, of course!).

So I propose a three-pronged approach:

1) If the output domain of the cipher is pure printable ASCII, display the ciphertext as ASCII:

FGV OL ZIT ZODT YGK QSS UGGR DTF ZG EGDT ZG ZIT QOR GY ZITOK HQKZN.

2) If the output domain is ASCII, but you want to make it a bit harder, remove spaces (to eliminate word-length information), and then display the ciphertext in five-letter groups:

FGVOL ZITZO DTYGK QSSUG GRDTF ZGEGD TZGZI TQORG YZITO KHQKZ N.

3) If the output domain is not restricted to pure printable ASCII, display the ciphertext in hexadecimal:
Code:
 

46 47 56 20 4F 4C 20 5A 49 54 20 5A 4F 44 54 20
59 47 4B 20 51 53 53 20 55 47 47 52 20 44 54 46
20 5A 47 20 45 47 44 54 20 5A 47 20 5A 49 54 20
51 4F 52 20 47 59 20 5A 49 54 4F 4B 20 48 51 4B
5A 4E 20 20 20 20 0A


I think you and I are broadly in agreement here. Perhaps we should do a poll or something, to find out what most people think about this suggestion?
Offline Profile Quote Post Goto Top
 
Donald
Elite member
[ *  *  *  *  * ]
Quote:
 
I think you and I are broadly in agreement here.

Yes, I think we are. If your cipher has unusual characters in it, putting it up in hex seems a good idea.

Donald
Offline Profile Quote Post Goto Top
 
PulsarSL
Super member
[ *  *  *  * ]
Donald
Sep 11 2005, 04:18 AM
Quote:
 
I think you and I are broadly in agreement here.

Yes, I think we are. If your cipher has unusual characters in it, putting it up in hex seems a good idea.

Donald

Sounds good to me.
Offline Profile Quote Post Goto Top
 
PulsarSL
Super member
[ *  *  *  * ]
I am reviving this thread from the dead to tell you that I'm nearly done with my simple frequency counter.

More to come.
Offline Profile Quote Post Goto Top
 
1 user reading this topic (1 Guest and 0 Anonymous)
Go to Next Page
« Previous Topic · General · Next Topic »
Add Reply