Welcome Guest [Log In] [Register]
Welcome to Crypto. We hope you enjoy your visit.


You're currently viewing our forum as a guest. This means you are limited to certain areas of the board and there are some features you can't use. If you join our community, you'll be able to access member-only sections, and use many member-only features such as customizing your profile, sending personal messages, and voting in polls. Registration is simple, fast, and completely free.


Join our community!


If you're already a member please log in to your account to access all of our features:

Username:   Password:
Add Reply
Pattern Word Dictionaries
Topic Started: Apr 13 2008, 01:41 PM (2,040 Views)
jdege
Member Avatar
NSA worthy
[ *  *  *  *  *  * ]
In another thread, we've been discussing pattern word dictionaries. These are a tool for pattern word matching, which is a technique that's been used in cryptanalysis basically forever.

The idea is to create a system that indicates letter repetitions within words. People have done this in a number of ways. The letter repetitions within the word "people", for example, could be expressed as "1-4,2-6", meaning the 1st and fourth, and second and fifth are the same. Or it could be expressed as "12-1-2". You'll see forms such as these in the older crypto books. Neither is particularly easy for a computer to produce or process. The more common form, these days, is "123142" or "abcadb". I prefer the latter.

There are two tasks, then. One is creating a list of words with their patterns, the other is searching the list for words that match a particular pattern. Both require a routine for calculating the pattern for a given word.

I wish I could tell you I had some polished, sophisticated program for doing either of these tasks. I don't. I have a collection of Unix shell scripts, using the standard Unix text-processing tools, cat, tr, sort, awk, sed.

Still, if you're running Windows, you can get a set of these tools easily from the WingGNU project. If you're running on a Mac, you really are running Unix, despite all the pretty pictures, and all of these tools are already installed on your system. All you need is to figure out how to open up a command shell.

So:

Task 0: Get a bunch of text files containing english text. Project Gutenberg has put thousands of books up on the net in plain ASCII format. Grab a selection, download them onto your computer.

Task 1: Create a word-list. From the above text files, extract all the distinct words. I use the following shell script:
Code:
 
#!/bin/sh
# genwordlist.sh

cat "$@" |            # 1
tr A-Z a-z |          # 2
tr -cs a-z"'" \\n |   # 3
sort -u               # 4

Step 1: "cat" prints the contents of all the files that were passed on the command-line to stdout.
Step 2: "tr" reads stdin, converts all the uppercase letters to lowercase, and writes to stdout.
Step 3: "tr" converts all the characters that aren't either lowercase letters or single quotes to newlines.
Step 4: "sort" sorts stdin, removing duplicates.

The result is a file containing all the unique words.

Task 2: Create a pattern word list.

This I do with awk:
Code:
 
#!/bin/sh
# canonize.sh

cat "$@" |
sort -u |
awk -f canonize.awk |
sort -t ' ' -k1,1 -k2,2
Code:
 
BEGIN {
 letters = "abcdefghijklmnopqrstuvwxyz"
}

{
 for (a in used) used[a] = 0;
 used["'"] = "'"

 word = tolower($0)

 n = 1
 result = ""
 for (i=1; i<=length(word); i++) {
  l = substr(word, i, 1)

  if (!used[l]) {
   used[l] = substr(letters, n, 1)
   n += 1
  }

  result = result used[l]
 }

 print result, word
}

The shell script sorts and removes duplicates from the input (which wouldn't really be necessary, if the output was already sorted), and passes the result through the awk script, which prints the pattern and the word, separated by a space. The results are then sorted first by the pattern and then by the word.

The output looks like this:
Code:
 
a a
a i
aab eel
a'ab e'en
aabc eels
aabc ooze
aabca eerie
aabcd aaron
aabcd lloyd
aabcd oozed
aabcde aarhus
aabcde eerily
aabcdeff eelgrass
ab ad
ab al
ab am
ab an
ab as
ab at
ab be
ab by
ab do

The resulting file is no more than a start. As you use it, you'll find that you'll want to edit it, both to add words that aren't in it and should be, and to remove words that may have been in the text you started with, but only confuse things. What I got from the scripts had nearly every single letter as a valid word, I removed all but 'a' and 'i'. All of the state abbreviations showed up as valid two-letter words, I removed them as well.

Task 3: Search the file for words that match a pattern.

This is yet another shell script:
Code:
 
#!/bin/sh
# matches.sh

pattern=$( echo "$@" |             #1
 awk -f canonize.awk |             #2
 cut -d' ' -f1 )                   #3

grep "^${pattern} " wordlist.txt | #4
cut -d' ' -f2                      #5

Step 1: Print the command-line arguments to stdout
Step 2: Print the patternword-space-word
Step 3: print only the patternword
(Because the above is in a subshell expression, it's result is assigned to the shell variable "pattern".)
Step 4: Search the file "wordlist.txt" for lines that match the regular expression "^pattern " - that, lines that start with the pattern followed by a space.
Step 5: For the output lines, print only the second field, that is, just the word.

So, "match crbcyr" outputs all the words that have a pattern of "abcadb":
Code:
 
aerate
asians
balboa
briber
indian
leslie
people
proper
thatch
When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.
Offline Profile Quote Post Goto Top
 
Revelation
Member Avatar
Administrator
[ *  *  *  *  * ]
Nice :) This makes things much easier. Maybe I'll create a C-version to make sure that non-linux users can match patterns too.
RRRREJMEEEEEPVKLWENFNVJKEEEEEAOLKAFKLXCFZAASDJXZTTTTTTTLSIOWJXMOKLAFJNNKFNXN
RAGRBAQEMHIGDJVDSEOXVIYCELFHWLELJFIENXLRATALSJFSLCYTKLASJDKMHGOVOKAJDNMNUITN
RRRRLJVEEEEECLYVYHNVPFTAEEEEEMWLMEIRNGLARWJAKJDFLWNTIERJMIPQWOTZEOCXKNUBNXCN
RJIRPOWEANFUSNCZVDVZNMSFEKLOEPZLDKDJWSAAAAAAAOERHJCTNCKFRIMVKSOFOMKMANREWNBN
RZUDRGXEEEEENFQIDVLQNCKNEEEEEDGLLLLLLAWIOSNCDARLODMTOEJXMILDFJROTKJSDNLVCZNN
Offline Profile Quote Post Goto Top
 
jdege
Member Avatar
NSA worthy
[ *  *  *  *  *  * ]
Just to keep things all in one place, there is a pattern word dictionary available online at:

http://fiziwig.com/crypto/pattern.html

The format isn't the same as what is produced above, so the match script won't work with it, but the following awk program will convert it:

Code:
 
/^[A-Z]/ {
 pattern = tolower($1)
 $1 = ""
}

{
 for (i=1; i<= NF; i++) {
  if ($i != "") print pattern, $i
 }
}
When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.
Offline Profile Quote Post Goto Top
 
fiziwig
Elite member
[ *  *  *  *  * ]
jdege
Apr 14 2008, 02:22 PM
Just to keep things all in one place, there is a pattern word dictionary available online at:

http://fiziwig.com/crypto/pattern.html

The format isn't the same as what is produced above, so the match script won't work with it, but the following awk program will convert it:

Code:
 
/^[A-Z]/[space]{
[space] pattern[space]=[space]tolower($1)
[space] $1[space]=[space]""
}

{
[space] for[space](i=1;[space]i<=[space]NF;[space]i++)[space]{
[space] [space]if[space]($i[space]!=[space]"")[space]print[space]pattern,[space]$i
[space] }
}
FYI The above link is no longer active.

The same pattern word dictionary is now available at my new domain: http://cryptopian.com/pword.txt

Offline Profile Quote Post Goto Top
 
1 user reading this topic (1 Guest and 0 Anonymous)
« Previous Topic · Utilities · Next Topic »
Add Reply