| Welcome to Crypto. We hope you enjoy your visit. You're currently viewing our forum as a guest. This means you are limited to certain areas of the board and there are some features you can't use. If you join our community, you'll be able to access member-only sections, and use many member-only features such as customizing your profile, sending personal messages, and voting in polls. Registration is simple, fast, and completely free. Join our community! If you're already a member please log in to your account to access all of our features: |
| Pattern Word Dictionaries | |
|---|---|
| Tweet Topic Started: Apr 13 2008, 01:41 PM (2,040 Views) | |
| jdege | Apr 13 2008, 01:41 PM Post #1 |
|
NSA worthy
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
In another thread, we've been discussing pattern word dictionaries. These are a tool for pattern word matching, which is a technique that's been used in cryptanalysis basically forever. The idea is to create a system that indicates letter repetitions within words. People have done this in a number of ways. The letter repetitions within the word "people", for example, could be expressed as "1-4,2-6", meaning the 1st and fourth, and second and fifth are the same. Or it could be expressed as "12-1-2". You'll see forms such as these in the older crypto books. Neither is particularly easy for a computer to produce or process. The more common form, these days, is "123142" or "abcadb". I prefer the latter. There are two tasks, then. One is creating a list of words with their patterns, the other is searching the list for words that match a particular pattern. Both require a routine for calculating the pattern for a given word. I wish I could tell you I had some polished, sophisticated program for doing either of these tasks. I don't. I have a collection of Unix shell scripts, using the standard Unix text-processing tools, cat, tr, sort, awk, sed. Still, if you're running Windows, you can get a set of these tools easily from the WingGNU project. If you're running on a Mac, you really are running Unix, despite all the pretty pictures, and all of these tools are already installed on your system. All you need is to figure out how to open up a command shell. So: Task 0: Get a bunch of text files containing english text. Project Gutenberg has put thousands of books up on the net in plain ASCII format. Grab a selection, download them onto your computer. Task 1: Create a word-list. From the above text files, extract all the distinct words. I use the following shell script:
Step 1: "cat" prints the contents of all the files that were passed on the command-line to stdout. Step 2: "tr" reads stdin, converts all the uppercase letters to lowercase, and writes to stdout. Step 3: "tr" converts all the characters that aren't either lowercase letters or single quotes to newlines. Step 4: "sort" sorts stdin, removing duplicates. The result is a file containing all the unique words. Task 2: Create a pattern word list. This I do with awk:
The shell script sorts and removes duplicates from the input (which wouldn't really be necessary, if the output was already sorted), and passes the result through the awk script, which prints the pattern and the word, separated by a space. The results are then sorted first by the pattern and then by the word. The output looks like this:
The resulting file is no more than a start. As you use it, you'll find that you'll want to edit it, both to add words that aren't in it and should be, and to remove words that may have been in the text you started with, but only confuse things. What I got from the scripts had nearly every single letter as a valid word, I removed all but 'a' and 'i'. All of the state abbreviations showed up as valid two-letter words, I removed them as well. Task 3: Search the file for words that match a pattern. This is yet another shell script:
Step 1: Print the command-line arguments to stdout Step 2: Print the patternword-space-word Step 3: print only the patternword (Because the above is in a subshell expression, it's result is assigned to the shell variable "pattern".) Step 4: Search the file "wordlist.txt" for lines that match the regular expression "^pattern " - that, lines that start with the pattern followed by a space. Step 5: For the output lines, print only the second field, that is, just the word. So, "match crbcyr" outputs all the words that have a pattern of "abcadb":
|
| When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl. | |
![]() |
|
| Revelation | Apr 13 2008, 04:24 PM Post #2 |
|
Administrator
![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Nice
This makes things much easier. Maybe I'll create a C-version to make sure that non-linux users can match patterns too.
|
|
RRRREJMEEEEEPVKLWENFNVJKEEEEEAOLKAFKLXCFZAASDJXZTTTTTTTLSIOWJXMOKLAFJNNKFNXN RAGRBAQEMHIGDJVDSEOXVIYCELFHWLELJFIENXLRATALSJFSLCYTKLASJDKMHGOVOKAJDNMNUITN RRRRLJVEEEEECLYVYHNVPFTAEEEEEMWLMEIRNGLARWJAKJDFLWNTIERJMIPQWOTZEOCXKNUBNXCN RJIRPOWEANFUSNCZVDVZNMSFEKLOEPZLDKDJWSAAAAAAAOERHJCTNCKFRIMVKSOFOMKMANREWNBN RZUDRGXEEEEENFQIDVLQNCKNEEEEEDGLLLLLLAWIOSNCDARLODMTOEJXMILDFJROTKJSDNLVCZNN | |
![]() |
|
| jdege | Apr 14 2008, 02:22 PM Post #3 |
|
NSA worthy
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just to keep things all in one place, there is a pattern word dictionary available online at: http://fiziwig.com/crypto/pattern.html The format isn't the same as what is produced above, so the match script won't work with it, but the following awk program will convert it:
|
| When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl. | |
![]() |
|
| fiziwig | Jan 18 2014, 10:47 PM Post #4 |
|
Elite member
![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
FYI The above link is no longer active. The same pattern word dictionary is now available at my new domain: http://cryptopian.com/pword.txt |
![]() |
|
| 1 user reading this topic (1 Guest and 0 Anonymous) | |
| « Previous Topic · Utilities · Next Topic » |





![]](http://z2.ifrm.com/static/1/pip_r.png)



This makes things much easier. Maybe I'll create a C-version to make sure that non-linux users can match patterns too.
12:32 AM Jul 11