Tidy v 0.2

I’m excited to announce that I’ve added a slew of new features to Tidy, my Rust command-line tool that helps users combine and clean large word lists.

Some of the more interesting new features include the ability to:

guarantee a maximum shared prefix length
enforce a minimum edit distance between words
print corresponding dice rolls before words, separated by a tab. E.g. 35466 ladle
cut generated list to set length by randomly removing words.

This is in addition to being able to remove prefix words and other useful functionality included in earlier versions.

To install Tidy, check the README.

What is “maximum shared prefix length”?

This feature was inspired by a quality of one of the EFF’s short word lists, for which the following is true:

Each word has a unique three-character prefix. This means that future software could auto-complete words in the passphrase after the user has typed the first three characters.

I thought this was a pretty neat feature of a word list, so I decided to make this a new option in Tidy.

In Tidy v 0.2, users can use the -x flag to set what I’m calling a “maximum shared prefix length”. Setting this value to say, 4, means that knowing the first 4 characters of any word on the generated list is sufficient to know which word it is. As an example, we’d know that if a word starts with “radi”, we know it must be the word “radius” (if “radical” had been on the list, it Tidy would have removed it).

This is useful if you intend the list to be used by software that uses auto-complete. For example, a user will only have to type the first 4 characters of any word before a program could successfully auto-complete the entire word.

(Users of Tidy v 0.2 can use the “attributes” flag twice (-AA) to get information about shared prefix length for a generated list. Tidy will print both “Longest shared prefix” and “Unique character prefix”, which is longest shared prefix + 1.)

Enforcing a minimum edit distance

That same EFF short list also had the nifty feature that:

All words are at least an edit distance of 3 apart. This means that future software could correct any single typo in the user’s passphrase (and in many cases more than one typo).

Thus, I added an option in Tidy to enforce a minimum edit distance in the generated word list (using -d flag). I did this by swiping an edit_distance function from this repository of common algorithms written in Rust (MIT licensed!). I think it’s correct! And it saved me a big headache!

Again using this function, Tidy can also calculate the minimum edit distance of list (use -AA).

How short can the shortest word on a word list (safely) be?

Through a discussion with a fellow Fediverse user, I started to think if there was a mathematical way to determine how short the shortest word on a word list could be while maintaining the wonderful entropy-per-word guarantees that lead us to use these word lists to create passphrases in the first place.

In fact, Joseph Bonneau, one of the creators of the EFF word lists, writes in a 2016 blog post announcing the new word lists that this ease of computing the entropy of resulting passphrase is a key advantage of the method:

Estimating the difficulty of guessing or cracking a human-chosen password is very difficult. It was the primary topic of my own PhD thesis and remains an active area of research. (One of many difficulties when people choose passwords themselves is that people aren’t very good at making random, unpredictable choices.)

Measuring the security of a randomly-generated passphrase is easy. The most common approach to randomly-generated passphrases (immortalized by XKCD) is to simply choose several words from a list of words, at random. The more words you choose, or the longer the list, the harder it is to crack. Looking at it mathematically, for k words chosen from a list of length n, there are n^k possible passphrases of this type. It will take an adversary about (n^k)/2 guesses on average to crack this passphrase.

Later in the post, writing about the then-new EFF long list, Bonneau notes that “We took all words between 3 and 9 characters from the list”, but doesn’t explain the reason they chose a 3-character minimum.

We can wonder: If we had a 7,776 word list had a lot of 1- and 2-character words on it, would the passphrases generated from that list be as secure as from the EFF long list (which also has 7,776 words, but a minimum word length of 3)? My gut says no, but I wanted to try to figure out an answer grounded in (my very amateur) math.

An attempt at mathematically determining a safe minimum word length

The EFF long list has 7,776 words on it, so we can expect that adding one word from it to a passphrase adds about 12.925 bits of entropy (log2(7776)). So as an example, a three-word passphrase would have 38.775 bits of entropy. This is the math Bonneau is referring to when he writes “Measuring the security of a randomly-generated passphrase is easy.” And I agree – it’s easy!

But what if all three words we got for our passphrase were three-letter words? Or what if we used that hypothetical list that has “a” on it? Randomly selecting words from the list, we hypothetically could get a passphrase of “a-a-a”. Can we claim that “a-a-a” has 38.775 bits of entropy? A brute-force alphabet attack (putting aside the hyphens) would suggest an entropy of 3 * log2(26) which is 14.10 bits! We’ve greatly over-estimated the amount of entropy per word.

I contend that this edge-case thought experiment may be useful when we ask what the shortest word on a good word list should be. There may be an established method for determining what this minimum word length should be, but if there is I don’t know about it yet. Here’s the math I’ve worked out on my own.

Entropy per letter

This mental exercise of taking the shortest word from the list and assuming the worst-case – that the user generates a passphrase entirely of words of that shortest length – gave me an idea.

Let’s return to the list with “a” on it. Given that the list is 7,776 words long, we’re expecting every word to provide 12.925 bits of entropy. Since this list has at least one 1-character word on it, we’re effectively claiming “The word ‘a’ provides 12.925 bits of entropy.” We’re asserting that the entropy of single letter is 12.925.

This is clearly false, since there are only 26 lowercase English letters. At best, a single letter adds 4.7 bits of entropy (log2(26)). As we saw before, we’re over-estimating entropy per word.

Let’s run through another example. Let’s say we wise up and replace all 1-character words with longer words, but we keep some 2-character words like “of”, “to”, “if” etc. Do we still make a bad assumption? Let’s see.

In this case, we’re expecting the word “of” to provide 12.925 (log2(7776)) bits of entropy. But when we realize there are only 26 letters in English, we see that, at best, two of these letters only generates 2 * log2(26) == 2 * 4.7 == 9.4 bits of entropy. Again, 12.925 is an overestimate!

Now that we’ve run through two examples, let’s see if we can generalize this checking process.

If we take the entropy per word from a list (log2(list_length)) and divide it by the length of the shortest word on the list, we get a value we might call “assumed entropy per letter”. In our first example with “a” on the list, we’d get as assumed entropy per letter of log2(7776) / 1 == 12.925. In the second example with “of” on the list, we’d get log2(7776) / 2 == 6.463.

Comparing both of these numbers (12.925 and 6.463) to the more realistic assumption of entropy per letter, log2(26) == 4.7, we see our same overestimates calculated differently.

Kind of feels like we should aim for an “assumed entry per letter” of, at least, under 4.7 bits… We might call this “the brute force line”.

The “brute force line”

If this “assumed entropy-per-letter” number is greater than 4.7, we can say that the list falls below what I’m calling the brute-force line (I’m using “below” here because it corresponds to the minimum word length being too low). I call the brute-force line because the 4.7 bits-per-letter is a representation of a brute force attack in which an attacker would try every letter combination.

The EFF long list clears this line, since log2(7776) / 3 == 12.92 / 3 == 4.3 bits per letter, which is less than 4.7.

Checking our work against advice from Arnold Reinhold

The creator of a popular diceware list, researcher Arnold Reinhold, writes in an FAQ on diceware passphrases:

Why shouldn’t I use a Diceware passphrase shorter than 19 characters?

One way someone might use to find your passphrase is to write a computer program that tries all combinations of characters up to some length. If your Diceware passphrase is very short, such a program would come up with you passphrase eventually. Using a passphrase that is at least 19 characters in length, including the spaces between the words, makes such an attack as difficult as searching all six word Diceware passphrases. By the way, it is very unlikely that the dice will give you a passphrase that short.

First of all, it’s nice to see a published researcher worrying about the same thing I am! Specifically, his warning against 6-word passphrases, created from his 7,776-word list, that are fewer than 19 characters. Let’s see if we can explain how he got that number 19.

A 6-word passphrase from a 7,776-word list should give us 77.55 bits on entropy. Let’s say we have a 6-word passphrase that just meets Reinhold’s minimum of 19 characters. We can calculate the entropy per letter as 77.55/19 == about 4.08 bits per letter. This clears our brute-force line of 4.7 bits per letter, but not by a ton, offering us hope that Reinhold did some similar math when he wrote the 19-character minimum.

But I’m not sure this is how Reinhold got to that 19 minimum, since both 18-character and 17-character passphrases also clears the brute-force line. (Running through that math: we assume the 18-character passphrase has 77.55 bits of entropy, which means we’re assuming each letter contributes 77.55/18 == 4.31 bits of entropy, which is safely below 4.7. Likewise, a 17-character passphrase clears it with 4.56 bits per letter.) Or maybe he integrated the 5 word-separating spaces (plus the numbers and punctuation marks in his list?) into his math to get to something closer to that 4.08 number. Or maybe he just wanted to give advice with a bit of safety cushion.

A “stricter” line

Let’s make yet another leap.

Of course, there are many combinations of two and three letters that are not words in English, and thus, assuming the words on the list are English, have a low chance of appearing on the generated word list.

If we go by a 1951 Claude Shannon paper (page 54), each letter in English actually only gives 2.62 bits of entropy. Users can see if their generated word list falls above this (stricter) line, which I’m calling the “Shannon line,” by using the -A/--attributes flag.

Interesting, the EFF long list does not clear this line, since log2(7776) == 12.92 / 3 == 4.3 bits per letter, which is greater than 2.62.

Does any of this realistically matter?

I don’t really know! I’m not a mathematician or cryptographer. For one thing, the odds of getting a three-word passphrase of 3-letter words from the EFF list is about 0.0001%. But more importantly, I’m not even sure how an attacker could effectively use this information to crack a passphrase faster.

Like, in 2018, 1Password’s word list of 18,328 words had at least some 3-letter words on it, meaning it had an assumed entropy per letter of 4.72 – below both our lines. And yet, when they held a passphrase-cracking competition that year, I don’t think any participants used a method other than running through all 18328^3 possible combinations. (Here’s my blog post about that challenge.)

How I folded some of this math into Tidy

Undeterred, and assuming all the above is sound math, I went ahead and had Tidy v 0.2 calculate this “assumed entropy per letter” metric for generated lists (use the -A flag to see it). It doesn’t affect the generated word list in any way – it just informs users of how their generated list measures up (and only if they use that option flag).

Here’s the Rust code, found in the display_information.rs file:

fn assumed_entropy_per_letter(list: &[String]) -> f64 {
    let shortest_word_length = get_shortest_word_length(list) as f64;
    let assumed_entropy_per_word = calc_entropy(list.len());

    assumed_entropy_per_word / shortest_word_length
}

For convenience, Tidy also checks if this assumed_entropy_per_letter value is greater than 4.7 or 2.62.

What are prefix words (aka prefix codes)?

Tidy v0.2 retains the ability to remove prefix words from a given word list.

A word list that doesn’t have any prefix words (also known as “prefix codes”) can better guarantee more consistent entropy when combining words from the list randomly and without punctuation between the words.

As a brief example, if a list have “boy”, “hood”, and “boyhood” users who specified they wanted two words worth of randomness (entropy) might end up with “boyhood”, which an attacker guessing single words would try. Removing prefix words – in this case “boy” – prevents this possibility from occurring.

You can read more about this issue here.

Printing list attributes information

Given these new features, I thought it’d also be nice if Tidy could print attributes of a given list to the terminal.

Running Tidy v 0.2 on the EFF long list with flags -AA (tidy -AA -ts --dry-run eff.txt), Tidy will print:

List length               : 7776
Length of shortest word   : 3 (aim)
Length of longest word    : 9 (zoologist)
Free of prefix words      : true
Entropy per word          : 12.9248
Assumed entropy per letter: 4.3083
Above brute force line    : true
Above Shannon line        : false
Shortest edit distance    : 1
Longest shared prefix     : 8
Unique character prefix   : 9

Here are the same attributes on the current 1Password list:

List length               : 18176
Length of shortest word   : 3 (abe)
Length of longest word    : 8 (zwieback)
Free of prefix words      : false
Entropy per word          : 14.1497
Assumed entropy per letter: 4.7166
Above brute force line    : false
Above Shannon line        : false
Shortest edit distance    : 1
Longest shared prefix     : 7
Unique character prefix   : 8

Cutting lists down and printing dice rolls

Since dice are often used in conjunction with these word lists, I also wanted Tidy to be able to print dice roll information next to words.

aback
abandons
abated
abbey
abbot
abbreviated
abdomen
abducted
aberrant
abide
ability
abject
abnormally
etc.

Up for a challenge, I allowed the user to set the number of sides of their dice to between 2 and 36. This involved some fun handling of non-base-10 numbers, a bit of zero-padding, and a challenging off-by-one error!

use radix_fmt::*;
pub fn print_as_dice(n: usize, base: u8, list_length: usize) -> String {
    // Set width for zero-padding

    // First, get the literal width of the largest number we'll be printing.
    // This is, by definition the length of the list.
    // We want the length of the number in the base we want to print all
    // the numbers, so use radix function.
    let n_as_base = radix(n, base);

    // Pad dice roll numbers with zeros
    let n_width = n_as_base.to_string().len();
    let pad_width = radix(list_length - 1, base).to_string().len();

    let mut padded_n = String::new();
    for _i in n_width..pad_width {
        padded_n.push('0');
    }
    // Now that we have the appropriate number of zeros
    // in `padded_n`, it's time to add our number
    padded_n += &n_as_base.to_string();

    // Print the dice rolls in slightly different ways,
    // depending on the value of the base.
    match base {
        // Values of 0 and 1 should have been caught earlier,
        // so we'll panic! if we have them here
        0 | 1 => panic!("Too few dice sides entered"),
        // If base is 2 or 3, just print as-is, zero-indexed.
        2 | 3 => padded_n,
        // If base is a common dice size (between 4 and 8), we'll add
        // one to each digit, to make it one-indexed and thus easier to 
        // compare to actual rolled dice
        4..=8 => padded_n
            .chars()
            .map(|ch| (ch.to_string().parse::<usize>().unwrap() + 1).to_string())
            .collect::<String>(),
        // If base is over base 9, we'll print each digit as zero-index, but
        // we'll add a hyphen _between_ values to make it easier to read.
        9..=36 => padded_n
            .chars()
            .map(|ch| ch.to_string() + "-")
            .collect::<String>()[0..padded_n.chars().count() * 2 - 1]
            .trim()
            .to_string(),
        _ => panic!("Amount of dice sides received is too high"),
    }
}

However, these diceware lists are often a set length so they use every possible dice roll (to use the EFF lists as an example once again: 6^5 == 7,776 words for the long list, and 6^4 == 1,294 words for their short lists). So I also needed to implement a way to cut lists to a set length. I made sure to make these cuts randomly, rather than cut, say, the last 100 words from an alphabetized list.

// User can cut words from nearly finished list.
// Does so randomly.
tidied_list = match req.cut_to {
    Some(amount_to_cut) => {
        let mut rng = thread_rng();
        tidied_list.shuffle(&mut rng);
        tidied_list.truncate(amount_to_cut);
        tidied_list
    }
    None => tidied_list,
};
// Finally, sort and de-duplicate list (for final time)
tidied_list = sort_and_dedup(&mut tidied_list);

Users can do this with the -c/--cut-to flag. And since I expect users to frequent use exponent math to calculate this particular value, Tidy can now take such math in this option in the form of 6**5.

/// Parse user's input to the `cut_to` option, either directly as a `usize`,
/// or, if they entered Python exponent notation (base**exponent). Either
/// way, return a `usize` or `expect`/`panic!`.
///  
/// This is useful when making lists fit to a specific amount of dice and
/// dice sides. (As an example, five rolls of a six-sided dice would be: 6**5).
fn eval_cut_length(input: &str) -> usize {
    match input.split("**").collect::<Vec<&str>>().as_slice() {
        [] => panic!("Please specify a number."),
        [num_string] => num_string
            .parse::<usize>()
            .expect("Unable to parse cut-to! Enter a number or a base**exponent"),
        [base_string, exponent_string] => {
            let base: usize = base_string
                .parse::<usize>()
                .expect("Unable to parse base of cut-to!");
            let exponent: u32 = exponent_string
                .parse::<u32>()
                .expect("Unable to parse exponent of cut-to!");
            base.pow(exponent)
        }
        _ => panic!("You can only specify one exponent! Use format: base**exponent"),
    }
}

An example

These two features make Tidy pretty handy! For example, here’s how to make a 7,776-word list from 1Password’s word list, removing prefix words and guaranteeing 4 characters can auto-complete any word, and add corresponding 6-sided dice role for each word.

tidy -P -x 4 --cut-to 6**5 --dice 6 1password-2021.txt

Sold?

For up-to-date instructions on how to install Tidy, check the README.