About five years ago I started thinking about passphrases and the word lists used to generate them. At first, I just built tools to audit, and later create, word lists rather than actually create word lists myself. In 2020, I finally started working on making lists.
My work building word list tools has culminated in the latest versions of Tidy, which, among other things, incorporates a pretty neat process I created for making a list uniquely decodable.
Today, I’ve polished up four of my word lists and am freshly publishing them as Orchard Street Wordlists, named after the first street I lived on in Manhattan. I’m hoping that a little branding will help them get some attention and use.
The lists are comprised of words taken from two sources: Google Books Ngram data and a Wikipedia word frequency project. I basically blended the most frequently used words from both sources, then cut profane and strange words. I then made all of the lists uniquely decodable, using Schlinkert pruning. Crucially, this attribute means that users can create passphrases without separators between words (e.g. “adjudicationhisssynodmanlyacculturationinextricably”). This means the lists are suitable for use with password managers like KeePassXC, which allow users to not use a word separator.
The Orchard Street Long List is a 17,576-word list. It provides a robust 14.1 bits of entropy per word, meaning a 7-word passphrase gives almost 99 bits of entropy. (It is a new version of my UD1 list.) I think it’s a pretty solid choice for using with KeePassXC if the user wants an extra bump in security (about 1.1 bits per word more than if you used the EFF long list).
The Orchard Street Medium List is my version of the classic Diceware lists of 7,776 words, like the EFF long list. The EFF long list has become popular since its release in 2016 – for example KeePassXC uses an only slightly modified version of it as its default word list. I hope that my Medium List offers a slight advantage of having more common words, since I used a technique called Schlinkert pruning to make the list uniquely decodable rather than removing all prefix words. (I also hope it’s not too confusing that my “medium” list is the same length as EFF’s “long” list, but part of my claim is that one can create a 17k-word list of English words that is usable.)
And lastly, I included two short lists from my Remote Words project. Orchard Street Alpha and Orchard Street QWERTY both have 1,296 words that are optimized for inputting into devices like smart TVs and video game consoles when using TV remotes and controllers. You can read more about these lists in this post or the Remote Words repo.
I also created included versions of the Medium and Short lists with corresponding dice roll numbers (e.g.
34565 holiday), if users want to use dice to create passphrases (see EFF’s guide on how to do that).
Currently, I’m licensing all of the lists under a Creative Commons Attribution-ShareAlike 3.0 Unported License, since that’s what Wikipedia’s text is licensed under. I’d prefer to use CC-BY-SA 4.0 if possible, but I’m not sure if that’s technically legal.
However, I admit that it may not be possible to copyright a list of alphabetical words, no matter their source or what manipulations I have performed on them beforehand. If this is the case, maybe I should use CC0. Again, I’m not a lawyer!
If you have thoughts on licensing this project, I’ve created a related GitHub Issue or welcome input on Mastodon.
git push on this particular project felt like a culmination of all of my work with word lists. After five years thinking about very niche questions surrounding word lists, part of me hopes I can let these lists be as they are and maybe find something else to think about in my spare, creative moments!