Jeff Mixon / ML for password modeling: PassGAN through PassGPT, plus a rule generator

Created Thu, 09 Mar 2023 00:00:00 -0800 Modified Mon, 22 Jun 2026 16:04:57 -0700

Why model passwords at all

Password modeling is the part of password cracking that asks: given the passwords humans actually pick, what are the next million passwords most worth trying? The classical answer is rule-based — start from a wordlist, mutate it with a rule engine like hashcat’s, prioritize by frequency. The ML answer is generative — train a model on a leaked corpus, sample from it, throw the samples at hashes you are trying to crack.

Both work. Neither dominates. The interesting research question over the last several years has been which combinations of the two do better than either alone.

I’ve spent a multi-year personal-research arc poking at this — both reading the literature and re-implementing it, both on the model side and on the operational toolchain side. This post is a tour of what that arc actually was.

The model arc

Four implementations, each with a different theoretical lens:

  • PassGAN is the foundational paper — a Wasserstein GAN trained on a password corpus to produce password-shaped strings. It reignited interest in generative password modeling and is the baseline everything since has compared against.
  • IWGAN (Improved Wasserstein GAN) is the more stable training recipe for the same generative goal — gradient penalty in place of the original clipping, materially better convergence behavior.
  • GNPassGAN brings normalization tricks to bear specifically on the password-modeling problem, with the goal of better coverage of the corpus distribution rather than just better samples in the mode.
  • PassGPT is the LLM-era reimagining: a transformer language model trained on a password corpus, generating passwords token by token rather than as fixed-length strings. The strengths and weaknesses are exactly what you’d expect from making that swap — more flexible, less efficient per-sample, very different failure modes.

I re-implemented or ported all four — PassGAN and its derivatives in PyTorch (with a TensorFlow 2 port of the original in flight at one point), PassGPT in the HuggingFace ecosystem. The reproductions are not contributions in themselves — the papers exist — but the act of reproducing them is what builds the intuition for which one is the right tool for a given operational situation.

Pantagrule — the rule-side artifact

On the rule side, I built and maintain pantagrule: a generator that takes a password corpus and produces hashcat rule files derived from the mutations actually observed in real-world breaches. The output is hashcat-compatible rules, the input is whatever corpus you point it at, the implementation is Python.

It is not glamorous and it is the part I use most. A well-tuned set of rules applied to a focused wordlist will outperform almost anything you do with a generative model on a constrained hashing budget. The generative side becomes interesting when you’ve exhausted the rule-based candidates and need to keep going.

The closed loop

The thing the literature mostly doesn’t talk about is how the two sides combine in practice. The shape I converged on:

  1. Generate with the rule pipeline first. It is the cheapest candidate per hash attempted. Run until rule-driven candidates plateau in cracks-per-hour.
  2. Switch to generative candidates — PassGPT for diversity, IWGAN/GNPassGAN for in-distribution depth — and run those into the same distributed cracker.
  3. Feed back. The new cracks become input to the next round of pantagrule rule generation, which biases the next rule pass toward the patterns that worked on this corpus.

The distributed cracker is Hashtopolis — a coordinator/agent model in Python where multiple GPU machines pull tasks, run hashcat against them, and report back. It is the layer that turns “I have an idea about which candidates to try” into “I have eight GPUs trying them at once.”

The loop is not novel; the specific shape of which model goes after which kind of remaining hash is where the operational judgment lives, and you only develop that by running the loop many times.

What the research arc actually taught me

The thing the arc taught me that the papers themselves did not: the right comparison metric for a password model is not perplexity, it is expected cracks per CPU-hour against a held-out hash set. Models that look better on language-modeling metrics frequently look worse on the operational metric, because the operational metric weights coverage of common-but-non-modal mass differently than perplexity does. Once you internalize this, a lot of the apparent contradictions between published results stop being contradictions and start being differences in what the authors were actually optimizing for.

The other thing it taught me: this is genuinely a place where ML practice and security operations have something to teach each other, and where most people on either side haven’t met someone fluent in the other. That gap is the slot I have been working in.

Closer

The model side and the rule side are not in opposition. The papers tend to compare a new generative model against rule-based baselines as if the question is which to use; the operational answer is to use both, in the right order, with the right handoff. That handoff is where the interesting engineering is, and it is what most of this research arc has actually been about.