The math is easy. Can you explain why he's wrong?

Assume I use diceware and assume I give you the dictionary that I used to generate passwords (7776 words). Assume I tell you my password is at most 6 words long. Calculate the key space taking into account what you know:

7776^6 = 2.2E23

Compare to a 12 character "random, but easily typed character" password: a-zA-Z0-9 and all of the typical symbols: !@#$[];',. etc. Let's just call it 80 characters.

Sigma(n=1,12) 80^n = 7.0E22

So 6 random words form a dictionary that the attacker knows is an order of magnitude larger search space than 12 random characters.

My comparison assumes the 'best case' for random passwords: brute force search of the entire key space. I also assumed the worst case for diceware passwords (the attacker knows exactly which words are valid in my password, that I used only lower case letters to type them, that it's exactly 6 words long - not 4, not 7) and still diceware is better than 12 random digits by a large amount. Bumping it to 16 random characters vs 6 random words does not erase the advantage diceware ware if you allow me a minor change like "maybe I don't use spaces" or "maybe I capitalize some words".

The XKCD comic restricted the comparison space - he assumed the attacker knew the strategies in both cases and tuned his algorithm accordingly. He was also considering the common advice to start with a random word and modify it some way - that ends up in a much smaller amount of entropy than a purely random password. I tried to correct for these short comings in my example just to show that his advice still holds.

In his example and looking at his concerns (how hard is it to generate and memorize a strong password) things favour the random words approach even more. If the attacker doesn't have information about what passwords should look like and they resort to brute forcing the entire a-z0-9+symbols search space then the longer password will be stronger - that tends to favour diceware for the reason he highlighted.

Using your recommended site to evaluate passwords:

First I used diceware to make a 6 random word (The minimum recommended length) password:Note that the advantage calculated here is much higher than in my example because here he's assuming the attacker only knows that he has to search a-z+spaces, not that he can restrict his key space to combinations of a specific list of 7776 words.

- Password: cash party island beset waxen coil
- Search Space: 1.65E60
- Massive Cracking Array Scenario: 5.23 trillion trillion trillion centuries

Using keychain to generate a 12 character random password:

- Password: zXn6(iy77&:r
- Search space: 5.23E23
- Massive Cracking Array Scenario: 1.74 centuries

Assuming compute speed doubles ever year and that 1.74 centuries starts looking pretty damn small. If you're sending 'sexy pictures' with a 12 character password to a mistress now - they'll be pretty easy to crack (1 month) in 10 years when your wife is looking to divorce for a history of cheating. What are the odds those files end up laying about on a gmail account waiting for a sopena?

In order to reach the same "durability" as I had with diceware I had to use a 30 character random character password. That seems to demonstrate exactly the point Randal was making: a few random words is just as strong and infinitely easier to memorize than random passwords or using a common strategy of mangling an uncommon word in predictable ways.

you're right, it's not completely the same, but the fact that it's using real words from a dictionary means it's not all that strong either. essentially the difference is between a 4 character password where each character can be one of ~70 choices and a 4 character password where each character can be one of ~10,000 choices (arbitrary example), while yes, it is stronger, it still takes a sane amount of time to crack. an order of magnitute, as you have calculated, is not really that much stronger in terms of passwords. The mistake that Randall made is exactly the one that you pointed out in the haystack calculator that I linked - it doesn't take into account dictionary attacks. Steve Gibson's method, on the other hand, is not vulnerable to a dictionary attack (of course, it might have other weaknesses of its own).