This may be old news to some of you, but I just noticed the other day that Amazon has added a whole panel of "text stats" for many of its books. I noticed it because my last book The Ghost Map just came out in paperback (go read it people -- it's a lot more fun than this post will turn out to be) and so I'm back into the swing of checking Amazon a few times a day. Text Stats is a pretty wonky page -- everything from some of the "readability" indices, to overall word count, to what Amazon calls "Fun stats" like "Words per dollar." (Quotes you never hear at Barnes and Noble: "This copy of Infinite Jest is such a bargain at only 39,574 words per dollar!")
But the two stats that I found totally fascinating were "Average Words Per Sentence" and "% Complex Words," the latter defined as words with three or more syllables -- words like "ameliorate", "protoplasm" or "motherf***er." I've always thought that sentence length is a hugely determining factor in a reader's perception of a given work's complexity, and I spent quite a bit of time in my twenties actively teaching myself to write shorter sentences. So this kind of material is fascinating to me, partially because it lets me see something statistically that I've thought a great deal about intuitively as a writer, and partially because I can compare my own stats to other writers' and see how I fare. (Perhaps there's a literary Rotisserie league lurking somewhere on those Text Stats pages.)
So I spent a few hours last week plugging in the numbers for my books, as well as a few other authors that I assembled in an entirely unscientific fashion: Malcolm Gladwell, Steven Pinker, Seth Godin, Christopher Hitchens -- and then, just to see how far I'd come, I threw in my intellectual (and, sadly, stylistic) heroes from my early twenties, the post-structuralist legends Michel Foucault and Frederic Jameson. I compiled stats for 3-4 books for each author, except Gladwell who has written two, and then plotted them on a scatter chart, with the y axis representing % complex words and the x axis representing words per sentence. The results were pretty fascinating:
Some observations:
1. There's a clear cluster of Hitchens/Johnson/Pinker in the center. (From eyeballing some other Amazon pages, I think Dawkins, Michael Pollan, E. O. Wilson would have been in that general area as well.) But what I thought was so striking was that even in that cluster, each author's books are closer to his other books than they are to the other two author's books. In other words, each of us has a certain sweet spot of complexity that we come back to book after book. My first and last books, Ghost Map and Interface Culture had the exact same words per sentence, down to the decimal point: 24.6. (My longest sentences turned out to be in Emergence, followed closely by Everything Bad at 25.8 and 25.7.) Pinker tends to be just slightly less complex syntactically (with the one outlier Blank Slate, which is more complex than anything I've written.) And Hitchens tends to write longer sentences by a couple of words.
2. Gladwell's sentences are fully 25% shorter than mine. I'm not sure if the average reader would notice the difference between the Johnson/Hitchens/Pinker cluster, but a 25% drop in sentence length has to alter the reading experience dramatically. Clearly, the only things separating me from selling ten million copies of my books are those extra 6.5 words per sentence.
3. Check out Foucault and Jameson. They are literally on another planet. The top spot goes to Jameson's "Postmodernism" book which I read like scripture my first year of grad school: 53 words per sentence! Interestingly, most of the variation shows up in sentence length not in word complexity -- you often hear people complain about the impenetrable jargon of critical theory, but it looks here like the sentence length is as least as much of a culprit.
4. I would love to see some stats on dynamic range here: not just average sentence length, but how much the sentence lengths vary over the course of each book. One of the things I learned when I started writing in a less academic style (largely when I was doing FEED) is the importance of throwing in a very short sentence for emphasis at regular intervals. (Come to think of it, I may have learned this from reading Gladwell's early pieces in the New Yorker.)
5. Is there a Literature grad school version of the Lazy Web? If so, I would love to see a study that cross-referenced sales and syntactical complexity across thousands of books and determined who had the highest sales-to-complexity ratio of all time.
6. After looking at the Jameson number, I went back to one of my papers from junior year at Brown to see how awful my prose was. I pulled up the scariest sentence in the first paragraph and did a quick word count: 75 words. 75! And no semi-colons either. I bet Fred Jameson's pretty psyched I never finished that PhD...

Steven, this is awesome! I hadn't noticed this on Amazon until you pointed it out.
I'm sure there's got to be some correlation with sales here. I wonder what is median and range for something like the NYT bestsellers lists. Just as each author has their personal sweet spot, surely book buyers as a whole and within their niches have their own.
Beyond that, I'm curious about re-readability, which of course is tougher to measure. Though I got a few hours enjoyment the first time around, I don't think I'll ever go back to read a Godin or Gladwell book. Ever. But for books in the Johnson/Pinker/Hitchens/et al range and beyond, I'm nearly certain I will.
Posted by: Mark Larson | October 21, 2007 at 06:21 PM
Another cool toy to play with on Amazon pages is the Concordance feature - it lists the most common words (aside from "the" "and" etc.) in tag cloud format. There's nothing quite like seeing a book you spent years writing boiled down to 100 key words, but it's an interesting interface into the text.
Posted by: Jason Mittell | October 21, 2007 at 07:26 PM
Interesting tool! Tolstoy, Dostoyevsky, Austen all write at about 18-20 words per sentence, with words usually around 1.5 syllables. But then I thought, what about Hemingway? Surely his terse style would yield a different outcome. And yet it didn't. After a cursory search, I find Hemingway's numbers are the same as these other writers.
So what's being lost in the aggregation of this data?
Posted by: Michael Patrick Gibson | October 22, 2007 at 06:26 AM
Interesting stuff, Steven. To my baseball-addled mind, the short-sentence trick works just like a change-up for a pitcher: just when you think you've got him all figured out and can take him for granted, pow!, he changes the rhythm.
Re Michael's comment, "Surely his terse style would yield a different outcome." I once had a professor who walked our writing class through an analysis of Hemingway's sentence length. The surprising fact was that Hemingway *often* used sentences that were quite long -- above 50 words, even above 80 words. But they would still be in his terse style, as they would often be comprised of several short independent clauses joined by "and." It makes some intuitive sense when you remember the opening sentence of ~The Old Man and the Sea~, which includes three independent clauses in less than 30 words, but Hemingway used even bigger run-ons of sentencelets (?) in some of his earlier work.
Posted by: Tim Walker | October 22, 2007 at 07:28 AM
I emend my Hemingway stat. The books I looked at were Farewell to Arms and For Whom The Bell Tolls. So to broaden the search, I went to his complete edition for short stories. (I figured since it collects from across his career, it'd be fairly representative.) On this, he's even terser than I thought:
Syllables per Word: 1.3
Words per Sentence: 10.3
Posted by: Michael Patrick Gibson | October 22, 2007 at 09:03 AM
"you often hear people complain about the impenetrable jargon of critical theory, but it looks here like the sentence length is as least as much of a culprit."
This may be because people tend to read one sentence at a time. It seems natural to pause at the end of a sentence and think "what did that just say?", but not so natural to pause mid-sentence. So in text composed of long sentences, one is more likely to get lost -- a 50-word sentence is more likely to throw someone off the track than two 25-word sentences.
Posted by: Isabel Lugo | October 22, 2007 at 10:59 AM
Great stuff Steven. Isn't this related to what you wrote in Interface Culture on Apple's V-Twin search tool? If I remember it correctly V-Twin picked words with more than six letters, or maybe seven, in all the documents on a computer. This enabled comparisons of documents and V-Twin could tell which ones were "related", the ones containing the same long words.
Posted by: Rikard Linde | October 22, 2007 at 12:59 PM
I'm surprised that Hannah Arendt doesn't score higher. The Portable HA only hits 17% and 28.6, while her personal correspondence scored lower (12%, 16.8, though possibly mixed with others' writing). One should always keep Mark Twain in mind when reading anything written by someone with a Germanic background.
http://ccat.sas.upenn.edu/jod/texts/twain.german.html
Posted by: Eric H | October 22, 2007 at 06:33 PM
MPG,
One thing that is probably being lost in Hemingway's case is that he practiced two rather different styles. He was terse at times, but as Tim notes, he could also tease a sentence along when he wanted to. Hemingway was a more bi-modal writer than most. Another thing not lost not in aggregation but rather in translation is Tolstoy's and Dostoyevshy's syllable count. In the original, they may have very different syllable counts because of the nature of the language, or because of the style of the translator. The same may be true of sentence length.
Posted by: kharris | October 23, 2007 at 06:23 AM
If you want to test your own writing (without getting a book published and sold on Amazon), you can do it within Microsoft Word (2000 or later, I think).
1. Under the Tools menu, choose Options.
2. On the Spelling and Grammar tab, check the boxes toward the bottom for "Check Grammar" and "Show readability statistics".
3. Click OK.
4. Under Tools, choose "Check Spelling and Grammar".
5. Click through all of the grammar mistakes that Word found in your document.
6. When it is done checking your grammar, Word will display readability statistics. It includes Words Per Sentence, but not syllables per word (although it does contain the Flesch Reading Ease score and Flesch-Kincaid score, which are partially based on both of these metrics).
Posted by: Drew | October 24, 2007 at 06:00 AM
There are also more "industrial strength" tools developed by computer linguists for this kind of thing:
http://portal.tapor.ca/portal/portal
Best, Max
Posted by: Max | October 24, 2007 at 12:40 PM
Hey, I won!
What do I get?
Posted by: Seth Godin | October 25, 2007 at 03:55 AM
So that's why more people read Seth's blog than mine! I used to just think he was smarter. ;-)
Posted by: Tim Peter | October 25, 2007 at 09:08 AM
Perhaps this is why a Seth Godin read always feels 'fresh', and never labored. The thought of reading his books and blogs posts alike, never brings with it a hesitation that "this is going to be draining", or "written solely for the sake of writing".
RE Hemingway: This is where the stats can be misleading. Taking averages belies an author like Hemingway. While he may in fact have many long sentences, his 'voice' is clearly established by the short, redundant statements that often followed them.
This was his mastery; to SAY a lot in the lengthy sentences, capped by a brief repeat of the slice of the previous prose. Which was how and what he wanted the reader to remember from each page.
Now then, where's that book with the long title? Oh yes, The Dip.
Thanks for your thoughtful post.
Posted by: Ed (NextInstinct) | October 25, 2007 at 09:22 AM
Here's something a little scary: John Henry Cardinal Newman's Essay on the Development of Christian Doctrine sits right on top of... one of Christopher Hitchens' books, at 17% complex words and 31.8 words per sentence.
No, it's not his book on atheism.
Posted by: Joe Marier | October 25, 2007 at 10:06 AM
Ooh; so nice to be reminded of this *before* I send my tome off to the printers . . . although perhaps I should just change my last name to something beginning with 'G'
I wonder how much web-style has affected this trend? I do know that my blog posts tend to be punchy; the stuff intended for dead-tree versions tend to be, well, less punchy.
Posted by: Joel D Canfield | October 25, 2007 at 11:44 AM
Back in the late 80's there was an application, Corporate Voice, that let you feed it writing samples, and it would tell you how close you were to those samples. It worked. Unfortuantely, the application was not a success in the market.
Posted by: David Locke | October 25, 2007 at 01:21 PM
Short is good. Its almost hard to leave that sentence with so few words...
Posted by: Levi | October 25, 2007 at 01:25 PM
Try Steinbeck. He is said to have experimented with his writing style.
Posted by: David Locke | October 25, 2007 at 01:26 PM
Very interesting stuff, and a fantastic feature for Amazon to add - I'm sure they've been using it internally for ages.
The concordance thing is also interesting. I compared a few of Bill Bryson's travel books and the top 100 words are almost identical per book - I guess if you hit upon a winnnig formula then stick with it:
Book A: http://tinyurl.com/2n4zce
Book B: http://tinyurl.com/2obmb3
I wonder what further information could be gleaned from this. It certainly bodes well for the essays I currently have to write (my average words/sentence is 18.9).
Posted by: Joe | October 25, 2007 at 01:43 PM
Thanks for the tip. I'm doing more and more writing, which I'm glad for. I appreciate as many "rails" as I can get to help guide the process.
I'll keep your post for future reference.
-Andrew
Posted by: Andrew Robinson | October 25, 2007 at 04:18 PM
Pretty impressive study. If an author could match short sentences (aiming high sales books) and good content (to benefit the readers), then he/she would acomplish the perfect formula. Nice post.
Posted by: Leonardo Kuba | October 25, 2007 at 06:56 PM
I wonder how this translates to "hear-ability" as I really like Godin and Gladwell audiobooks when read by the author, but have a harder time reading them on the printed page.
Posted by: Tara Jacobsen | October 25, 2007 at 07:27 PM
We just had a demo at work last week of a tech writing plug-in for our docs that checks sentence length, among other things. It flags any sentence longer than 26 words as too long to be fully comprehended by our audience (which may include non-English readers).
When we ran this on our existing docs, practically every 2nd sentence was too long by these standards.
Posted by: Bryan | October 26, 2007 at 11:27 AM
I'm just wondering what would the graph have looked like if you would have plotted Gayatri Chakravarti Spivak on it...
Posted by: dhamini | October 27, 2007 at 06:37 AM