Tool For Thought
This week's edition of the Times Book Review features an essay that I wrote about the research system I've used for the past few years: a tool for exploring the couple thousand notes and quotations that I've assembled over the past decade -- along with the text of finished essays and books. I suspect there will be a number of you curious about the technical details, so I've put together a little overview here, along with some specific observations. For starters, though, go read the essay and then come back once you've got an overview.
The software I use now is called DevonThink, and I'm sorry to report that it is only available for Mac OS X. (I know there are a number of advanced search tools available for Windows, so I'm sure most of what I describe here could be reproduced -- I just don't know enough about the search tools on that platform to recommend anything.)
I talked in the Times essay about using the tool as a springboard for new ideas and inspiration. Here's what that process looks like in practice. This is the window that shows me an overview of part of my "research library" in DevonThink:

These are all books that I have transcribed digital passages from over the past 10 years or so -- you can see how many quotes for each book in the little number in parentheses after each title. Oftentimes I'll start the exploration with a straightforward keyword search, in this case: "urban ecosystem." I plug that in, and get back one result, a short quote from Manuel DeLanda's excellent 10,000 Years Of Non-Linear History.

This is where it gets interesting. I take that quote, and click on the "see also" button, which generates an instant list of other documents or quotes that have some semantic connection to the original one. I can see a few words from the entry, along with the author and book title.

I find another, more elaborate quote from DeLanda in that bunch:

And then I perform a "see also" on that quote. I get back a few pointers to essays that I've actually written -- and completely forgotten about -- including a review of an E.O. Wilson book on biodiversity that I wrote about three years ago. Ultimately, I end up with this wonderful quote from Jane Jacobs that draws an explicit analogy between natural and made-made ecosystems. The whole process takes me no more than a minute.

Over the past few years of working with this approach, I've learned a few key principles. The system works for three reasons:
1) The DevonThink software does a great job at making semantic connections between documents based on word frequency.
2) I have pre-filtered the results by selecting quotes that interest me, and by archiving my own prose. The signal-to-noise ratio is so high because I've eliminated 99% of the noise on my own.
3) Most of the entries are in a sweet spot where length is concerned: between 50 and 500 words. If I had whole eBooks in there, instead of little clips of text, the tool would be useless.
I think #3 is the point that needs to be drilled home to people working on desktop search. It's been hidden from us largely because the web itself is broken up into pages that are often in that 500 word sweet spot. Think about the difference between Google and Google Desktop: Google gives you URLs in return for your search request; Google Desktop gives you files (and email messages or web pages where appropriate.) On the web, a URL is an appropriate search result because it's generally the right scale: a single web page generally doesn't include that much information (and of course a blog post even less.) So the page Google serves up is often very tightly focused on the information you're looking for.
But files are a different matter. Think of all the documents you have on your machine that are longer than a thousand words: business plans, articles, ebooks, pdfs of product manuals, research notes, etc. When you're making an exploratory search through that information, you're not looking for the files that include the keywords you've identified; you're looking for specific sections of text -- sometimes just a paragraph -- that relate to the general theme of the search query. If I do a Google Desktop search for "Richard Dawkins" I'll get dozens of documents back, but then I have to go through and find all the sections inside those documents that are relevant to Dawkins, which saves me almost no time.
So the proper unit for this kind of exploratory, semantic search is not the file, but rather something else, something I don't quite have a word for: a chunk or cluster of text, something close to those little quotes that I've assembled in DevonThink. If I have an eBook of Manual DeLanda's on my hard drive, and I search for "urban ecosystem" I don't want the software to tell me that an entire book is related to my query. I want the software to tell me that these five separate paragraphs from this book are relevant. Until the tools can break out those smaller units on their own, I'll still be assembling my research library by hand in DevonThink.
I wonder whether it might be possible to have software create those smaller clippings on its own: you'd feed the program an entire e-book, and it would break it up into 200-1000 word chunks of text, based on word frequency and other cues (chapter or section breaks perhaps.) Already Devonthink can take a large collection of documents and group them into categories based on word use, so theoretically you could do the same kind of auto-classification within a document. It still wouldn't have the pre-filtered property of my curated quotations, but it would make it far more productive to just dump a whole eBook into my digital research library.
The other thing that would be fascinating would be to open up these personal libraries to the external world. That would be a lovely combination of old-fashioned book-based wisdom, advanced semantic search technology, and the personality-driven filters that we've come to enjoy in the blogosphere. I can imagine someone sitting down to write an article about complexity theory and the web, and saying, "I bet Johnson's got some good material on this in his 'library.'" (You wouldn't be able to pull down the entire database, just query it, so there wouldn't be any potential for intellectual property abuse.) I can imagine saying to myself: "I have to write this essay on taxonomies, so I'd better sift through Weinberger's library, and that chapter about power laws won't be complete without a visit to Shirky's database."
These extra features would be wonderful, but the truth is I'm thrilled to have the software work as well as it does in its existing form. I've been fantasizing about precisely this kind of tool for nearly twenty years now, ever since I lost an entire semester building a Hypercard-based app for storing my notes during my sophomore year of college. There's a longstanding assumption that the modern, web-enabled PC is the realization of the Memex, but if you go back and look at Bush's essay, he was describing something more specific -- a personal research tool that would learn as you interacted with it. That's what I think about whenever I use this system to stumble across a genuinely useful new idea: finally, I have a Memex!
I'm trying to replicate your system. Do you name the individual entries with the text of the quote?
Also, can you go a bit into your quote-harvesting process? Do you input as you read, or ...?
Thanks.
Posted by: Pedro | January 29, 2005 at 03:43 AM
I think plain old paragraphs fit your #3 requirement pretty well. They're units of text whose size is usually on the smaller side of the 50-500 "sweet spot", and almost always carry enough information to be somewhat self-contained in relation to the text around them.
Each file type usually has a specific way of defining paragraphs, and even in plain text there are a few common strategies most people use, such as keeping a blank line between two paragraphs, or preceding each one with a tab or a few spaces. For this reason, making a program to fetch paragraphs from a document wouldn't be too hard.
Posted by: Bira | January 29, 2005 at 04:04 AM
I use DevonThink for a similar purpose, and I love it. Although I should mention that in my browser I can't actually see your screenshots (?)
Also:
something I don't quite have a word for: a chunk or cluster of text
Have you considered using "lexia" as the word you're looking for?
Posted by: Jeremy Bushnell | January 29, 2005 at 04:26 AM
I would like to see something akin to this for images. Any ideas?
Posted by: ed | January 29, 2005 at 04:29 AM
One small, mildly off-topic request: would you mind changing the images in this post to be in PNG or JPEG format? Neither Firefox nor IE on Windows seems to be able to load them.
Posted by: Evan DiBiase | January 29, 2005 at 05:01 AM
Sorry about the images -- could have sworn they were jpegs before. They should be viewable now.
As for how I capture the quotes themselves, I have long used an advanced piece of software called a "research assistant" to type in passages that I've marked. I just started experimenting with scanning and OCR'ing in though, which seems to work fairly well...
Posted by: Steven Johnson | January 29, 2005 at 05:16 AM
Very interesting -- thanks for sharing.
As for the entire book vs. quote -- I use a program on Windows called DTSearch which is basically a full-text search program on steroids.
One of the things it can do is show the results in context and use fuzzy searches, proximity settings, etc. so if I search for "concept X", rather than saying "oh, it's somewhere in this e-book here" it will show the relevant parts of the book that match the search.
Still a long way from being perfect and it can't do some of the things it looks like you're doing with DevonThink, but works pretty well.
I've looked at a lot of this stuff on Mac and Wintel, and its kind of odd at just how primitive the tools are for either OS for this sort of thing. If you'd have asked me in the mid-1990s, I'd have assumed progress on organizing and searching free-form info would have progressed a lot farther than it has.
Posted by: Brian Carnell | January 29, 2005 at 06:10 AM
This Devonthink app seems a lot like the new Spotlight feature in the upcoming Mac OS 10.4 Tiger. What sorts of features does Devonthink offer that Spotlight won't?
(As in, why should I buy Devonthink instead of waiting to upgrade to Tiger?)
Posted by: Tarek | January 29, 2005 at 07:03 AM
Can someone recommend an equivalent to DevonThink for Windows? I don't even know how to do a google search for the software because I don't know what it is called in the general sense.
Posted by: Halfer | January 29, 2005 at 08:05 AM
DevonThink vs Spotlight:
http://www.devon-technologies.com/products/devonthink/background/spotlight.php
Posted by: Matthew Amster-Burton | January 29, 2005 at 09:08 AM
When and where will your piece on London sewers appear - sounds interesting (for a civil engineer like me anyway).
I'll second the request above for the names of Windows programs equivalent to Devon. Shouldn't all Devon's competitors be deluging you with emails after your article?
Posted by: Ethan | January 29, 2005 at 11:43 AM
I'm very interested in mindhandling software and I am thus very glad about your post about DevonThink. Right now I'm testing it and will most probably buy it.
Please keep us furthermore informed about think and expression tools, such as ThinkDevon, Ulysses and others.
Cheers, Stefan
Posted by: Stefan Herzog | January 29, 2005 at 12:19 PM
Sorry -- comments were down for a few hours. Should be back up now.
Posted by: Steven | January 30, 2005 at 02:21 AM
Suddenly, DevonThink makes sense. As a returning student after many, many years away, I'm trying to find how to take best advantage of the technology which simply didn't exist before. DevonThink is a tool I've downloaded and tried, and never really had it click. It's clicking now.
What's problematic, however, is that now I've got one more tool which does one thing and that's it. Sure, I could compose in DT, but it's not its strength. So I compose in one location, save my research in DT, and my bibliographic info in EndNote (which I might drop for Sente or Bookends anyway). I suppose three tools isn't that bad, now that I think about it.
Posted by: Jeffrey | January 30, 2005 at 02:39 AM
If you use Windows, check out www.asksam.com
Posted by: Adnan | January 30, 2005 at 02:51 AM
Questia (online library of ebooks) can make semantic searches except it can handicapped by the fact that you're searching through whole e-books even though it lets you search inside the book.
(www.questia.com)
Posted by: Adnan | January 30, 2005 at 02:59 AM
I tried DevonThink some months ago. I initially liked it but then stopped to use it, as it lacks multilingual capacities. I usually store quotes or chunks of text in the language they are written, and that approach unfortunately prevents DevonThink to do its magic. Still looking for a piece of software with such capacity.
Posted by: Ricardo Montiel | January 30, 2005 at 03:47 AM
I'd add that the useful chunk size online is often not the URL of a main page or an index but a permalink pointing to a specific, often brief, entry in a weblog.
btw, SBJ, ever experiment with Voodoo Pad?
Posted by: xian | January 30, 2005 at 05:01 AM
Steven - this is poignant post about search. We just completed a book titled "Lucene in Action" and I built a "search inside" the book website for it. The granularity of search results are book sections, not pages. I am also capturing, yet not exposing yet, each page of a section in order to have better information displayed. I've also linked a blog into the table of contents page - so I can add commentary/errata after the fact to a book section. I will be building in "see related" types of connections that are not made explicit.
I'd be grateful for you to review what I've built and offer suggestions to further enhance this type of thing. I have not yet considered hooking in handling multiple books, but our publisher is definitely interested in adopting the system I've built and these types of inter-book connections would be a great thing to have.
Posted by: Erik Hatcher | January 30, 2005 at 07:57 AM
A much simpler (and of course less powerful) program for writers to keep track of notes of any kind (I use it for quotes) is Notational Velocity. It's free and is OS X only. It's my most used app. You can get it here: http://pubweb.nwu.edu/~zps869/nv.html
Posted by: dobbs | January 30, 2005 at 08:33 AM
Steven:
You said: "I wonder whether it might be possible to have software create those smaller clippings on its own"
I have two possible solutions you could investigate:
1) Book2Pod is free, and converts etext into iPod-notes sized chunks -- each chunk is about 4K big, which works out to about 680 words - a bit higher than the sweet spot, but maybe not so bad. http://www.tomsci.com/book2pod/
2) The O'Reilly network published a 3 parter on how to build an eDoc reader for the iPod here: http://www.macdevcenter.com/pub/a/mac/2004/12/14/ipod_reader.html I think they have the finished software available for download, but, since they give you the source, you can probably hack it to generate notes much smaller than 4K (ie: somewhere in the 50-500 word zone)
Both of these are free. The second one is interesting because it can format text from pdf's into iPod sized notes.
Anyway: Thanks for sharing DevonThink with us. I've seen it before, but I think I'll go have a closer look at it in light of what you just wrote.
Posted by: Robert Hahn | January 30, 2005 at 08:46 AM
Hi Steven, Glad to come across your article in the Times and your site. I'm really curious how you digitize/save all your qoutes from other sources. Are they word coduments, emails to self, some kind of database? I'm doing the same but am pretty haphazard about it and would love to hear your method. Thanks.
Posted by: Larry Straus | January 30, 2005 at 09:37 AM
For almost a decade from ~1988 I kept my reading & research commonplace book in Persoft's IZE, a DOS textbase -- orphaned all too soon -- that did simple but very useful things with keywords presented in an indented hierarchy. The more entries and keywords I gave it, the more the hierarchies took on increasingly interesting and suggestive sequences; i.e. they looked more like *outlines.* IZE seemed to understand the content of the passages.
I knew perfectly well that appearance was "just" a reflection of my choices of keywords -- an embodiment of how I used and related words -- but it felt uncanny all the same.
Norretranders quotes Kline quotes Hertz on Maxwell's equations: "One cannot escape the feeling that these equations have an existence and an intelligence of their own, that they are wiser than we are, wiser even than their discoverers, that we get more out of them than was originally put into them."
Posted by: Monte Davis | January 30, 2005 at 09:44 AM
"One of the new applications that came out last year was Google Desktop -- using the search engine's tools to filter through your personal files." Loading this Google software into at least a Windows machine opens a back door to the computer. Anyone can open this door and walk into your computer.
Posted by: Karen | January 30, 2005 at 10:00 AM
And, of course, ten minutes later I trip over New Scientist on semantic search for Google... now Slashdotted...
http://www.newscientist.com/article.ns?id=dn6924
Posted by: Monte Davis | January 30, 2005 at 10:12 AM