This week's edition of the Times Book Review features an essay that I wrote about the research system I've used for the past few years: a tool for exploring the couple thousand notes and quotations that I've assembled over the past decade -- along with the text of finished essays and books. I suspect there will be a number of you curious about the technical details, so I've put together a little overview here, along with some specific observations. For starters, though, go read the essay and then come back once you've got an overview.
The software I use now is called DevonThink, and I'm sorry to report that it is only available for Mac OS X. (I know there are a number of advanced search tools available for Windows, so I'm sure most of what I describe here could be reproduced -- I just don't know enough about the search tools on that platform to recommend anything.)
I talked in the Times essay about using the tool as a springboard for new ideas and inspiration. Here's what that process looks like in practice. This is the window that shows me an overview of part of my "research library" in DevonThink:

These are all books that I have transcribed digital passages from over the past 10 years or so -- you can see how many quotes for each book in the little number in parentheses after each title. Oftentimes I'll start the exploration with a straightforward keyword search, in this case: "urban ecosystem." I plug that in, and get back one result, a short quote from Manuel DeLanda's excellent 10,000 Years Of Non-Linear History.

This is where it gets interesting. I take that quote, and click on the "see also" button, which generates an instant list of other documents or quotes that have some semantic connection to the original one. I can see a few words from the entry, along with the author and book title.

I find another, more elaborate quote from DeLanda in that bunch:

And then I perform a "see also" on that quote. I get back a few pointers to essays that I've actually written -- and completely forgotten about -- including a review of an E.O. Wilson book on biodiversity that I wrote about three years ago. Ultimately, I end up with this wonderful quote from Jane Jacobs that draws an explicit analogy between natural and made-made ecosystems. The whole process takes me no more than a minute.

Over the past few years of working with this approach, I've learned a few key principles. The system works for three reasons:
1) The DevonThink software does a great job at making semantic connections between documents based on word frequency.
2) I have pre-filtered the results by selecting quotes that interest me, and by archiving my own prose. The signal-to-noise ratio is so high because I've eliminated 99% of the noise on my own.
3) Most of the entries are in a sweet spot where length is concerned: between 50 and 500 words. If I had whole eBooks in there, instead of little clips of text, the tool would be useless.
I think #3 is the point that needs to be drilled home to people working on desktop search. It's been hidden from us largely because the web itself is broken up into pages that are often in that 500 word sweet spot. Think about the difference between Google and Google Desktop: Google gives you URLs in return for your search request; Google Desktop gives you files (and email messages or web pages where appropriate.) On the web, a URL is an appropriate search result because it's generally the right scale: a single web page generally doesn't include that much information (and of course a blog post even less.) So the page Google serves up is often very tightly focused on the information you're looking for.
But files are a different matter. Think of all the documents you have on your machine that are longer than a thousand words: business plans, articles, ebooks, pdfs of product manuals, research notes, etc. When you're making an exploratory search through that information, you're not looking for the files that include the keywords you've identified; you're looking for specific sections of text -- sometimes just a paragraph -- that relate to the general theme of the search query. If I do a Google Desktop search for "Richard Dawkins" I'll get dozens of documents back, but then I have to go through and find all the sections inside those documents that are relevant to Dawkins, which saves me almost no time.
So the proper unit for this kind of exploratory, semantic search is not the file, but rather something else, something I don't quite have a word for: a chunk or cluster of text, something close to those little quotes that I've assembled in DevonThink. If I have an eBook of Manual DeLanda's on my hard drive, and I search for "urban ecosystem" I don't want the software to tell me that an entire book is related to my query. I want the software to tell me that these five separate paragraphs from this book are relevant. Until the tools can break out those smaller units on their own, I'll still be assembling my research library by hand in DevonThink.
I wonder whether it might be possible to have software create those smaller clippings on its own: you'd feed the program an entire e-book, and it would break it up into 200-1000 word chunks of text, based on word frequency and other cues (chapter or section breaks perhaps.) Already Devonthink can take a large collection of documents and group them into categories based on word use, so theoretically you could do the same kind of auto-classification within a document. It still wouldn't have the pre-filtered property of my curated quotations, but it would make it far more productive to just dump a whole eBook into my digital research library.
The other thing that would be fascinating would be to open up these personal libraries to the external world. That would be a lovely combination of old-fashioned book-based wisdom, advanced semantic search technology, and the personality-driven filters that we've come to enjoy in the blogosphere. I can imagine someone sitting down to write an article about complexity theory and the web, and saying, "I bet Johnson's got some good material on this in his 'library.'" (You wouldn't be able to pull down the entire database, just query it, so there wouldn't be any potential for intellectual property abuse.) I can imagine saying to myself: "I have to write this essay on taxonomies, so I'd better sift through Weinberger's library, and that chapter about power laws won't be complete without a visit to Shirky's database."
These extra features would be wonderful, but the truth is I'm thrilled to have the software work as well as it does in its existing form. I've been fantasizing about precisely this kind of tool for nearly twenty years now, ever since I lost an entire semester building a Hypercard-based app for storing my notes during my sophomore year of college. There's a longstanding assumption that the modern, web-enabled PC is the realization of the Memex, but if you go back and look at Bush's essay, he was describing something more specific -- a personal research tool that would learn as you interacted with it. That's what I think about whenever I use this system to stumble across a genuinely useful new idea: finally, I have a Memex!
Steven, are you familiar with Simpy (my name should link to it)? Simpy currently does for web pages what you described in this post, and the upcoming Simpy release will have support for Notes, which will work _much_ like you described your tool. I hope to make the new release is about a week.
Posted by: Otis | January 30, 2005 at 10:38 AM
Steven,
Thanks for the interesting article and followup here on your web site. I've been using DevonThink on and off for some time now, and you have supplied me with a schema for using it that I was close to and yet, at the same time, far from.
I've noticed that you are simply putting page numbers with your quoted text, but you are not putting the source (since you have them filed in source-specific folders). I've long wondered how much biblio info to put into each note, usually falling on the more-the-better side, because I worry about my note getting dissociated from its folder at some time in the future. There is also the problem of tracing your note back to the containing folder, that is, if I'm reading a note what source is it from. DevonThink doesn't seem to have any command to track the note back to its containing folder (unless I'm missing something).
This brings up my primary wish for a future revision of DevonThink: the ability to include metadata tags for each note, which in this case could include the full reference, page number, etc. Another program, Tinderbox, handles this metadata beautifully, by putting it in headers at the top of each note, and by making these fields customizable. Of course, that one feature hasn't been enough to make me move out of DevonThink.
Doug
Posted by: Douglas Holschuh | January 30, 2005 at 10:44 AM
Steven,
I've spent the last three years doing doctoral research and development in this space. I was reading over your blog entries and thinking how closely what you've written here matches a lot of what I find in my own notes (I also like De Landa's work, BTW).
I think Topic Maps are the technology you're looking for to enable sharing "personal libraries to the world." The phrase that Steven Newcomb (one of its inventors) uses is "global knowledge interchange." The ISO Topic Map standard is ideally suited for creating a graph-structured, subject-based index of a set of information resources. These Topic Map documents can be merged or federated with others in controlled ways, even maintaining the contexts between who said what.
My own project is called Ceryle, uses a graph visualization of Topic Maps as the primary organizational metaphor, heavily uses Dublin Core metadata, works cross-platform, and I've recently been filling in some of the bibliographic support features as I'm using it to organize my own dissertation. The software is currently in evaluation and will eventually be released into open source. I'd be happy to discuss the project in greater detail if you wish.
Thanks very much for your informative article -- Murray
Posted by: Murray Altheim | January 31, 2005 at 03:12 AM
I have about 200 mb of blog entries I'd like to somehow import into this. I wonder if that's even doable.
Posted by: Mark Crane | January 31, 2005 at 11:44 AM
Great essay in the NYT Book Review; I had no idea such technology existed. Guess I will have to return to the Mac.
Posted by: Larry White | January 31, 2005 at 12:04 PM
Do you still use a reference manager like Endnotes as well or is this the only app you use to organize your sources? If you still use a reference manager, does it play nice with DevonThink? Import/export, etc.?
Posted by: Tanya | January 31, 2005 at 12:52 PM
I'd like to know of a product even remotely close on Windows as well. I have AskSam-- like many others it can search quickly through text-- but the next level (the "see also" in the description above) isn't there, and the UI is not particularly good...
Posted by: Chris L | February 01, 2005 at 10:10 AM
Steve, how do you title your quotes/documents? It seems that it just an indiscriminate first few phrases of the note? Why not use page numbers?
Posted by: John Beeler | February 01, 2005 at 10:45 AM
Steve, a very good and timely essay in the New York Times Book Review! In addition to the very postive comments, I would add that DEVONthink would seem to be using what is known in the Informational Retrieval (IR) and Natural Language Processing (NLP) fields as "collocation of words" that can be defined as:
"The 'collocation' of words refers to the regular patterns of co-occurence in which words may be found in a given context; the way words are found together. eg. We expect to see fish with chips, goods with chattels, break with enter, blue with sky. In certain circumstances, indeed, such items would be foregrounded if they did not occur together. When this happens it is 'unusual collocation'".
Thanks,
John
Posted by: John T Kane | February 01, 2005 at 12:13 PM
fascinating! thanks for sharing your tool. i really like the idea of small personal library that people could query for research purposes.
going further, let's say a groups of people within the same intellectual domain pool their library and opens it up for more semantic linking... and so on....
eventually this could potentially lead to a small sampling of what Sir Tim Berners-Lee had been dreaming all along: The Semantic Web.
Posted by: coolmel | February 02, 2005 at 04:27 AM
For the PC, the Orbis part of the NotaBene
academic suite would be quite good.
www.notabene.com
Posted by: ReaderFella | February 02, 2005 at 06:26 AM
I've started using a wiki to do similar tasks, and I would love to see this sort of search/linking ability incorporated into it.
Posted by: bs23 | February 03, 2005 at 08:35 AM
Steven Johnson?s essay in TNYT resonates with every researcher and writer who seeks connections among words, concepts and especially the fertile relationships that make their ideas come alive, sustaining their work with rich evidence. Ideas feed on ideas, which, in turn, inform the manner in which we build convincing arguments. In the Windows environment, NOTA BENE achieves much of what Steven presents in his essay (www.notabene.com). Except that NOTA BENE is designed as an integrated package of three applications that, together, provide sophisticated hypertext searches which are automatically joined to their respective bibliographic sources and are then served to the word processing application. In addition, all the components of the writer?s document, including not only the writing, but also found texts, references and bibliographies, are automatically formatted according to major academic style manuals. Thus, hypertext, bibliographic management and word-processing are combined into a seamless whole.
But to some of Steven?s specific points, ORBIS, which is the hypertext application in NOTA BENE, does more than respond to boolean operators as it searches across the user?s computer network. One of its more powerful features is the ability to search associated terms. For example, a writer could search for ?church? and be presented with passages containing ?eccleasistic,? ?tithes,? ?anticlericalism,? or ?priest,? among others. The user brings the bibliographic reference associated with each selected item as it is brought into the writer?s evolving document. The bibliographic reference then takes on the academic style for that document. In my many years of using NOTA BENE, I have not ceased to marvel at discovering relationships that are neither obvious nor likely to be remembered in our cluttered minds and that are found in years? worth of accumulated notes and texts scanned into the computer. But then again, I also marvel at the masses who actually believe that merely by typing into their computers they have endowed their writing with a dynamic, living personality when all they?ve done is, well, type.
Mark
Posted by: Mark Szuchman | February 05, 2005 at 01:09 AM
Two tools may be of interest:
1. MS Word's indexing function is very flexible for capturing connections you want to be reminded of, including your own added comments that don't appear in the body of written material being indexed and which aren't indexed by automatic methods. Such indexes are easy to skim, scan, search, and change.
2. FLIPP is a way to clarify explanation of how to use complex systems by putting content information in non-symbolic, non-verbal visual frameworks that look like game boards. Instead of describing rules of complex logic, it displays all at once all the scenarios that make any kind of sense for a given complex system. Users simply select the scenario that fits their situation and meets their objectives then follow it to conclusion. Several things make it remarkably friendly: all logical connections in the scenarios are shown without words, symbols, icons, or spaghetti connecting lines. The number of text explanation pages is reduced typically by 90%. User preference has been universal. "Logic revealed to a degree beyond belief." Translation among any languages is vastly simpler because the logic -- the gameboard formats -- remain unchanged across all languages, even those written right-to-left. Computers, while convenient, aren't required.
The method is now in the public domain freely available to all and demonstrated at http://www.flipp-explainers.org
Posted by: David Cox | February 05, 2005 at 09:30 AM
A related and useful tool I've found is called Furl (www.furl.net). It saves the actual content of a webpage you've visited to your free profile on the Furl server (like Google's cache). It also will save your comments and category description if you choose to enter them. You can save webpages with one click as you surf and amass a searchable history of your most interesting finds.
It also has a public feature that allows you to search other people's collections.
Posted by: Sarah | February 17, 2005 at 02:26 AM
Hi Steven,
In the associative thinking space (is there such a space?), one of the more notable programs is called IdeaFisher, found at http://www.ideafishing.com.
Marsh Fisher, the co-founder of Century 21 Real Estate discovered that it's the association between disparate words and ideas that create the most valuable end products. He used "real estate" and "franchise" to arrive at the company that he took public. But in the last 5 years of his mentoring, he's shown me a much bigger world through the EXPANSION of associations (seeing how far from the root you can wander) to the DRILLING-DOWN on specific concepts (through specific questions posed by seasoned "experts").
It's a fascinating area of study, and his software does a pretty good job of both those operations.
The current versions run on XP and Mac Classic, but the company is releasing a version for OSX and upcoming Microsoft OS. There's a blog of screen shots and descriptions at http://www.ideafisher-upgrade.com.
So... here's the API issue: who can make a SAFE, easy to use system that does what DEVONthink does (searching and organizing associative content) with a search-engine that can drill-down on that content, and works seamlessly between the web and the desktop... but also keeps results in enough of a linear form that users don't get lost on tangents when they are in "brainstorming" mode, but is free-form enough to allow the interface to not get in the way of the user experience?
I'll try DEVONthink, and see how this relates. For desktop content, the Mac's own integrated search app is pretty slick, but the results are not persistent... and Google still works slicker when it comes to web-centric content...
It was really enlightening to run across your NY Times article, and I look forward to seeing how you expand on this concept.
Best,
ME
Posted by: eAgent | February 17, 2005 at 07:17 AM
testings
Posted by: Johny | February 19, 2005 at 06:27 AM
An historical note: Index Cards, Clean Copies, and Research Assistants - from Jerry Monaco
For years I have been writing on index cards. The precedent was of course Nabokov, who wrote his novels on index cards, but also the chess players I knew in my youth, which was b.c. (i.e. before personal computers). All the great chess players used to remember opening variations, innovations, etc. by keeping vast indexes of their favorite openings on index cards.
I would arrange my index cards by quotes and books and date and potentially each card was cross-indexed. When I stopped writing seriously and only wrote for myself in my journals, I stopped indexing all of my paragraphs and quotes. What I soon realized, after I stopped indexing, is that the availability cards at my fingertips had also created an accessibility of the thoughts on those cards in my memory..
I assume that computer indexing and access has a similar, though less manual, effect with the added plus of being able to use the computer as a supplement brain.
The working habit of writers is not a very well understood process. Both Melville and Tolstoy needed help (their wives) to create clean copies of their manuscripts. What typewriters allowed writers to do is create their own clean copies without collaboration. Computers have now allowed us to create indexes of our own thoughts without collaboration also. This used to be the job of research assistants.
Jerry Monaco
His Blog
Shandean Postscripts to Politics and Culture
Posted by: Jerry Monaco | February 21, 2005 at 03:03 AM
"Already Devonthink can take a large collection of documents and group them into categories based on word use, so theoretically you could do the same kind of auto-classification within a document."
Hi Steven,
You may be interested in trying theConcept by Mesa Dynamics (disclaimer: my company) which resolves documents (one or more), or search engine results from Google (and other search engines) into an index of the most significant key words and phrases in the overall text/web page results.
It isn't quite the memex you're looking for, but like DevonThink, the idea is to break up thousands of words into semantic concepts. However, instead of "searching" for information, theConcept builds an index that helps a user understand the prevailing topics in the text. Each concept can then be explored more deeply by looking for citations from specific places where the key words or phrases were discovered.
If you do try it, I'd be happy to answer any questions or respond to comments and/or suggestions.
All the best,
Danny Espinoza
Mesa Dynamics
http://www.mesadynamics.com
Posted by: Danny Espinoza | February 24, 2005 at 12:40 PM
Thuriam is a technology consultant providing world-class services, Customer interaction in varied application areas and focused on BPO & Knowledge industry. Thuriam assists in the identification and development of business opportunities in the emerging BPO & Knowledge Services markets.Offshore outsourcing of your Business Process is a compelling business strategy. At Thuriam we attempt to identify the possible outsourcing opportunity and present our capability as an integrated Technology and process Outsourcer catering to all your needs. We would be your partner to provide end to end services across the value chain.Our integrated analyses provide industry, competitive, customer and technology innovation along with strategic, tactical, and operational recommendations to help maximize the bottom line from business strategy, service delivery, marketing and sales efforts. We provide high end services by utilizing the latest technologies and at low costs.
Our Services:
Knowledge Services
Medical Backoffice Services
· Medical Transcription
· Medical Billing
· Medical Coding
Legal Consulting Service
· Legal BPO
· Legal Documentation
· Legal Research
Data Research Services
· Data Collection & Extraction
· Database Services
BPO Services
Digitization Services
· Transcription Services
· Media Conversion Services
Data Processing Services
· Form Processing
· Data Entry
Data Conversion Services
· Electronic Publishing
· Prepress Services
· Data Capture
Medical Transcription:
Our endeavor is to constantly re-engineer the dictation/transcription process in order to take full advantage of the latest advances in computers, networking and digital technology. Combining these technologies with competent transcription skills allows us to provide cost-effective solutions to meet your transcription needs. We offer clients a set of core competencies, which are invaluable to the health information management professionals as well as other professionals who must dictate a steady stream of reports and correspondence. We are specialized in the following Reports - Discharge Summary, Death Summary, Progress Notes, Clinical Notes, Emergency Notes, History & Physical, Radiology Consultation Reports, Office Visits, Physician Letters, and Psychiatry Reports etc.
Medical Billing:
In today's competitive business world medical billing acts as a key in the revenue management of health care industries and hospitals. Medical Billing plays a major part in the income of the respective organization. Any inaccuracies made in pricing, missed charges, errors in coding during manual procedure will turn out as a very huge loss to the organization. Billing process includes patient registration, charge entry, patient statement and enquiries, standard reports and accounts receivable management. Thuriam's medical billing service enables to eliminate these loses and accurately process your medical billing. more
Medical Coding:
In today's business world Medical coding is considered to be a serious business among the health care industries and hospitals. Manual errors while performing this task may create a major problem and huge loss to the organization. Thuriam's Medical coding consists of combination of numbers and alphabets adhering to different coding standards. These codes help medical insurance and others to understand what was wrong with the patient, whether treatment was necessary and what services were administered. All these coding helps the non-medical staff at the insurance and other health service providers to handle the claims and make payments on predetermined basis. more
Legal Consulting Service
At Thuriam we have highly dynamic and high growth Off-Shore service delivery space of Legal Consulting and Back Office Services, where most of the initial process delivery opportunities are moving towards commoditization, we have invested significantly in creating a unique Solution Based value proposition.
Our diversified business practice represents a wide range of commercial, industrial and financial enterprises, both publicly and privately held. Thuriam has the following work departments: corporate, employee benefits, health law, intellectual property, private clients, regulation and government affairs, tax and trial. We offer Legal Drafting, Deposition Summary, legal coding, legal billing, paralegal, legal research, to documentation service. more
Data Harvesting Service
We believe that our Value Proposition framework when tailored to meet our client's specific business and process delivery requirements will further enhance the value we passionately deliver to our customers on consistent basis. We understand just how critical building and maintaining directories and databases is to any business, educational or research organization. Our expertise in data harvesting domain includes Data Capturing Service From The Web, Online data entry and internet search, Catalog / database management, Internet research, email mining and customized list making, Portal management support, e-Newsletters / e-Clippings, Secondary Research / Market Intelligence. more
Data Conversion Service
Our world leading data conversion capabilities include services such as e-Book Conversion, XML, SGML, HTML, PDF and Tiffs. This apart, Thuriam converts data stored on paper, proprietary file formats and many other formats into the choice of data and medium that you would desire. Worldwide users of our data conversion services include major publishers and manufacturing companies, academic and research libraries colleges and universities, and public utilities.
Many organizations are faced with mountains of non-standard source material created over the years by a variety of organizations and individuals. Their current and future business goals depend on converting this data to a standardized database. Thuriam offers a full range of consulting and data conversion services to help organizations meet this critical need. We not only convert data, we understand your objectives in a broader context. That's why we proactively suggest improvements in database design, and identify simple solutions that add power at little or no cost. We foresee and resolve integration issues, always focusing on improving the ease, reliability, and speed of your database. more
Digitization Service
At Thuriam, we provide the specialized Digitization with the combination of individual project management capacity and experience by retrieving the contents from videos, books, images, archives, photographs, government records and collections, and any other documents and converting in to comprehensive digital formats for universal access, web delivery uses, copying, and other means of distribution. more
Data Processing Service
Data Processing is a constraint to be done by all government agencies, institutions, companies, industries, etc. for purposes like purchase requisitions, travel approvals, time off requests, and expense reports. Thuriam initiates Data Processing by receiving the forms from you and create a new database for the information in your forms. Then follows designing of the form structure, quality check and delivery of product. more
Our Strategy Consulting includes
· Business Strategy & Design
· Market Analysis
· Competitive Intelligence
· Portfolio Development
· Product & Service Innovation
· Blueprints for the future
· Project Management
· Process Development
We are looking forward to the prospect of increasing India's share of the world market in BPO & Knowledge Services outsourcing through the advisory services offered by Thuriam.
To learn more about our service packages, specific to meeting your business objectives, e-mail us at: info@thuriam.com.
Posted by: thuriam | February 28, 2005 at 04:45 AM