2 December 2005

Allowing chaos

One of my cube neighbors, a new-ish employee, said that he didn't want to keep his desk clean because he did not yet have a clear understanding of the product he's working on. I understood what he meant, and I think it's important. Only after he understands the system can he organize his environment to fit that system. My note-taking process begins on a small stack of paper-to-be-recycled, white side up, sitting in front of my keyboard. I scribble notes and drawings and UML diagrams as needed. From there, if they're valuable and not just scribbles, I move them to my development wiki in the appropriate location and HTML-ify them with wiki links and external links. Eventually, I may add further notes, link other articles to them, or move them into a more appropriate location as I get a better understanding of the domain...

Clay Shirky has an article criticizing the goals of the semantic Web as, to put it mildly, flawed (he may have used the phrase "utter failure"). He argues that the semantic content of RDF triples provides far less, in the resultant syllogisms, than what we are already approaching in the messy world of HTML. His primary point is that imposition of a strict ontology pushes out the possibility of generalization. There are points in his argument that I disagree with--significantly, he assumes that larger knowledge can't emerge from thousands of pieces of smaller knowledge--but I appreciate his appreciation of vagueness and messiness.

I had recently gotten re-interested in natural language processing, and was thinking about semantic extraction and transformations. What methods are there to process domain-specific documents and convert them into their contained list of domain topics? For example: you have a collection of resumes and want to tag them with jobs (sales, graphic artist, etc.), experience, and whatever. One approach is to build a lexicon of the domain, parse the text of each document into phrases, then transform the deep meaning of the phrases into their respective topics. A rigid ontology must be built and maintained mapping ontological concepts to topics. Another approach is to use word clustering against "model" documents that represent the topics. Matching against the model documents would determine similarity and therefore likelihood of topic inclusion.

This last approach is one that search engines take and is best illustrated in clustering search engines (such as Clusty or Vivisimo). With clustering, the engine dynamically separates results containing different homographs or different semantic domains. "Turkey" the bird is sorted separately from "Turkey" the country, and "Turkey" farming is sorted from "Turkey" recipes for Thanksgiving. Examination of the raw text can provide the semantic ontology. Or at least part of it.

All this comes down to the question: when is it beneficial to allow chaos? The choices are either to pre-define a structure that information should fit in, or to allow information to manifest itself and post-define rules that can order that information into a structure.

[ posted by sstrader on 2 December 2005 at 6:54:09 PM in Language & Literature , Programming ]