Tales from the Dark Side - continued
Under the gun for my editorial in IEEE Intelligent Systems, I decided I would continue with the “dark side” theme, and see if I could state it coherently in a form that non-PlanetRDF AI folks might understand. I’m not sure if I did, but here’s a preprint of what I submitted (sorry about the length - just stop reading here if you want) - it will undergo a couple more rewrites and some editing before it appears, but according to the media the blogosphere values rapidity over style, so here goes - a preprint of my editorial for IEEE Intelligent Systems:
The Dark Side of the Semantic Web
Intelligent Readers,
I’ve recently started giving a talk with the same name as this column at various locales. My original hope was that the provocative title would catch peoples’ eye, and might indeed raise eyebrows, as I’m obviously not known as a critic of Semantic Web technology. On the contrary, I remain a committed tech-evangelist for this important new technology and my first couple of slides make it clear – I’m not using “dark side” as in Darth Vader and the Dark Side of the force, but rather as in “dark side of the moon.” The point is there’s a lot happening in the Semantic Web space that is exciting and important, but which is coming from the “Web” side, rather than the AI space. As a result, to many AI researchers this is an unknown part of the technology, and thus the “dark side” allusion.
To understand this trend, which some have started to call “Web 3.0″ (perhaps unfortunately), it is important to understand the Web not as a linked set of documents, but as a technical construct of protocols, processes, languages and tools that make it all work. While there’s no way to go into all that in this short editorial, we can gain some insight by looking at some of the emerging trends on the Web, and examine how a little bit of AI (and I’ll return and stress that “little bit” part later) can have a big effect.
Learning from “Web 2.0″
One of the problems with the ubiquity and importance of the Web is that it is often hard to tell the marketing from the meat for various Web applications. The trend that came to be known as “Web 2.0″ (perhaps unfortunately), started from a fairly specific core of technologies (essentially Web Services and Ajax) but became the name by which almost everything new and exciting on the Web came to be known. Wikipedia, Flickr, the “blogosphere,” social networking sites and YouTube, to name just a few, have been new Web applications associated with this so-called next generation Web. (In a quick aside, it’s worth noting that Tim Berners-Lee, as you can read in his book Weaving the Web, included such applications in his original Web vision, so they might more aptly be considered the realization of the original “Web 1.0,” and not a new generation of technology, but such is the marketing needed to make things happen in Silicon Valley).
From an AI point of view, the most interesting thing about “Web 2.0″ applications has been the use of tagging technology as a means of associating keywords with non-textual items. Photo and video sites have taken great advantage of this approach, as have social bookmarking approaches (like del.icio.us) and various players in the Wiki and blogging space. “Ahh,” cried the critics, “The Semantic Web is overkill. Folksonomies and social processes are all we need to make this work.” Perhaps they were right, that is, up to a certain point.
And that point is being reached now. In retrospect it seems obvious to many (and to many in the AI community it was obvious from the beginning), that this was a technology that can only scale to a certain level. Here’s a simple thought experiment. Supposing one was to take every photograph in flickr and tag it with all the text needed to capture every concept in the photo – the “thousand words” that the picture is worth. Then take the tens of millions of these photo documents and ask how you might search for them based on these keywords. Sounds a lot like what Google™ was created for, doesn’t it? So how, one could ask, would these unstructured, undisambiguated, non-semantically aligned keywords somehow create, as many of their advocates claimed, a naturally occurring semantics that would somehow challenge the rule of the keyword-based search engine. Rather up to a certain size statistics look great and work well (and clustering was a key to early “Web 2.0″ successes), but beyond a certain size, making statistical retrieval work for language is non-trivial (and a mainstay of many of AI’s human-language technology researchers to whom the claims of the taggers never held water). In short, the taggers are learning one of the recurring themes of AI – that that which looks easy in the small, often is much harder in the large.
That said, however, I must admit that those of us pushing AI on the Web also had a lot to learn from “Web 2.0.” Clay Shirky, for example, who has been wrong in almost every one of his criticisms of the Semantic Web, got one thing right – the realization that the social aspects of these new Web applications were critical to their success. The need to organize knowledge in some formal way, such as the expressive ontologies so dear to us in the AI community, is only one way to approach things, especially when there are social processes in place to help one navigate the tangled mess that the Web provides. Or to use just one specific example, as an information retrieval challenge YouTube is a disaster, but as a way of spreading video in a viral way across the social structures of the World Wide Web, it is an unmatched success.
A little Semantics
For many AI researchers, this social part of the Web really is like the dark side of the moon. We’re so used to thinking that “knowledge is power,” that we fall into a slippery slope, “more is better,” fallacy. If some expressivity is good, lots must be great, and in some cases this is correct. What we forget, however, is something I’ve been saying for a long time, it’s become sort of a catch phrase in Semantic Web circles, “a little semantics goes a long way.” In fact, something I’m just now beginning to understand, is exactly how little is needed to go a long way on something as mind-boggling huge and unorganized as the World Wide Web.
A key realization that Berners-Lee had with respect to the design of RDF is having unique names for different terms, with a social convention for precisely differentiating them, could in and of itself be an important addition to the Web. If you and I decide that we will use the term “http://www.cs.rpi.edu/~hendler/elephant” to designate some particular entity, then it really doesn’t matter what the other blind men think it is, they won’t be confused when they use the natural language term “elephant” which is not even close, lexigraphically, to the longer term you and I are using. And if they choose to use their own URI, “http://www.other.blind.guys.org/elephant” it won’t get confused with ours.
The trick comes, of course, as we try to make these things more interoperable. It would be nice if someone from outside could figure out, and even better assert in some machine-readable way, that these two URIs were really designating the same thing – or different things, or different parts of the same thing, or … ooops, notice how quickly we’re on that slippery slope. If we want to have all the ways we could talk about how these things relate, we’re back to rediscovering knowledge representation in all its glory (and the morass of reasoning issues that come with). But what if we stop somewhere, and only allow a little bit of this. Suppose, we simply go with “same” or “different.” Sounds pretty boring, and not at all precise, certainly not something that is going to get you an article in IEEE Intelligent Systems.
Ahh, but now let’s move to the Web. Whenever a user creates a blog entry in livejournal.com, a small machine readable description becomes available against a somewhat minimal person ontology called FOAF (For “Friend of a Friend”). I read recently that livejournal has about 15,000,000 of these FOAF entries. There are other blogging and social networking sites that also create FOAF files, accounting for at least fifty million little machine-readable web documents. And FOAF contains a small little piece of OWL, the Semantic Web Ontology language, which says that if two entries have the same email address, they should be assumed to be by the same person. This one little piece of “same” information suddenly allows a lot of interoperability and a jump-start for many kinds of data-mining applications. The rule may not be 100% correct, but then on the Web, what is?
Believe it or not, several startup companies are looking quite successful by using this rule, and other similar assertions about equality (or inequality), as a basis of helping to create personal information management tools for web data or to do a better job of matching advertising to Web users (I hate to say it, but such matching is the biggest legal money-maker on the Web). By being able to, even heuristically, equate things found in different web applications to one another, a whole range of mash-ups and other Web applications become possible. A very little piece of semantics, multiplied by the billions of things it can be applied to on the Web, can be a lot of power.
Semantic Raisins
There’s more to this dark side story that gets into technical aspects of Web application development. Forthcoming Semantic Web standards, such as the SPARQL query language or the GRDDL mechanism for adding semantic annotations to XHTML pages, make it much easier to embed these little bits of semantics into other Web applications (including those of “Web 2.0″). So while the funding from places like DARPA and the NSF in the US, and from the EU’s IST program, has been looking at what we might call the “high end” of the Semantic Web, the leading edge of the Web world has been bumping into the “low end” and finding useful solutions available in the RDF and OWL world.
Expert Systems never really made it as a stand-alone technology, but they were far more successful when appropriately embedded in other applications (Pat Winston’s “raisins in the raisin bread”). Semantic Web developers are beginning to understand that our technology can similarly gain use by being successfully embedded into the somewhat chaotic, but always exciting, world of Web Applications. This opens up a brand-new playground for us to explore largely unexamined approaches in which a little AI, coupled with the very “long tail” of the Web, opens up new and exciting possibilities for a very different class of (just a little bit) intelligent systems.
Welcome to the Dark Side,
(Signature)

December 14th, 2006 at 8:34 pm
Great stuff! (Semantic Raisins had me puzzled for a minute…)
Coincidentally I just got back the edited draft of part 2 of a thing I’m doing for a column in IEEE Internet Computing, tidy it in the morning (part one’s here). Same general area, slightly different angle. I’m suggesting that although it might not involve RDF, a lot of the Web 2.0 stuff is heading towards the Semantic Web by putting data and new kinds of wiring on the Web. (All it’s lacking are the languages to join it all together…). Sometimes seems like it might be an emergent property of the Web to fill in its own missing bits, though that’s heading into weird territory
December 14th, 2006 at 11:25 pm
yeah, but we also want to avoid things be separately rediscovered outside the standards - I’ve seen a lot of Web 2.0 stuff that would be better with some use of SW standards instead of their own homebrew ways of doing the same thing.
I used to be asked “can’t you just do this with XML” and our answer was “yes, we did” (interestingly I don’t get asked that much anymore — people are slowly beginning to realize what XML is and isn’t good for)
December 15th, 2006 at 4:23 am
[…] Jim Hendler continues his exploration of the dark side of the semantic web with a must-read editorial for IEEE Intelligent Systems, the well-respected AI journal: A key realization that Berners-Lee had with respect to the design of RDF is having unique names for different terms, with a social convention for precisely differentiating them, could in and of itself be an important addition to the Web. If you and I decide that we will use the term “http://www.cs.rpi.edu/~hendler/elephant” to designate some particular entity, then it really doesn’t matter what the other blind men think it is, they won’t be confused when they use the natural language term “elephant” which is not even close, lexigraphically, to the longer term you and I are using. And if they choose to use their own URI, “http://www.other.blind.guys.org/elephant” it won’t get confused with ours. […]
December 16th, 2006 at 2:22 pm
I think I’m in agreement with your general direction but found it odd that you link the low end with “useful solutions available in the RDF and OWL world”. You are really high end if that’s low end to you. One suggestion would be for the OWL/RDF world to attempt to come to some agreement on a very restricted list of name spaces for low level semantics (Dublin core, etc.). I like microformats for that reason, the community insists on brevity and avoids duplication, while OWL/RDF space boggles the mind.
Correct me please, but I see RDF people representing the same semantic concepts in a myriad of name spaces, ultimately self defeating. For example, storing the “description” of something. In a typical OWL doc you’ll find 4 or five different ways of storing that text. At the RDF level, then the OWL level, then there’s the dublin core namespace, and often the domain ontology will declare it’s own.
I’m sure there’s some derision at microformats, but finally we have a clear way to indicate the geo co-ordinates of an event, place, etc. and it’s easy and lucid to the rest of us. That’s a “raisin” that brings tremendous value. Once all images are geo tagged, we need not append mounds of non-temporal meta data to the image, as the location acts as a key to gather more info about where the picture was taken.
December 18th, 2006 at 6:52 am
O lado negro da semantic web…
Artigo em pr-publicao para o IEEE Inteligent Systems, para pessoas que no gostem particularmente de RDF…
December 19th, 2006 at 6:27 am
[…] I haven’t said anything much about semantic web stuff for a while as I’ve been occupied with other things. However Jim Hendler’s ‘Tales from the Dark Side’ piece in IEEE Intelligent Systems reawoke an old interest. In short: I still think the RDF people have got it wrong with URIs, and so far nobody’s convinced me otherwise. […]
December 19th, 2006 at 4:25 pm
[…] Mindswap Weblog » Blog Archive » Tales from the Dark Side - continued Under the gun for my editorial in IEEE Intelligent Systems, I decided I would continue with the “dark side” theme, and see if I could state it coherently in a form that non-PlanetRDF AI folks might understand. (tags: www.mindswap.org 2006 web_semântica rdf editorial blog_post xml ontology semantic_web) […]
December 22nd, 2006 at 6:46 pm
In as much as W3C Semantic Web and AI are paving new and important ground, I still see three major hurdles that suggest that at some point, innovation must give way to real invention. First, I do not see W3C as sematic, but rather as linguistic. It’s all about language and descriptions, not the “language of thought,” that pure conceptual meaning we have in our heads. Language is a good system for general communication, especially diplomatic, but not so for precise communication of precise concepts and ideas. The idea of making an arbitrarily agreed upon system of communication (language), that is adrift and ever changing, a technological standard does not make sense on a small or large scale?? There are far too many religions in technology to think that a world of creative cats will all of a sudden align to someone elses standard. I think science will set that standard, not organizations. When it does, we will make the jump across the linguistic/semantic gap into the world of machines that understand the language of thought.
Secondly, I think the word “scalability” is right on the money. Isn’t that what its all about? When I think of all the variables associated with every facit of human knowledge, it seems to me that the most pressing problem that technology faces is piercing the complexity barrier with a new architecture that scales proportional to content. A new invention that can integrate unlimited variables to capture precise meaning. I agree with the writer that this is the core goal/value of AI. Otherwise, we are left will more layers of structure, logic and language - with all of that, machines will never learn.
Finally, I see the internet and the network-centric world in general, as a conveyor for information, not knowledge. Any time people have to interpret data, documents and objects to make sense of its content, it is a sure sign that only a small part of knowledge is being conveyed. The missing ingredient is theory, the conditional reasoning power that humans carry around in their brains. When theory and information combine, their is knowledge - the stuff that decreases the uncertainty if a quantum world.
Kudos for a good article.
January 15th, 2007 at 3:44 pm
[…] Mi sono imbattuto nel blog del creatore di SWOOP e ho trovato alcuni post molto interessanti. Cito un post, Tales from the Dark Side - continued, che dovrebbe diventare un editoriale per IEEE IS The point is there’s a lot happening in the Semantic Web space that is exciting and important, but which is coming from the “Web” side, rather than the AI space. As a result, to many AI researchers this is an unknown part of the technology, and thus the “dark side” allusion. The need to organize knowledge in some formal way, such as the expressive ontologies so dear to us in the AI community, is only one way to approach things, especially when there are social processes in place to help one navigate the tangled mess that the Web provides. Or to use just one specific example, as an information retrieval challenge YouTube is a disaster, but as a way of spreading video in a viral way across the social structures of the World Wide Web, it is an unmatched success. For many AI researchers, this social part of the Web really is like the dark side of the moon. We’re so used to thinking that “knowledge is power,” that we fall into a slippery slope, “more is better,” fallacy. If some expressivity is good, lots must be great, and in some cases this is correct. What we forget, however, is something I’ve been saying for a long time, it’s become sort of a catch phrase in Semantic Web circles, “a little semantics goes a long way.” In fact, something I’m just now beginning to understand, is exactly how little is needed to go a long way on something as mind-boggling huge and unorganized as the World Wide Web. By being able to, even heuristically, equate things found in different web applications to one another, a whole range of mash-ups and other Web applications become possible. A very little piece of semantics, multiplied by the billions of things it can be applied to on the Web, can be a lot of power. Semantic Web developers are beginning to understand that our technology can similarly gain use by being successfully embedded into the somewhat chaotic, but always exciting, world of Web Applications. This opens up a brand-new playground for us to explore largely unexamined approaches in which a little AI, coupled with the very “long tail” of the Web, opens up new and exciting possibilities for a very different class of (just a little bit) intelligent systems. collaborative tagging FOAF ontology owl RDF web2.0 […]