Prof. Gerhard Weikum (53) is a Director at the Max Planck Institute for Computer Science in Saarbrücken, where he heads the Databases and Information Systems department. His previous employers include the Swiss Federal Institute of Technology (ETH Zurich). His special field of interest is the automated and intelligent search for information in data systems and the World Wide Web. Weikum is one of the world’s leading researchers regarding the utilization of statistical methods to facilitate the efficient and targeted use of the knowledge scattered throughout the Internet. Weikum, who refers to his mathematical search procedures simply as “knowledge harvesters,” works with Siemens in training doctoral candidates.
Your vision is to bring order to the knowledge that is scattered throughout the Internet and make it available to everyone. One of the search programs you’ve developed is called NAGA — “Not another Google answer.” What’s wrong with Google answers?
Weikum: Search engines such as Google are great; there’s no doubt about it. But they’re still comparatively dumb, because they’re unable to answer complex questions. Let’s say your question is “Which famous scientist survived his four children?” To answer a question like that, the best search engines will supply you with thousands of websites containing the words “scientist” and “children,” but the correct answer is Max Planck. In other words, such engines only provide a fraction of the knowledge that is available in the Internet.
Can your software do more?
Weikum: It would certainly be able to identify Max Planck. It establishes logical, semantic connections between concepts, and it can understand the context. But regarding knowledge in the Internet, there are much more exciting issues than search engines. Take the question of how best to exploit the knowledge of the many millions of people who use the Internet. How can we harvest the implicit human knowledge that is to be found in blogs, Internet forums, and other kinds of websites?
In other words, how can we best tap the collective intelligence of the Web?
Weikum: Collective intelligence is something of a myth. Why should one million non-experts know more than one expert? If a bunch of non-experts write a heap of nonsense, there’s no reason why that should add up to the truth. For a long time, search engines fed with the terms “Barack Obama” and “country of origin” came up with the answer “Kenya,” simply because there had been lots of speculation in the Internet that President Obama was not a U.S. citizen. The challenge is therefore to separate the wheat from the chaff and filter out the vagueness and untruths from the information available in the Internet. Distilling that information to produce collective knowledge only makes sense if you stick to highquality sources. Machine systems have an advantage here in that sensors don’t lie. So in the case of such systems, it really is true, statistically speaking, that the sum of all the items of information is more reliable than the best single item.
Is collective intelligence therefore easier to achieve in the machine world?
Weikum: That’s difficult to say. It’s just a different kind of problem, that’s all. Just because thousands of sensors are supplying data, that doesn’t mean the system is intelligent. The important thing is what you do with the data. Equipping all the consumers in the power grid with intelligent sensor systems would produce a level of networking that has never been achieved before. But that would not be intelligent per se.
How can we develop an intelligent overall system?
Weikum: The main task we will face in the future is to make machine systems so fast and so intelligent that they can react to change in real time. The biggest challenge here is to ensure that a system possesses sufficient dynamism to be able to adjust to a new situation instantaneously. That requires an enormous amount of computing power. It would be a bit like cars being equipped with ice sensors that could issue warnings in real time about which sections of the road are dangerous, so that other cars can then be diverted onto alternative routes. A smart power grid could react in a similar way to fluctuations in demand or in generating capacity. But even then, you’re still going to push up against physical limits at some point or other. Events such as black ice or a downed transmission line can throw the whole system out of kilter. I suspect that a system is genuinely intelligent only if it also knows what to do in such exceptional circumstances.
Are you saying that we need rapid adaptability in the machine world and high data quality in the Internet?
Weikum: Yes. I think there’s a clear distinction to be made between the social intelligence of the Web and the technical intelligence of industrial applications. However, there are some overlaps. Take the common practice of adding tags to the images posted in the Internet so as to describe their content. This enables other users to search for such images much more quickly and precisely. A sea view, for example, could be tagged with the words “cliff,” “sea” or “sailboat.” Now, machine systems have great difficulties recognizing the precise characteristics of images — for example, of a waterfall or an image in a mist. Tagging therefore contains implicit, collective human intelligence. Or take the time when Internet users were asked to scan satellite images for the wreckage of a missing sailboat. Hundreds of thousands of people took part. It’s possible to imagine something similar happening with the evaluation of medical images. For many years now, there have been software systems that search CT scans and similar images for tumors or other signs of morbidity. This kind of software works on the basis of statistical models and is primed by being fed with training images. It is certainly conceivable that several hundred specialists from the field, working in an Internet forum, might first annotate the individual characteristics of such images. That way, their rich knowledge would be incorporated into the image recognition software.
That kind of forum would, of course, be limited to experts. But how can we go about using the collective intelligence of the Internet as a whole?
Weikum: That’s precisely the difficulty. For example, at present our software only utilizes more or less trustworthy sources such as Wikipedia and news portals, which we evaluate according to their quality and reliability. We’re very conservative in this regard. We still don’t use blogs. Nonetheless, that type of forum is very interesting. For example, the medical portals where people talk about the side effects of particular drugs can contain valuable information that goes far beyond the broad-based guidance that is contained in patient information leaflets. The challenge here is to find a way of utilizing these “soft” data in the Internet and making them systematically available to users. Alternatively, services such as Twitter are excellent for predicting new trends. And they could also be an important source of information for service providers. Say, for example, hundreds of people tweet about a delayed train. Car rental companies or bus operators could then advertise alternative arrangements on cell phones. The value of the collective intelligence that is brought together in the Internet is largely derived from the diversity of the information and opinions involved. The task that will face us in the future is to make this knowledge available in such a way that we are able to conserve its diversity.