Data mining at Siemens. Dr. Volker Tresp (right) and colleague Christof Störmann from Corporate Technology have developed software that automatically recognizes fraudulent behavior among cell-phone users
If you order a book or a CD from Amazon, youll immediately be provided with information on related products that may be of interest to you. These product suggestions generally hit the mark, which means higher revenues for the online retailer. Data mining is what enables companies like Amazon to provide this service. Its a little like electronic treasure hunting, whereby computer programs analyze existing data in order to obtain new and useful information. Data mining is primarily viewed as a tool for service providers such as mail-order companies, wireless network operators and banks. It allows them to collect the information gained from millions of clicks at their websites, payment transactions, or phone-call data and apply it toward creating better marketing strategies or more targeted customer service.
"Data mining first appeared on the scene about ten years ago," says Hans-Peter Kriegel, a pioneer in the field and a professor of information technology at the University of Munich. It began with supermarket scanners that made possible a detailed analysis of each product bought. "One of the first results is now legend in the data-mining research community," says Kriegel. "People who buy large bags of diapers generally also purchase beer." Algorithms were utilized to generate thousands of such associations, some of which were useful, many strange, and most irrelevant. "It quickly became clear that you could only obtain valuable information if you asked the right questions," Kriegel explains.
Data mining is more than just data analysis (see graphic). It is part of a procedure that begins with the processing of data and generally does not end even after results have been interpreted, as the latter activity only brings forth new questions. "Individuals still play an important role, since human intervention is necessary at many stages of the process, and such intervention is only possible with some degree of advance knowledge," says Ulrich Reincke, head of the Analytical Solutions Excellence Center in Germany for U.S. software company SAS. "Unlike statistics, where data is created, selected and analyzed for a particular purpose, data mining involves the analysis of an uncontrolled flood of data as a means of drawing conclusions about the future," he adds. The trick is to not only find interesting correlations in the present, but also to be able to make predictions about future behavior.
Dr. Ralph Neuneier and his researchers at Siemens Corporate Technology are using data-mining tools from Panoratio to analyze commercial websites. Such tools make it possible to examine "clicking behavior" in real time. Web surfers can thus be presented with links that other users displaying similar clicking behavior also visited. The Siemens website, which was recently named the best among 200 companies by ComputerBild magazine, has over a million visitors a month. Panoratio software compresses the data generated by mouse clicks from one gigabyte to about five megabytes and makes it accessible in a networked form. Information collected includes duration of visit, user domain, pages previously visited and the order of the clicks. Researchers can thus see how users navigate the site and which content interests them. Personalization measures could then be implemented that would make the site even more appealing. For instance, frequent visitors could be recognized and greeted at the homepage with a list of their preferred links.
Data Jungle Pitfalls. The first step in mining involves preparing the data. Thats a big challenge, according to Reincke, since much of the data comes in volumes of ten or more terabytes and the data sets are generally in different formats. Anyone who has ever attempted to convert an address file with more than ten fields per entry from an e-mail program into tabular form is well aware of the pitfalls of the data jungle. First and last names often get switched, and zip codes can end up in the field for street names. But real data mining is even more difficult, as it usually involves data formats from several different divisions of a company. And once youve finally got the formats in line, you still have to search for and correct errors and outliers statistical anomalies. Experts estimate that preparing the data for processing accounts for 60 to 90 % of data-mining costs.
Database specialists choose their data mining tool according to the type of issue in question. "Decision trees are very helpful in determining rules," says Dr. Volker Tresp from Siemens Corporate Technology (CT) in Munich. Such trees, which are similar to computer flow charts, are constructed in stages on the basis of the questions being posed and the possible answers. In the process, IT engineers use part of the data they already have in order to test the trees predictive ability. Once theyve set up a system that correctly depicts all factors, they can make fairly good predictions and explain them in a logical way. Neural networks, which are based on the structure of the human brain, function in a similar manner. "Such networks are more robust with regard to faulty data, but theyre also harder to interpret and are therefore mostly used in conjunction with very complicated problems," says Tresp. Siemens uses neural networks to optimize rolling mills and paper production, for example, and has also developed a sales forecast program for products such as cell phones (see Pictures of the Future, Fall 2003, Simulation and Optimization).
A third method is cluster analysis, which is useful for establishing similarities and is especially good for identifying outliers. Kriegel and his team used such a procedure as the basis for developing a prototype of an interactive similarity-search system. The program, known as Boss, specializes in the management of CAD components, such as automotive parts, but can also be applied to the organization of any type of object. Another cluster procedure takes similar objectsscrews, for example and positions them next to one another in a two-dimensional depiction. Boss places the objects in hierarchical order and generates appropriate depictions of those that are similar. "A vehicle or aircraft designer can then quickly determine whether or not an existing part can also be used for a new model," Kriegel explains. "Engineers can search the database without having to further specify the component and that saves time and money."
One Step Ahead of Machine Failure. However, its not just development departments that can benefit from data mining; the procedure also leads to improvements in production. Whether its semiconductor production, cell phone manufacturing or auto productionsupplier parts may individually fulfill all requirements, but they can cause problems in combination when installed in the final product. The networking of all process data can reveal correlations that an individual engineer cannot see and quality-control experts would have difficulty finding.
Each slice of the "share-price pie" (above) depicts development of a share price on an index over 20 years, starting at the center and moving outward. The exchange rates are standardized and comparable. The lighter the color, the higher the price. Identically colored rings stand for identical developments. The VisDB system (below) is used to analyze large databases. It allows various visualizations to be made
A data-mining tool such as a neural network recognizes the parameters that are critical for ensuring high quality and which therefore must be most carefully monitored. It can also make forecasts, regarding maintenance, for example. "Data on equipment and processes can be employed to make forecasts of the likelihood of machine failure," Reincke explains.
Recognition of deviations is also a feature of systems designed to track cases of credit-card fraud. Here, data on account transactions is used to generate a pattern for the typical behavior of a normal customer. If, for example, a credit card is used to withdraw cash from several ATMs in a short period of time, or to purchase large amounts of expensive electronic equipment, the algorithm interprets such behavior as suspicious. The system then warns the issuing bank, which treats the card as possibly having been stolen. Such analyses and interpretation require powerful computers, as the huge amounts of data have to be processed as quickly as possible in a cyclical manner.
Data mining is a complex process. Although it takes place inside computers, it still calls for skillful human intervention at several stages along the way. First of all, the data is combined into an appropriate form so that it can be analyzed. Before the actual data-mining process begins, certain data is selected. Following analysis, experts have to interpret the patterns that have been found. This allows them to see whether the software has uncovered interesting correlations or only information that is irrelevant
Siemens CT has developed another type of forecast tool for banks to help them plan precisely when to fill ATMs and how much cash to put in them. Up until now, ATMs have been stocked with much more cash than they needed. The sums range from 20,000 to 40,000 per machine which adds up to a mountain of money considering that in Germany alone there are some 50,000 ATMs. Estimates indicate that banks could earn approximately 50 million per year on this cash at about 5 % interest.
Recognizing Cell-Phone Fraud. Siemens CT has also developed a system that recognizes fraud in wireless networks. The system is now being used by a network operator with approximately one million customers. "Our software sounds an alarm when a cell-phone customer begins behaving in an unusual manner," Tresp explains. This might involve making several consecutive calls abroad or to special service numbers that charge high rates. Such fraud often involves people who sign a cell-phone contract although they dont have enough money in their account to cover the costs, and then defraud the network operator by selling call time cheaply for cash until the company blocks the SIM card.
In another application, SAS uses data-mining techniques to determine which customers are highly likely to cancel their cell-phone contract in the near future. The advantage for the network operator is that it can offer the customer a better package before he or she decides to cancel, thus boosting customer loyalty. The SAS system applies all telephone-related data to the analysis, including time and duration of calls as well as the numbers dialed and the corresponding costs. It then compares this information with data on the area in which the customer lives, which it obtains from data providers and which can contain up to 100 parameters. This is supplemented by information from call centersin other words, complaints, questions or special requests. All of the data on individual customers is encoded and made anonymous, which means it cannot be traced back.
Due to data-protection considerations, data-mining tools store raw data for only a short period and then combine them into larger, completely anonymous sets at regular intervals. "We feed the software with data collected in the past," says Reincke. "This leads to the generation of patterns that are characteristic of those customers who have actually canceled contracts." The patterns are applied to current data and the system then produces a cancellation probability statement for each customer.
Evaluating e-mails. Calls made to call centers are converted into text documents, which are then examined using a new method known as text mining. Here a program recognizes similar documents by comparing the words they contain. "The program automatically breaks a text down into its individual elements," Reincke explains. The software puts all words into their basic forms and selects simple synonyms in order to reduce complexity. Similarities are determined on the basis of how often certain words are used in the texts. A "well-trained" data-mining system can, for example, classify incoming e-mails and automatically forward them to the right destination.
Networks of telephone and e-mail connections. Special software uses telephone data to determine the density of phone calls in the U.S. (left). Hewlett-Packard has compared its organizational structure with e-mail communication among its employees and discovered many virtual commonalities in the process (right). Each dot stands for an employee and each line for communication made via e-mail
Researchers at CT also wrote Teklisa text-mining program that sorts incoming conventional mail. One project involved scanning letters and analyzing them with the program. Afterwards, they were distributed electronically. Text mining can also automatically analyze website content.
Following its merger with Compaq, Hewlett-Packard used SAS software to reorganize the product range of its different brands into new categories. This process involved the comparison of more than one million products. The classification, which, if manually executed, would have occupied an entire team for a long time, was completed by a single employee within just a few weeks with an accuracy of 95 percent. "Whats more, by using data-mining methods we also found out that our organizational structure does not always reflect the actual connections between employees," stated HP Lab Director Bernardo A. Huberman at a conference in March 2004. HP had automatically analyzed its employees e-mail connections and discovered many commonalities that the structure of the organization did not take into account.
The tremendous potential of data mining has only just begun to be exploited. Now, the lucrative market it represents is attracting companies like Microsoft, which sent a Beta version of its Yukon data-mining platform to more than 10,000 software developers for testing. Microsoft intends to use Yukon to compete with similar products from Oracle and IBM. "The platform will contain several different algorithms for things like decision trees, clustering and association rules," says Surajit Chaudhuri, manager of the data management, exploration and mining group at Microsoft. "In the long term we are also interested in text mining." This could some day allow even private users of Microsoft products to structure and analyze old data from their Office programs.
Another forward-looking enterprise is Panoratio (see box and Pictures of the Future, Fall 2003, Preventing Blackouts). The company has developed a type of MP3 system for databases that reduces terabytes of data down to several hundred megabytes, thus enabling extremely rapid analyses even on standard PCs. Such technology will not only help Amazon tip off customers about the right book. It will also enable industrial companies to mine their own data more efficiently in the future and thus optimize their logistic and production processes.
Norbert Aschenbrenner