Success Stories in Data/Text Mining[1]

Christophe Giraud-Carrier

Department of Computer Science

Brigham Young University


Abstract

This document presents a collection of successful implementations of Data/Text Mining. It consists mainly of excerpts from various sources, including vendors’ literature and web sites, general data/text mining web sites, and proceedings of relevant conferences. It is organized by domains of activity.


Some statistics about this document:

            Number of stories reported:                           146

Insurances: 12, Retail and Direct Marketing: 22, Telecommunications: 10, Banks and Financial Institutions: 32, Health Care: 20, Public Transport (Air, Rail, etc.): 7, Pharmaceutical Industry and Bioinformatics: 7, Manufacturing and Production: 13, Miscellaneous: 23

 

            Number of DM tools/vendors referenced:      30

Clementine (SPSS), Intelligent Miner (IBM), DD Series (DataDistilleries), Enterprise Miner (SAS), DataScope (Cygron), digiMine Services, WizRule/WizWhy (WizSoft), Data Miners, Quadstone System (Quadstone), Recon (Lockheed Martin), KnowledgeSEEKER/KnowledgeSTUDIO (Angoss), S-PLUS (Insightful), PolyAnalyst (Megaputer Intelligence), Cognos, Exlusive Ore, Business Objects, CART/MARS (Salford Systems), NeuroShell (Ward Systems Group), Z Solutions, CleverPath Predictive Analysis Server, Teradata Warehouse Miner, The Neural Network Toolbox (MATLAB), KXEN Analytic Framework, PolyVista Discovery Technology, E.piphany's product suite, Model Builder (Fair Isaac), ModelMAX, ScorXPRESS (ASA)


Insurances

Liverpool Victoria

Winterthur Insurance

ALKA[i]

Independence Blue Cross

Spaarbeleg

Empire Blue Cross and Blue Shield of New York

FBTO

Medical Benefits Fund of Australia

Amicon

Isapre Cruz Blanca

The Hartford Steam Boiler Inspection & Insurance Co. (HSB)

Major Insurance Company

Retail and Direct Marketing

J. Sainsbury’s

Safeway UK

Marks & Spencer

Plow & Hearth

Williams-Sonoma

Thomas Cook

Argos, UK

UNICEF Germany

CG2 Direct

Direct Wines

Allrecipes.com

Woodcraft Supply Corp

Envision EMI, Inc.

CustomerLinx

Cécile Co., Japan

Miami Herald Publishing Co.

Jubii, Denmark

Eddie Bauer

Fingerhut

The Vermont Country Store.

Experian

Medium-sized software company

(Tele)Communications

Globo.com

One 2 One

US West

British Telecommunications

ECtel

Verizon Wireless

Hutchison Telecommunications

MTN

Large Wireless Provider

British Telecom

Bell Canada

Banks and Financial Institutions

Banco Espírito Santo

Lloyds TSB

Associates Finance

Barclaycard

Keystone Financial

Morgan Stanley

Provident Financial

HSBC Bank plc

Fireman’s Fund

Beneficial National Bank

Postbank

Credit Suisse

Teikoku Databank

Standard Life Bank

M&T Bank

PrimeCredit Limited

BankFinancial

Kookmin Bank

NASDAQ

Central Institute of Mathematics in Economy, Moscow

Hang Seng Bank

Fleet Bank

JCB Co., Ltd

AXA Financial Inc.

Mellon Bank

Chelsea Building Society

Skandia Bank

Bank of Montreal

Bank of America

The Dreyfus Corp.

Marshall & Ilsley Corporation

Stock Selection Using Recon[23]

Health Care

Pediatrix[ii]

Children’s Memorial Research Center

St George’s Hospital NHS Trust

Highmark Inc.[iii]

Pfizer, Inc.

o   « Pfizer, Inc., a major research based, global health care company, is at the forefront of research on therapies for male erectile dysfunction (ED). Recently, the company has received FDA approval for a new treatment called Viagra, the first oral treatment for this condition….Pfizer awarded an ED research grant to a team led by Dr. Raymond C. Rosen, an internationally recognized ED expert. The research resulted in the development of…the International Index of Erectile Function (IIEF)…this questionnaire is a self-administered, 15-item measure that is cross culturally valid and psychometrically sound…Pfizer led a worldwide market research effort…to determine the IIEF’s usability… The overall findings indicated that an abbreviated version of the IIEF would further increase acceptance by doctors and patients…Pfizer then tasked its researchers to use proven statistical methods to reduce the 15-item IIEF to five questions that would conform to the National Institutes of Health (NIH) definition of ED2…The SHIM:IIEF-5 was developed using data from four major studies of men diagnosed with ED and two control samples of men without a history of ED….The data were analysed using CART[24] and logistic regression methodologies in concert. CART was used to rate the relative importance of each of the IIEF’s 15 items in terms of their ability to discriminate between the presence and absence of ED….Dr. Cappelleri and his Pfizer colleagues found firm agreement between the CART results and the NIH definition of ED….The next step after selecting the questions was to develop a scoring system that would be easy to administer. In this case, Dr. Cappelleri wanted to determine a cut-off point in which men scoring at that point or lower on the SHIM:IIEF-5 could be classified as having ED, while men scoring higher could be classified as having normal erectile functionality….Dr. Cappelleri then used CART to develop a scoring system to determine an objective SHIM:IIEF-5 score that gave a high level of sensitivity (high probability of correctly identifying ED) and specificity (high probably of correctly identifying men without ED)…. As a tool in the identification of such an under diagnosed condition, the SHIM:IIEF-5 is a crucial part of the Outcomes Research Tools and marketing programs for Viagra. »

Olympic Games 2002 and State of Pennsylvania

Universidad de Santiago de Compostela

American Healthways

New York City Health Department

First Choice Medical Management

Baylor Health Care System[iv]

University of Sheffield

US Defense Department

National Research Center for Surgery, Moscow

University Clinic for Anesthesiology, Universität Ulm

San Francisco Heart Institute

Anthem, Inc.

Bridgeport Hospital

Olsten Health Services

Cardiff Public Health Laboratory (UK)

MEDai

Public Transport (Air, Rail, etc)

Warwickshire County Council

Swissair MZ-VPP (Airport Policies and Infrastructure)

Qantas Airways

Southwest Airlines

Atraxis AG, Swissair Group

IAURIF

Metropolitan Transportation Authority of New York

Pharmaceutical Industry and Bioinformatics

Unilever

o   « UK based Unilever’s Environmental Safety Laboratory – a state-of-the-art toxicology facility – approves the safety of numerous new and proposed products each year....With Clementine’s rapid modelling environment for data mining, Unilever modelled the corrosivity of organic acids, bases and phenols, critical ingredients in many new products. Then putting Clementine’s neural network models to work, Unilever trained the models to judge corrosivity based on several descriptive attributes. Clementine’s models enabled Unilever to go beyond the limited “corrosive” or “non-corrosive” categories previously established. Since the most strongly corrosive or non-corrosive substances quickly gravitated to their respective extremes, and the substances in between were scored to reveal a gradation of corrosivity, Unilever was able to create a more complete way to test for corrosive substances in new products....With Clementine and this new process, Unilever is leading the way toward in computero research, and away from in vivo and in vitro experimentation. The end result? They saved significant time and money in product development cycles and minimised the need for animal testing. »

Children’s Memorial Hospital[28]

o   « Each year, nearly 3,000 children in the U.S. are diagnosed with brain tumors. Almost half will die within five years, making it the most fatal cancer among children. If a child does survive a brain tumor, the long-term effects can be significant, and can include neurological disabilities, retardation and psychological problems. Beyond surgery, successful treatments for pediatric brain tumors are rare. Dr. Eric Bremer, director of brain tumor research at Children’s Memorial Hospital in Chicago, is one of the leading scientists searching for a better way to treat pediatric brain tumors. One of Dr. Bremer’s main goals is to build a gene expression database for pediatric brain tumors, and to then correlate this with both past and ongoing research on effective treatments. As a result of the mapping of the human genome, researchers have gained new tools to study these genetic variations, but the work can quickly produce an overwhelming amount of data. His challenge is to make sense of the 7,000 to 30,000 data points for each brain tumor sample. To do this, Dr. Bremer uses Clementine®, a data mining workbench from SPSS Inc., which enables him to quickly analyze this voluminous amount of data in different ways, and identify patterns and relationships. As an example of Clementine’s use, Dr. Bremer combined his own data with that of a publicly available data set resulting in a total of 133 tumor samples from the six major pediatric brain tumor types. Clementine classified these tumors with greater than 95 percent accuracy. He then uses SPSS’ LexiQuest™ Mine, a text mining technology, to sift through mountains of scientific literature to extract patterns that, for example, when combined with genetic patterns identified from his brain tumor database with the help of Clementine, can be used to help him evaluate prime drug targets that would form the basis for a cancer cure – Dr. Bremer’s ultimate goal….As a result of Dr. Bremer’s and other researchers work, types of pediatric brain tumors can be more accurately diagnosed, and the life expectancy for children with brain tumors has grown from five months to 39 months. »

Critical Outcome Technologies

o   « The ability to predict biological activity based on molecular structure is leading researchers to breakthroughs in the most complex challenges of medicine. Using a combination of artificial intelligence tools, Dr. Wayne Danter of Critical Outcome Technologies (London, Ontario, Canada) has developed a method to predict whether specific molecular structures are effective against a disease. Currently under study is the HIV1 virus….Modeling each molecule and predicting its effectiveness using standard statistical methods is virtually impossible because of the enormous number of variables. Dr. Danter uses CARTâ (Classification and Regression Trees), a software package from Salford Systems to help build models that isolate the most important variables. Working with public domain, molecular HIV data, Danter trains CART and complementary systems to predict if a given molecular structure is biologically active against a disease….To satisfy Dr. Danter’s specialized modelling needs in his HIV research, he inputs the results into another Salford Systems product, MARSâ (Multivariate Adaptive Regression Splines), then into a neural network program from Ward Systems Group, NeuroShellâ Classifier.[29] MARS is a non-parametric regression procedure that extends Dr. Danter’s work by improving the accuracy of predictions. NeuroShellâ Classifier then categorizes a molecule’s activity based on patterns derived from CART and MARS….The ability to analyze molecular structure and predict effectiveness helps Dr. Danter look for existing drugs to battle diseases like HIV, as well as to develop potential new medications….The results to date are impressive. In a recent study conducted by Dr. Danter, he analysed 311 drugs with known in vitro activity against the HIV1 virus. The system correctly classified more than 96% of the molecules….During the past several months, Dr. Danter has also used CART in developing models to study central nervous system receptors, anti-arthritic medications, and antibiotics, among others. »

AnVil and HealthSouth

o   « …AnVil, a small bioinformatics firm in Burlington, is hoping to unveil potentially lucrative secrets hidden inside the reams of patient data collected by one of the nation’s largest network of health-care providers. The company is set to announce today [29 July 2002] that it has reached a deal with Birmingham, Ala.-based HealthSouth to apply its experience in analyzing complex databases for drug researchers to HealthSouth's mammoth repository of patient records. The companies are betting on the analysis, which will focus on drugs to treat stroke and orthopedic patients, to produce insights that will not only help HealthSouth improve medical care but also aid drug makers in their efforts to discover and develop new drugs…. The hope is that such data could reveal how drugs work in real-world settings, shedding light on questions that are not answered in tightly controlled clinical trials. Researchers could test hypotheses in large numbers of patients over extended periods of time. They could spot subsets of patients who don't respond to existing medications, revealing a promising avenue of research for new drugs or for improved versions of existing drugs…. HealthSouth runs a nationwide network of outpatient surgery centers, diagnostic imaging clinics, and rehabilitative services, treating between 1 million and 2 million patients a year. Given that amount of data, the potential exists to answer questions about the safety and effectiveness of drugs in a way that can not be replicated in clinical trials. »

UCB Pharma

o   « The UCB Group is a major Belgian corporation with core business centered on chemicals, films, and pharmaceuticals...its UCB Pharma division researches, produces, and markets medical products covering the central nervous system, the cardiovascular system, and immuno-allerology. The pharmaceutical market has changed dramatically over recent decades....The presence of fierce competition on all sides has prompted UCB Pharma to examine the market very carefully before it introduces any new drug....”Omega, our data mining tool, was developed in SAS software by SPS, a Quality Partner of SAS Institute.”...Omega examines, interprets, and reports on trends in purchasing behavior....[It] facilitates strategic decisions on our existing product range, as well as assisting with new product development and the way we do business generally. »

Human Genome Project

o   « The HGP has been trying for more than a decade to decipher the complete sequence of the human genetic code.…the hope is that the research eventually will reveal what genetic patterns cause diseases, with the possibility of cures to follow.…Bruce Weir, Ph.D., a DNA analysis expert at North Carolina State University in Raleigh, used SAS Enterprise Miner to analyze SNP [single nucleotide polymorphism] data from patients with Alzheimer’s disease. His team, which includes geneticists and statisticians from both the public and private sectors, found genetic patterns associated with the disease. »

NIEHS – Predictive Toxicology Challenge

o   « A second round of the challenge (PTE2) [1999] consists of predicting the outcome for 30 chemical bioassays for carcinogenesis...The data provided includes both structural [atoms and bonds making up the molecules] and non-structural [short term toxicity assays] information. ...Provided a sufficiently expressive representation language, good solutions may be obtained from the structural information only. ...Experiments...with STEPS support this claim....The rules obtained by STEPS using structural information only, are comparable in terms of accuracy to those obtained using both structural and non-structural information by all PTE2 participants. In addition, this approach may produce insights into the underlying chemistry of carcinogenicity, one of the principal aims of the PTE2 challenge. Furthermore...carcinogenic activity for a new chemical can be predicted without the need to obtain the non-structural information from laboratory bioassays. Hence, the results may be expected in a more economical and timely fashion, while also reducing reliance on the use of laboratory animals. »

o   The Predictive Toxicology Challenge was devised to provide Machine Learning programs with the opportunity to participate in an enterprise of immense humanitarian and scientific value. Details of the Predictive Toxicology Challenge 2000-2001 are at http://www.informatik.uni-freiburg.de/~ml/ptc/.

Manufacturing and Production

Hewlett Packard

DaimlerChrysler

Toyota Motor Corp.

Herlitz AG

R.R Donnelly & Sons

Halliburton Energy Services

Compaq

Bayernwerke AG

John Deere Waterloo Works

Tauernkrafwerke AG

Southeastern US Electric Utility

Northeastern US Electric Utility

Miscellaneous

EDF Energy[v]

o   « EDF Energy has reduced customer bad debt by 60 per cent after using data mining software to analyse its customers’ financial performance. It has also used the technology to identify 1.7 million potential clients, by evaluating its criteria for new subscribers against data on UK householders. Clifford Budge, EDF Energy customer insight manager, says the Clementine data mining software from specialist vendor SPSS has become an essential tool for examining data. ‘There is definitely demand to have more insight into the business,’ said Budge. ‘We use Clementine to look at customer retention, product information and customer acquisition.’ The customer insight team was most recently asked to re-examine the rules governing the system, which scours individual customer records for common occurrences of bad debt. ‘With any piece of work like this, we are looking at a problem spread over five million customer records,’ said Budge. ‘When we were asked to re-evaluate these rules, the model we came up with was 60 per cent better than it was before.’ By predicting which geographical areas are most likely to have a high concentration of unpaid bills, the new rules are helping EDF to cut 60 per cent of the bad debt that would otherwise be wiped from its balance sheet....Similar work to improve the targeting capabilities of the sales and marketing department has increased the potential sales pool by more than 25 per cent….‘The work we do is equivalent to creating 10 models a year at about £14,000 per model, where other companies can pay as much as hundreds of thousands of pounds for just one,’ said Budge.” »

Anderson Analytics[vi]

o   « It all started when Anderson Analytics Managing Partner and Founder, Tom Anderson, was looking for a means to compare different diamonds to each other, as well as different diamond retailers to each other, to see which diamonds were a better value for the price. “I thought that diamond retailers must have a formula for valuating diamonds and setting the correct selling and buying price based on what they call the 4 C’s (cut, color, clarity and carat),” said Anderson. “But diamond merchants never reveal their methods for determining value, and to my surprise there was no such formula anywhere on the Internet.” So Tom turned to Anderson Analytics to do what it does best -- take large amounts of data, such as general diamond pricing and 4 ‘C’ information for thousands of diamonds, and analyze it. Using advanced statistical software Anderson Analytics literally ‘mined’ all the data available on over 44,000 diamonds! The result of the multivariate analysis was a methodology for predicting the price and market value of diamonds with a very high degree of certainty. »

Ohio Department of Natural Resources

o   « In addition to resource management, ODNR is responsible for promoting leisure services and recreational opportunities for the public….”We began to realize that we needed to think about our business in a new way,” said Mike Costello, program administrator of the Ohio Division of Wildlife Licensing fees support the cost of maintenance operations….”With hunting and fishing licenses on the decline, we knew we had to change from a product- to a customer-focused business,” explains Costello….A shift in business philosophy was leading ODNR toward a customer relationship management approach. Costello’s team began to explore the benefits of data mining. With the recent automation of its licensing processes, ODNR has gained the ability to examine enormous amounts of previously unavailable data. From the first and second years of data collection, ODNR identified a 50 percent churn rate. Of all the people who bought fishing or hunting licenses the first year, only half came back. The agency lost 350,000 customers but gained an amazing 325,000 new ones. ODNR’s CRM strategy quickly became focused on maintaining the loyalty of current customers. Predictive modelling helped assess which customers were more likely to lapse, and the agency created marketing campaigns, including postcards and ads, to strengthen customer commitment and bring them ‘back to the woods.’ As as a result of its CRM strategies, ODNR generated more than half a million dollars in direct licensing revenues. “Convincing management to buy the technology soon became a no-brainer,” said Costello. “Our return on investment was clearly evident and quickly attainable.”. »

State of Texas[vii]

o   « The state of Texas has harnessed SPSS predictive analytics software to enhance its "Advanced Database System (ADS)" tax compliance function. Predictive models are a key component of ADS and played an important role to recover over $400 million in unpaid taxes since its inception in 1998. The Audit Division of the Texas Comptroller of Public Accounts (CPA) uses SPSS and Elite Analytics, LLC, a data mining consulting service, to maximize taxpayer compliance and maintain revenue streams. State tax agencies are charged with reducing the "tax gap" between the tax owed and the amount collected. Audits are critical to enforcing tax laws and helping tax agencies achieve revenue objectives. With SPSS' data mining workbench, Clementine, Texas CPA and Elite Analytics developed an audit selection strategy that more accurately predicts which audit leads are more likely to yield greater tax adjustments. "State tax agencies have limited staff and resources, and predictive analytics enable the agencies to more efficiently and effectively identify delinquent taxpayers and make effective resource allocation decisions," said Daniele Micci-Barreca, PhD, a principal at Elite Analytics. "With SPSS predictive analytics, the Texas CPA was able to refine its traditional audit selection strategies to produce more accurate results.".” »

Waco Police Services

o   « In the heart of Texas, analysts with Waco Police Services use SAS to predict criminal activity and crime patterns. Based primarily on crime reports and histories, the Waco crime model is used to create weekly reports that map crime by area and predict the city’s top 10 “hot spots.” “The crime model complements our ability to predict and track crime,” says Sgt. Dennis Kidwell, bureau chief for the crime analysis section, “because we’re able to analyze and track a large number of observations simultaneously.” Those observations include the time, date and location of all crimes, broken down into 11 major classifications, including residential burglary, vehicle burglary, criminal mischief, homicide and assault….Using SAS for analysis and software from MapInfo for GIS mapping, Waco Police Services has become more effective in preventing crime. Says Kidwell, “The crime model has been highly accurate in predicting the occurrence of criminal activity, which allows us to be more responsive in assigning the manpower to counter criminal activity.” »

Center Parcs

o   « Center Parcs is Europe’s market leader in the field of short vacations….The most important challenge for Center Parcs is how to reach maximum occupancy for the vacation parks. This is also called yield management. The average occupancy of the approximately nine thousand bungalows in Europe is about 90%, which is unequalled in the leisure industry. Center Parcs believes that there is still room to increase the occupancy rate and the profitability. Center Parcs’ primary marketing challenge was to decrease the number of mail-packs and to optimally target the customers most likely to respond, thus realizing a higher occupancy rate with lower costs….Center Parcs has implemented DataDistilleries’ analytical Customer Relationship Management (aCRM) to optimise the yield. The advanced Customer Behavior Modeling technology in DD Series gives Center Parcs accurate and up-to-date insight into their customer base. Center Parcs has divided its customer base into four segments. DataDistilleries’ role in achieving maximum occupancy is to optimise Center Parcs’ direct-mail and brochure channel towards the four segments. DD Series targets customers that are likely to book a vacation for arrival within a certain arrival window, based on their past behavior, and to predict response rates for these marketing campaigns. »

Cabrillo College

o   « In order to better uphold its mission, Cabrillo wanted to determine which students were most likely to drop out in order to improve student retention by offering a more relevant selection of classes scheduled at convenient times. With SPSS Inc.’s ClementineÒ as its data mining solution, Cabrillo College is gaining a deep understanding of student enrolment patterns and tendencies….Clemetine allows Cabrillo to explore and evaluate a range of variables and predict each student’s probability of completing a class, transferring out of a class or leaving the school altogether, “By predicting which students may need some attention or reinforcement of their education, we can provide each student on an individual basis with relevant information or discuss how we might be able to help them overcome the obstacles negating their staying,” said Dr. Luan. Cabrillo used a combination of both segmentation and clustering techniques to establish typologies and to understand grouping dynamics as well as predictive modeling….”We can adjust our curriculum to add programs or subtract classes that are disadvantageous to our student’s learning,” said Dr. Luan. “Moreover, we can determine what classes should be offered at what times. For example, we found students with particular profiles were most likely to take night classes. Rather than print and distribute hundreds of class catalogues, with the information we obtained using Clementine, we were able to adjust class schedules and match them to students’ preferences,” added Dr. Luan. “This reduced our marketing budget while increasing our effectiveness.” »

West Midlands Police Department

o   « While many cases lacking evidence were filed away, the department is now re-examining them, and doing it more quickly than ever before. In Clementine, Adderley uses two kohonen networks to cluster similar physical descriptions and Mos [modus operandi]. He then combines clusters to see whether groups of similar physical descriptions coincide with groups of similar Mos. If he finds a good match, and perpetrators are known for one or more of the offenses, it is possible the unsolved cases were committed by the same individuals. Adderley’s analytical team further investigates the clusters, using statistical methods to verify the similarities’ importance. If clusters indicate the same criminal might be at work, the department is likely to re-open and investigate the other crimes. Or, if the criminal is unknown but a large cluster indicates the same offender, the leads from these cases can be combined – and the case reprioritized. Adderley is also investigating the behavior of prolific repeat offenders, with the goal of identifying crimes that seem to fit their behavioural pattern. »

State Revenue Agencies (MA, CA, TX)

o   « State revenue agencies across the nation are hunting for tax evaders with new high-tech tools: computer programs that mine an increasing number of databases for clues on the finances of people and businesses….The tax agencies' "data warehouses" can stockpile data from state and federal agencies and, in some cases, private sources. And they are using new tools to analyze the data, including "data-mining" software that can scrutinize mountains of information to find patterns or establish relationships….The Massachusetts system mixes databases from the IRS and Customs, along with state motor vehicle, incorporation and professional licensing records. The state tax agency says it uses other databases, but won't name them….The Massachusetts agency has brought in $47 million thanks to the system since its June 2002 inception, LeBovidge [Massachusetts Revenue Commissioner] said. California officials estimate that for the four years ending in fiscal 2003, their new system brought in $260.6 million -- while Texas says is data-mining tech has harvested more than $362 million since the late 1990s. As an example of a successful case, Massachusetts officials said IRS records led them to a man who worked in the state but had not bothered to file state income taxes. He had to cough up $33,000. »

Cox Communications

o   « Cox Communications, a Fortune 500 company, is a multiservice communications company serving approximately 6.4 million customers nationwide [USA]. Prior to the September 2002 installation of KXEN Analytic Framework, Cox was suing tools such as Oracle’s Discoverer and SQL Navigator. These tools are not suited for building predictive models to support customer retention, acquisition and lifetime valuation which are fundamental company objectives….The product has reduced elapsed time for model creation, start to finish, by approximately 80 percent and reduced model building time from three weeks to one. By using this tool, the churn rate has reduced by a percentage point, and the company has realized the return on investment in the two months it has been in service….Predictive model generation is Cox’s primary use, recognizing the output varies depending on the data input and designated objective. Cox uses these models to identify a customer’s propensity to purchase products and services, desire to terminate a contract or whether a customer is a potential credit risk. »

Federal Bureau of Investigation (FBI)

o   « The bureau’s reorganization followed bruising revelations from field agents that their memos about suspicious enrolments of Middle Eastern men at flight-training schools and requests to investigate Zacarias Moussaoui, the alleged 20th hijacker, were overlooked. Critics can't hold Mueller accountable for those intelligence failures-he became FBI director on Sept. 3. But he's the point man now. To combat terrorism, the director says, the FBI will expand its use of data mining and financial-record and communication-analysis tools. He envisions the day when artificial-intelligence systems match data points to identify possible terrorist activity. In its March budget request, the bureau sought close to $70 million to consolidate its investigative data warehouses, develop a secure network to share data with other intelligence and law-enforcement agencies, and implement new analytical and visualization software. »

o   « Attorney General John Ashcroft recently announced [July 2002] that the Justice Department was loosening its guidelines to allow FBI agents to, among other things, dig into the vast commercial treasure house of data on consumers’ buying habits, preferences and traits….While the FBI hasn't publicly specified how agents would use the data, experts say the bureau likely would employ a sophisticated technique called data mining to spot relationships in enormous amounts of data no human could possibly detect….Because potential terrorists purchase products, rent apartments and use credit cards, experts said the FBI hopes that analyzing the data would reveal patterns that could help prevent future attacks. »

AC Milan Football Club

o   « For years team doctors and coaches have looked for crystal balls that would show ACL injuries in the making, soothsayers who could hold forth on hamstrings that might blow, genies that could warn of a rotor cuff about to explode in the new hot prospect’s shoulder. AC Milan may have found such an oracle: a computer smart enough to recognize the signs of an athlete coming apart. The renowned Italian soccer club – which has four players competing in this year’s [2002] World Cup – has teamed up with Computer Associates International [using CleverPath Predictive Analysis Server] to test the feasibility of using neural networks, a form of artificial intelligence, to predict injuries and optimise conditioning for each athlete, perhaps even to help select which players to sign….An 18-month test of the system gave promising results….The pilot program showed that injury prediction was a possibility….Ultimately, the neural network correctly predicted injuries 84 percent of the time….The new system could be of great help to coaches trying to predict which players are at risk, said Dr. Arthur Bartolozzi, chief of sports medicine at the Pennsylvania Hospital in Philadelphia. “A lot of factors can contribute to injury: conditioning, fatigue, equipment, weather conditions, surface conditions,” Bartolozzi said. If a system could be developed to take all of these things into account, it might allow team doctors to pull players before they get injured, he added….The program might also have saved AC Milan a lot of money, had it been around when the team was bidding for Fernando Redondo. “We spent an enormous amount of money buying a player who got hurt after three minutes on the treadmill,” Meersseman [head of AC Milan’s medical team] said. “Maybe we would have thought twice about buying him. Maybe the price would have been different.” And if the team had gone through with the deal anyway, the neural network might have told them how to avoid the big injury. »

Polyphonic SMI

o   « The magic ingredient set to revolutionise the pop industry is, simply, a piece of software that can "predict" the chance of a track being a hit or a miss. This computerised equivalent of the television programmer Juke Box Jury is known as Hit Song Science (HSS). It has been developed by a Spanish company, Polyphonic HMI…It isolated and separated 20 aspects of song construction including melody, harmony, chord progression, beat, tempo and pitch and identifies and maps recurrent patterns in a song, before matching it against a database containing 30 years' worth of Billboard hit singles - 3.5m tunes in all. The program then accords the song a score, which registers, in effect, the likelihood of it being a chart success….HSS confidently predicted Norah Jones's meteoric success (tipping no less than 10 songs on her debut album Come Away with Me) well in advance of her chart-topping appearances and in the face of an industry unconvinced she would have any commercial impact….Of course, the appeal to record labels is obvious, as it offers a rational underpinning for commercial decisions. With the recordings themselves being the least expensive element of launching an act, the marketing resource being the greatest, and most companies being run by bean counters, we can be certain that this kind of analytical software won't go away ….It's all in the clusters, you see. Hit songs, typically, fall into one of a number of groupings - there are around 50 in the US and 60 in the UK where, traditionally, tastes have been more diverse. Belonging to the same cluster does not mean songs sound the same, though, more that they are mathematically similar. And the analysis has thrown up some very unlikely musical bedfellows: Some U2 songs are in the same cluster as Beethoven, while spandex ultra rocker Van Halen sits right alongside MOR piano babe Vanessa Carlton. It is for this reason that Polyphonic are confident their software won't homogenise our already stratified and similar sounding charts. They are already working with one radio station to expand their playlist without losing audience share by selecting songs with the correct mathematical rhythms. In a world where drearily repetitive playlists have become the norm this could be the answer to an oft-uttered prayer. This strategic approach may seal the software's place in history. McCready [CEO of Polyphonic] explains how they are helping a very well known "smooth male jazz crooner" who is finding it difficult to break into the US market. The label's marketing department are promoting him to the Norah Jones audience. But Polyphonic's analysis has shown that the crooner's song patterns are more similar to Linkin Park, Aerosmith and JayZ. This kind of interpretation offers an unprecedented rationale for appealing to a seemingly unlikely demographic….Ric Wake, producer of international acts such as Jennifer Lopez and Anastacia, has drawn the technology into the heart of the creative process. When you're only a few "mathematical rhythms" away from a great hit this could save hours, days, even weeks of studio grind. At the end of each day relevant tracks are downloaded and feedback is presented the next morning. Supporters of the software argue that it does not detract from the artistic process; it is still the humans who must find the solutions to a low-scoring song. »

Hungarian Museum of War History

o   « In 2001, the Institute of Hungarian History of War purchased the records of those who died in the camps from the Hungarian government and began working on them. The researchers found that because the Soviet record-keepers had used Cyrillic letters rather than Latin, and because much of the most vital information (such as Hungarian names and cities) was missing, traditional technological solutions could not be used to match the Soviet records to the Hungarian military records….By chance, Zoltan Benedek heard about the data quality problems and had an idea. Benedek, a consultant in the data mining division of KFKI ISYS Information Systems Ltd., contacted János Bús, director of central archives at the Hungarian Museum of War History, with an offer to help….KFKI ISYS staff members recognized that the problem facing the institute was similar to one they themselves had faced while consolidating databases and building SAS data warehouses for clients. In these situations, various departments and systems referred to specific records by several different names. In addition, numerous human errors and missing values made it a challenge to sort records for analysis. To formulate a solution, they relied on the power of SAS software and a data mining technique called “textual link analysis.”….The project did not take long for the data mining team at KFKI ISYS to complete, and they were able to publish the search results on the Internet. The cleaned data sets brought to light information on approximately 27,000 people who had disappeared during the war. The first person identified had lived five years longer than stated in official Hungarian government records. He had been imprisoned for five years and had worked in a Soviet mine several thousand kilometres away from his home after his family believed him to be dead….THE KFKI ISYS team is receiving e-mails daily from grateful Hungarians all over the world who are finally able to discover information about their relatives’ remains and exact dates of death. »

FineArtExplorer.com

o   « Discovering art you like just got a lot easier. And for artists, being discovered is now a lot easier as well. Z Solutions, an Atlanta based developer of adaptive learning technology, has announced the launch of its FineArtExplorer website (http://www.FineArtExplorer.com) powered by the MyMuzeä Discovery Engine….The FineArtExplorer site is of value to both artists and art buyers. Artists don’t have to describe or characterize their art in words to be found and enjoyed online. Art buyers are directed to works that match their taste without extensive searching. FineArtExplorer.com is like hav[ing] a shopping adviser who learns your taste then displays only items of interest. Users are walked through a gallery of fine art online. As each image is rated by the viewer, a profile of the user’s taste is created. After surveying two dozen pieces, the discovery engine goes to work and creates a personal gallery of art that it recommends. The user can then click through to the host site of the artist and learn more, see more or purchase art. The personal gallery can be saved and is updated as new art matching the profile is discovered. »

DynMeridian

o   « DynMeridian, a DynCorp company, is a professional services firm that provides comprehensive analytical and technical support in arms control, national security affairs and related high technology to U.S. government and industry clients....The project involved exploring recent technological advancements in data mining and XML to determine the feasibility of leveraging these techniques in support of early identification of and/or prediction of personnel retention trends, as well as the efficient transfer of the information to the decision makers. PolyAnalyst was used to data mine the personnel database to generate insightful information and entity-level propensity-to-lose results....In this project, scoring test data with the developed classification model produced results where the top 10 percent of the cases of the predicted most loyal personnel contained over 60 percent of all people who indeed served for a long time. This represents a greater than six-fold lift and allows the client to better target expensive personnel loyalty programs to the most qualified candidates. »

Northwood Inc.

o   « Northwood Inc. is a major integrated forest products company located in Prince George, British Columbia. Northwood operates five lumber mills, a pulp mill, a plywood mill and a treated wood plant...The company also operates a forest center and nursery, which produces approximately eight million seedlings a year... For a young seedling vegetation can be deadly. In order to protect its investment, Northwood had to apply for herbicide permits on a case-by-case basis to control vegetation and protect newly planted trees. In 1997, the government of British Columbia changed the law so timber companies could submit a pre-planned, integrated pest management plan for approval, eliminating the need for individual permits. Before developing the plan, Northwood had to determine under which conditions a tree was threartened by vegetation and when it was necessary to apply herbicide. In order to accurately define standards and thresholds for vegetation management, Northwood had to identify patterns among the numerous variables in a database of 15,000 trees, containing data collected over nine years....Northwood used Cognos Scenario[33] to identify patterns of tree mortality and growth and rank critical factors in a tree’s ability to reach maturity. Based on the information mined from the data through Scenario, Northwood was able to outline at what thresholds tree were under-performing. With this knowledge, Northwood drew up an integrated pest management plan that was approved by the government of British Columbia. Northwood is now better able to manage vegetation, while using less herbicide and seeing more trees reach full growth. »

East Ayrshire Council

o   « Due to the introduction of the 4 C’s of the Best Value Regime, challenge, consult, compare and compete, all Councils must demonstrate how they consult citizens, service users and others for their views....Since 1999, East Ayrshire Council have developed and distributed a number of different questionnaires in order to support their Best Value initiatives....Since every Department within the Council must consult with citizens, the reliance on a manual system started to cause problems. One of the major problems was that by the time the information was analysed, it was almost out of date. Analysts at East Ayrshire Council held the view that they needed a more modern approach to replace the existing pen, paper and graphs in Excel methodology....By adopting an analytical solution from SPSS, East Ayrshire Council were more able to deliver better service to their citizens....By using SPSS Base and Data Entry software to code, input and analyse questionnaires, East Ayrshire Council have saved a huge amount of staff time whilst also improving the accuracy and quality of the results. »

MetLife Insurance Company

o   « While there appears to be no shortage of applications that are candidates for conventional auditing tools such as Caats, there have not been many such examples for the ‘exotic tools’ such as data mining and data visualization software. This situation should change dramatically as auditing units start to achieve some success in using these tools. Human Resource Applications represent one of the virgin areas for data mining....WizRule will indicate anomalies based upon deviations from salary averages for specific levels. While these may or may not point to problem area, auditors should satisfy any concerns that may have related to the deviations....If you include the distance from the home to the office or time to travel this distance as a variable, WizRule will determine any variance from the norm. While on the surface this may appear to be an unimportant finding, it can point out some very serious areas that should be of concern to the auditor. Employees who have to travel great distances to get to work are good candidates for excessive lateness, absences and eventually leaving for a position closer to their residence....Employees, who do not participate in 401K matching programs, medical benefit programs, etc., may be trying to hide something. While this is not always the case, it is area worth investigating. (E.g. the employee is receiving two payroll checks!)....As an example, WizRule will indicate that 67 of 68 employees in the Detroit office belong to the Central Claims group. One employee belongs to the Southern Claims group. (Again a possible ghost employee)....These are just a few examples that can be employed using Data Mining Tools in most Corporate Human Resource Systems. »

US Department of Agriculture

o   « Commercial lenders use the technology to predict loan-default or poor-repayment behaviors at the time they decide to make a loan. The USDA’s principal interest, on the other hand, lies in predicting problems for loans already in place. Isolating problem loans lets the USDA devote more attention and assistance to such borrowers, thereby reducing the likelihood that their loans will become problems....The USDA retained my company, Exclusive Ore[34], to provide it with data mining training. As part of that training, my colleagues and I performed a preliminary study with...a small sample of current mortgages...roughly 2 percent of the USDA’s current database....At the USDA, our goal was to build a model that would predict the loan classification based on information about the loan, borrower, and property....For the USDA, the initial models revealed that the important factors to loan outcome included loan type, such as regular or construction; type of security, such as first mortgage or junior mortgage; marital status; and monthly payment size....The USDA’s preliminary data mining study sought to demonstrate the technology’s potential as a predictor and learning tool...the department plans to expand the limited number of attribute available...to include payment histories....Eventually, the USDA will use these models to identify loans for added attention and support, with the goal of reducing late payments and defaults. »

o   « Agriculture Department officials were alerted to more than $250 million in fraudulent crop insurance claims in the past three years after they began using data mining. Based on that initial success, officials in the USDA's Risk Management Agency are exploring additional uses of data mining for improving Federal Crop Insurance Corp. policies and procedures….Since 2000, when lawmakers allocated $20 million for a five-year study to reduce waste, fraud and abuse in the insurance program, USDA officials have documented a better than 20 to 1 return on the $13 million they have invested in data mining....In the past few years, agency officials paid about $3 billion a year for legitimate crop loss claims, Westmoreland [director of strategic data acquisition and analysis at the agency] said. But data mining has shown that about 1,800 out of 1.5 million people enrolled in the crop insurance program attempt to file fraudulent claims. Before officials discovered data mining, the USDA typically lost about $7 million a year from fraudulent claims. Officials tried to recover the money, but litigation is difficult and time-consuming, Westmoreland said. "We would rather people make sure they file a valid claim and get paid for a valid claim than try to rectify claims that may have been invalid," he said. "It just works out better for everyone involved." » 

Compaq Computer Corp.

o   « Sales and marketing executives at Compaq Computer Corp. count on text mining tools to analyze company descriptions in their prospect database. The results help executives target customers for new sales and marketing campaigns. »

University of Louisville Medical Center

o   « A new text mining project at the University of Louisville Medical Center will let doctors make better use of medical databases such as Medline, PsychInfo and Toxline for evidence-based medicine. Search results of these medical databases can often yield 2,000 matches, but advanced modelling with Enterprise Miner can reduce the results to 100 highly relevant documents and sort those 100 documents into smaller subgroups or categories. »

 



[1] An earlier version of this document appeared as Data/Text Mining Case Studies Repository, An ELCA Informatique SA White Paper, 2004

[2] Clementine is SPSS Inc.’s Data Mining tool. See http://www.spss.com for details.

[3] See http://www.datadistilleries.com for details of DataDistilleries’ range of analytical CRM products.

[4] See http://www.ca.com for details of Computer Associates International’s analytical products and services.

[5] DataScope is Cygron Pte. Ltd’s Data Mining tool. See http://www.cygron.com for details.

[6] WizRule is WizSoft, Inc.’s data auditing and cleansing application, specialised in anomaly/deviation detection. See http://www.wizsoft.com for details.

[7] Intelligent Miner is IBM’s Data Mining tool. See http://www.ibm.com for details.

[8] ModelMAX is ASA’s Data Mining tool. See http://www.asacorp.com/products for details.

[9] DigiMine Services is a fully hosted data warehousing and data mining solution which provides advanced analytics to e-businesses. See http://www.digimine.com for details.

[10] Entreprise Miner is SAS Institute’s Data Mining tool. See http://www.sas.com for details.

[11] Quadstone System is Quadstone, Inc.’s Data Mining tool. See http://www.quadstone.com for details.

[12] KXEN Analytic Framework is KXEN’s suite of Data Mining tools. See http://www.kxen.com for details.

[13] WizWhy is WizSoft, Inc.’s Data Mining tool. See http://www.wizsoft.com for details.

[14] See http://www.businessobjects.com/applications/aa_products.htm for details of Business Objects’ analytical products and services.

[15] KnowledgeSEEKER is Angoss Software Corporation’s original Data Mining package. See http://www.angoss.com for details.

[16] See http://www.data-miners.com for details.

[17] KnowledgeSTUDIO is Angoss Software Corporation’s advanced Data Mining tool. See http://www.angoss.com for details.

[18] See http://www.fairisaac.com for details of Fair, Isaac’s analytics offerings.

[19] ScorXPRESS is ASA’s customised Data Mining solution for fraud detection. See http://www.asacorp.com/products for details.

[20] See http://www.teradata.com for details of Teradata’s DWH and DM products and services

[21] PolyAnalyst is Megaputer Intelligence, Inc.’s Data Mining tool. See http://www.megaputer.com for details.

[22] BusinessMiner remains part of Cognos’ BusinessObjects package. It is however being extended by new products and services.

[23] Recon is a trademark and service mark of Lockheed Martin Missiles and Space Company, Inc. Further details are in relevant papers by G.H. John (see http://robotics.stanford.edu/~gjohn)

[24] CART is Salford Systems’ Data Mining tool. See http://www.salford-systems.com for details.

[25] See http://www.polyvista.com for details of PolyVista’s suite of analytics software.

[26] See http://www.mathworks.com for details of MATLAB’s extensive range of statistics and data analysis techniques, in particular its Neural Network Toolbox.

[27] S-PLUS is Insightful Corporation’s (formerly MathSoft, Data Analysis Products Division) data analysis tool. See http://www.insightful.com for details.

[28] This work was recognized by Computerworld Honors Foundation for Outstanding Achievement in Medicine and is now part of the Computerworld Honors Foundation’s collection of case studies (20044978) and the Smithsonian Institution’s permanent collection.

[29] See http://www.neuroshell.com for further details on the NeuroShell product range.

[30] Text Miner is SAS’s text mining tool, part of Enterprise Miner. See http://www.sas.com for details.

[31] The Neural Network Toolbox is one of MATLAB’s specialised toolbox for advanced data analysis. See http://www.mathworks.com/products/neuralnet for details.

[32] See http://www.zsolutions.com for details of Z Solutions’ Data Mining software and consultancy services, specialised in neural networks.

[33] See http://www.cognos.com/products/datamining.html for details of Cognos’ Data Mining products and services.

[34] Exlusive Ore, Inc. is a data mining and database management consultancy. See http://www.xore.com for details.



[i] Leading Danish Insurer Improves Claim Handling & Customer Support with SPSS Predictive Analytics, KDnuggets News, No. 4, Item 40, 2006 (www.kdnuggets.com/news/2006/n04/40i.html)

[ii] Pediatrix Data Mining Identifies Favorable Antibiotic Combination for At-Risk Infants, Genetic Engineering News, Breaking News, 27 January 2006 (www.genengnews.com/news/bnitem.aspx?name=1153932XSL_NEWSML_TO_NEWSML.xml)

[iii] Data Mining: Solving Care, Cost Capers, Greg Gillespie, Health Data Management, January 2005 (www.healthdatamanagement.com/html/current/PastIssueStory.cfm?ArticleId =10300&issuedate=2004-11-01)

[iv] Baylor Health Care System Implements SAS Software to Evaluate Technology Efforts, DM Direct Newsletter: Industry Implementations, DMReview Web Editorial Staff, 8 April 2005 (www.dmreview.com/editorial/newsletter_article.cfm?nl=dmdirect&articleId=1025098&issue=20167)

[v] EDF Energy keeps eye on bad debt, Miya Knights, Computing, 10 November 2005 (www.vnunet.com/computing/news/2145830/edf-energy-keeps-eye-bad-debt)

[vi] Anderson Analytics: Data Mining for Diamonds, CRM Today, 13 February 2006 (www.crm2day.com/news/crm/117346.php)

[vii] SPSS Predictive Analytics Helps State of Texas Recover $400 Million in Unpaid Taxes, DM Direct Newsletter: Industry Implementations, DMReview.com Web Editorial Staff, 8 July 2005 (www.dmreview.com/article_sub.cfm?articleId=1031670)