Research Areas Overview
From DML
[edit] Possible Research Areas/Issues in Data Mining
As you embark in advanced research, remember that research is hard work. It is very demanding, very unforgiving, and yet extremely rewarding. Because of the nature of research, you will be most successful if you select a topic that you are really interested in so that, when the hard days come, you can get through them for love of the topic! You should also make sure that you will have access to the resources, both in terms of documents and people, necessary to assist you through the research process.
With this in mind, here is a list of potential research topic in the Data Mining Lab. There may be others, but there are current projects in any one of these areas that may be initiated at any time.
[edit] Meta-learning / DM Process Automation
Meta-learning differs from base-level learning in the scope of adaptation. Whereas learning at the base-level focuses on accumulating experience on a specific learning task (e.g., credit rating, medical diagnosis, mine-rock discrimination, fraud detection, etc.), learning at the meta-level is concerned with accumulating experience on the performance of multiple applications of a learning system. Thus meta-learning addresses such issues as:
- What is the range of applicability of our current algorithms? What assumptions do they make? How can we decide in advance whether the target hypothesis is likely to be found by the algorithm?
- Do we need commonsense deduction to know what bias to use?
- Can we develop diagnostics that can tell us when our algorithms are not doing well and that can help us identify what is going wrong?
- Can we define properties of data sets that can be measured to understand those data sets and to generate synthetic data for experiments at the meta-level?
- Can we generate meta-models that map data sets or applications to mining algorithms (or combinations thereof)?
- What is the potential role of ontology in meta-learning, especially in the building of effective data mining assistants?
- What is the relevance of the NFL Theorems to meta-learning? Is there a set of "interesting problems" or do problems truly span the whole universe? How much of that universe is covered by current algorithms?
- Can we make data mining more accessible to non-experts? How much of the data mining process can be automated? It has been stated that "when it comes to improving the efficiency of the knowledge discovery process as a whole, additional research on efficient mining algorithms will have diminishing returns if the rest of the process remains difficult and manual."
- Most approaches in model selection have focused on accuracy, i.e., finding the model or DM process that will maximize accuracy. Can we define multi-criteria evaluation functions, including factors beyond accuracy?
We have done plenty of work in this area, published a number of papers, and organized several workshops. We know most of the main players and are regarded as experts in this domain. There seems to be renewed activity within the field here.
[edit] Prior Knowledge
It is an accepted fact that the more we know the better we learn, and it is rather uninteresting to re-invent the wheel. One would rather focus on discovering new insight than unearthing existing knowledge. Issues in this area include:
- Can we find effective ways to incorporate prior knowledge into our learning algorithms and the data mining process in general
- What form should prior knowledge take (e.g., UML-like diagrams)? What type of prior knowledge is most useful? Where does such prior knowledge come from and how do we extract it?
- What about commonsense knowledge? What is it? How is it developed?
- How do we bring problem-specific knowledge to bear effectively on the data mining process? For example, how do we add useful structure to the data that augments the traditional feature-vector representation? How do we incorporate such intangibles as company preference into the process?
- How do we design improved utility/interestingness functions from domain-specific knowledge (e.g., for rule post-processing)? How do we use domain knowledge to know which patterns are "good" and which are not?
- How do we integrate human and computer data mining (Mitchell\’s mix-initiative data mining)?
We have done some work in this area. As a whole there is little work in the field on this topic, expect in the area of inductive logic programming. We are not experts in this field but know many of them and could pursue research in this area.
[edit] Never-ending and Incremental Learning
Most learning algorithms, and by extensions data mining applications, run under a kind of one-shot, batch mode. In other words, they require all the data to be available a priori and build a model from that data once. If new data becomes available, the process typically has to be restarted. This is not representative of many domains (e.g., data streams) nor of the way we most effectively learn. Topics in this area include:
- Can we design efficient algorithms for learning incrementally?
- What are the implications of a 24-hr/day 7-days/week learner? Suppose it is turned on today, what would I expect in one year?
- Are there inherently incremental applications? What are they?
- Does incremental learning somehow free us from the NFL Theorems?
- Is never-ending learning a prerequisite to long-term survival (i.e., learn, adapt, evolve)?
We have done some work in this are, published a few papers, most recently, Brent's MS thesis and the work with Keith Copsey from QinetiQ, PLC. We can certainly pursue research in this area, which remains relatively untouched overall.
[edit] Closed-loop Data Mining
Most data mining applications are built under the assumption that what works well now will work well in the future with little thought as to how one might monitor whether this is the case, or how to adapt to potential changes. This is related to the notion of incremental learning, but does not have to be, i.e., one could rebuild a model from scratch once one knew that the current model was no longer trustworthy or reliable. Research in control theory focuses on closed system with a feedback loop, generally where input and output values are continuous. Questions in this area include:
- How does one extend the continuous setting of control theory to the discrete setting of data mining?
- Whilst you select data and build model, something changes in the underlying phenomenon so that test data (unseen) can no longer be guaranteed to have any relation to training data. How do you detect such changes?
- What is the relationship to concept drift? What are sources of drift? What mechanisms can be devised to accommodate them?
- Can we define a kind of software engineering of machine learning? What guarantees can we give about a system\’s future performance following deployment? Can we get a clearer understanding (and possible formalization) of the data mining process? How do we validate the results of the data mining process?
We done nothing of substance in this area, but it is an interesting problem that ought to be addressed if DM systems are to become part of business-as-usual.
[edit] Transfer Learning / Multi-task Learning
Learning is seldom done in isolation. Often related tasks are learned in some kind of sequence, or what we know in some domain may certainly help learning in another. This is related to the issue of prior knowledge, except that here there may not be an explicit representation of the knowledge, simply an implicit transfer from one task to another. Issues in this area include:
- How do we measure the "distance" between tasks and/or classifiers? How can we determine whether two tasks are similar enough?
- From an experimental standpoint, how do we generate similar tasks? Chances are that in current repositories (e.g., UCI), there are far more dissimilar tasks than there are similar ones. This makes it difficult to demonstrate anything involving similarity, i.e., how do we generate a reasonable number of positive examples for transfer learning?
- How much should be transferred between tasks? Can we identify the relevant subsets?
- Can we design effective and efficient transfer mechanisms for all classes of learning models? Most of the work has focused on neural networks and probabilistic models. What about others?
- How can we guarantee that the cost of transfer does not exceed the cost of re-learning from scratch?
- How can a system keep a persistent memory of what it learns, i.e., how can what it learns today be used tomorrow?
We have done little work in this area, most recently the work of Jun on transfer learning in decision trees. This is a very interesting area, which overlaps some with the never-ending learning topic. We know several of the players in this field and can continue research around this topic.
[edit] Human Learning-directed Machine Learning
So far, humans remain our best examples of effective and rather efficient learning systems. As such, much can still be learned from human learning to guide efforts in machine learning. Questions include:
- Where do human and machine learning align?
- Where do human and machine learning differ?
- Human traits, such as motivation, seem to play a crucial role in learning. How can they be incorporated into machine learning?
- Language acquisition and mastery seem to have a lot to do with learning ability. What is the impact on machine learning?
We have done nothing in this area, although it is rather fascinating and sort of the underlying implicit foundation of all our work. But we have not pursued it as a separate, well-defined endeavor, yet.
[edit] Structure-rich Data Mining
The standard representation in data mining is the feature-vector representation or the single relational table representation. Although widespread and useful, this representation is limited in expressiveness and sometimes requires a possibly lossy transformation from richer representations. Issues in this area include:
- What are structure-rich applications where higher-order representations are warranted (e.g., bio-informatics, molecular biology)?
- Can we design efficient algorithms to learn within these far richer domains?
- Is it simply a trade-off between costly data engineering with simple learning algorithms vs. little data engineering with costly learning algorithms? Or is there something at stake in the representation itself?
We have done some work in this area, published a few papers, although a few years back now. There has been some resistance in the community to accepting some of our ideas. Yet, it seems that there is great potential and value in pursuing this area.
[edit] Record Linkage
Record linkage consists of discovering duplicate records within a data collection, such that records that are believed to refer to the same entity are treated as a single entity. Issues in this area include:
- Can we find effective similarity measures across heterogeneous data types?
- How do we deal with uncertainty and highly-sparse records (e.g., genealogical applications)?
- What is the role of prior or domain-specific knowledge? How much can be learned automatically vs. hard-coded by experts?
We have done some work in this area, most recently, Burdette's MS thesis and Steve's paper on MBGRL. We have generated some interest at the Church's FCHD and are committed to continue research in this area.
[edit] Social Network Analysis / Link Analysis
Much of the research in machine learning and data mining has focused on extracting knowledge from the characteristics of entities, with little interest in the relationships among them. Social network and link analysis focus on discovering knowledge from the patterns of interaction and interconnectivity among entities, social actors or others. Issues in this area include:
- What constitutes an interesting pattern, when dealing with relationships? What type of social behavior, beyond hubs and authorities, exists on the Web?
- Most graph-based algorithms are NP-hard or NP-complete. Can we come up with good polynomial approximations so that mining networks (which are but graphs) is a feasible activity?
- What is the role of identifiers, which are ignored in traditional DM applications? Can they serve as proxy for hidden variables (e.g., people visiting this shop are more likely to be terrorists)?
- What are the interactions between explicit social networks and implicit social networks? What can be learned from overlaying the two?
- What contribution can the study of implicit vs. explicit social networks make to political science, especially in relation to the notion of social capital? Can social capital be formalized and quantified in a well-defined theoretical framework?
We have done some work in this area, most recently Matt's MS thesis and have had some encouraging feedback from Robert Putnam. It is unclear what the contribution to Computer Science will be, but there may be a very interesting contribution to the social and political sciences. This remains to be shown, but there is still potential in this general area of research.
[edit] Medical Data Analysis
Medical data is unique in a number of ways (e.g., privacy, impact on real lives, statistical significance) and thus makes for a very interesting and challenging application area in data mining. In general, the work is less likely to focus on the design of algorithms (although this may still be useful) and more on pre-processing, quality assessment, actionability and ultimate relevance to medical practice. This is an area where the interplay between prior domain knowledge and data mining is at the fore, and the subject of much debate. It is also an area where one may feel excited about the prospect of making a real difference in the lives of people. Questions in this domain include:
- What is the role of observational studies based on data mining in the overall process of acquiring and applying medical knowledge?
- What safeguard should be in place so that all conclusions are thoroughly validated and statistically significant?
- How do we handle the various forms of bias that may appear as a result of patient selection or data pre-processing, etc?
- How do we account for and control the many parameters that may affect outcomes? Are statistical models the only ones that can be trusted or is there added value in some of the data mining algorithms?
- How do we design the process and conduct the data mining study to ensure acceptance from the medical community?
We have done some work in this area recently, mostly with RemedyMD and bariatrics data. There is a commitment on both sides to continue research in this area, both in terms of actual mining studies and in terms of automatic analytics (see Meta-learning above). We can certainly support research in this field.
[edit] Applied Data Mining
Data Mining is a business-driven activity. Much can be learned about it and how to improve it through actual projects with industrial partners. We have several such partners (and always look for new ones) and can support a number of interesting projects that allow us to apply the results of our research as well as help unearth interesting issues that may feed back into the research. Currently, we can support applications in:
- E-commerce
- Bariatrics surgery
[edit] Additional Issues
The following are additional issues, generally regarded as open questions in the Data Mining community that could therefore also serve as excellent research topics. In general, we have no solid expertise in these areas. We know enough to get started.
- Semi-supervised Learning
- Multimedia Mining
- Mining audio, images and video
- Mining across heterogeneous media (e.g., both text and images)
- Web Mining
- Text Mining
- Nathan's thesis on semantic distance is a potential contribution to this area
- There is plenty of expertise available locally with Dr. Ringger and his group.
- Bioinformatics Data Mining
- Scaling Up
- Highly-dimensional data
- Huge amounts of data (e.g., exabytes)
- Fast streaming data (i.e., real-time DM)
- Stochastic/Probabilistic Data Mining
- Sequential and Time-series Data Mining
- Distributed and Multi-agent Data Mining
- Security, Privacy and Data Integrity Issues
- Mining Unbalanced, Cost-sensitive Data
