Data Mining and Information Retrieval
The term data mining refers loosely to the process of semiautomatically analyzing
large databases to find useful patterns. Like knowledge discovery in artificial
intelligence (also called machine learning) or statistical analysis, data mining
attempts to discover rules and patterns from data. However, data mining differs
from machine learning and statistics in that it deals with large volumes of data,
stored primarily on disk. That is, data mining deals with “knowledge discovery
in databases.”
Some types of knowledge discovered from a database can be represented by
a set of rules. The following is an example of a rule, stated informally: “Young
womenwith annual incomes greater than $50,000 are the most likely people to buy
small sports cars.” Of course such rules are not universally true, but rather have
degrees of “support” and “confidence.” Other types of knowledge are represented
by equations relating different variables to each other, or by other mechanisms
for predicting outcomes when the values of some variables are known.
There are a variety of possible types of patterns that may be useful, and
different techniques are used to find different types of patterns. In Chapter 20 we
study a few examples of patterns and see how they may be automatically derived
from a database.
Usually there is a manual component to data mining, consisting of preprocessing
data to a form acceptable to the algorithms, and postprocessing of discovered
patterns to find novel ones that could be useful. There may also be more than
one type of pattern that can be discovered from a given database, and manual
interaction may be needed to pick useful types of patterns. For this reason, data
mining is really a semiautomatic process in real life. However, in our description
we concentrate on the automatic aspect of mining.
Businesses have begun to exploit the burgeoning data online to make better
decisions about their activities, such as what items to stock and how best to
target customers to increase sales. Many of their queries are rather complicated,
however, and certain types of information cannot be extracted even by using SQL.
Several techniques and tools are available to help with decision support.
Several tools for data analysis allow analysts to view data in different ways.
Other analysis tools precompute summaries of very large amounts of data, in
order to give fast responses to queries. The SQL standard contains additional
constructs to support data analysis.
Large companies have diverse sources of data that they need to use for making
business decisions. To execute queries efficiently on such diverse data, companies
have built data warehouses. Data warehouses gather data from multiple sources
under a unified schema, at a single site. Thus, they provide the user a single
uniform interface to data.
Textual data, too, has grown explosively. Textual data is unstructured, unlike
the rigidly structured data in relational databases. Querying of unstructured
textual data is referred to as information retrieval. Information retrieval systems
have much in common with database systems—in particular, the storage and
retrieval of data on secondary storage. However, the emphasis in the field of
information systems is different from that in database systems, concentrating on
issues such as querying based on keywords; the relevance of documents to the
query; and the analysis, classification, and indexing of documents.
Frequently Asked Questions
Recommended Posts:
- Characteristics of the Database Approach
- View of Data
- Subclasses, Superclasses, and Inheritance
- what is database management system
- Database-System Applications
- Specialization and Generalization
- Constraints and Characteristics of Specialization and Generalization Hierarchies
- Modeling of UNION Types Using Categories
- A Sample UNIVERSITY EER Schema, Design Choices, and Formal Definitions
- Example of Other Notation: Representing Specialization and Generalization in UML Class Diagrams
- Data Abstraction, Knowledge Representation, and Ontology Concepts
- Using High-Level Conceptual Data Models for Database Design
- Using High-Level Conceptual Data Models for Database Design
- Using High-Level Conceptual Data Models for Database Design
- Using High-Level Conceptual Data Models for Database Design