Data integration

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Data integration is the process of combining data residing at different sources and providing the user with a unified view of these data [1]. This process emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories). Data integration appears with increasing frequency as the volume and the need to share existing data explodes. It has been the focus of extensive theoretical work and numerous open problems remain to be solved. In management practice, data integration is frequently called Enterprise Information Integration.

Contents

[edit] History

Figure 1: Simple schematic for a data warehouse. The information from the source databases is extracted, transformed then loaded into the data warehouse.
Figure 2: Simple schematic for a data integration solution. A system designer constructs a mediated schema over which a user poses queries. The source databases are interfaced with wrapper code if needed.

The problem of combining heterogeneous data sources under a single query interface is not a new one. The rapid adoption of databases after the 1960s naturally led to the need to share or merge existing repositories. This merging can be done at several levels in the database architecture[2]. One popular approach is Data Warehousing (see figure 1). Here data from several sources are extracted, transformed, and loaded into source and can be queried with a single schema. This can be perceived architecturally as a tightly coupled approach because the data reside together in a single repository at query time. Problems with tight coupling can arise with the "freshness" of data, for example when an original data source is updated, but the warehouse still contains the older data and the ETL process needs to be executed again. It is also difficult to construct data warehouses when you only have a query interface to the data sources and no access to the full data. This problem frequently arises when integrating several commercial query services like travel or classified advertisement web applications.

The recent trend in data integration has been to loosen the coupling between data. Here the idea is to provide a uniform query interface over a mediated schema (see figure 2). This query is then transformed into specialized queries over the original databases. This process can also be called as view based query answering because we can consider each of the data sources to be a view over the (nonexistent) mediated schema. Formally such an approach is called Local As View (LAV) — where "Local" refers to the local sources/databases. An alternate model of integration is one where the mediated schema is designed to be a view over the sources. This approach called Global As View (GAV) — where "Global" refers to the global (mediated) schema — is often used due to the simplicity involved in answering queries issued over the mediated schema. However, the obvious drawback is the need to rewrite the view for mediated schema whenever a new source is to be integrated and/or an existing source changes its schema.

Some of the current work in data integration research concerns the Semantic Integration problem. This problem is not about how to structure the architecture of the integration, but how to resolve semantic conflicts between heterogeneous data sources. For example if two companies merge their databases, certain concepts and definitions in their respective schemas like "earnings" inevitably have different meanings. In one database it may mean profits in dollars (a floating point number), while in the other it might be the number of sales (an integer). A common strategy for the resolution of such problems is the use of ontologies which explicitly define schema terms and thus help to resolve semantic conflicts. This approach is also called ontology based data integration.

[edit] Example

Consider a web application where a user can query a variety of information about cities such as crime statistics, weather, hotels, demographics, etc. Traditionally, the information must exist in a single database with a single schema. Information of this breadth, however, is difficult and expensive for a single enterprise to collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases, weather websites, and census data.

A data integration solution may address this problem by considering these external resources as materialized views over a virtual mediated schema, resulting in "virtual data integration". This means application developers construct a schema to best model the kinds of answers their users want. This virtual schema is called the mediated schema. Next, they design "wrappers" or adapters for each data source, such as the crime database and weather website. These adapters simply transform the local query results (those returned by the respective websites or databases) into an easily processed form for the data integration solution (see figure 2). When an application-user queries the mediated schema, the data integration solution transforms this query into appropriate queries over the respective data sources. Finally, the results of these queries are combined into the answer to the user's query.

A convenience of this solution is that new sources can be added by simply constructing an adapter for them. This contrasts with ETL systems or a single database solution where the entire new dataset must be manually integrated into the system.

[edit] Theory of Data Integration

The theory of data integration[1] is a subset of database theory and formalizes the underlying concepts of the problem in first-order logic. Its results tell us whether data integration is possible and how difficult it is to perform. While its definitions may appear abstract, they are general enough to accommodate all manner of integration systems.

[edit] Definitions

Data integration systems are formally defined as a triple \left \langle G,S,M\right \rangle where G is the global (or mediated) schema, S is the heterogeneous set of source schemas, and M is the mapping that maps queries between the source and the global schemas. Both G and S are expressed in languages over alphabets comprised of symbols for each of their respective relations. The mapping M consists of assertions between queries over G and queries over S. When users pose queries over the data integration system, they pose queries over G and the mapping then asserts connections between the elements in the global schema and the source schemas.

A database over a schema is defined to be a set of sets, one for each relation (in a relational database). The database corresponding to the source schema S would be the set of sets of tuples for each of the heterogeneous data sources and is called the source database. Note that this single source database may actually be a collection of disconnected databases. The database corresponding to the virtual mediated schema G is called the global database. The global database must satisfy the mapping M with respect to the source database. The legality of this mapping depends on the nature of the correspondence between G and S. Two popular ways to model this correspondence are Global as View or GAV and Local as View or LAV.

Figure 3: Illustration of tuple space of the GAV and LAV mappings[3]. In GAV, the system is constrained to the set of tuples mapped by the mediators while the set of tuples expressible over the sources may be much larger and richer. In LAV, the system is constrained to the set of tuples in the sources while the set of tuples expressible over the global schema can be much larger. Therefore LAV systems must often deal with incomplete answers.

In GAV, the global database is modeled as a set of views over S. In this case M associates to each element of G a query over S. Query processing becomes a straightforward operation because the associations between G and S are well-defined. The burden of complexity is placed on implementing mediator code instructing the data integration system exactly how to retrieve elements from the source databases. If any new sources are added to the system, considerable effort may be necessary to update the mediator, thus the GAV approach should be favored in cases where the sources are not likely to change.

In a GAV approach to the example data integration system above, the system designer would first develop mediators for each of the city information sources and then design the global schema around these mediators. For example, consider if one of the sources was for a weather website. The designer would likely then add a corresponding element for weather to the global schema. Then the bulk of effort is to write the proper mediator code that will transform predicates on weather into a query over the weather website. This effort can be complicated if there is another source related to weather because the designer is charged with the task of writing code to properly combine the results from the two sources.

On the other hand, in LAV, the source database is modeled as a set of views over G. In this case M associates to each element of S a query over G. Here the exact associations between G and S are no longer well-defined. As is illustrated in the next section, the burden of determining how to retrieve elements from the sources is placed on the query processor. The benefit of an LAV modeling is that new sources can be added with far less work than in a GAV system, thus the LAV approach should be favored in cases where the mediated schema is more likely to change.

In an LAV approach to the example data integration system above, the system designer designs the global schema first and then simply inputs the schemas of the respective city information sources. Consider again if one of the sources was for a weather website. The designer would add corresponding elements for weather to the global schema only if none existed already. Then an adapter or wrapper for the website would be written and a schema description of the website's results added to the source schemas. The complexity of adding the new source is moved from the designer to the query processor.

[edit] Query Processing

The theory of query processing in data integration systems is commonly expressed using conjunctive queries [4]. One can loosely think of a conjunctive query as a logical function applied to the relations of a database such as "f(A,B) where A < B". If a tuple or set of tuples is substituted into the rule and satisfies it (makes it true), then we consider that tuple as part of the set of answers in the query. While formal languages like Datalog express these queries concisely and without ambiguity, common SQL queries are classified as conjunctive queries as well.

An important property of conjunctive queries (in terms of data integration) is query containment. A query A contains another query B (denoted A \supset B) if the results of applying B are a subset of the results of applying A for any database. The two queries are said to be equivalent if the resulting sets are equal for any database. This is important because in both GAV and LAV systems, the user's conjunctive queries are posed over a virtual schema represented by a set of views, or "materialized" conjunctive queries. Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our user's query. This corresponds to the problem of answering queries using views (AQUV) [5].

In GAV systems, a system designer writes mediator code to define the query rewriting. Each element in the user's query corresponds to a substitution rule just as each element in the global schema corresponds to a query over the source. Query processing is simply expanding the subgoals of the user's query according to the rule specified in the mediator and thus the resulting query is likely to be equivalent. While the majority of the work is done beforehand by the designer, some GAV systems such as Tsimmis involve simplifying the mediator description process.

In LAV systems, queries undergo a more radical process of rewriting. This is because there is no mediator to align the user's query with a simple expansion strategy. The integration system must execute a search over the space of possible queries in order to find the best rewrite. The resulting rewrite may not be an equivalent query but maximally contained, and the resulting tuples may be incomplete. The MiniCon algorithm[5] is currently the leading query rewriting algorithm for LAV data integration systems.

In general, the complexity of query rewriting is NP-complete[5]. If the space of rewrites is relatively small this is not a problem even for integration systems with hundreds of sources.

[edit] Enterprise Information Integration

Enterprise Information Integration (EII), is the commercial application of Data Integration. Unlike the theoretical problems described above, the private sector is more concerned with the problems of data integration as a viable product [6]. Emphasis is neither on correctness nor tractability but on speed and simplicity. The industry for EII has emerged, but many professionals believe it is not performing to its full potential. Practitioners cite the following major problems with EII must be addressed for the industry to become mature:

EII must be simple to understand 
Answering queries with views is interesting from a theoretical standpoint, but it is difficult to understand how to incorporate it as an enterprise solution. Some developers believe it should be merged with EAI. Others believe it should be incorporated with ETL systems, citing customers' confusion over the differences between the two services.
EII must be simple to employ 
Even if EII is recognized as a solution to a problem, it is currently time-consuming and complex to apply most EII software solutions to the problem. A variety of schema-less solutions such as "Lean Middleware" [7] have been posed, however ease of use and speed of employment are inversely proportional to the generality of such systems. Others cite the need for standard data interfaces to speed and simplify the integration process in practice.
EII must handle higher order information 
It is difficult even with a functioning information integration system to determine if a given application will be satisfied by the sources in the database. Answering these kinds of questions about a set of repositories requires semantic information like metadata and/or ontologies. The few commercial tools that leverage this information are in their infancy.

[edit] See also

[edit] References

  1. ^ a b Maurizio Lenzerini (2002). "Data Integration: A Theoretical Perspective". PODS 2002: 233-246. 
  2. ^ Patrick Ziegler and Klaus R. Dittrich (2004). "Three Decades of Data Integration - All Problems Solved?". WCC 2004: 3-12. 
  3. ^ Christoph Koch (2001). Data Integration against Multiple Evolving Autonomous Schemata. http://www.csd.uoc.gr/~hy562/Papers/thesis_final.pdf. 
  4. ^ Jeffrey D. Ullman (1997). "Information Integration Using Logical Views". ICDT 1997: 19-40. 
  5. ^ a b c Alon Y. Halevy (2001). "Answering queries using views: A survey". The VLDB Journal: 270-294. 
  6. ^ Alon Y. Halevy et al. (2005). "Enterprise information integration: successes, challenges and controversies". SIGMOD 2005: 778-787. 
  7. ^ David A. Maluf et al. (2005). "Lean middleware". SIGMOD 2005: 788-791. 

[edit] External links

Personal tools
Languages