Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author. DISTRIBUTION DESIGN IN OBJECT ORIENTED DATABASES A t hesis p resented in partia l fulfi lment of t he requi rements for the degree of MAST ER O F I NFO RMATION S C IENCE I N I NFORMATION SYSTEMS at Massey University, Palmerston North, New Zeala nd Hui Ma 2003 Abstract The advanced development of object oriented database systems has attracted much research. However, very few of them contribute to the distribution design of object oriented databases. The main tasks of distribution design are fragmenting the database schema and allocating the fragments to different sites of a network. The aim of fragmentation and allocation is to improve the performance and increase the avai lability of a database system. Even though much research has been done on distributed databases, the research a lmost always refers to the relational data model (RDM). Very few efforts provide distribution design techniques for distributed object oriented databases. The aim of this work is to generalise distribution design techniques from relational databases for object oriented databases. First, the characteristics of distributed databases in general and the techniques used for fragmentation and allocation for the RDM are reviewed. Then , fragmentation operations for a rather generic object oriented data model (OODM) are de­ veloped. As with the RDM, these operations include horizontal and vertical fragmentation. A third operation named splitting is also introduced for OODM. Finally, normal predicates are introduced for OODM. A heuristic procedure for horizontal fragmenting of OODBs is also presented. The adaption of horizontal fragmentation techniques for relational databases to object oriented databases is t he main result of this work . Acknowledgements I would like to thank Professor Klaus-Dieter Schewe, my supervisor , for his patience, guid­ ance, suggestions and constant support during this research. I am also thankful to Markus Kirchberg for his encouragement and guidance through the early stage of chaos and con­ fusion. A special thanks goes to Madre Chrystal for her kindly devoting valuable time to proof read my draft. The Massey Masterate Scholarship, which was awarded to me for the period February 2002 - February 2003 for graduate studies, was crucial to t he successful completion of this project. Finally, I am grateful to my husband and my parents for t heir patience and love. Without them this work would never have come into existence (litera lly). Hui Ma March 31 , 2003 iii Table of Contents 1 Introduction 1.1 What is a Distributed Database? 1.2 Why Distribution? ...... . . 1.3 Distribution Design: Fragmentation and Allocation 1.4 Objective: Fragmentation in Object Oriented Databases 1.5 The Outline of the Thesis ................ . 2 Distributed Databases 2 .1 Characteristics . . . 2.1.l Data Independence . 2.1.2 Network Transparency (Distribution Transparency) 2.1.3 Replication Transparency .. 2.1.4 Fragmentation Transparency 2.2 Key Concepts . . . . 2.2.1 Heterogeneity 2.2.2 Autonomy .. 2.2.3 Distribution . 2.2.4 Classification of Distributed DBMS . 2.3 A Framework for Distributed Databases . . 2.3.l The Objective of the Design of Data Distribution . 2.3.2 The Reasons for Fragmentation . 2.3 .3 Alternative Design Strategies . v 1 2 2 4 5 6 7 7 8 9 9 10 10 10 11 11 12 13 14 15 16 3 Distribution Design for Relational Databases 3.1 3.2 3.3 3.4 3.5 3.6 3.7 The Relational Data Model Characteristics ...... Horizontal Fragmentation Vertical Fragmentation . Mixed Fragmentation Allocation . . Related Work 3.7.l Horizontal Fragmentation 3.7.2 Vertical Fragmentation. 4 Object Oriented Databases 4.1 Fundamentals of the OODM 4.1.1 Type Definitions 4.1.2 Class Definitions 4.1.3 Schema Definit ion 4.2 An Example of Object Oriented Database Schema 4.3 Queries . . . . . . . . . 4.3.1 Path Expressions 4.3.2 Queries ..... 5 Fragmentation Operations in Object Oriented Databases 5.1 Split Fragmentation ... 5.2 Horizontal Fragmentation 5.2.1 Horizontal Fragmentation on Class Level 5.2.2 Horizontal Fragmentation on Type Level. 5.3 Vertical Fragmentation for Object Oriented Databases .. .... .... ... . . .. . . 5.3.1 Vertical Fragmentation on Class Level 5.3.2 Vertical Fragmentation on Type Level 5.4 Fragmentation Strategies 5.5 Related Work ...... . vi 19 20 21 22 26 29 29 30 30 35 37 38 38 42 42 43 51 51 54 61 62 64 64 66 68 68 70 71 72 6 A M ethod for Horizontal Fragmentation in Object Oriented D at abases 75 6.1 Simple P redicates . 75 6.2 Normal Predicates 78 6.3 The Heuristic Fragmentat ion Process . 80 6.4 A Cost Model . . . . 82 6.4. l Query-Trees . 82 6.5 6.6 6.4.2 Calculation of Size for Classes . 6.4.3 Calculation of Size for Fragments or Intermediate Nodes . 6.4.4 Allocate Intermediate Nodes to Sites 6.4.5 Calculation of Query Costs . . . . . A Heuristic P rocedure for Horizontal Fragmentation Summary . .. .......... . 7 Conclusion and Possible Ex tens ion 7.1 Summary . . 7.2 Future Work Bibliography 83 88 89 90 92 97 99 99 100 103 Chapter 1 Introduction Relational database systems have been well accepted because they reflect the nature of the structures of many organizations and enable the possibility of efficiently and effectively sharing the data. As computer-based systems have penetrated all areas, there are increased d emands for the non-conventional applications, such ru:; computer-aided design (CAD) , ge­ ographic information systems, image, and graphic database systems, etc. However, the complex structure of the data in these applications cannot be adequately modelled by tra­ ditional relational databases. Consequently, object oriented databases are due to take the place of relational ones and become more and more popular. At the same time network fa­ cilities enable the distribution of database systems. However, approaches have been limited to t he relational data model which lead to the emergence of distributed relational database management systems (DRDBMSs). But distributed object oriented database management systems (DOODBMSs) are due as well. It is desirable to design a distributed object oriented database in such a way that the system can perform efficiently and effectively. The techniques of distribution design of relational databases have been intensely studied from the 1980s. There is also substantial research contributing to the study of the object oriented data model , but there is little research on the distribution design of object oriented databases. The aim of this thesis is to study the existing (mature) distribution design techniques for relational databases and to adopt them into an object oriented approach. The following sections of this chapter will first review the definit ion of a distributed database and then study the reasons of the emergence of distributed database systems. Two main distribution 1 2 design techniques will also be briefly defined. After reviewing some research work that has been done in DRDBMSs we will set up the objective of studying fragmentation in DOODBMSs. Finally, the outline of this thesis will be presented. 1.1 What is a Distributed D atabase? Ceri and Pelagatti [6] define a distributed database as a collection of data that logically belongs to the same system but is spread over the sites of a computer network. Ozsu and Valduriez [24] give a similar definition: a distributed database system is a collection of multiple, logically interrelated databases distributed over a computer network. They explain that the logically related files, which are individually stored at each site of a computer network, are not enough to form a distributed database. There need to be a structure among them. Ceri and Pelagatti [6] support this view and state that the data at different sites must have propert ies that tie them together, and that access to the files should be via a common interface. They explain that physical distribut ion means that data does not reside at the same site in the same processor. Ozsu and Valduriez [24] point out that physical distribution does not necessarily imply that the computer systems are geographically distributed. The sites among the network could even have the same address. They could be in the same room, but the communication between them is done over a network instead of shared memory, and the communication network is the only shared resource. 1.2 Why Distribut ion? Ceri and Pelagatti [6] and Ozsu and Valduriez [24] describe the motivation for the develop­ ment of distributed databases. Distributed database research is motivated by the reliability, performance, and economic concerns of distributed databases in organizations. Reliability refers to the ability to tolerant faults. Performance refers to the ability to reduce query response time and increase throughput. Economic concerns relate to the reduction of data communication and update synchronization costs. • Organizational Reasons 3 In recent years, the demand for more information by industries, governments and aca­ demic institutes has led to databases that have exceeded the physical limitations of centralized systems. The advances of telecommunication techniques make distributed database systems more affordable and useful [14, l]. Ceri & Pelagatti [6] state that distributed databases are motivated by organizational reasons: many organizations , especially global organizations are often decentralized. For such organizations, imple­ menting the information system in a decentralized way might be more suitable. • Economic Reasons Ceri & Pelagatti [6] state that economic reasons are another motivation for the devel­ opment of distributed databases. They argue that large, centralized computer centres are becoming questionable with respect to economies of scale. Ozsu & Valduriez [24] support this view and state that it normally costs much less to put together a sys­ tem of smaller computers with the equivalent power of a single big machine due to the advance of minicomputers and microcomputers. The authors also state that the communication cost can be reduced when distributed databases are implemented. If databases are geographically dispersed and the application accessing them are at the intersection of dispersed data, it will be much more economical to partition the re­ lations and the applications so that the data processing can be done locally at each site. • Reliability and Availability Improved reliability and availability is one of the potential advantages of distributed databases which the centralized databases lack [6, 24]. When replications of data have been placed at different sites, the crash of one site or the failure of the communication link would not necessarily make the data impossible to reach. When the system crashes and the communication link fails, even though some of the data will not be accessible, the distributed database system still provides limited services. • Interoperability of Existing Databases Ceri & Pelagatti [6] also mention that when there are several databases already ex­ isting in an organization and there is the necessity of executing global applications, 4 distributed databases are the natural solution. In this case, the distributed database is designed bottom-up from existing local databases. These local databases may need to be reconstructed to some degree, but it is much cheaper than building a new inte­ grated distributed database. • Expandabili ty Ozsu & Valduriez [24] and Ceri & Pelagatti [6] state that it is easier to accommodate increasing database sizes in a distributed environment. If an organization grows by adding new and relatively autonomous branches or warehouses, then the distributed database approach supports the information needs of the new sites with the minimum degree of impact on the existing system. • Local Autonomy Local autonomy is emphasized by Ceri & Pelagatti [6] as a major reason that many business organizations consider a distributed information system. Since data is dis­ tributed, a group of users that commonly share such data can have this data placed at the site they work. Thus, the local controls are allocated to local users to enable them to take partial responsibility for information management in the distributed database . 1.3 Distribution Design: Fragmentation and Allocation Distribut ion design is one of the major research problems whose solution will enhance performance of the distributed databases. It involves data acquisition, fragmentation of databases, allocation and replication of the fragments, and local optimization. Fragmenta­ tion and allocation are the most important elements of a distributed database design phase. They play important roles in the development of a cost efficient system [24] . Fragmentation is a design technique to divide a single database into two or more partitions such that the combinat ion of the partitions yields the original database without any loss or addit ion of information [25]. This reduces the amount of irrelevant data accessed by the application, thus reducing the number of disk accesses. The result of the fragmentation process is a set of fragments defined by a fragmentation schema. Fragmentation can be either horizontal, vertical or mixed. 5 Horizontal fragmentation partitions a relation or a class into disjoint unions (fragments), which will have exactly the same structure but different contents. Thus a horizontal frag­ ment of a relation or class contains a subset of the whole relation or class instance. Vertical fragm entation results in attributes and methods being partitioned into different fragments and therefore reduces irrelevant data accessed by applications [28]. Allocation is the process of assigning a node on the network to each fragment after the database has been properly fragmented [24]. When data is allocated, it may either be replicated or maintained as a single copy. The replication of fr agments will improve the reliability and efficiency of read-only queries. The intention of allocation is to minimize the data transfer cost and the number of messages needed to process a given set of applications, so that the system functions effectively and efficiently [17, 33, 24]. For the sake of simpleness , we will not consider replication of fragments when we discuss fragmentation in this thesis. 1.4 Objective: Fragmentation in Object Oriented Databases The techniques of fragment ing relational databases have been intensely studied since the early 1980s. There are many different approaches to fragmentation and allocation for dis­ tribution design in relational databases. Navathe & Ra [22] develop a vertical partitioning algorithm using a graphical technique. Navathe, Karlapalem & Ra [21] propose a mixed fragmentation methodology for distributed database design. Tamhankar & Ram [33] pro­ pose an integrated methodology for fragmentation and allocation. Chu [8] designs two methods for partitioning attribute which treat the transaction as the decision variable. Even though much research has been dedicated to the issue of object-oriented databases, little has been related to the distribution design of object oriented databases. The fragmen­ tation of object oriented databases is a complex and still open research problem because the semantic model of the object oriented data model is much richer and more complicated than that of the relational model. The object oriented data model allows not only record con­ structors to be used but also set and other bulk type constructors. These constructions may even appear not only on the class level but also in nested structures [28]. The aim of this thesis is to generate distribution design techniques from traditional RDM to the OODM. 6 This thesis will concentrate on fragmentation, especially on horizontal fragmentation for OODM. 1.5 The Outline of the Thesis The thesis will start with a brief review of the issues related to distribution design in relational databases in Chapter 2. The general characteristics of distributed databases will be reviewed. Some key concepts related to distributed databases and a framework for distribution design will also be presented. Chapter 3 covers characteristics of distribution design in relational databases. Fragmenta­ tion and allocation in RDM are studied in this chapter. Also, an overview of related works is provided in this chapter. In Chapter 4, the fundamentals of the object oriented data model will be discussed. At the same time, an example of the object oriented database schema will be presented in this chapter. Chapter 5 will concentrate on some fragmentation techniques that can be used in the distri­ bution design of object oriented databases. Splitting, horizontal fragmentation and vertical fragmentation will be presented and analyzed. Some design strategies will also be presented. Finally, related works will be reviewed. Chapter 6 will introduce normal predicates in the object oriented data model. Then some heuristics of horizontal fragmentation in object oriented databases will be presented. Finally, Chapter 7 contains the conclusion of this work. Future work is also listed in this chapter. Chapter 2 Distributed Databases Distributed Database Management Systems (DDBMSs) are characterized by several levels of transparencies that they can support. The classification of distributed databases is based on autonomy and heterogeneity. Distribution design is performed under a framework that sets the objective of the design. There are two approaches to the design of distribution. They constitute different approaches to the design process. But both of them might be applied to complement one another [24]. This chapter defines the fundamental concepts and sets the framework for discussing distributed databases . We start by reviewing the characteristics of a distributed database system. Then we will present a architectural model for distributed DBMSs. Two alternative design strategies will also be discussed in this chapter. 2 .1 Characteristics Some desirable functions that should be supported by a true distributed DBMS are proposed by Ozsu & Valduriez [24] . Data independence is considered to be one of the main motivations for introducing databases. In a distributed database, data independence has the same importance as in traditional databases. However, a new aspect is added to the usual notion of data independence, namely distribution transparency [6]. Atre & Advisor [3] explain that data access should be transparent and synchronized to preserve database integrity. Ozsu & Valduriez [24] define transparency as separation of the higher-level semantics of a system from lower-level implementation issues. In other words, transparency means that when the users are accessing the data, they do not need to know where the data is stored, in what 7 8 format it is stored, or how it is to be accessed [3]. In a distributed relational database environment, each relation can be partitioned into a set of fragments on the basis of relevant informational content. The fragments of the relation are stored at different sites. Furthermore, it might be preferable to duplicate some of this data at other sites for performance and reliability reasons [24] . The result is a distributed database which is fragmented and replicated. The database users would see a logically integrated, single image database even though it may be physically distributed. The DDBMS should enable users to access the distributed database as if it were a centralized one. The ideal form with full transparency would imply that a query language interface to a distributed DBMS is not different from a query interface to a centralized DBMS. An ideal DBMS should provide a number of different types of transparencies. In the fol­ lowing sections the different levels of transparency will be reviewed. 2.1.1 Data Independence Data independence is a fundamental form of transparency that centralized database systems can provide. Data independence refers to the fact that the definition and maintenance of data are independent from the applications and are controlled by a server of the DBMS [23]. In other words, data independence means that the actual organization of data is transparent to the application programmer. Programs are written with a conceptual schema [6]. Ozsu & Valduriez [24]propose two types of data independence: • Logical data independence refers to the immunity of user applications to changes in the logical structure of the database. For example, if a user application operates on a subset of the attributes of a relation, it should not be affected later when new attributes are added to the same relation. • Physical data independence refers to hiding the details of the storage structure from user applications. Programmers should not be concerned with the details of physical data organization when they design a user application. And the user application should not need to be modified when data organizational changes occur with respect 9 to what data type the data is assigned and what storage hierarchies t he data is distributed across . 2.1.2 N etwork Transparency (Distribution Transparency) Network transparency or distribution transparency means that programs can be written as if the database were not distributed [6]. Ozsu & Valduriez [24] state that there should be no difference between database applications that would run on a distributed database and those that would ruu on a centralized one. They also state that the user should be protected from operational details of the network and even the existence of the network. T hey separate the distribution transparency into location t ransparency and naming t ransparency. • Location t ransparency means that the command used to perform a task is independent of both the location of the data and the system on which an operation is carried out. • Naming transparency refers to the fact that a unique name is provided for each object in the database. If there is no naming transparency, users arc required to embed the location name (or an identifier) as part of the object name. 2.1.3 Replication Transparency Ozsu & Valduriez [24] state t hat replication transparency is dealing with whether the users should be aware of the existence of copies, or whether the system should handle the man­ agement of copies. In a DBMS with replication transparency, users should act as if there is single copy of the data. They also emphasize t hat replication transparency refers only to the existence of replicas not to the location. Ceri & Pelagatti [6] explain that there are several reasons for considering data redundancy as a desirable characteristic: first, the locality of applications can be increased if the data is replicated at all sites where applications need it; and second, the availability of the system can be increased because one site failure does not stop the execution of applications at other sites if the data stored is replicated. The performance can be increased t hrough replication, as the retrieval can execute on any copy of the data if t here is more than one copy of the 10 data. But on the other hand, replication causes problems in updating databases. Thus, if the user application is not retrieval oriented but update oriented, it might not be a good idea to have too many copies of the data. Whether or not to have copies and how many copies to have is decided to a considerable degree by the nature of user applications [24]. 2.1.4 Fragmentation Transparency With fragmentation transparency, users are not aware of the data separation. Fragmenta­ tion transparency is implied if the database systems have the function of physical data inde­ pendence. Fragmentation is often used to improve performance, availability and reliability. Fragmentation can also reduce the negative effects of replication because by partitioning the data, only a subset of the relation, not the full relation, needs to be stored. Less space is required and fewer data items need to be managed [24]. 2. 2 Key Concepts First, in this section, we will review and explain basic concepts that include heterogeneity, autonomy, and distribution which are used in the classification of distributed database sys­ tems. Then, classification of distributed database systems: tight integration, multidatabase, and semiautonomous database systems will be discussed explicitly. 2.2.1 Heterogeneity Heterogeneity in DDBMSs can arise at different levels in the system, including hardware at different sites, different operating systems, different network protocols, different local DBMSs, and different models that local DBMSs are based on [4, 6, 24]. However, an important distinction is at the level of local DBMSs and the model they are based on, because differences at lower levels are managed by the communication software not by the DBMS. Therefore, only the heterogeneity of DBMS, the model of DBMS and the semantics of the data model need to be considered. Hence, the term "homogeneous DDBMS" refers to a DDBMS with the same DBMS based on the same data model with the same semantics. 11 2.2 .2 Autonomy Autonomy is concerned with the distribution of control, not of data. It indicates the degree to which individual DBMSs can operate independently [6 , 24]. Autonomy refers to a function of a number of factors such as whether the component systems exchange information, whether they can independently execute transactions and whether each site is allowed to modify itself. Requirements of an autonomous system have been specified in a variety of ways. Bell & Crimson [4] specify the dimensions of autonomy as: • Design autonomy: local sites are given the freedom to decide the information content of the DB , to select the data model and to choose storage structures. • Participation autonomy: local sites have the right to decide what data to contribute and when, and the freedom to decide when to come and to go. • Communication autonomy: local sites have the right to decide how and under what terms to communicate with other sites in the network. • Execution autonomy: local sites have the freedom to make decisions whether and how to process local operations to store, retrieve and update local data. 2.2 .3 Distribution In Ozsu & Valduriez [24] distribution refers to the physical distribution of data over multiple sites. There are two alternative classes that DBMSs use to distribute data: client/server dis­ tribution and peer-to-peer distribution. Client/server distribution concentrates data man­ agement duties with servers, while clients focus on providing the application environment that includes the user interface [3]. The client and servers share the communication re­ sponsibility. The sites on a network are distinguished as "clients" and "servers", and their functionality is different [24]. In peer-to-peer systems, there is no distinction between client machines and servers. Each machine has full DBMS functionality and machines can com­ municate with each other to execute queries and transactions. Peer-to-peer systems are also called fully distributed systems [24] . 12 2.2.4 Classification of Distributed DBMS Bell & Grimson [4] divide distributed database management systems into homogeneous DDBMSs and heterogeneous DDBMSs. Homogeneous DDBMSs can be further divided into classes depending on whether or not they are autonomous. Heterogeneous DDBMSs can be further divided into classes based on wether they are integrated or not. The au­ thors also categorize systems according to the degree of heterogeneity, the method of data distribution and the extent of local autonomy. Ozsu & Valduriez [24] and Ceri & Pela­ gatti [6] present a working classification of possible design alternatives along three similar dimensions: autonomy, distribution, and heterogeneity. Ozsu & Valduriez [24] propose a classification of the implementation alternatives of distributed database systems which is based on autonomy and heterogeneity. This is illustrated in Figure 2.1. The different im­ plementation alternatives that cover the important aspects of these features of distributed database design are reviewed below. Distributed Distributed heterogeneous heterogeneous DBMS federated DBMS Distributed heterogeneous multidatabase Heterogeneity • • ~st.em . / ~~~~~~~~·~~-- ~~~~----1~ Distributed homogeneous DBMS ~ ' ' ' Distributed homogeneous federated DBMS Distributed homogeneous multi database system Autonomy Figure 2.1: DDBMS Implementation Alternatives [24] At one extreme, a tightly integrated (or truly distributed) database is a DDBMS with no local autonomy. In other words it has full global control. This kind of DDBMS has a global schema and all applications access the global database. The local DBMSs do not operate independently. In contrast , in a multidatabase there are only local database schema and no global schema. 13 Global applications must access each local database separately. The individual systems are stand-alone DBMSs, which know neither of the existence of other DBMSs nor how to communicate with them. Local autonomy guarantees that local users can access their own local DB independently of the existence of the multidatabase and its global users . Gligor & Popescu-Zeletin [15] list the requirements of multidatabase systems as follows: • The local operations of the individual DBMSs are not affected by their participation in the multidatabase system. • System consistency or operation should not be affected when a DBMS join or leave the multidatabase confederation. • The manner of query processing and optimization of individual DBMS should not be affected by the execution of global queries that access multiple databases. Semiautonomous (or federated) systems consist of a local database schema and a global schema. Local databases have their own schema and can operate independently, but they also have decided to participate in a federation to make their local data sharable. Each of these DBMSs determines what information it wants to share with other users. Applications may be executed locally and globally. They are not fully autonomous systems because the parts of their own databases that they share with other users should be able to be modified by other users in order to change information. In this thesis, the discussion of data fragmentation and allocation is limited to the dis­ tributed homogeneous DBMS, which provides an integrated view of the data to users even though the database is distributed. 2.3 A Framework for Distributed Databases The objective of the design of distribution is discussed by Ceri & Pelagatti [6]. The design of a distributed database system involves making decisions on the architecture of DDBMS, on the process of design. Two major strategies for designing distributed databases that are identified by Ceri & Pelagatti [6] are: top-down approach and bottom-up approach. Ozsu 14 & Valduriez [24] develop a framework for the process of distribution design based on these approaches. The reasons and criteria for fragmentation and allocation will be discussed in this section. 2.3.1 The Objective of the Design of Data Distribution There are several objectives that should be taken into account in the design of distribution [6]: • In a distributed database system one of the major costs is associated with communica­ tion. To minimize communication costs, one goal of DDBMSs is to achieve processing applications locally. The degree of local processing can be maximized by distributing data, therefore minimizing transaction costs. To achieve this goal, the data should be kept as close as possible to the applications which use them. The advantage of processing applications locally is not only the reduction of remote access costs, but also increased simplicity in controlling the execution of the application. • We can improve the availability and fault-tolerance of read-only applications by stor­ ing multiple copies of the same information at different sites. When one site of the database is down or the community link for that site is broken, the system can still execute the applications by accessing the other copies of the information. • Distributing workload over the sites is done in order to take advantage of the different powers of utilization of the computers at each site, and to maximize the degree of parallelism of execution of applications . But the trade-off between processing locally and distributing workloads should be considered in the design of data distribution. • Database distribution should reflect the cost and availability of storage at each site. Even though the storage cost is not relevant when compared with the cost of in­ put or output (I/O), central processing unit (CPU), and transmission costs of the applications, the limitation of available storage at each site should be considered. 15 2.3.2 The Reasons for Fragmentation To simplify the problem, we do not consider replication at the first step of distribution design. The purpose of fragmentation design is to determine non-overlapping fragments which could be the logical unit of allocation [6]. The individual tuple or attribute of a relation cannot be considered as the unit of allocation because the allocation problem would become unmanageable. The fragments are constituted by grouping tuples or attributes that have the same "properties" from the viewpoint of their application [6]. This is based on the idea that two elements in the same fragment that have the same "properties" will be accessed by the applications together. Therefore, the fragments obtained in this way are the appropriate units of allocation [6]. Ozsu & Valduriez [24] mention that with respect to fragmentation , the important issue is the appropriate unit of allocation. The authors explain that there are three reasons for fragment relations: • Applications are usually based on the views of subsets of relations. Thus the applica­ tions often access any subset of an entire relation locally. • If there is a relation on which many application views are defined at different sites , storing a given relation at one site will result in an unnecessarily high volume of remote data accesses. Storing a given relation at different sites will cause problems in executing updates and may not be desirable if storage is limited. • The decomposition of a relation into fragments permits many transactions to be exe­ cuted concurrently and results in the parallel execution of a single query by dividing it into a set of sub queries that operate on fragments. However, on the other hand , fragmentation may cause the following problems [24]: • The applications whose views are defined on more than one fragment may suffer per­ formance degradation when the relations are not partitioned into mutually exclusive fragments. 16 • When the attributes participating in a dependency of a relation are decomposed into different fragments and stored at different sites, the task of checking for dependencies would result in chasing after data in a number of sites. 2.3.3 Alternative Design Strategies Ceri & Pelagatti [6] and Ozsu & Valduriez [24] propose two alternative approaches to the design of data distribution, top-down and bottom-up approaches. In the case of tightly integrated distributed databases, design proceeds top-down from requirements analysis and logical design of the global database to physical design of each local database. In the case of distributed multidatabase systems, the design process is bottom-up and involves the integration of existing databases [24]. But the authors also emphasis the fact that real applications are rarely simple enough to fit nicely in either of these approaches. The two approaches may need to be applied to complement each other. Top-down Approach As shown in Figure 2.2, in the top-down approach, the process starts with a requirements analysis that defines the environment of the system and elicits both the data and processing needs of all potential database users [35]. The requirements analysis also specifies where the final system is expected to stand with respect to the objectives of the DDBMS. The objective is defined with respect to performance, reliability and availability, economics , and expandability ( flcxi bili ty) . The requirements documents are the inputs to two parallel activities: view design and conceptual design. The outputs of view design are the interfaces for the user, and the outputs of conceptual design are entity types and relationship types which are used to construct an external schema. In a distributed relational database environment, the objective of distribution design is to design the local conceptual schemas (LCSs) by distributing the relations and subrelations (fragments). The fundamental issues in top-down design are fragmentation and allocation [24]. 17 The last step in the design process is the physical design, during which local concept ual schemas are mapped to the physical storage devices avai lable at the corresponding local sites . Co nceptual np~nn Requirements Anal is System Requirements (Objectives) User In ut \11 ew I nteqrati on Feedback Observation and ~-------------< Monitorin \11ewDesign External Feedback Figure 2.2: Top-down Design Process [24] Bottom-up Design Process User Input Ceri & Pelagatti [6] and Ozsu & Valduriez [24] state that top-down design is suitable for the systems which are developed from scratch. But when the distributed database is developed as the aggregation of existing databases, it is not easy to follow the top-down approach. The bottom-up approach, which starts with individual local conceptual schemata, is more suitable for this environment [6, 24]. Ceri & Pelagatti [6] explain that the bottom-up approach is based on the integration of existing schemata into a single, global schema. Integration is the process of the merging of common data definitions and the resolution of conflicts among different representations that are given to the same data. The global 18 conceptual schema is the product of the process [24]. Ceri & Pelagatti [6] conclude that there are three requirements for bottom-up design: (1) the selection of a common database model for describing the global schema of the database, (2) the translation of each local schema into the common data model, and (3) the integration of the local schemata into a common global schema. The authors also state that these three requirements are particularly important in heterogeneous distributed systems. Chapter 3 Distribution Design for Relational Databases There are many reasons to study distributed relational databases. Firstly, the mathematical foundation of the relational data model (RDM) makes the theory research problem easy to formulate. Secondly, there are a large number of relational systems on the market, and most distributed database systems are relational [23]. Current distributed database management systems (DBMSs) are mainly available for relational databases (RDBs) [33] . The distribution design of databases involves data acquisition, fragmentation of the database, allocation and replication of the partitions and local optimization [19]. In a distributed relational database environment, fragmentation is the design technique that divides a rela­ tion into a set of partitions such that the combination of the partitions yields the original database without any loss or addition of information [24, 25]. Fragmentation can be done in several ways: horizontal, vertical, and mixed (hybrid) . Hor­ izontal fragmentation splits a relation into a set of disjoint new relations with the same relation schema. Each of the new relations is obtained by applying a selection operation to the original relation [28] . Vertical fragmentation involves dividing the attributes of a rela­ tion into groups and then applying a projection operation to the original relation over each group. The original relation can be reconstructed by union or join operations, respectively, of the new relations resulting from horizontal or vertical fragmentation [28]. The following sections will first set the preliminaries of the discussion of distribution design of RDB. Then we will list the characteristics of the distribution design. Afterwards we will 19 20 review each design technique mentioned above. 3.1 The Relational Data Model The relational data model was first introduced by Codd [9] in 1970. Codd [10] emphasizes that the relational model has three power features. First, its data structures are simple. This feature allows a high degree of independence from the physical data representation. Second, the relational model provides a solid theoretical foundation for data consistency. The consistent states of a database can be uniformly defined and maintained through in­ tegrity rules. Thirdly, the relational model allows the set-oriented manipulat ion of relations. With this feature , powerful nonprocedural languages can be developed based either on set theory (relational algebra) or on logic (relational calculus). To define the relational data model, we suppose that a certain set V of possible domains D is given. For instance, D can be NAT, which is a set of natural numbers, ST RI NG, which is a set of character strings, and BOOL, that is a set of booleans, etc. Every attribute in a relation schema is assigned a domain D out of V. The following definition is given in [28]. Definition 3.1. Let V be a non empty set of domains. (i) A relation schema R consists of a finite set attr(R) of attributes and a domain assign- ment dom: attr(R) -t V. (ii) A tuple over a relation schema R is a mapping t : attr(R) -t LJ with t(A) E dom(A) DE'D for all A E attr(R). (iii) A relation r over a relation schema R is a finite set of tuples over a relation schema R. (iv) A relational database schema is a finite, non-empty set S of relation schemata. ( v) An instance I of a relational database schema S assigns to each R E S a relation I(R) over R. 21 Remark: (i) A relation schema R can be written in the form of R = {A1 : D1 , .. . , An : Dn} with attr(R) = {A1, ... ,An} and dom(Ai) = Di(i = 1, ... ,n) . (ii) We usually use the term 'database' instead of talking about instances. Each tuple in a relation is uniquely identified by its key which is the subset of the set of attributes attr(R) of a relation r. The key is minimal if and only if there is no subset of a relation that can be a key. In order to define the key, we need to define a operation named projection first. Definition 3.2. Let r be a relation over relation schema R, and X ~ R. For a tuple t Er, projection oft, denoted as t[X] is a mapping t' : X --t LJ dom(A) with t' E dom(A) such AEX that t' (A) = t(A) for each A EX. Then we can have the following definition: Definition 3.3. A key on a relation schema Risa subset I< ~ attr(R) restricting relations rover R to satisfy t1 = t2 for all tuples t1, t2 Er with t1[K] = t2[K]. A key K is called minimal if and only if no proper subset of I< is a key. A primary key is simply a distinguished minimal key 3.2 Characteristics For a given relational database schema S = {R1, ... , Rk}, fragmentation applies to each of the relations ~ in the schema. To ensure the consistency of the database, the fragmentation operation should satisfy the following three characteristics [24]: (i) Completeness Informally, completeness refers to the fact that fragmentation does not lead to a loss of information. If a relation ~ is decomposed into fragments R}, .. . , Rfi, each data item that can be found in~ can also be found in one or more of R{ 22 (ii) Reconstruction Informally, reconstruction refers to the fact that it should be easy to reconstitute a non-fragmented relation back from its fragments. For a given relation Ri with frag­ mentation FR; = {R[, ... , Rfi}, it should be possible to define a relational operator that can reconstruct Ri with R{. • For horizontal fragments, reconstruction of a global relation Ri is performed by using the union operator in the horizontal fragmentation. Using relational algebra, k; Ri = LJ R{ j=l • For vertical fragments, reconstruction of the original global relation Ri is made by using the natural join operation: (iii) Disjointness Informally, disjointness refers to the fact that fragmentation should not lead to a replication of information. But if replication comes into play, this requirement has to be relaxed. If a relation Ri is horizontally decomposed into fragments R}, ... Rfi and the data item d'i is in R{, it does not appear in any other fragment Rf(k ¥- j). If a relation is vertically partitioned, disjointness is defined only on the non-primary key attributes of the relation. To simplify the problem, replication is not considered at the first stage of the design of distributed databases. 3.3 Horizontal Fragmentation There are two forms of horizontal fragmentation: primary horizontal fragmentation and de­ rived horizontal fragmentation [6, 24]. Schewe [27] defines primary horizontal fragmentation as: Definition 3.4. Let S = R1, ... , Rn be a relational database schema, primary horizontal 23 fragmentation on relation schema R;, replaces R;, by a set { R}, ... , Rfi} of new relation schemata such that: (i) the attributes in each R{ are the same as in R;,. (ii) Each relation ri over R;, can be split into pairwise disjoint relations r}, ... , rfi over R} , ... , Rfi respectively such that ri = rf U · · · U rfi holds. By using relational algebra, the operation of horizontal fragmentation can be expressed as: with 'Pj as selection predicates defined on R;, that are derived from application information. Of course, the disjointness property implies that we must have 'Pj I\ 'Pk {:::} J alse for all j i= k Similarly, the completeness property implies the requirement that c.p1 V · · · V 'Pki {:::} true We notice the above definition includes the criteria which are also known as the correctness rules of horizontal fragmentation in the above section: completeness, reconstruction, and disjointness. Example 3.1. Take relation schema LECTURER= {name, specialization, location} , a re­ lation r over LECTURER is name specialization location Chris Database Theory Palmy David System Analysis Wellington Peter End User Computing Palmy John Database Theory Auckland Barbara Multimedia Auckland 24 with a set of selection formulae: < acpj( Rx)) j=l be a necessary 'remainder fragment'. 25 Example 3.2. Take relation schemata LECTURER = {name, specialization, location Code} and CAMPUS = {location Code, city}, a relation r over LECTURE and relation r' over CAM- PUS are: name specialization location Co de Chris Database Theory 1 location Code city David System Analysis 3 1 Palmy Peter End User Computing 1 2 Auckland John Database Theory 2 3 Wellington Barbara Multimedia 2 with a set of selection formulae <{Jj defined on relation CAMPUS, relation r over LECTURER can be fragmented as: R1 = LECTURER I>< O"cp 1=city = 'Palmy•(CAMPUS) R2 =LECTURER I>< O"cp2 =city = 'Wellington•(CAMPUS) R3 =LECTURER I>< O"cp3 =city = 'Auckland'(CAMPUS) Then relation r is partitioned into the following three relations: name specialization location Code Chris Database Theory 1 Peter End User Computing 1 name specialization location Code David System Analysis 2 26 name specialization location Code John Database Theory 3 Barbara Multimedia 3 Note that the remainder relation r4 is empty because all tuples of r can match a tuple in one of the fragment rj of relation r'. D 3.4 Vertical Fragmentation Vertical fragmentation exploits relation schemata to be sets of attributes. Vertical fragments result from a projection operation to the original relation. The original relation can be reconstructed by the joining of the new relations . It can be defined as below: Definition 3.6. Let S = R1, . •. , Rn be a relational database schema with relation schemata . { 1 ki } ~ = {Ail, ... , Aini}. Vertical fragmentation replaces Hi by a set Ri , ... , Ri of new relation schemata such that: ki . (i) the attributes are distributed, i.e., I4, = U Rf, j = l (ii) each relation ri over I4, is split into relations r1, = rrR{(ri)(j = 1, ... ,ki) such that Ti = r} IXl · · · IXl rfi holds, (iii) in Relational Algebra, ~ = R} IXl · • · IXl R7', (iii ') in a special case, a distinguishing new attribute dif can be added to relation ~ as a minimal key, then after vertical fragmentation dif E R{ for all j E {1, .. . , ki}, and I4. = rrn;- {dif}(R} 1X1···1X1 R7i). Using Relational Algebra, vertical fragmentation could be written as R{ = 7rattr(R{)(~) for all j E { 1, .. . , ki} Not having the distinguished new attribute dif would require a lossless join-decomposition, which in turn would mean t hat a join-dependency must hold. However, a well designed 27 database schema would exclude such dependencies, except for the case, where R} n · · · n R7; contains a key. Thus, it is normally required that the primary key (i.e. a chosen minimal key) is part of all Ri . Example 3.3. Take relation schema LECTURER = {name, specialization, location, depart­ ment, IRD, salary}, a relation r over LECTURER is name specialization location department IRD salary Chris Database T heory Palmy Information Systems 234569 92000 Richard Accounting history Wellington Accounting 235169 38000 Peter End User Computing Palmy Information Systems 256487 64560 Mary Careers Auckland Human Resources 156426 65900 Barbara Multimedia Auckland Computer Science 352486 56000 Relation r can be vertically fragmented in the following two ways: (i) F irst , alter the relation schema LECTURER to LECTURER' by attaching a distinguish­ ing new attribute 'dif' to LECTURER. Then, by using operation: R1 = 7r dif, specialization, location , department (LECTURER' ) R2 = 1rdif, name, IRD, salary(LECTURER') the above relation r is vertically partitioned into the following two relations: dif specialization location department 1 Database Theory Palmy Information Systems 2 Accounting history Wellington Accounting 3 End User Computing Palmy Information Systems 4 Careers Auckland Human Resources 5 Multimedia Auckland Computer Science 28 dif name IRD salary 1 Chris 234569 92000 2 Richard 235169 38000 3 Peter 256487 64560 4 Mary 156426 65900 5 Barbara 352486 56000 (ii) Alternatively, let {name, department} be the primary key. Using operation: R1 = 1r name, department, specialization, location (LECTURER) R2 ='Tr name, department, IRD, salary(LECTURER) the above relation r is vertically part itioned into the following two relations: name department specialization location Chris Information Systems Database Theory Palmy Richard Accounting Accounting history Wellington Peter Information Systems End User Computing Palmy Mary Human Resources Careers Auckland Barbara Computer Science Multimedia Auckland name department IRD salary Chris Information Systems 234569 92000 Richard Accounting 235169 38000 Peter Information Systems 256487 64560 Mary Human Resources 156426 65900 Barbara Computer Science 352486 56000 D 29 3.5 Mixed Fragmentation Database users usually access subsets of data which are vertical and horizontal fragments of global relations . Therefore, there is a need for mixed fragmentation, which applies a sequence of horizontal and vertical fragmentation on a relation. It can be achieved by successively performing horizontal and vertical fragmentation on a relation. The different sequencing of vertical and horizontal operations generates different fragmentation schema. Even though these operations can be recursively repeated, having more t han two levels of fragmentation is not of practical interest [6]. k Definition 3. 7. For a given relation ~ in the database schema S, mixed fragments Ri are built by successive steps of horizontal and vertical fragmentation on relation ~, such that the correct rules for both vertical and horizontal fragmentation can be met. Using relational algebra, it can be expressed by using alternatively sequences of projection and selection operations: allowing 'Pk = true and R{ = Ri captures all other possibilities for such sequences. 3.6 Allocation Once a fragmentation schema has been decided upon, each fragment must be assigned to one or more nodes in the distributed database management system. The allocation problem involves finding the "optimal" distribution of the fragments to the sites. The discussion of allocation is to find an allocation model that minimizes the total costs of processing and storage while trying to meet certain t ime restrictions [24]. The definition of a cost minimized allocation is as below: Definition 3.8. For a given set of fragments {Fi , ... , Fn} with different sizes s1, .. . , Sn, if the network has nodes N1, ... , Nki fragment allocation is to assign a node Nj to each fragment Fi such that the summary of all the transaction and storage costs from all the 30 sites can be kept to a minimum, where the transaction and storage costs are calculated according to a predefined cost model. The focus of this thesis is on horizontal fragmentation of DOODBs. We will not go into detail on the discussion of vertical fragmentation and allocation of fragments. 3. 7 Related Work A lot of research has contributed to fragmentation and allocation in the area of distributed relational databases. This section will present some approaches for horizontal and vertical fragmentation. 3. 7.1 Horizontal Fragmentation For horizontal fragmentation, Ceri and Pelagatti [6] introduce two types of fragmentation: primary and derived. Derived fragmentation, which is performed to facilitate the join between fragments, is determined in terms of primary fragmentation. • Primary horizontal fragmentation of a relation can be defined by determining a set of disjoint and complete selection predicates. Ceri and Pelagatti propose a procedure which produces a set of disjoint and complete selection predicates. This procedure can be summarized as follows: (i) Derive a set of simple predicates P = {pi, ... ,Pn} from application information. Simple predicates take the form of: Ai= value where Ai is an attribute of the relation schema. The simple predicates Pi E P must satisfy two properties. They must be complete and minimal in order to guarantee that elements in the same fragments share the "same properties" in terms of allocation. The definition of complete and minimal can be found in [6]. 31 (ii) Construct a set of minterm predicates from P by applying arbitrary conjunctions of all predicates Pt, where Pt is either Pi E P or its negation. Some of these conjunctions may be unsatisfiable. A set I of implications among the Pt can be used to determine (and remove) these unsatisfiable minterms. The authors point out that it is not possible to analyse all the transactions that use the database. Only the most important and critical transactions should be taken into account. It is widely accepted that the "20/80" rule should be applied as a guideline to choose user applications to determine simple predicates. It means that only 20% of user queries should be taken into account, because they usually account for 80% of the data access . In particular , no update transactions will be considered. It is also suggested that fragments that have similar properties should not be distinguished. Otherwise the execution of the algorithms for a complete and minimal set of predicates will become very expensive. • Derived fragmentation in [6] is performed by applying semijoin operations. Member relations are partitioned according to the fragmentation of their owners. By using relational algebra, derived fragmentation is expressed as R~ = R2 t>< R{ with R2 indi­ cating the member relations, R1 indicating owner relations that have been fragmented into a set of disjoint fragments { Ri , ... , Rl} with 1 :S j :S i. Oszu and Valduriez [24] follow the lines of Ceri, Pelagatti and Navathe [6 , 20] and develop horizontal partitioning algorithms. Their fragmentation algorithm is based on the following necessary information requirements: (i) Database information. The database information concerns the global conceptual schema. From the global conceptual schema, we can know how the database relations are connected to one another, especially with joins. (ii) Application information. This includes the description of user queries and fre­ quencies with which user applications access and update data. The quantitative and qualitative information acquired from application information can be summarized in the following four categories: 32 • Simple predicates for relation R = {A1 : Di, ... , An : Dn} can be defined in the form of with A as an attribute defined over domain Di , () E { =, <, /:-, :S, >, ~} and Value E Di. A set of all simple predicates defined on relation R is denoted by Pr= {p1 ,p2, ··· ,Pm}· • Minterm predicates are the conjunctions of simple predicates and their nega­ tions. The set of minterm predicates Mi= {mil, mi2, ... , miz} over a set Pri of simple predicates is defined by: Mi = { mij lmij = f\ P7d P;kEPr; where P7k = Pik or P7k = 'Pik· Note that all simple predicates in Pri appear (positively or negatively) in each minterm predicate. • Minterm selectivity presented with sel(mi) is the number of tuples of the relation that would be accessed by a user query specified according to a given minterm predicate mi. • Access frequency with which users access data. For user application qi, it is denoted with acc(qi)· For a minterm predicate mi it is represented as acc(mi)· The input of the algorithm for horizontal fragmentation is a relation R and a set of simple predicates Pr. The output of the algorithm is a set of fragments { R1, ... , Rn} of R. The objective of the algorithm is that Pr should be complete and minimal, which is defined as follows: • Completeness of simple predicates A set of simple predicates Pr is said to be complete if and only if there is an equal probability of access by every application to two tuples belonging to the same minterm fragment that is defined according to Pr. For example, if we assume there is a relation PROJECT ={PNumber: NAT, PName: STRING, Budget: NAT, Campus: STRING, Description: STRING, Begin: DATE, 33 End: DATE} , there are two applications defined on it, and there are only three different locat ions for Project. Find the budget of projects at each location. (a) Find projects with budget less than $200000. (b) According to applicat ion (a) Pr={Campus= 'Wel', Campus='P N', Campus='Auk '} is 11ot complete with respect to application (b) because application (b) will access two tuples in a fragment defined by predicate Pr with different probability if the values of the budget of one object is more than or equal to $200000 and that of the other object is less than $200000. A complete set of simple predicates should be P r={Campus='Wcl ', Campus='PN', Campus='Auk', Budget<200000, Budget ~ 200000 }. • Minimality of simple predicates If a predicate influences how fragmentation is performed (e.g. J be partitioned into Ji and Jj), there should be at least, one application that accesses Ji and fj differently. In other words, the simple predicate should be relevant in determining a fragmentation. If all the predicates of a set, Pr are relevant, then Pr is minimal. If Pi E 714 and fragment fi is determined by mi, 'Pi E mj and fragment J1 is deter­ mined by mj, then Pi is relevant iff ace( mi) # ace( mj) card(fi) card(Jj) Where acc(mi) and acc(mj) denote the access frequencies of minterm predicate mi and mj, card(fi) and card(fj) denote t he cardinalities of fragment fi and Jj. For instance, a set of simple predicates Pr = {Campus='Wel', Campus='PN', Cam­ pus='Auk', Budget< 200000, Budget ~ 200000} is minimal in addition to being complete. However , if there is another predicate Pj :=PName='BigNet' in Pr , then Pr ={Campus= 'Wel', Campus= 'PN', Campus= 'Auk', Budget<200000, Budget ~ 200000, PName='BigNet '} is not minimal because there is no application that ac­ cesses fragment Ji (PName is 'BigNet ') and Jj (PName is not 'BigNet') differently. 34 Ozsu and Valduriez [24] first present an iterative algorithm named COM_MIN to generate a complete and minimal set of predicates Pr' from a given set of simple predicates Pr. The algorithm checks each predicate Pi in the given set of simple predicates Pr to see if it can be used to partition the relation R into at least two parts which are accessed differently by at least one application. If Pi satisfies the fundamental rule of completeness and minimality then it should be included in Pr'. If Pi is nonrelevant then it should be removed from Pr'. But this algorithm is not practical because checking Pi can not be defined with machine readable language. Moreover , it does not consider the fragment combination possibility that some of the minterm fragments might be allocated to the same site. A algorithm named PHORIZONTAL is introduced to describe primary horizontal fragmen­ tation. It uses the algorithm COM_MIN and a set of implications I as inputs to produce a set of satisfiable miniterm predicates M . If a mintcrm predicate mi is contradictory to a implication rule in I , then it is removed from M. Minterm fragments are defined according to the set of satisfiable minterm predicates M. But the set I of implications is hard to define. For the derived horizontal fragmentation , semijoin operations are involved. Member re­ lations are fragmented according to the fragments of its owner relation. Details of the definition of member and owner relations can be found in [24]. They emphasize that care should be taken with the relations that have more than one link to the owner relations. Two criteria are suggested [24] . First, choose the fragmentation with better join characteristics. Second, choose the fragmentation used in more applications. In fact, the algorithm is not very practical, as it will always result in a subset Pr' of P r, the set of minterm predicates M' determined by Pr' and the corresponding set of fragments. Simple predicates are omitted from Pr if they do not contribute to the fragmentation , i.e. they only violate the minimality principle. This emerges to considering just the simple predicates AiOVi in the most important queries and to take all satisfiable minterm predicates. This obviously leads to fragments that are accessed different ly by at least two queries. The algorithm further does not give executable rules for eliminating the unsatisfiable minterm predicates. 35 The major problem, however, is that the number of fragments resulting from the algorithm is exponential in the size of Pr. In practice, it would be important to reduce this number significantly, which would mean to re-combine some of the fragments. In fact, this implies giving up the completeness principle and replacing it by optimization criteria based on a cost model. Several other researchers have worked on fragmentation in the relational data model. Na­ vathe et al. [21] define a schema for simultaneously applying the horizontal and vertical fragmentation algorithms on a relation to produce a grid . They use a technique similar to the vertical fragmentation presented in [20, 22] to produce horizontal fragments. The fragmentation schema generated by the algorithm is independent of the sequencing of the horizontal or vertical fragment algorithms. Tamhankar and Ram [33] introduce an inte­ grated methodology for fragmentation and allocation. They make an attempt to combine fragmentation, allocation and replication into a single step of distribution design and apply the combination to a practical problem. 3.7.2 Vertical Fragmentation Vertical fragmentation is more complicated than horizontal fragmentation because of the total number of alternatives. If there are n simple predicates that are used to define horizon­ tal fragmentation , there are at most 2n possible fragments that can be defined. But in the case of vertical fragmentation, if a relation has m nonprimary key attributes, the possible fragments are given by the Bell number which is approximately B ( m) ~ mm. From the value of the possible vertical fragmentation, we find out it is impossible to get the optimal solutions to the vertical partitioning problem. We can only expect to find out a heuristic solution. There are several algorithms of vertical fragmentation that have been proposed in the lit ­ erature. Hoffer and Severance [7] measure the affinity between pairs of attributes and try to cluster attributes according to their pairwise affinity by using the bond energy algo­ rithm(BEA). Navathe et al.[20] extend the BEA approach and proposed a two-phase approach for vertical 36 partitioning. In the first step, they used the given input parameters in the form of an attribute usage matrix(AUM) to construct the attribute affinity matrix (AAM) on which clustering is performed. In the second step, estimated cost factors , which reflect the physical environment of fragment storage, were considered to further refine the partitioning schema. Cornell and Yu [11] apply the work of Navathe et al.[20] to physical design of relational databases. This approach uses specific physical factors such as the number of attributes, their length and selectivity, and cardinality of the relation. Navathe and Ra [22] construct a graph-based algorithm to solve the vertical partitioning problem where the heuristics used include an intuitive objective function which is not ex­ plicitly quantified. With the aim to overcome the complexity of attribute based algorithms, P.-C. Chu [8] proposes a transaction-oriented approach to vertical partitioning, in which no Attribute Utility Matrix but transaction information is used as the decision variable. The algorithm presented in Ozsu and Valduriez [24] takes AAF as input. The Bond Energy algori thm is employed to evaluate the togetherness of a pair of attributes. Shift operation is used to process the attribute clustering. The binary partitioning algorithm should be applied recursively when there is a large set of attributes. Muthuraj et al [19] propose a formal approach to address the problem of an n-array vert ical partitioning problem and derive a partition evaluator function which describes the affinity value for clusters of different sizes. This function can either be used for fragmentation progress or applied to evaluate the fragmentation schemata that are created and therefore to test and evaluate the different algorithms available. They argue that an attribute usage matrix instead of an attribute affini ty matrix (AAM) should be used in the process of fragmentation because AAM can only measure the closeness of a pair of attributes at one time and cannot measure the closeness of the entire cluster which may consist of more than two attributes. Both horizontal and vertical fragmentation problems are complex. Studying only one of them will be a hard task. This thesis will only concentrate on horizontal fragmentat ion and will not handle vertical fragmentation in detail. Chapter 4 Object Oriented Databases Object-oriented databases have brought about a fundamental change in the way a data and the procedures that operate on the data are viewed. Whilst general agreement on the broad features to be supported by an object oriented database system is slowly being reached, as yet there is still no firm agreement on a formal definition of an object-oriented database system [12, 30, 32]. Additionally, there is still much debate on which underlying object oriented data model is appropriate for database systems. However , it is widely accepted that objects are abstract ions of real world entities. Objects are regarded as the basic unit of persistent data [30], and the object oriented databases are composed of independent objects. A unique identifier should be assigned to an object because t he existence of an object should be independent of its value [2, 30, 18]. The using of immutable object identifiers enable sharing, mutabili ty of values and easily representation of cyclic structures. Our research will apply the object oriented data model (OODM) which is presented in [30 , 31]. This model applies an abstract object identifier to capture the fact that an object in the real world always has an unique identity. At the same time, an object in the real world can have different aspects and should not only be coupled with a unique type. In contrast, objects, as well as references to other objects, should be associated with more than one type that can change during the object's lifetime. The section below will present some fundamental concepts of the object oriented data model. It will also depict concepts by giving an object oriented database schema as well as an instance of the database. 37 38 4.1 Fundamentals of the OODM This section will review the object model proposed in Schewe & Thalheim [30] . In this model each object o consists of a unique identifier id, a set of (type-, value-) pairs (7i, vi), a set of (reference-, object-) pairs (refi, 01) and a set of methods methk. Values and objects are different in that values can be identified by themselves while objects can only be identified by applying an external identification mechanism. Types are used to structure values while classes are the groups of objects that have the same structure which uniformly combines aspects of object values and references. Subtyping is used to relate values with different types [31]. In this thesis, in order to simplify the problem of distribution design, the behavior( method) part will be omitted completely. The object oriented data model that we adapt to this project is just a simplified version of the data model introduced in [30, 31]. 4.1.1 Type Definitions The type system presented in [30, 31] consists of some basic types, type constructors and subtyping relation. Definition 4.1. (i) The base types are either BOOL, NAT, INT, FLOAT, STRING, ID, or .l, where ID is an abstract identifier type without any non-trivial supertype and .l is the trivial type that is a supertype for every type. (ii) Let Np and NF denote parameter-names and function-names. Let a i E NF and a,ai E Np(i = 1, ... ,n). A type constructor is either a record constructor (a1 : a1 , ... , an : an), a finite set constructor {a} , a list constructor [a], a bag constructor (a) or a union constructor (a1 : ai) U · · · U (an : an)· (iii) A type t is called proper iff the number of its parameters is 0. If there is no occurrence of ID in t , t is called a value type. If t' is a proper type occurring in a type t , then there exists a corresponding occurrence relation o: t x t'--+ BOOL. New types can be defined by nesting using predefined base types, such as BOOL, NAT, STRING, etc, and predefined constructors for records, finite sets, lists, unions, etc. So we 39 can first define some types in the example below. Example 4.1. First, we define PERSONNAME by using both a set constructor{-} and record constructor ( ·): Type PERSONNAME = (FName: STRING, LName: STRING, Title: {STRING}) End PERSONNAME Then we define PERSON based on type PERSONNAME. Type PERSON = (PersonID: NAT, Name: PERSONNAME, Address: STRING, DOB: DATE) End PERSON We can also define the following types which will be used when we define a university database schema in the next subsection. Type PROJECT = (PNumber: NAT, Pname: STRING, Begin: DATE, End: DATE, Description: STRING 40 Budget: STRING) End PROJECT Type COURSE = (CNumber: NAT, Cname: STRING) End COURSE Type DEPARTMENT = (Dname: STRING, {(TelNumber: NAT, Campus: STRING)}) End DEPARTMENT Type ROOM = (Building: STRING NO: NAT Campus: STRING) End ROOM Type SEMESTER = (Year: NAT, Season: NAT) End SEMESTER Subtypes are also introduced in [30]. They are used to relate values in different types. D Definition 4.2. Let a 1 , . . . , an be parameter-names. A subtype relation::; on types is given by the following rules: 41 (i) Every type t is its own subtype and a subtype of ..l. (ii) NAT:SINT:SFLOAT. (iii) ( ... , ai-l : O'.i-1, ai: ai, ai+l : ai+1, ... ) < ( ... , ai-l : a~_ 1 , ai+l : a~+l> .. . ) whenever aj:Saj. {ai} < {aj} (iv) [ai] < [aj] iff O'.i :S O'.j· (a) < (aj) (v) {a} :S (a) and [a] :S (a) . (vi) · · ·U(ai-1: O'.i-1)U(aH1: ai+1)U · · · :S · · ·U(ai-1: a~_ 1 )U(ai: a~)U(aH1: a~+ 1 )U .. . . whenever ai :S aj for all j = 1, .. . , n. Example 4.2. We can define STUDENT and LECTURER as subtypes of PERSON de­ fined in the above example. Type STUDENT = (PersonID: NAT, StudentID: NAT) End STUDENT Type LECTURER = (PersonID: NAT, Name: PERSONNAME, Specialization: STRING) End LECTURER D 42 4.1.2 Class Definitions In the OODM presented in [30], classes are used to structure objects having the same structure and behavior, while types are used to structure values. Each object in a class has an identifier, a collection of values, references to other objects and methods. Identifiers can be represented by using the unique identifier type ID. An object, which has multiple aspects, can simultaneously belong to different classes. This property guarantees that each object of the abstract object model can be captured by the collection of possible classes. A class structure allows us to uniformly combine aspects of object values and references. Relationships between classes are represented by references and referential constraints on the object identifiers involved [30]. Because we are not looking at the dynamics, the model we apply in this thesis will not consider methods. Definition 4.3. (i) Lett be a value type with parameters a 1 , ... , an such that ID does not occur in t. If the parameters are replaced by pairs ri : Ci with pairwise differ­ ent reference names ri and class names Ci, then the resulting expression is called a structure expression. (ii) A class consists of a class name C, a structure expression expc, a set of class names D1, ... , Dm (called superclasses). The proper type derived from expc by replacing each reference ri : Ci with the type ID is called representation type Tc of the class C. 4.1.3 Schema Definition The database schema is designed by a finite collection of type and class definitions [30]. Let us first review the definition of schema and the way to make it closed. Then we will look at an instance of a schema and some integrity constraints. Definition 4.4. A schema S is a finite set of classes that is closed in that all names appearing in a structure expression or as superclass must be names of classes defined in the schema. 43 Definition 4.5. An instance db of a structure schema S assigns to each class C E S a finite set db( C) of values of type (id : ID, value : Tc) such that the following conditions are satisfied: uniqueness of identifiers: For every class C we have Vid :: ID.Vv,w :: Tc.(id,v) E db(C) /\ (id,w) E db(C)::::} v = w. inclusion integrity: For a subclass C of C' we have Vid :: ID. id E dom(db(C))::::} id E dom(db(C')). Moreover, if Tc is subtype of Tb with subtype function f : Tc___, Tb, then we have Vid :: ID.Vv :: Tc .(id ,v) E db(C)::::} (id,f(v)) E db(C') referential integrity: For each reference from C to C' with corresponding occurrence relation Or we have Vid, id':: ID.Vv :: Tc .( id, v) E db(C) /\ or(v, id')::::} id' E dom(db(C')) where dom(db(C)) ={id: : I Div:: Tc.(id,v) E db(C)}. 4.2 An Example of Object Oriented Database Schema In this section, we show an example of a database schema and its instance, the contents of the database at a given time point. Example 4.3 shows an object oriented database schema transformed from a Higher Order Entity Relationship Schema in [34]. Figure 4.1 is the Higher Order Entity Relationship Model (HERM) diagram of the schema. We use the types defined in Example 4.1 to build the structural part S of the OODM schema. Therefore the structure of a class can be based on a type definition defined above or on a nameless type definition. It may also involve an IsA relation to model objects in more than one class. We use o to indicate concatenation for record types. 44 ID FNamc LNamc Building No Rcqnircct Require• ~ Bq;in ~ Dcscripcion End ~c.,.p.li Number PName Figure 4.1: HERM Diagram of University Database Schema [34] Example 4.3. We design a structural schema as below: Schema University Class PERSONC Structure PERSON End PERSONC Class SEMESTERC Structure SEMESTER End SEMESTERC Class RooMC Structure ROOM End RooMC Class DEPARTMENTC Structure DEPARTMENT End DEPARTMENTC Class COURSEC Struct ure CO URSE o (Requires: { r: CouRSEC}) End CouRSEC Class LECTURERC IsA PERSONC Structure LECTURER o (Department : D EPARTMENTC, Campus: STRING) End L ECTURERC Class P ROJECTC Structure PROJECT o (Primary Investigator : { P ERSONC U LECTURERC } ) 45 46 End PROJECTC Class PAPERC Structure (Course: COURSEC , Semester: SEMESTERC, Campus: STRING, Lecturer: LECTURERC, {( Day: DATE, Hour: NAT, Room: RooMC)}) End PAPERC Class STUDENTC IsA PERSONC Structure STUDENT o (Supervisor: LECTURERC, Major: D EPARTMENTC, Minor: D EPARTMENTC, Enroll: {(PAPERC, Grade: STRING)}) End STUDENTC D After we define the structure of the schema, we are going to describe the instances of the above schema. The instance of a database schema is t he content of the database at a given time point. We need a type ID of object identifiers to uniquely and efficiently identify the objects and to model objects in different classes and references to other objects. We use db as a name of the instance of the database schema. 47 Example 4 .4 . An instance of the above university database schema is presented as follow­ ing: db(P ERSONC) = {(i101 , (PersonID : 1001 , (FName: John,LName: Dever,{Professor ,Dr} ), Address : 28 Victoria Av, DoB:16/ Jan/ 1953)) , (i102, (PersonID: 1002, (FName: Allan, LName: Barry, Ti t le: {Senior Lecturer , Dr} ), Address: 66 Albert St, DoB:23/ Feb/ 1958)), ( i103, (PersonID : 2010, (FN ame: Shirley, Churchill , T it le: {Lecturer} ), Address: 531 Tramine Av. , DoB:lO/ Oct/1960)), (i104 , (PersonID : 3203 , (FName: Jerry, LName:Hubbard ,Title: {HoD, P rofessor, Dr} ), Address : 32 Ada St , DoB:02/ May/ 1945)), (i105, (PersonID : 2618, (FName: LName: James, LName: Hooks,Tit le: {Lecturer} ) , Address :l16 College St, DoB:28/ Jun/ 1960)), (i105, (PersonID : 4322, (FName: LName: Jill , Heslop, Tit le: { } ), Address: 22 Fegerson St, DoB:ll / Jul / 1985)), (i107 , (PersonID : 4198, (FName: LName: Frances,LName: Caban,Title: { } ), Address: 78 Cuba St,DoB:30/ Nov/ 1983)) , (i10s, (PersonID : 4077, (FName: LName: Lindsay,LName:Hamilton ,Title: { } ), Address: 29 Church St, DoB:06/ Dec/1982)), (i1og, (PersonID : 4198, (FName: LName: Jeff, LName: Perera,Title: { } ), Address: 99 Broadway Av, DoB:18/ Aug/1980)) , (i11o, (PersonID : 2396, (FName: Lindsay, LName: Kirton, Title: { } ), Address:195 King St, DoB:03/Sept/1975))} 48 db(SEMESTERC) = {(i201, (Year: 2000,Season: 01)), (i202, (Year: 2000, Season: 02)), (i203, (Year: 2001 , Season: 01)), (i204, (Year: 2001, Season: 02)), (i205, (Year: 2002, Season: 01)), ( i2o6, (Year : 2002, Season : 02)), (i207, (Year: 2003, Season: 01))} db(RooMC) = {(i301 , (Building: SSLB, NO: 2, Campus: PN)), (i302, (Building: SST, NO: 1, Campus: WN)), (i303, (Building: Marsdon, NO : 1, Campus: PN)), (i304, (Building: AH, NO : 2, Campus: ALB)), (i305, (Building: BSC, NO: 203, Campus: ALB))} db(DEPARTMENTC) = {(i401, (Dname: Information Systems, {(TelNumber: 063566199, Campus:PN) , (TelNumber : 045763112, Campus:WN)} )), (i402 , (Dname: Accounting, {(TelNumber: 063563188, Campus:PN), (TelNumber: 098132699, Campus:ALB)})), (i403, (Dname: Marketing, {(TelNumber: 063564132, Campus:PN), (TelNumber: 045663188, Campus:WN)} ))} db( COURSEC) = {(i501, (CNumber: 110.001, CName:Accounting Principle, Requires: { } )), (i502 , (CNumber: 110.105, Taxation, Requires: { i501} )), (i503, (CNumber : 156.100, Principle of Marketing, Requires: {} )) , (i504, (CNumber: 157.221 , Information Systems Analysis, Requires: {i503 })) , (i505, (CNumber: 157.331 , Database Concepts, Requires: {i504 } ))} db(LECTU RERC ) = { ( i1o1, (PersonID : 1001 , (FN ame: John, LN ame:Dever , Title: {Professor , Dr} ), Specialization: Commercial Law, Department:ii402, Campus:ALB)), (i102, (PersonID : 1002, (FName: Allan , LName:Barry, Ti t le: {Senior Lecturer, Dr} ), Specialization: Pricing, Department :ii403 , Campus:PN)), (i103, (PersonID : 2010, (FName: Shirley, LName:Churchill , Ti t le: {Lecturer} ), Specializat ion: Databases, Department:ii40l , Campus: WN)), 49 (i104, (PersonID : 1002, (FName: Jerry, LName:Hubbard , Title: {HoD Professor , Dr} ) , Specialization: Distributed Systems, Department :ii40l , Campus:P N)), (i105 , (PersonID : 1002, (FName: J ames , LName:Hooks, Tit le: {Lecturer} ), Specialization: Culture and Accounting, Department:ii402, Campus:WN)) } db(PROJECTC ) = {(i101 , (PNumber: pOOOl , PName: DIMO, Begin: Jun 2000, End: Dec 2002 , Descript ion: Mult ilevel transaction ... , Budget : 100000,Primarylnvestigator: {i103, i104} )), (i102, (PNumber: p0201 ,PName: Consumer Behavior , Begin: J an 2002, End: Dec 2002, Description: The patterns of ... , Budget: 3000,Primarylnvestigator : { i102})), (i703, (PNumber: p0301 ,PName: Small Business Accounting, Begin: Feb 2003 , End: Dec 2003, Description: Small businesses are ... , Budget: 5000,Primarylnvestigator : { iio1 , i uo}))} 50 db(PAPERC) = {(is01, (Course:: iso1, Semester:i201, Campus:ALB, Lecturer:i101, {(Day:Mon, Hour: lOam, Room:i304), (Day:Thur, Hour: 2pm, Room:i304)} )), (iso2, (Course: : iso3, Semester:i202, Campus:ALB, Lecturer:i102, {(Day:Wed, Hour: 9am, Room:i305), (Day:Fri, Hour: lpm, Room:i305)} )), (iso3, (Course: : iso2, Semester:i203, Campus:WN, Lecturer:i105, {(Day:Tues, Hour: lpm, Room:i302), (Day:Thur, Hour: 3pm, Room:i302)} )), (iso4, (Course: : iso4, Semester:i204, Campus:WN, Lecturer:i103, {(Day: Mon, Hour: 9am, Room:i302), (Day:Wed, Hour: lpm, Room:i302)}))} (isos, (Course: : iso4, Semester:i205, Campus:PN, Lecturer:i104, {(Day:Mon, Hour: 11am, Room:i301), (Day:Wed, Hour: 2pm, Room:i301)} )), (iso6 , (Course: : isos, Semester:i205, Campus:PN, Lecturer:i104, {(Day:Mon, Hour: llam, Room:i301), (Day:Wed, Hour: 2pm, Room:i303)} )), (is07, (Course: : isos, Semester:i207, Campus:PN, Lecturer:i104, {(Day:Tues, Hour: lOam, Room:i301), (Day:Thur, Hour: 3pm, Room:i303)} ))} db(STUDENTC) = {(i105, (PersonID:4322,StudentID: 99368,Supervisor: iio1, Major: i402, Minor: i4o3, {(Paper : iso1, Grade: A+), (Paper: iso2, Grade: B+)})) , (i101, (PersonID:4198, StudentID: 00695, Supervisor: i105, Major: i402, Minor: i1101, {(Paper : iso3 , Grade: B) , (Paper: iso4 , Grade: A)})) , (i1os, (PersonlD:4077, StudentlD: 01396, Supervisor: i105, Major: i401, Minor: NUL,{(Paper :isos, Grade: A-), (Paper: iso1, Grade:A)})) (i109, (PersonlD:4198, StudentlD: 02396, Supervisor: i105, fviajor: i401, lVIinor: NU L , {(Paper : isos, Grade: B+), (Paper: iso1, Grade: NU L)}) )} 4.3 Queries 51 D For object oriented databases we must define some bas ic query algebra that can be used to define queries on database. To define such a query algebra we have to refer to elements of classes and their components. For this I will define path expressions. The discussion in the following subsection will cover the situations of path expressions defined on elements of base types, record types, set types, union types and references. 4.3.1 Path Expressions In the relational data model, only the record type constructor is used and each attribute is defined on a base type. In the case of the object oriented data model, the whole underlying type system includes not only record type constructors but also some other bulk type constructors. A type system can be expressed as [28, p. 9]: 52 In fact, there could be further type constructors, e.g. for lists and multisets, but I will concentrate on this simplified type system here. Details of the above type system have been provided in Section 4.1. In object oriented databases, for a given class C with structure expression expc and representation type Tc, a database instance of a class C satisfies db(C) ~ {(i,v) Ii E dom(ID ),v E dom(Tc)} The complete definition of a database (or instance of a database schema) was given in Definition 4.5. We define the path expressions as below. D efinition 4 .6. Let C be a class with structure expression expc and representation type Tc . Path expressions for class C (or expc) may have the following formats: (i) ident of type ID, (ii) value of type Tc, (iii) if expc uses a record type, i.e. expc = (a1 : exp1 , . . . , an : expn) , then we get path expressions: • value.°'i of type n which is a representative type for expi and 1 ::::; i ::::; n, • value!ai of type To if ai is a reference to class D, i.e. ai : expi = ai : D , • value.ai.pathi where pathi is a path expression for expi, • value!ai.pathi where ai is a reference to class D , i.e. ai : expi = ai : D and pathi is a path expression for exp D. (iv) if expc use a union type, i.e. expc = (a1 : exp1) U · · · U (an: expn), then we get path expression: • value.ai of type n which is the representation type for expi, • value.ai.pathi where pathi is a path expression for eXPi· ( v) if ex pc uses a set type, i.e. expc = {exp} , then we obtain path expression: 53 • value of type Tc= {exp}, • value.path where path is a path expression corresponding to exp. (vi) if expc uses only a reference , i.e. expc = r : D, then we obtain path expression: value!r.path where path is path expression for the class D. Remark: If r : D appears in expc, i. e. expc = ... , ai : r : D , ... , we get Ti = ID . Thus path expressions are: • value!ai.r of type Tv, • If TD has structure expression expD, eg. expD = (ail : expi1, ... , ain : eXPin)· Then this situation is treated as if r : Dis replaced by expv. Therefore the structure expres­ sion of class C can be represented with expc = ... , ai : expv, .... The representat ion type of class C is Tc = ... ai : TD .... Then we get path expression value!ai to refer to the value of type Tv, i.e. if expv = (ai 1 : expi 1 , ••• , ain : expin), We can use to refer to the value of the component in expD and handle it as if there were no references. If path is a path expression, we use Tpath to denote the representation type of the structure expression that is identified by the path. Example 4.5. We choose class LECTURERC from university database schema. Class LECTURERC IsA PERSONC Struct (PersonID: NAT, Name: (FName: STRING,LName: ....__...., '-v-' path 1 path2 STRING, Title: {STRING}) , Specialization: STRING, Department: DEPARTMENTC, '-v-' "--v-"' path3 Campus: STRING) path5 There are some path expressions defined on class LECTURERC as following: 54 (i) ident (ii) value (iii) path1 =value.Name path2 = value.Name.FName path3 =value.Title path4 = value.Specialization (iv) path5 = value!Department path~ = value!Department.Name path5 refers to the identifier of the department being referenced and path~ refers to the name of the referenced department. D 4.3.2 Queries From application information we can get a set of queries accessing databases. For the relational model, we can use relational algebra to model queries and to optimize queries. We must provide a model for querying object oriented databases. Basically, an algebra in the general mathematical sense is given by a set A and a set of operations op on A [29]. In the object oriented model, each query results in a set of pairs (id, v), where id is an identifier and v is a value of some proper type. The value v may contain identifiers, which must appear in the database, to which the query is applied. More generally, it would also be possible that the identifiers appear in the query result , but much more complicated queries are not handled here. We refer to such a set of pairs as a "class instance". A query Q on S consists of a structure expression expQ called answer schema (with all references pointing to classes in S) and an algebra expression q, i.e. a query Q is defined in the form Q = (expQ, q). Every operator in the query algebra accepts (one or two) class instances as arguments and returns another class instance as a result [25]. 55 We start by saying that every class name CE S can be considered as a query q = C. The answer schema is expc itself and database instance db is mapped via query q to db(Q) , which is a new class instance db(Q) extending db over Stoa new database, which contains a set of new objects with new identifiers created for each of them. Also, pairs (v : T) with value v of type T is considered as a query. Definition 4.7. Let S be a database schema, db be a database instance over S , id(db) be the set of all identifiers appearing in db , db( Q) be the resulting class instance of evaluating query algebra Q = (expQ, q). There are two basic query algebra operations: (i) q = C with C is a class name appeared in S with a query expressed as Q = (expc, q), result ing in db(Q) = { (idw,v) J idw :: ID ,idw ~ id(db).3id :: ID .(id,v) E db(C)}. (ii ) q = (v : T) with a type T and a value v of type T and Q = (T,q), resulting in db(Q) = {(idw, v) J idw :: ID f\ idw ~ id(db)} . Example 4.6. the university database instance db in Example 4.4 contains an instance db(RooMC) of class RooMC : db(RooMC) = { ( i 301 , (Building : SSLB, NO : 2, Campus : PN)), (i302, (Building: SST, NO: 1, Campus: WN)), (i303, (Building: Marsdon, NO: 1, Campus: PN)), (i304, (Building: AH, NO : 2, Campus: ALB)) , (i305, (Building: BSC, NO : 203 , Campus: ALB))} A query operation q =ROOMC for a query Q1 (eXPRooMC, RooMC) with a resulting 56 instance of Q as db( Qi) = {(i1001 , (Building: SSLB, NO: 2, Campus: PN)), (i1002, (Building: SST, NO: 1, Campus: WN)), (i1003, (Building: Marsdon, NO: 1, Campus: PN)), (i1004, (Building: AH, NO : 2, Campus: ALB)), (i1005, (Building: BSC, NO : 203, Campus: ALB))} Note the result of the query on db(C) is a new class that contains a set of objects with new identifiers which have not appeared in the database. But the values for all the attributes are the same. Another query Q2 =((Building: STRING, NO : NAT, Campus: STRING),(Building: SSLB, NO : 2, Campus: PN)) results: db(Q2) = {(igsoo1, (Building: SSLB, NO: 2, Campus: PN))} D We can define a selection operation with a query algebra q. To do this we need to use path expressions path for class C to define selection formulae. Path expressions have been defined in the previous subsection. Definition 4.8. Let C be a class, path be path expressions on class C, take either • two path expression path1, path2 of C of some type, • or a path expression path on class C and a value v of the type of path. Selection formulae c.p can be defined in either of the following forms: • c.p = path1 = path2, • or c.p =path = v. 57 Definition 4.9. (selection) Let S be a database schema, db be a database instance over S , id(db) be the set of all identifiers appearing in db. Take any query Q = (expQ, q) and a selection formula c.p for expQ we get: • a query algebra q' = O"cp(Q) for selection operation, • a query Q' = (expQ, q') with expQ' = expQ, • evaluating Q' on db results in a database instance: db(Q') = {(idw,v ) I idw :: ID /\ idw rt id(db).3id :: ID.(id ,v) E db(Q) /\ c.p(v) =true } Example 4. 7. There is a database instance db over the university database schema. With a query algebra q = PERSONC we have a query Q = (expPERsoNC , q) , then db is extended by adding a db( Q) within which all the objects' identifiers have not occurred before. Then we have a selection formula: c.p = value.Name.FName = 'John' . Using query operation O"cp(Q) we get a result as db(Q') = {(i95101 , (PersonID: 1001 , (FName: John ,LName: Dever ,{Professor,Dr} ), Address: 28 Victoria Av, DoB:16/ Jan/ 1953))} D Definition 4.10. (renaming operation) Let Q = (expQ, q) be a query, attr(Q) denote all the attribute names appearing in expQ with ai E attr(Q). A rename operation renames some of the attributes of a class. We have following expressions: • query algebra q' = Pai>--+bi, ... ,an>--+bn(Q), • new query Q' = (expQ, q'), • Evaluating query Q' on db we get: db(Q') = {(idw, v) I idw :: ID /\ idw rt id(db) .3id :: ID.( id, v) E db(Q)} 58 with bi as a new name for attribute ai· In order to define generalized projection, we need to use the definition of super type that is introduced in [26]. We use the form Te2 ::; Te1 to express that Te1 is a super type of type Te2 • This expression indicates a mapping: We can get the following definition for generalized projection. Definition 4.11. (generalized projection) Let Q = (expQ, q) be a query, expQ' be a new structure expression which is a super structure expression of expQ, i.e. TQ ::; TQ' for the representation types. A generalized projection is a mapping: results in a new query: Q' = (eXPQ', 1fQ1(Q)), with db(Q') = { (idw, v') I idw :: ID /\ idw rt id(db).3id :: I D.(id, v) E db(Q).v' = 7fg ( v)} . Definition 4.12. (join) Let Q1 = ( expQ 1 , q1), Q2 = ( expQ2 , q2) be two queries, exp be a common super structure expression with eXPQ; ::; exp (i = 1, 2) , then there exists: • a new structure for join operation exp1 f>