Copyright is owned by the Author of the thesis.  Permission is given for 
a copy to be downloaded by an individual for the purpose of research and 
private study only.  The thesis may not be reproduced elsewhere without 
the permission of the Author. 
 

DISTRIBUTION DESIGN 

IN 

OBJECT ORIENTED DATABASES 

A t hesis p resented in partia l fulfi lment of t he requi rements for the degree of 

MAST ER O F I NFO RMATION S C IENCE 

I N 

I NFORMATION SYSTEMS 

at Massey University, Palmerston North, 

New Zeala nd 

Hui Ma 

2003 


Abstract 

The advanced development of object oriented database systems has attracted much research. 

However, very few of them contribute to the distribution design of object oriented databases. 

The main tasks of distribution design are fragmenting the database schema and allocating 

the fragments to different sites of a network. The aim of fragmentation and allocation is to 

improve the performance and increase the avai lability of a database system. Even though 

much research has been done on distributed databases, the research a lmost always refers to 

the relational data model (RDM). Very few efforts provide distribution design techniques 

for distributed object oriented databases. 

The aim of this work is to generalise distribution design techniques from relational databases 

for object oriented databases. First, the characteristics of distributed databases in general 

and the techniques used for fragmentation and allocation for the RDM are reviewed. Then , 

fragmentation operations for a rather generic object oriented data model (OODM) are de­

veloped. As with the RDM, these operations include horizontal and vertical fragmentation. 

A third operation named splitting is also introduced for OODM. Finally, normal predicates 

are introduced for OODM. A heuristic procedure for horizontal fragmenting of OODBs is 

also presented. The adaption of horizontal fragmentation techniques for relational databases 

to object oriented databases is t he main result of this work . 


Acknowledgements 

I would like to thank Professor Klaus-Dieter Schewe, my supervisor , for his patience, guid­

ance, suggestions and constant support during this research. I am also thankful to Markus 

Kirchberg for his encouragement and guidance through the early stage of chaos and con­

fusion. A special thanks goes to Madre Chrystal for her kindly devoting valuable time to 

proof read my draft. 

The Massey Masterate Scholarship, which was awarded to me for the period February 2002 -

February 2003 for graduate studies, was crucial to t he successful completion of this project. 

Finally, I am grateful to my husband and my parents for t heir patience and love. Without 

them this work would never have come into existence (litera lly). 

Hui Ma 

March 31 , 2003 

iii 


Table of Contents 

1 Introduction 

1.1 What is a Distributed Database? 

1.2 Why Distribution? ...... . . 

1.3 Distribution Design: Fragmentation and Allocation 

1.4 Objective: Fragmentation in Object Oriented Databases 

1.5 The Outline of the Thesis ................ . 

2 Distributed Databases 

2 .1 Characteristics . . . 

2.1.l Data Independence . 

2.1.2 Network Transparency (Distribution Transparency) 

2.1.3 Replication Transparency .. 

2.1.4 Fragmentation Transparency 

2.2 Key Concepts . . . . 

2.2.1 Heterogeneity 

2.2.2 Autonomy .. 

2.2.3 Distribution . 

2.2.4 Classification of Distributed DBMS . 

2.3 A Framework for Distributed Databases . . 

2.3.l The Objective of the Design of Data Distribution . 

2.3.2 The Reasons for Fragmentation . 

2.3 .3 Alternative Design Strategies . 

v 

1 

2 

2 

4 

5 

6 

7 

7 

8 

9 

9 

10 

10 

10 

11 

11 

12 

13 

14 

15 

16 


3 Distribution Design for Relational Databases 

3.1 

3.2 

3.3 

3.4 

3.5 

3.6 

3.7 

The Relational Data Model 

Characteristics ...... 

Horizontal Fragmentation 

Vertical Fragmentation . 

Mixed Fragmentation 

Allocation . . 

Related Work 

3.7.l Horizontal Fragmentation 

3.7.2 Vertical Fragmentation. 

4 Object Oriented Databases 

4.1 Fundamentals of the OODM 

4.1.1 Type Definitions 

4.1.2 Class Definitions 

4.1.3 Schema Definit ion 

4.2 An Example of Object Oriented Database Schema 

4.3 Queries . . . . . . . . . 

4.3.1 Path Expressions 

4.3.2 Queries ..... 

5 Fragmentation Operations in Object Oriented Databases 

5.1 Split Fragmentation ... 

5.2 Horizontal Fragmentation 

5.2.1 Horizontal Fragmentation on Class Level 

5.2.2 Horizontal Fragmentation on Type Level. 

5.3 Vertical Fragmentation for Object Oriented 
Databases .. .... .... ... . . .. . . 

5.3.1 Vertical Fragmentation on Class Level 

5.3.2 Vertical Fragmentation on Type Level 

5.4 Fragmentation Strategies 

5.5 Related Work ...... . 

vi 

19 

20 

21 

22 

26 

29 

29 

30 

30 

35 

37 

38 

38 

42 

42 

43 

51 

51 

54 

61 

62 

64 

64 

66 

68 

68 

70 

71 

72 


6 A M ethod for Horizontal Fragmentation in Object Oriented D at abases 75 

6.1 Simple P redicates . 75 

6.2 Normal Predicates 78 

6.3 The Heuristic Fragmentat ion Process . 80 

6.4 A Cost Model . . . . 82 

6.4. l Query-Trees . 82 

6.5 

6.6 

6.4.2 Calculation of Size for Classes . 

6.4.3 Calculation of Size for Fragments or Intermediate Nodes . 

6.4.4 Allocate Intermediate Nodes to Sites 

6.4.5 Calculation of Query Costs . . . . . 

A Heuristic P rocedure for Horizontal Fragmentation 

Summary . .. .......... . 

7 Conclusion and Possible Ex tens ion 

7.1 Summary . . 

7.2 Future Work 

Bibliography 

83 

88 

89 

90 

92 

97 

99 

99 

100 

103 


Chapter 1 

Introduction 

Relational database systems have been well accepted because they reflect the nature of 

the structures of many organizations and enable the possibility of efficiently and effectively 

sharing the data. As computer-based systems have penetrated all areas, there are increased 

d emands for the non-conventional applications, such ru:; computer-aided design (CAD) , ge­

ographic information systems, image, and graphic database systems, etc. However, the 

complex structure of the data in these applications cannot be adequately modelled by tra­

ditional relational databases. Consequently, object oriented databases are due to take the 

place of relational ones and become more and more popular. At the same time network fa­

cilities enable the distribution of database systems. However, approaches have been limited 

to t he relational data model which lead to the emergence of distributed relational database 

management systems (DRDBMSs). But distributed object oriented database management 

systems (DOODBMSs) are due as well. 

It is desirable to design a distributed object oriented database in such a way that the system 

can perform efficiently and effectively. The techniques of distribution design of relational 

databases have been intensely studied from the 1980s. There is also substantial research 

contributing to the study of the object oriented data model , but there is little research on 

the distribution design of object oriented databases. 

The aim of this thesis is to study the existing (mature) distribution design techniques for 

relational databases and to adopt them into an object oriented approach. The following 

sections of this chapter will first review the definit ion of a distributed database and then 

study the reasons of the emergence of distributed database systems. Two main distribution 

1 


2 

design techniques will also be briefly defined. After reviewing some research work that 

has been done in DRDBMSs we will set up the objective of studying fragmentation in 

DOODBMSs. Finally, the outline of this thesis will be presented. 

1.1 What is a Distributed D atabase? 

Ceri and Pelagatti [6] define a distributed database as a collection of data that logically 

belongs to the same system but is spread over the sites of a computer network. Ozsu 

and Valduriez [24] give a similar definition: a distributed database system is a collection of 

multiple, logically interrelated databases distributed over a computer network. They explain 

that the logically related files, which are individually stored at each site of a computer 

network, are not enough to form a distributed database. There need to be a structure among 

them. Ceri and Pelagatti [6] support this view and state that the data at different sites must 

have propert ies that tie them together, and that access to the files should be via a common 

interface. They explain that physical distribut ion means that data does not reside at the 

same site in the same processor. Ozsu and Valduriez [24] point out that physical distribution 

does not necessarily imply that the computer systems are geographically distributed. The 

sites among the network could even have the same address. They could be in the same room, 

but the communication between them is done over a network instead of shared memory, 

and the communication network is the only shared resource. 

1.2 Why Distribut ion? 

Ceri and Pelagatti [6] and Ozsu and Valduriez [24] describe the motivation for the develop­

ment of distributed databases. Distributed database research is motivated by the reliability, 

performance, and economic concerns of distributed databases in organizations. Reliability 

refers to the ability to tolerant faults. Performance refers to the ability to reduce query 

response time and increase throughput. Economic concerns relate to the reduction of data 

communication and update synchronization costs. 

• Organizational Reasons 


3 

In recent years, the demand for more information by industries, governments and aca­

demic institutes has led to databases that have exceeded the physical limitations of 

centralized systems. The advances of telecommunication techniques make distributed 

database systems more affordable and useful [14, l]. Ceri & Pelagatti [6] state that 

distributed databases are motivated by organizational reasons: many organizations , 

especially global organizations are often decentralized. For such organizations, imple­

menting the information system in a decentralized way might be more suitable. 

• Economic Reasons 

Ceri & Pelagatti [6] state that economic reasons are another motivation for the devel­

opment of distributed databases. They argue that large, centralized computer centres 

are becoming questionable with respect to economies of scale. Ozsu & Valduriez [24] 

support this view and state that it normally costs much less to put together a sys­

tem of smaller computers with the equivalent power of a single big machine due to 

the advance of minicomputers and microcomputers. The authors also state that the 

communication cost can be reduced when distributed databases are implemented. If 

databases are geographically dispersed and the application accessing them are at the 

intersection of dispersed data, it will be much more economical to partition the re­

lations and the applications so that the data processing can be done locally at each 

site. 

• Reliability and Availability 

Improved reliability and availability is one of the potential advantages of distributed 

databases which the centralized databases lack [6, 24]. When replications of data have 

been placed at different sites, the crash of one site or the failure of the communication 

link would not necessarily make the data impossible to reach. When the system 

crashes and the communication link fails, even though some of the data will not be 

accessible, the distributed database system still provides limited services. 

• Interoperability of Existing Databases 

Ceri & Pelagatti [6] also mention that when there are several databases already ex­

isting in an organization and there is the necessity of executing global applications, 


4 

distributed databases are the natural solution. In this case, the distributed database 

is designed bottom-up from existing local databases. These local databases may need 

to be reconstructed to some degree, but it is much cheaper than building a new inte­

grated distributed database. 

• Expandabili ty 

Ozsu & Valduriez [24] and Ceri & Pelagatti [6] state that it is easier to accommodate 

increasing database sizes in a distributed environment. If an organization grows by 

adding new and relatively autonomous branches or warehouses, then the distributed 

database approach supports the information needs of the new sites with the minimum 

degree of impact on the existing system. 

• Local Autonomy 

Local autonomy is emphasized by Ceri & Pelagatti [6] as a major reason that many 

business organizations consider a distributed information system. Since data is dis­

tributed, a group of users that commonly share such data can have this data placed at 

the site they work. Thus, the local controls are allocated to local users to enable them 

to take partial responsibility for information management in the distributed database . 

1.3 Distribution Design: Fragmentation and Allocation 

Distribut ion design is one of the major research problems whose solution will enhance 

performance of the distributed databases. It involves data acquisition, fragmentation of 

databases, allocation and replication of the fragments, and local optimization. Fragmenta­

tion and allocation are the most important elements of a distributed database design phase. 

They play important roles in the development of a cost efficient system [24] . 

Fragmentation is a design technique to divide a single database into two or more partitions 

such that the combinat ion of the partitions yields the original database without any loss 

or addit ion of information [25]. This reduces the amount of irrelevant data accessed by the 

application, thus reducing the number of disk accesses. The result of the fragmentation 

process is a set of fragments defined by a fragmentation schema. Fragmentation can be 

either horizontal, vertical or mixed. 


5 

Horizontal fragmentation partitions a relation or a class into disjoint unions (fragments), 

which will have exactly the same structure but different contents. Thus a horizontal frag­

ment of a relation or class contains a subset of the whole relation or class instance. Vertical 

fragm entation results in attributes and methods being partitioned into different fragments 

and therefore reduces irrelevant data accessed by applications [28]. 

Allocation is the process of assigning a node on the network to each fragment after the 

database has been properly fragmented [24]. When data is allocated, it may either be 

replicated or maintained as a single copy. The replication of fr agments will improve the 

reliability and efficiency of read-only queries. The intention of allocation is to minimize the 

data transfer cost and the number of messages needed to process a given set of applications, 

so that the system functions effectively and efficiently [17, 33, 24]. For the sake of simpleness , 

we will not consider replication of fragments when we discuss fragmentation in this thesis. 

1.4 Objective: Fragmentation in Object Oriented Databases 

The techniques of fragment ing relational databases have been intensely studied since the 

early 1980s. There are many different approaches to fragmentation and allocation for dis­

tribution design in relational databases. Navathe & Ra [22] develop a vertical partitioning 

algorithm using a graphical technique. Navathe, Karlapalem & Ra [21] propose a mixed 

fragmentation methodology for distributed database design. Tamhankar & Ram [33] pro­

pose an integrated methodology for fragmentation and allocation. Chu [8] designs two 

methods for partitioning attribute which treat the transaction as the decision variable. 

Even though much research has been dedicated to the issue of object-oriented databases, 

little has been related to the distribution design of object oriented databases. The fragmen­

tation of object oriented databases is a complex and still open research problem because the 

semantic model of the object oriented data model is much richer and more complicated than 

that of the relational model. The object oriented data model allows not only record con­

structors to be used but also set and other bulk type constructors. These constructions may 

even appear not only on the class level but also in nested structures [28]. The aim of this 

thesis is to generate distribution design techniques from traditional RDM to the OODM. 


6 

This thesis will concentrate on fragmentation, especially on horizontal fragmentation for 

OODM. 

1.5 The Outline of the Thesis 

The thesis will start with a brief review of the issues related to distribution design in 

relational databases in Chapter 2. The general characteristics of distributed databases 

will be reviewed. Some key concepts related to distributed databases and a framework for 

distribution design will also be presented. 

Chapter 3 covers characteristics of distribution design in relational databases. Fragmenta­

tion and allocation in RDM are studied in this chapter. Also, an overview of related works 

is provided in this chapter. 

In Chapter 4, the fundamentals of the object oriented data model will be discussed. At 

the same time, an example of the object oriented database schema will be presented in this 

chapter. 

Chapter 5 will concentrate on some fragmentation techniques that can be used in the distri­

bution design of object oriented databases. Splitting, horizontal fragmentation and vertical 

fragmentation will be presented and analyzed. Some design strategies will also be presented. 

Finally, related works will be reviewed. 

Chapter 6 will introduce normal predicates in the object oriented data model. Then some 

heuristics of horizontal fragmentation in object oriented databases will be presented. 

Finally, Chapter 7 contains the conclusion of this work. Future work is also listed in this 

chapter. 


Chapter 2 

Distributed Databases 

Distributed Database Management Systems (DDBMSs) are characterized by several levels 

of transparencies that they can support. The classification of distributed databases is based 

on autonomy and heterogeneity. Distribution design is performed under a framework that 

sets the objective of the design. There are two approaches to the design of distribution. They 

constitute different approaches to the design process. But both of them might be applied to 

complement one another [24]. This chapter defines the fundamental concepts and sets the 

framework for discussing distributed databases . We start by reviewing the characteristics of 

a distributed database system. Then we will present a architectural model for distributed 

DBMSs. Two alternative design strategies will also be discussed in this chapter. 

2 .1 Characteristics 

Some desirable functions that should be supported by a true distributed DBMS are proposed 

by Ozsu & Valduriez [24] . Data independence is considered to be one of the main motivations 

for introducing databases. In a distributed database, data independence has the same 

importance as in traditional databases. However, a new aspect is added to the usual notion 

of data independence, namely distribution transparency [6]. Atre & Advisor [3] explain that 

data access should be transparent and synchronized to preserve database integrity. Ozsu & 

Valduriez [24] define transparency as separation of the higher-level semantics of a system 

from lower-level implementation issues. In other words, transparency means that when the 

users are accessing the data, they do not need to know where the data is stored, in what 

7 


8 

format it is stored, or how it is to be accessed [3]. 

In a distributed relational database environment, each relation can be partitioned into a set 

of fragments on the basis of relevant informational content. The fragments of the relation are 

stored at different sites. Furthermore, it might be preferable to duplicate some of this data at 

other sites for performance and reliability reasons [24] . The result is a distributed database 

which is fragmented and replicated. The database users would see a logically integrated, 

single image database even though it may be physically distributed. The DDBMS should 

enable users to access the distributed database as if it were a centralized one. The ideal 

form with full transparency would imply that a query language interface to a distributed 

DBMS is not different from a query interface to a centralized DBMS. 

An ideal DBMS should provide a number of different types of transparencies. In the fol­

lowing sections the different levels of transparency will be reviewed. 

2.1.1 Data Independence 

Data independence is a fundamental form of transparency that centralized database systems 

can provide. Data independence refers to the fact that the definition and maintenance of 

data are independent from the applications and are controlled by a server of the DBMS [23]. 

In other words, data independence means that the actual organization of data is transparent 

to the application programmer. Programs are written with a conceptual schema [6]. Ozsu 

& Valduriez [24]propose two types of data independence: 

• Logical data independence refers to the immunity of user applications to changes in 

the logical structure of the database. For example, if a user application operates 

on a subset of the attributes of a relation, it should not be affected later when new 

attributes are added to the same relation. 

• Physical data independence refers to hiding the details of the storage structure from 

user applications. Programmers should not be concerned with the details of physical 

data organization when they design a user application. And the user application 

should not need to be modified when data organizational changes occur with respect 


9 

to what data type the data is assigned and what storage hierarchies t he data is 

distributed across . 

2.1.2 N etwork Transparency (Distribution Transparency) 

Network transparency or distribution transparency means that programs can be written as 

if the database were not distributed [6]. Ozsu & Valduriez [24] state that there should be no 

difference between database applications that would run on a distributed database and those 

that would ruu on a centralized one. They also state that the user should be protected from 

operational details of the network and even the existence of the network. T hey separate 

the distribution transparency into location t ransparency and naming t ransparency. 

• Location t ransparency means that the command used to perform a task is independent 

of both the location of the data and the system on which an operation is carried out. 

• Naming transparency refers to the fact that a unique name is provided for each object 

in the database. If there is no naming transparency, users arc required to embed the 

location name (or an identifier) as part of the object name. 

2.1.3 Replication Transparency 

Ozsu & Valduriez [24] state t hat replication transparency is dealing with whether the users 

should be aware of the existence of copies, or whether the system should handle the man­

agement of copies. In a DBMS with replication transparency, users should act as if there 

is single copy of the data. They also emphasize t hat replication transparency refers only to 

the existence of replicas not to the location. 

Ceri & Pelagatti [6] explain that there are several reasons for considering data redundancy 

as a desirable characteristic: first, the locality of applications can be increased if the data is 

replicated at all sites where applications need it; and second, the availability of the system 

can be increased because one site failure does not stop the execution of applications at other 

sites if the data stored is replicated. The performance can be increased t hrough replication, 

as the retrieval can execute on any copy of the data if t here is more than one copy of the 


10 

data. But on the other hand, replication causes problems in updating databases. Thus, if 

the user application is not retrieval oriented but update oriented, it might not be a good 

idea to have too many copies of the data. Whether or not to have copies and how many 

copies to have is decided to a considerable degree by the nature of user applications [24]. 

2.1.4 Fragmentation Transparency 

With fragmentation transparency, users are not aware of the data separation. Fragmenta­

tion transparency is implied if the database systems have the function of physical data inde­

pendence. Fragmentation is often used to improve performance, availability and reliability. 

Fragmentation can also reduce the negative effects of replication because by partitioning 

the data, only a subset of the relation, not the full relation, needs to be stored. Less space 

is required and fewer data items need to be managed [24]. 

2. 2 Key Concepts 

First, in this section, we will review and explain basic concepts that include heterogeneity, 

autonomy, and distribution which are used in the classification of distributed database sys­

tems. Then, classification of distributed database systems: tight integration, multidatabase, 

and semiautonomous database systems will be discussed explicitly. 

2.2.1 Heterogeneity 

Heterogeneity in DDBMSs can arise at different levels in the system, including hardware 

at different sites, different operating systems, different network protocols, different local 

DBMSs, and different models that local DBMSs are based on [4, 6, 24]. However, an 

important distinction is at the level of local DBMSs and the model they are based on, 

because differences at lower levels are managed by the communication software not by the 

DBMS. Therefore, only the heterogeneity of DBMS, the model of DBMS and the semantics 

of the data model need to be considered. Hence, the term "homogeneous DDBMS" refers 

to a DDBMS with the same DBMS based on the same data model with the same semantics. 


11 

2.2 .2 Autonomy 

Autonomy is concerned with the distribution of control, not of data. It indicates the 

degree to which individual DBMSs can operate independently [6 , 24]. Autonomy refers 

to a function of a number of factors such as whether the component systems exchange 

information, whether they can independently execute transactions and whether each site is 

allowed to modify itself. 

Requirements of an autonomous system have been specified in a variety of ways. Bell & 

Crimson [4] specify the dimensions of autonomy as: 

• Design autonomy: local sites are given the freedom to decide the information content 

of the DB , to select the data model and to choose storage structures. 

• Participation autonomy: local sites have the right to decide what data to contribute 

and when, and the freedom to decide when to come and to go. 

• Communication autonomy: local sites have the right to decide how and under what 

terms to communicate with other sites in the network. 

• Execution autonomy: local sites have the freedom to make decisions whether and how 

to process local operations to store, retrieve and update local data. 

2.2 .3 Distribution 

In Ozsu & Valduriez [24] distribution refers to the physical distribution of data over multiple 

sites. There are two alternative classes that DBMSs use to distribute data: client/server dis­

tribution and peer-to-peer distribution. Client/server distribution concentrates data man­

agement duties with servers, while clients focus on providing the application environment 

that includes the user interface [3]. The client and servers share the communication re­

sponsibility. The sites on a network are distinguished as "clients" and "servers", and their 

functionality is different [24]. In peer-to-peer systems, there is no distinction between client 

machines and servers. Each machine has full DBMS functionality and machines can com­

municate with each other to execute queries and transactions. Peer-to-peer systems are also 

called fully distributed systems [24] . 


12 

2.2.4 Classification of Distributed DBMS 

Bell & Grimson [4] divide distributed database management systems into homogeneous 

DDBMSs and heterogeneous DDBMSs. Homogeneous DDBMSs can be further divided 

into classes depending on whether or not they are autonomous. Heterogeneous DDBMSs 

can be further divided into classes based on wether they are integrated or not. The au­

thors also categorize systems according to the degree of heterogeneity, the method of data 

distribution and the extent of local autonomy. Ozsu & Valduriez [24] and Ceri & Pela­

gatti [6] present a working classification of possible design alternatives along three similar 

dimensions: autonomy, distribution, and heterogeneity. Ozsu & Valduriez [24] propose a 

classification of the implementation alternatives of distributed database systems which is 

based on autonomy and heterogeneity. This is illustrated in Figure 2.1. The different im­

plementation alternatives that cover the important aspects of these features of distributed 

database design are reviewed below. 

Distributed Distributed 
heterogeneous heterogeneous 
DBMS federated DBMS 

Distributed 
heterogeneous 
multidatabase 

Heterogeneity • 
• ~st.em . / 

~~~~~~~~·~~-- ~~~~----1~ 

Distributed 
homogeneous 
DBMS 

~ 
' ' ' 

Distributed 
homogeneous 
federated DBMS 

Distributed 
homogeneous 
multi database 
system 

Autonomy 

Figure 2.1: DDBMS Implementation Alternatives [24] 

At one extreme, a tightly integrated (or truly distributed) database is a DDBMS with no 

local autonomy. In other words it has full global control. This kind of DDBMS has a global 

schema and all applications access the global database. The local DBMSs do not operate 

independently. 

In contrast , in a multidatabase there are only local database schema and no global schema. 


13 

Global applications must access each local database separately. The individual systems 

are stand-alone DBMSs, which know neither of the existence of other DBMSs nor how to 

communicate with them. Local autonomy guarantees that local users can access their own 

local DB independently of the existence of the multidatabase and its global users . 

Gligor & Popescu-Zeletin [15] list the requirements of multidatabase systems as follows: 

• The local operations of the individual DBMSs are not affected by their participation 

in the multidatabase system. 

• System consistency or operation should not be affected when a DBMS join or leave 

the multidatabase confederation. 

• The manner of query processing and optimization of individual DBMS should not be 

affected by the execution of global queries that access multiple databases. 

Semiautonomous (or federated) systems consist of a local database schema and a global 

schema. Local databases have their own schema and can operate independently, but they 

also have decided to participate in a federation to make their local data sharable. Each of 

these DBMSs determines what information it wants to share with other users. Applications 

may be executed locally and globally. They are not fully autonomous systems because the 

parts of their own databases that they share with other users should be able to be modified 

by other users in order to change information. 

In this thesis, the discussion of data fragmentation and allocation is limited to the dis­

tributed homogeneous DBMS, which provides an integrated view of the data to users even 

though the database is distributed. 

2.3 A Framework for Distributed Databases 

The objective of the design of distribution is discussed by Ceri & Pelagatti [6]. The design 

of a distributed database system involves making decisions on the architecture of DDBMS, 

on the process of design. Two major strategies for designing distributed databases that are 

identified by Ceri & Pelagatti [6] are: top-down approach and bottom-up approach. Ozsu 


14 

& Valduriez [24] develop a framework for the process of distribution design based on these 

approaches. The reasons and criteria for fragmentation and allocation will be discussed in 

this section. 

2.3.1 The Objective of the Design of Data Distribution 

There are several objectives that should be taken into account in the design of distribution 

[6]: 

• In a distributed database system one of the major costs is associated with communica­

tion. To minimize communication costs, one goal of DDBMSs is to achieve processing 

applications locally. The degree of local processing can be maximized by distributing 

data, therefore minimizing transaction costs. To achieve this goal, the data should 

be kept as close as possible to the applications which use them. The advantage of 

processing applications locally is not only the reduction of remote access costs, but 

also increased simplicity in controlling the execution of the application. 

• We can improve the availability and fault-tolerance of read-only applications by stor­

ing multiple copies of the same information at different sites. When one site of the 

database is down or the community link for that site is broken, the system can still 

execute the applications by accessing the other copies of the information. 

• Distributing workload over the sites is done in order to take advantage of the different 

powers of utilization of the computers at each site, and to maximize the degree of 

parallelism of execution of applications . But the trade-off between processing locally 

and distributing workloads should be considered in the design of data distribution. 

• Database distribution should reflect the cost and availability of storage at each site. 

Even though the storage cost is not relevant when compared with the cost of in­

put or output (I/O), central processing unit (CPU), and transmission costs of the 

applications, the limitation of available storage at each site should be considered. 


15 

2.3.2 The Reasons for Fragmentation 

To simplify the problem, we do not consider replication at the first step of distribution 

design. The purpose of fragmentation design is to determine non-overlapping fragments 

which could be the logical unit of allocation [6]. The individual tuple or attribute of a 

relation cannot be considered as the unit of allocation because the allocation problem would 

become unmanageable. The fragments are constituted by grouping tuples or attributes that 

have the same "properties" from the viewpoint of their application [6]. This is based on 

the idea that two elements in the same fragment that have the same "properties" will be 

accessed by the applications together. Therefore, the fragments obtained in this way are 

the appropriate units of allocation [6]. 

Ozsu & Valduriez [24] mention that with respect to fragmentation , the important issue is 

the appropriate unit of allocation. The authors explain that there are three reasons for 

fragment relations: 

• Applications are usually based on the views of subsets of relations. Thus the applica­

tions often access any subset of an entire relation locally. 

• If there is a relation on which many application views are defined at different sites , 

storing a given relation at one site will result in an unnecessarily high volume of 

remote data accesses. Storing a given relation at different sites will cause problems in 

executing updates and may not be desirable if storage is limited. 

• The decomposition of a relation into fragments permits many transactions to be exe­

cuted concurrently and results in the parallel execution of a single query by dividing 

it into a set of sub queries that operate on fragments. 

However, on the other hand , fragmentation may cause the following problems [24]: 

• The applications whose views are defined on more than one fragment may suffer per­

formance degradation when the relations are not partitioned into mutually exclusive 

fragments. 


16 

• When the attributes participating in a dependency of a relation are decomposed into 

different fragments and stored at different sites, the task of checking for dependencies 

would result in chasing after data in a number of sites. 

2.3.3 Alternative Design Strategies 

Ceri & Pelagatti [6] and Ozsu & Valduriez [24] propose two alternative approaches to the 

design of data distribution, top-down and bottom-up approaches. In the case of tightly 

integrated distributed databases, design proceeds top-down from requirements analysis and 

logical design of the global database to physical design of each local database. In the case 

of distributed multidatabase systems, the design process is bottom-up and involves the 

integration of existing databases [24]. But the authors also emphasis the fact that real 

applications are rarely simple enough to fit nicely in either of these approaches. The two 

approaches may need to be applied to complement each other. 

Top-down Approach 

As shown in Figure 2.2, in the top-down approach, the process starts with a requirements 

analysis that defines the environment of the system and elicits both the data and processing 

needs of all potential database users [35]. The requirements analysis also specifies where 

the final system is expected to stand with respect to the objectives of the DDBMS. The 

objective is defined with respect to performance, reliability and availability, economics , and 

expandability ( flcxi bili ty) . 

The requirements documents are the inputs to two parallel activities: view design and 

conceptual design. The outputs of view design are the interfaces for the user, and the 

outputs of conceptual design are entity types and relationship types which are used to 

construct an external schema. 

In a distributed relational database environment, the objective of distribution design is to 

design the local conceptual schemas (LCSs) by distributing the relations and subrelations 

(fragments). The fundamental issues in top-down design are fragmentation and allocation 

[24]. 


17 

The last step in the design process is the physical design, during which local concept ual 

schemas are mapped to the physical storage devices avai lable at the corresponding local 

sites . 

Co nceptual 
np~nn 

Requirements 
Anal is 

System Requirements 
(Objectives) 

User 
In ut 

\11 ew I nteqrati on 

Feedback Observation and 
~-------------< Monitorin 

\11ewDesign 

External 

Feedback 

Figure 2.2: Top-down Design Process [24] 

Bottom-up Design Process 

User 

Input 

Ceri & Pelagatti [6] and Ozsu & Valduriez [24] state that top-down design is suitable for the 

systems which are developed from scratch. But when the distributed database is developed 

as the aggregation of existing databases, it is not easy to follow the top-down approach. 

The bottom-up approach, which starts with individual local conceptual schemata, is more 

suitable for this environment [6, 24]. Ceri & Pelagatti [6] explain that the bottom-up 

approach is based on the integration of existing schemata into a single, global schema. 

Integration is the process of the merging of common data definitions and the resolution 

of conflicts among different representations that are given to the same data. The global 


18 

conceptual schema is the product of the process [24]. Ceri & Pelagatti [6] conclude that 

there are three requirements for bottom-up design: 

(1) the selection of a common database model for describing the global schema of the 

database, 

(2) the translation of each local schema into the common data model, and 

(3) the integration of the local schemata into a common global schema. The authors 

also state that these three requirements are particularly important in heterogeneous 

distributed systems. 


Chapter 3 

Distribution Design for Relational 
Databases 

There are many reasons to study distributed relational databases. Firstly, the mathematical 

foundation of the relational data model (RDM) makes the theory research problem easy to 

formulate. Secondly, there are a large number of relational systems on the market, and most 

distributed database systems are relational [23]. Current distributed database management 

systems (DBMSs) are mainly available for relational databases (RDBs) [33] . 

The distribution design of databases involves data acquisition, fragmentation of the database, 

allocation and replication of the partitions and local optimization [19]. In a distributed 

relational database environment, fragmentation is the design technique that divides a rela­

tion into a set of partitions such that the combination of the partitions yields the original 

database without any loss or addition of information [24, 25]. 

Fragmentation can be done in several ways: horizontal, vertical, and mixed (hybrid) . Hor­

izontal fragmentation splits a relation into a set of disjoint new relations with the same 

relation schema. Each of the new relations is obtained by applying a selection operation to 

the original relation [28] . Vertical fragmentation involves dividing the attributes of a rela­

tion into groups and then applying a projection operation to the original relation over each 

group. The original relation can be reconstructed by union or join operations, respectively, 

of the new relations resulting from horizontal or vertical fragmentation [28]. 

The following sections will first set the preliminaries of the discussion of distribution design 

of RDB. Then we will list the characteristics of the distribution design. Afterwards we will 

19 


20 

review each design technique mentioned above. 

3.1 The Relational Data Model 

The relational data model was first introduced by Codd [9] in 1970. Codd [10] emphasizes 

that the relational model has three power features. First, its data structures are simple. 

This feature allows a high degree of independence from the physical data representation. 

Second, the relational model provides a solid theoretical foundation for data consistency. 

The consistent states of a database can be uniformly defined and maintained through in­

tegrity rules. Thirdly, the relational model allows the set-oriented manipulat ion of relations. 

With this feature , powerful nonprocedural languages can be developed based either on set 

theory (relational algebra) or on logic (relational calculus). 

To define the relational data model, we suppose that a certain set V of possible domains D 

is given. For instance, D can be NAT, which is a set of natural numbers, ST RI NG, which 

is a set of character strings, and BOOL, that is a set of booleans, etc. Every attribute in a 

relation schema is assigned a domain D out of V. The following definition is given in [28]. 

Definition 3.1. Let V be a non empty set of domains. 

(i) A relation schema R consists of a finite set attr(R) of attributes and a domain assign-

ment dom: attr(R) -t V. 

(ii) A tuple over a relation schema R is a mapping t : attr(R) -t LJ with t(A) E dom(A) 
DE'D 

for all A E attr(R). 

(iii) A relation r over a relation schema R is a finite set of tuples over a relation schema 

R. 

(iv) A relational database schema is a finite, non-empty set S of relation schemata. 

( v) An instance I of a relational database schema S assigns to each R E S a relation 

I(R) over R. 


21 

Remark: 

(i) A relation schema R can be written in the form of R = {A1 : D1 , .. . , An : Dn} with 

attr(R) = {A1, ... ,An} and dom(Ai) = Di(i = 1, ... ,n) . 

(ii) We usually use the term 'database' instead of talking about instances. 

Each tuple in a relation is uniquely identified by its key which is the subset of the set of 

attributes attr(R) of a relation r. The key is minimal if and only if there is no subset of a 

relation that can be a key. In order to define the key, we need to define a operation named 

projection first. 

Definition 3.2. Let r be a relation over relation schema R, and X ~ R. For a tuple t Er, 

projection oft, denoted as t[X] is a mapping t' : X --t LJ dom(A) with t' E dom(A) such 
AEX 

that t' (A) = t(A) for each A EX. 

Then we can have the following definition: 

Definition 3.3. A key on a relation schema Risa subset I< ~ attr(R) restricting relations 

rover R to satisfy t1 = t2 for all tuples t1, t2 Er with t1[K] = t2[K]. 

A key K is called minimal if and only if no proper subset of I< is a key. 

A primary key is simply a distinguished minimal key 

3.2 Characteristics 

For a given relational database schema S = {R1, ... , Rk}, fragmentation applies to each of 

the relations ~ in the schema. To ensure the consistency of the database, the fragmentation 

operation should satisfy the following three characteristics [24]: 

(i) Completeness 

Informally, completeness refers to the fact that fragmentation does not lead to a loss 

of information. If a relation ~ is decomposed into fragments R}, .. . , Rfi, each data 

item that can be found in~ can also be found in one or more of R{ 


22 

(ii) Reconstruction 

Informally, reconstruction refers to the fact that it should be easy to reconstitute a 

non-fragmented relation back from its fragments. For a given relation Ri with frag­

mentation FR; = {R[, ... , Rfi}, it should be possible to define a relational operator 

that can reconstruct Ri with R{. 

• For horizontal fragments, reconstruction of a global relation Ri is performed 

by using the union operator in the horizontal fragmentation. Using relational 

algebra, 
k; 

Ri = LJ R{ 
j=l 

• For vertical fragments, reconstruction of the original global relation Ri is made 

by using the natural join operation: 

(iii) Disjointness 

Informally, disjointness refers to the fact that fragmentation should not lead to a 

replication of information. But if replication comes into play, this requirement has to 

be relaxed. If a relation Ri is horizontally decomposed into fragments R}, ... Rfi and 

the data item d'i is in R{, it does not appear in any other fragment Rf(k ¥- j). If a 

relation is vertically partitioned, disjointness is defined only on the non-primary key 

attributes of the relation. To simplify the problem, replication is not considered at 

the first stage of the design of distributed databases. 

3.3 Horizontal Fragmentation 

There are two forms of horizontal fragmentation: primary horizontal fragmentation and de­

rived horizontal fragmentation [6, 24]. Schewe [27] defines primary horizontal fragmentation 

as: 

Definition 3.4. Let S = R1, ... , Rn be a relational database schema, primary horizontal 


23 

fragmentation on relation schema R;, replaces R;, by a set { R}, ... , Rfi} of new relation 

schemata such that: 

(i) the attributes in each R{ are the same as in R;,. 

(ii) Each relation ri over R;, can be split into pairwise disjoint relations r}, ... , rfi over 

R} , ... , Rfi respectively such that ri = rf U · · · U rfi holds. 

By using relational algebra, the operation of horizontal fragmentation can be expressed as: 

with 'Pj as selection predicates defined on R;, that are derived from application information. 

Of course, the disjointness property implies that we must have 

'Pj I\ 'Pk {:::} J alse for all j i= k 

Similarly, the completeness property implies the requirement that 

c.p1 V · · · V 'Pki {:::} true 

We notice the above definition includes the criteria which are also known as the correctness 

rules of horizontal fragmentation in the above section: completeness, reconstruction, and 

disjointness. 

Example 3.1. Take relation schema LECTURER= {name, specialization, location} , a re­

lation r over LECTURER is 

name specialization location 

Chris Database Theory Palmy 

David System Analysis Wellington 

Peter End User Computing Palmy 

John Database Theory Auckland 

Barbara Multimedia Auckland 


24 

with a set of selection formulae: 

<p1 = location = 'Palmy' 

<p2 = location = 'Wellington' 

<p3 =location = 'Auckland' 

The above relation is partitioned into the following three relations: 

name specialization location 

Chris Database Theory Palmy 

Peter End User Computing Palmy 

name specialization location 

David System Analysis Wellington 

name specialization location 

John Database Theory Auckland D 

Barbara Multimedia Auckland 

Derived horizontal fragmentation is the partitioning of a relation that results from predicates 

being defined on another relation. For this operation, semi-joins: R1 ~ R2 = 7rR1 (R1 l><l R2) 

will be involved, we define derived horizontal fragmentation as below: 

Definition 3.5. Let S = R1 , ... , ~.Rx , ... , Rn be a relational database schema, De-

rived Horizontal fragmentation on relation schema ~ results that ~ be replaced by a set 

{Rt, ... , Rfi, R~;+l} of new relation schemata such that: 

(i) the attributes in each Rf are the same as in ~. 

(ii) each relation ri over ~ can be split into pairwise disjoint relations rt, . .. , rfi, rfi+1 

over R 1 Rki Rki+l respectively such that r· = r 1 U · · · U r~i U rki+l holds t, ... , t' t t t t t . 

Using relational algebra, the operation of derived horizontal fragmentation on relation ~ 

can be written as: 


with <{Jj be the predicate defined on relation Rx and 

ki 

R~;+l = ~ - LJ (~ t>< acpj( Rx)) 
j=l 

be a necessary 'remainder fragment'. 

25 

Example 3.2. Take relation schemata LECTURER = {name, specialization, location Code} 

and CAMPUS = {location Code, city}, a relation r over LECTURE and relation r' over CAM-

PUS are: 

name specialization location Co de 

Chris Database Theory 1 location Code city 

David System Analysis 3 1 Palmy 

Peter End User Computing 1 2 Auckland 

John Database Theory 2 3 Wellington 

Barbara Multimedia 2 

with a set of selection formulae <{Jj defined on relation CAMPUS, relation r over LECTURER 

can be fragmented as: 

R1 = LECTURER I>< O"cp 1=city = 'Palmy•(CAMPUS) 

R2 =LECTURER I>< O"cp2 =city = 'Wellington•(CAMPUS) 

R3 =LECTURER I>< O"cp3 =city = 'Auckland'(CAMPUS) 

Then relation r is partitioned into the following three relations: 

name specialization location Code 

Chris Database Theory 1 

Peter End User Computing 1 

name specialization location Code 

David System Analysis 2 


26 

name specialization location Code 

John Database Theory 3 

Barbara Multimedia 3 

Note that the remainder relation r4 is empty because all tuples of r can match a tuple in 

one of the fragment rj of relation r'. D 

3.4 Vertical Fragmentation 

Vertical fragmentation exploits relation schemata to be sets of attributes. Vertical fragments 

result from a projection operation to the original relation. The original relation can be 

reconstructed by the joining of the new relations . It can be defined as below: 

Definition 3.6. Let S = R1, . •. , Rn be a relational database schema with relation schemata 

. { 1 ki } ~ = {Ail, ... , Aini}. Vertical fragmentation replaces Hi by a set Ri , ... , Ri of new 

relation schemata such that: 

ki . 
(i) the attributes are distributed, i.e., I4, = U Rf, 

j = l 

(ii) each relation ri over I4, is split into relations r1, = rrR{(ri)(j = 1, ... ,ki) such that 

Ti = r} IXl · · · IXl rfi holds, 

(iii) in Relational Algebra, ~ = R} IXl · • · IXl R7', 

(iii ') in a special case, a distinguishing new attribute dif can be added to relation ~ as 

a minimal key, then after vertical fragmentation dif E R{ for all j E {1, .. . , ki}, and 

I4. = rrn;- {dif}(R} 1X1···1X1 R7i). 

Using Relational Algebra, vertical fragmentation could be written as R{ = 7rattr(R{)(~) for 

all j E { 1, .. . , ki} 

Not having the distinguished new attribute dif would require a lossless join-decomposition, 

which in turn would mean t hat a join-dependency must hold. However, a well designed 


27 

database schema would exclude such dependencies, except for the case, where R} n · · · n R7; 

contains a key. Thus, it is normally required that the primary key (i.e. a chosen minimal 

key) is part of all Ri . 

Example 3.3. Take relation schema LECTURER = {name, specialization, location, depart­

ment, IRD, salary}, a relation r over LECTURER is 

name specialization location department IRD salary 

Chris Database T heory Palmy Information Systems 234569 92000 

Richard Accounting history Wellington Accounting 235169 38000 

Peter End User Computing Palmy Information Systems 256487 64560 

Mary Careers Auckland Human Resources 156426 65900 

Barbara Multimedia Auckland Computer Science 352486 56000 

Relation r can be vertically fragmented in the following two ways: 

(i) F irst , alter the relation schema LECTURER to LECTURER' by attaching a distinguish­

ing new attribute 'dif' to LECTURER. Then, by using operation: 

R1 = 7r dif, specialization, location , department (LECTURER' ) 

R2 = 1rdif, name, IRD, salary(LECTURER') 

the above relation r is vertically partitioned into the following two relations: 

dif specialization location department 

1 Database Theory Palmy Information Systems 

2 Accounting history Wellington Accounting 

3 End User Computing Palmy Information Systems 

4 Careers Auckland Human Resources 

5 Multimedia Auckland Computer Science 


28 

dif name IRD salary 

1 Chris 234569 92000 

2 Richard 235169 38000 

3 Peter 256487 64560 

4 Mary 156426 65900 

5 Barbara 352486 56000 

(ii) Alternatively, let {name, department} be the primary key. Using operation: 

R1 = 1r name, department, specialization, location (LECTURER) 

R2 ='Tr name, department, IRD, salary(LECTURER) 

the above relation r is vertically part itioned into the following two relations: 

name department specialization location 

Chris Information Systems Database Theory Palmy 

Richard Accounting Accounting history Wellington 

Peter Information Systems End User Computing Palmy 

Mary Human Resources Careers Auckland 

Barbara Computer Science Multimedia Auckland 

name department IRD salary 

Chris Information Systems 234569 92000 

Richard Accounting 235169 38000 

Peter Information Systems 256487 64560 

Mary Human Resources 156426 65900 

Barbara Computer Science 352486 56000 

D 


29 

3.5 Mixed Fragmentation 

Database users usually access subsets of data which are vertical and horizontal fragments 

of global relations . Therefore, there is a need for mixed fragmentation, which applies a 

sequence of horizontal and vertical fragmentation on a relation. It can be achieved by 

successively performing horizontal and vertical fragmentation on a relation. The different 

sequencing of vertical and horizontal operations generates different fragmentation schema. 

Even though these operations can be recursively repeated, having more t han two levels of 

fragmentation is not of practical interest [6]. 

k Definition 3. 7. For a given relation ~ in the database schema S, mixed fragments Ri 

are built by successive steps of horizontal and vertical fragmentation on relation ~, such 

that the correct rules for both vertical and horizontal fragmentation can be met. Using 

relational algebra, it can be expressed by using alternatively sequences of projection and 

selection operations: 

allowing 'Pk = true and R{ = Ri captures all other possibilities for such sequences. 

3.6 Allocation 

Once a fragmentation schema has been decided upon, each fragment must be assigned to 

one or more nodes in the distributed database management system. The allocation problem 

involves finding the "optimal" distribution of the fragments to the sites. The discussion of 

allocation is to find an allocation model that minimizes the total costs of processing and 

storage while trying to meet certain t ime restrictions [24]. The definition of a cost minimized 

allocation is as below: 

Definition 3.8. For a given set of fragments {Fi , ... , Fn} with different sizes s1, .. . , Sn, 

if the network has nodes N1, ... , Nki fragment allocation is to assign a node Nj to each 

fragment Fi such that the summary of all the transaction and storage costs from all the 


30 

sites can be kept to a minimum, where the transaction and storage costs are calculated 

according to a predefined cost model. 

The focus of this thesis is on horizontal fragmentation of DOODBs. We will not go into 

detail on the discussion of vertical fragmentation and allocation of fragments. 

3. 7 Related Work 

A lot of research has contributed to fragmentation and allocation in the area of distributed 

relational databases. This section will present some approaches for horizontal and vertical 

fragmentation. 

3. 7.1 Horizontal Fragmentation 

For horizontal fragmentation, Ceri and Pelagatti [6] introduce two types of fragmentation: 

primary and derived. Derived fragmentation, which is performed to facilitate the join 

between fragments, is determined in terms of primary fragmentation. 

• Primary horizontal fragmentation of a relation can be defined by determining a set 

of disjoint and complete selection predicates. Ceri and Pelagatti propose a procedure 

which produces a set of disjoint and complete selection predicates. This procedure 

can be summarized as follows: 

(i) Derive a set of simple predicates P = {pi, ... ,Pn} from application information. 

Simple predicates take the form of: 

Ai= value 

where Ai is an attribute of the relation schema. The simple predicates Pi E P 

must satisfy two properties. They must be complete and minimal in order to 

guarantee that elements in the same fragments share the "same properties" in 

terms of allocation. The definition of complete and minimal can be found in [6]. 


31 

(ii) Construct a set of minterm predicates from P by applying arbitrary conjunctions 

of all predicates Pt, where Pt is either Pi E P or its negation. Some of these 

conjunctions may be unsatisfiable. A set I of implications among the Pt can be 

used to determine (and remove) these unsatisfiable minterms. 

The authors point out that it is not possible to analyse all the transactions that use 

the database. Only the most important and critical transactions should be taken into 

account. It is widely accepted that the "20/80" rule should be applied as a guideline 

to choose user applications to determine simple predicates. It means that only 20% 

of user queries should be taken into account, because they usually account for 80% of 

the data access . In particular , no update transactions will be considered. It is also 

suggested that fragments that have similar properties should not be distinguished. 

Otherwise the execution of the algorithms for a complete and minimal set of predicates 

will become very expensive. 

• Derived fragmentation in [6] is performed by applying semijoin operations. Member 

relations are partitioned according to the fragmentation of their owners. By using 

relational algebra, derived fragmentation is expressed as R~ = R2 t>< R{ with R2 indi­

cating the member relations, R1 indicating owner relations that have been fragmented 

into a set of disjoint fragments { Ri , ... , Rl} with 1 :S j :S i. 

Oszu and Valduriez [24] follow the lines of Ceri, Pelagatti and Navathe [6 , 20] and develop 

horizontal partitioning algorithms. Their fragmentation algorithm is based on the following 

necessary information requirements: 

(i) Database information. The database information concerns the global conceptual 

schema. From the global conceptual schema, we can know how the database relations 

are connected to one another, especially with joins. 

(ii) Application information. This includes the description of user queries and fre­

quencies with which user applications access and update data. The quantitative and 

qualitative information acquired from application information can be summarized in 

the following four categories: 


32 

• Simple predicates for relation R = {A1 : Di, ... , An : Dn} can be defined in 

the form of 

with A as an attribute defined over domain Di , () E { =, <, /:-, :S, >, ~} and 

Value E Di. A set of all simple predicates defined on relation R is denoted by 

Pr= {p1 ,p2, ··· ,Pm}· 

• Minterm predicates are the conjunctions of simple predicates and their nega­

tions. The set of minterm predicates Mi= {mil, mi2, ... , miz} over a set Pri of 

simple predicates is defined by: 

Mi = { mij lmij = f\ P7d 
P;kEPr; 

where P7k = Pik or P7k = 'Pik· Note that all simple predicates in Pri appear 

(positively or negatively) in each minterm predicate. 

• Minterm selectivity presented with sel(mi) is the number of tuples of the 

relation that would be accessed by a user query specified according to a given 

minterm predicate mi. 

• Access frequency with which users access data. For user application qi, it is 

denoted with acc(qi)· For a minterm predicate mi it is represented as acc(mi)· 

The input of the algorithm for horizontal fragmentation is a relation R and a set of simple 

predicates Pr. The output of the algorithm is a set of fragments { R1, ... , Rn} of R. The 

objective of the algorithm is that Pr should be complete and minimal, which is defined as 

follows: 

• Completeness of simple predicates 

A set of simple predicates Pr is said to be complete if and only if there is an equal 

probability of access by every application to two tuples belonging to the same minterm 

fragment that is defined according to Pr. 

For example, if we assume there is a relation PROJECT ={PNumber: NAT, PName: 

STRING, Budget: NAT, Campus: STRING, Description: STRING, Begin: DATE, 


33 

End: DATE} , there are two applications defined on it, and there are only three 

different locat ions for Project. 

Find the budget of projects at each location. (a) 

Find projects with budget less than $200000. (b) 

According to applicat ion (a) Pr={Campus= 'Wel', Campus='P N', Campus='Auk '} 

is 11ot complete with respect to application (b) because application (b) will access 

two tuples in a fragment defined by predicate Pr with different probability if the 

values of the budget of one object is more than or equal to $200000 and that of 

the other object is less than $200000. A complete set of simple predicates should 

be P r={Campus='Wcl ', Campus='PN', Campus='Auk', Budget<200000, Budget ~ 

200000 }. 

• Minimality of simple predicates 

If a predicate influences how fragmentation is performed (e.g. J be partitioned into Ji 

and Jj), there should be at least, one application that accesses Ji and fj differently. In 

other words, the simple predicate should be relevant in determining a fragmentation. 

If all the predicates of a set, Pr are relevant, then Pr is minimal. 

If Pi E 714 and fragment fi is determined by mi, 'Pi E mj and fragment J1 is deter­

mined by mj, then Pi is relevant iff 

ace( mi) # ace( mj) 
card(fi) card(Jj) 

Where acc(mi) and acc(mj) denote the access frequencies of minterm predicate mi 

and mj, card(fi) and card(fj) denote t he cardinalities of fragment fi and Jj. 

For instance, a set of simple predicates Pr = {Campus='Wel', Campus='PN', Cam­

pus='Auk', Budget< 200000, Budget ~ 200000} is minimal in addition to being 

complete. However , if there is another predicate Pj :=PName='BigNet' in Pr , then 

Pr ={Campus= 'Wel', Campus= 'PN', Campus= 'Auk', Budget<200000, Budget ~ 

200000, PName='BigNet '} is not minimal because there is no application that ac­

cesses fragment Ji (PName is 'BigNet ') and Jj (PName is not 'BigNet') differently. 


34 

Ozsu and Valduriez [24] first present an iterative algorithm named COM_MIN to generate 

a complete and minimal set of predicates Pr' from a given set of simple predicates Pr. The 

algorithm checks each predicate Pi in the given set of simple predicates Pr to see if it can 

be used to partition the relation R into at least two parts which are accessed differently by 

at least one application. If Pi satisfies the fundamental rule of completeness and minimality 

then it should be included in Pr'. If Pi is nonrelevant then it should be removed from Pr'. 

But this algorithm is not practical because checking Pi can not be defined with machine 

readable language. Moreover , it does not consider the fragment combination possibility 

that some of the minterm fragments might be allocated to the same site. 

A algorithm named PHORIZONTAL is introduced to describe primary horizontal fragmen­

tation. It uses the algorithm COM_MIN and a set of implications I as inputs to produce 

a set of satisfiable miniterm predicates M . If a mintcrm predicate mi is contradictory to a 

implication rule in I , then it is removed from M. Minterm fragments are defined according 

to the set of satisfiable minterm predicates M. But the set I of implications is hard to 

define. 

For the derived horizontal fragmentation , semijoin operations are involved. Member re­

lations are fragmented according to the fragments of its owner relation. Details of the 

definition of member and owner relations can be found in [24]. They emphasize that care 

should be taken with the relations that have more than one link to the owner relations. Two 

criteria are suggested [24] . First, choose the fragmentation with better join characteristics. 

Second, choose the fragmentation used in more applications. 

In fact, the algorithm is not very practical, as it will always result in a subset Pr' of P r, the 

set of minterm predicates M' determined by Pr' and the corresponding set of fragments. 

Simple predicates are omitted from Pr if they do not contribute to the fragmentation , 

i.e. they only violate the minimality principle. This emerges to considering just the simple 

predicates AiOVi in the most important queries and to take all satisfiable minterm predicates. 

This obviously leads to fragments that are accessed different ly by at least two queries. The 

algorithm further does not give executable rules for eliminating the unsatisfiable minterm 

predicates. 


35 

The major problem, however, is that the number of fragments resulting from the algorithm 

is exponential in the size of Pr. In practice, it would be important to reduce this number 

significantly, which would mean to re-combine some of the fragments. In fact, this implies 

giving up the completeness principle and replacing it by optimization criteria based on a 

cost model. 

Several other researchers have worked on fragmentation in the relational data model. Na­

vathe et al. [21] define a schema for simultaneously applying the horizontal and vertical 

fragmentation algorithms on a relation to produce a grid . They use a technique similar 

to the vertical fragmentation presented in [20, 22] to produce horizontal fragments. The 

fragmentation schema generated by the algorithm is independent of the sequencing of the 

horizontal or vertical fragment algorithms. Tamhankar and Ram [33] introduce an inte­

grated methodology for fragmentation and allocation. They make an attempt to combine 

fragmentation, allocation and replication into a single step of distribution design and apply 

the combination to a practical problem. 

3.7.2 Vertical Fragmentation 

Vertical fragmentation is more complicated than horizontal fragmentation because of the 

total number of alternatives. If there are n simple predicates that are used to define horizon­

tal fragmentation , there are at most 2n possible fragments that can be defined. But in the 

case of vertical fragmentation, if a relation has m nonprimary key attributes, the possible 

fragments are given by the Bell number which is approximately B ( m) ~ mm. From the 

value of the possible vertical fragmentation, we find out it is impossible to get the optimal 

solutions to the vertical partitioning problem. We can only expect to find out a heuristic 

solution. 

There are several algorithms of vertical fragmentation that have been proposed in the lit ­

erature. Hoffer and Severance [7] measure the affinity between pairs of attributes and try 

to cluster attributes according to their pairwise affinity by using the bond energy algo­

rithm(BEA). 

Navathe et al.[20] extend the BEA approach and proposed a two-phase approach for vertical 


36 

partitioning. In the first step, they used the given input parameters in the form of an 

attribute usage matrix(AUM) to construct the attribute affinity matrix (AAM) on which 

clustering is performed. In the second step, estimated cost factors , which reflect the physical 

environment of fragment storage, were considered to further refine the partitioning schema. 

Cornell and Yu [11] apply the work of Navathe et al.[20] to physical design of relational 

databases. This approach uses specific physical factors such as the number of attributes, 

their length and selectivity, and cardinality of the relation. 

Navathe and Ra [22] construct a graph-based algorithm to solve the vertical partitioning 

problem where the heuristics used include an intuitive objective function which is not ex­

plicitly quantified. 

With the aim to overcome the complexity of attribute based algorithms, P.-C. Chu [8] 

proposes a transaction-oriented approach to vertical partitioning, in which no Attribute 

Utility Matrix but transaction information is used as the decision variable. 

The algorithm presented in Ozsu and Valduriez [24] takes AAF as input. The Bond Energy 

algori thm is employed to evaluate the togetherness of a pair of attributes. Shift operation 

is used to process the attribute clustering. The binary partitioning algorithm should be 

applied recursively when there is a large set of attributes. 

Muthuraj et al [19] propose a formal approach to address the problem of an n-array vert ical 

partitioning problem and derive a partition evaluator function which describes the affinity 

value for clusters of different sizes. This function can either be used for fragmentation 

progress or applied to evaluate the fragmentation schemata that are created and therefore 

to test and evaluate the different algorithms available. They argue that an attribute usage 

matrix instead of an attribute affini ty matrix (AAM) should be used in the process of 

fragmentation because AAM can only measure the closeness of a pair of attributes at one 

time and cannot measure the closeness of the entire cluster which may consist of more than 

two attributes. 

Both horizontal and vertical fragmentation problems are complex. Studying only one of 

them will be a hard task. This thesis will only concentrate on horizontal fragmentat ion and 

will not handle vertical fragmentation in detail. 


Chapter 4 

Object Oriented Databases 

Object-oriented databases have brought about a fundamental change in the way a data and 

the procedures that operate on the data are viewed. Whilst general agreement on the broad 

features to be supported by an object oriented database system is slowly being reached, as 

yet there is still no firm agreement on a formal definition of an object-oriented database 

system [12, 30, 32]. Additionally, there is still much debate on which underlying object 

oriented data model is appropriate for database systems. 

However , it is widely accepted that objects are abstract ions of real world entities. Objects 

are regarded as the basic unit of persistent data [30], and the object oriented databases 

are composed of independent objects. A unique identifier should be assigned to an object 

because t he existence of an object should be independent of its value [2, 30, 18]. The using 

of immutable object identifiers enable sharing, mutabili ty of values and easily representation 

of cyclic structures. 

Our research will apply the object oriented data model (OODM) which is presented in 

[30 , 31]. This model applies an abstract object identifier to capture the fact that an object 

in the real world always has an unique identity. At the same time, an object in the real 

world can have different aspects and should not only be coupled with a unique type. In 

contrast, objects, as well as references to other objects, should be associated with more 

than one type that can change during the object's lifetime. The section below will present 

some fundamental concepts of the object oriented data model. It will also depict concepts 

by giving an object oriented database schema as well as an instance of the database. 

37 


38 

4.1 Fundamentals of the OODM 

This section will review the object model proposed in Schewe & Thalheim [30] . In this 

model each object o consists of a unique identifier id, a set of (type-, value-) pairs (7i, 

vi), a set of (reference-, object-) pairs (refi, 01) and a set of methods methk. Values and 

objects are different in that values can be identified by themselves while objects can only 

be identified by applying an external identification mechanism. Types are used to structure 

values while classes are the groups of objects that have the same structure which uniformly 

combines aspects of object values and references. Subtyping is used to relate values with 

different types [31]. In this thesis, in order to simplify the problem of distribution design, 

the behavior( method) part will be omitted completely. The object oriented data model that 

we adapt to this project is just a simplified version of the data model introduced in [30, 31]. 

4.1.1 Type Definitions 

The type system presented in [30, 31] consists of some basic types, type constructors and 

subtyping relation. 

Definition 4.1. (i) The base types are either BOOL, NAT, INT, FLOAT, STRING, ID, 

or .l, where ID is an abstract identifier type without any non-trivial supertype and 

.l is the trivial type that is a supertype for every type. 

(ii) Let Np and NF denote parameter-names and function-names. Let a i E NF and 

a,ai E Np(i = 1, ... ,n). A type constructor is either a record constructor (a1 : 

a1 , ... , an : an), a finite set constructor {a} , a list constructor [a], a bag constructor 

(a) or a union constructor (a1 : ai) U · · · U (an : an)· 

(iii) A type t is called proper iff the number of its parameters is 0. If there is no occurrence 

of ID in t , t is called a value type. If t' is a proper type occurring in a type t , then 

there exists a corresponding occurrence relation o: t x t'--+ BOOL. 

New types can be defined by nesting using predefined base types, such as BOOL, NAT, 

STRING, etc, and predefined constructors for records, finite sets, lists, unions, etc. So we 


39 

can first define some types in the example below. 

Example 4.1. First, we define PERSONNAME by using both a set constructor{-} and 

record constructor ( ·): 

Type PERSONNAME 

= (FName: STRING, 

LName: STRING, 

Title: {STRING}) 

End PERSONNAME 

Then we define PERSON based on type PERSONNAME. 

Type PERSON 

= (PersonID: NAT, 

Name: PERSONNAME, 

Address: STRING, 

DOB: DATE) 

End PERSON 

We can also define the following types which will be used when we define a university 

database schema in the next subsection. 

Type PROJECT 

= (PNumber: NAT, 

Pname: STRING, 

Begin: DATE, 

End: DATE, 

Description: STRING 


40 

Budget: STRING) 

End PROJECT 

Type COURSE 

= (CNumber: NAT, 

Cname: STRING) 

End COURSE 

Type DEPARTMENT 

= (Dname: STRING, 

{(TelNumber: NAT, 

Campus: STRING)}) 

End DEPARTMENT 

Type ROOM 

= (Building: STRING 

NO: NAT 

Campus: STRING) 

End ROOM 

Type SEMESTER 

= (Year: NAT, 

Season: NAT) 

End SEMESTER 

Subtypes are also introduced in [30]. They are used to relate values in different types. 

D 

Definition 4.2. Let a 1 , . . . , an be parameter-names. A subtype relation::; on types is given 

by the following rules: 


41 

(i) Every type t is its own subtype and a subtype of ..l. 

(ii) NAT:SINT:SFLOAT. 

(iii) ( ... , ai-l : O'.i-1, ai: ai, ai+l : ai+1, ... ) < ( ... , ai-l : a~_ 1 , ai+l : a~+l> .. . ) whenever 

aj:Saj. 

{ai} < {aj} 

(iv) [ai] < [aj] iff O'.i :S O'.j· 

(a) < (aj) 

(v) {a} :S (a) and [a] :S (a) . 

(vi) · · ·U(ai-1: O'.i-1)U(aH1: ai+1)U · · · :S · · ·U(ai-1: a~_ 1 )U(ai: a~)U(aH1: a~+ 1 )U .. . . 

whenever ai :S aj for all j = 1, .. . , n. 

Example 4.2. We can define STUDENT and LECTURER as subtypes of PERSON de­

fined in the above example. 

Type STUDENT 

= (PersonID: NAT, 

StudentID: NAT) 

End STUDENT 

Type LECTURER 

= (PersonID: NAT, 

Name: PERSONNAME, 

Specialization: STRING) 

End LECTURER D 


42 

4.1.2 Class Definitions 

In the OODM presented in [30], classes are used to structure objects having the same 

structure and behavior, while types are used to structure values. Each object in a class 

has an identifier, a collection of values, references to other objects and methods. Identifiers 

can be represented by using the unique identifier type ID. An object, which has multiple 

aspects, can simultaneously belong to different classes. This property guarantees that each 

object of the abstract object model can be captured by the collection of possible classes. 

A class structure allows us to uniformly combine aspects of object values and references. 

Relationships between classes are represented by references and referential constraints on 

the object identifiers involved [30]. Because we are not looking at the dynamics, the model 

we apply in this thesis will not consider methods. 

Definition 4.3. (i) Lett be a value type with parameters a 1 , ... , an such that ID does 

not occur in t. If the parameters are replaced by pairs ri : Ci with pairwise differ­

ent reference names ri and class names Ci, then the resulting expression is called a 

structure expression. 

(ii) A class consists of a class name C, a structure expression expc, a set of class names 

D1, ... , Dm (called superclasses). The proper type derived from expc by replacing 

each reference ri : Ci with the type ID is called representation type Tc of the class C. 

4.1.3 Schema Definition 

The database schema is designed by a finite collection of type and class definitions [30]. Let 

us first review the definition of schema and the way to make it closed. Then we will look 

at an instance of a schema and some integrity constraints. 

Definition 4.4. A schema S is a finite set of classes that is closed in that all names 

appearing in a structure expression or as superclass must be names of classes defined in the 

schema. 


43 

Definition 4.5. An instance db of a structure schema S assigns to each class C E S a 

finite set db( C) of values of type (id : ID, value : Tc) such that the following conditions are 

satisfied: 

uniqueness of identifiers: For every class C we have 

Vid :: ID.Vv,w :: Tc.(id,v) E db(C) /\ (id,w) E db(C)::::} v = w. 

inclusion integrity: For a subclass C of C' we have 

Vid :: ID. id E dom(db(C))::::} id E dom(db(C')). 

Moreover, if Tc is subtype of Tb with subtype function f : Tc___, Tb, then we have 

Vid :: ID.Vv :: Tc .(id ,v) E db(C)::::} (id,f(v)) E db(C') 

referential integrity: For each reference from C to C' with corresponding occurrence 

relation Or we have 

Vid, id':: ID.Vv :: Tc .( id, v) E db(C) /\ or(v, id')::::} id' E dom(db(C')) 

where dom(db(C)) ={id: : I Div:: Tc.(id,v) E db(C)}. 

4.2 An Example of Object Oriented Database Schema 

In this section, we show an example of a database schema and its instance, the contents of 

the database at a given time point. Example 4.3 shows an object oriented database schema 

transformed from a Higher Order Entity Relationship Schema in [34]. Figure 4.1 is the 

Higher Order Entity Relationship Model (HERM) diagram of the schema. 

We use the types defined in Example 4.1 to build the structural part S of the OODM 

schema. Therefore the structure of a class can be based on a type definition defined above 

or on a nameless type definition. It may also involve an IsA relation to model objects in 

more than one class. We use o to indicate concatenation for record types. 


44 

ID 

FNamc LNamc 

Building No Rcqnircct Require• 

~ 

Bq;in ~ Dcscripcion 

End ~c.,.p.li 
Number PName 

Figure 4.1: HERM Diagram of University Database Schema [34] 


Example 4.3. We design a structural schema as below: 

Schema University 

Class PERSONC 

Structure PERSON 

End PERSONC 

Class SEMESTERC 

Structure SEMESTER 

End SEMESTERC 

Class RooMC 

Structure ROOM 

End RooMC 

Class DEPARTMENTC 

Structure DEPARTMENT 

End DEPARTMENTC 

Class COURSEC 

Struct ure CO URSE o 

(Requires: { r: CouRSEC}) 

End CouRSEC 

Class LECTURERC 

IsA PERSONC 

Structure LECTURER o 

(Department : D EPARTMENTC, 

Campus: STRING) 

End L ECTURERC 

Class P ROJECTC 

Structure PROJECT o 

(Primary Investigator : { P ERSONC U LECTURERC } ) 

45 


46 

End PROJECTC 

Class PAPERC 

Structure (Course: COURSEC , 

Semester: SEMESTERC, 

Campus: STRING, 

Lecturer: LECTURERC, 

{( Day: DATE, 

Hour: NAT, 

Room: RooMC)}) 

End PAPERC 

Class STUDENTC 

IsA PERSONC 

Structure STUDENT o 

(Supervisor: LECTURERC, 

Major: D EPARTMENTC, 

Minor: D EPARTMENTC, 

Enroll: {(PAPERC, 

Grade: STRING)}) 

End STUDENTC D 

After we define the structure of the schema, we are going to describe the instances of the 

above schema. The instance of a database schema is t he content of the database at a given 

time point. We need a type ID of object identifiers to uniquely and efficiently identify the 

objects and to model objects in different classes and references to other objects. We use db 

as a name of the instance of the database schema. 


47 

Example 4 .4 . An instance of the above university database schema is presented as follow­

ing: 

db(P ERSONC) = 

{(i101 , (PersonID : 1001 , (FName: John,LName: Dever,{Professor ,Dr} ), 

Address : 28 Victoria Av, DoB:16/ Jan/ 1953)) , 

(i102, (PersonID: 1002, (FName: Allan, LName: Barry, Ti t le: {Senior Lecturer , Dr} ), 

Address: 66 Albert St, DoB:23/ Feb/ 1958)), 

( i103, (PersonID : 2010, (FN ame: Shirley, Churchill , T it le: {Lecturer} ), 

Address: 531 Tramine Av. , DoB:lO/ Oct/1960)), 

(i104 , (PersonID : 3203 , (FName: Jerry, LName:Hubbard ,Title: {HoD, P rofessor, Dr} ), 

Address : 32 Ada St , DoB:02/ May/ 1945)), 

(i105, (PersonID : 2618, (FName: LName: James, LName: Hooks,Tit le: {Lecturer} ) , 

Address :l16 College St, DoB:28/ Jun/ 1960)), 

(i105, (PersonID : 4322, (FName: LName: Jill , Heslop, Tit le: { } ), 

Address: 22 Fegerson St, DoB:ll / Jul / 1985)), 

(i107 , (PersonID : 4198, (FName: LName: Frances,LName: Caban,Title: { } ), 

Address: 78 Cuba St,DoB:30/ Nov/ 1983)) , 

(i10s, (PersonID : 4077, (FName: LName: Lindsay,LName:Hamilton ,Title: { } ), 

Address: 29 Church St, DoB:06/ Dec/1982)), 

(i1og, (PersonID : 4198, (FName: LName: Jeff, LName: Perera,Title: { } ), 

Address: 99 Broadway Av, DoB:18/ Aug/1980)) , 

(i11o, (PersonID : 2396, (FName: Lindsay, LName: Kirton, Title: { } ), 

Address:195 King St, DoB:03/Sept/1975))} 


48 

db(SEMESTERC) = 

{(i201, (Year: 2000,Season: 01)), 

(i202, (Year: 2000, Season: 02)), 

(i203, (Year: 2001 , Season: 01)), 

(i204, (Year: 2001, Season: 02)), 

(i205, (Year: 2002, Season: 01)), 

( i2o6, (Year : 2002, Season : 02)), 

(i207, (Year: 2003, Season: 01))} 

db(RooMC) = 

{(i301 , (Building: SSLB, NO: 2, Campus: PN)), 

(i302, (Building: SST, NO: 1, Campus: WN)), 

(i303, (Building: Marsdon, NO : 1, Campus: PN)), 

(i304, (Building: AH, NO : 2, Campus: ALB)), 

(i305, (Building: BSC, NO: 203, Campus: ALB))} 

db(DEPARTMENTC) = 

{(i401, (Dname: Information Systems, {(TelNumber: 063566199, Campus:PN) , 

(TelNumber : 045763112, Campus:WN)} )), 

(i402 , (Dname: Accounting, {(TelNumber: 063563188, Campus:PN), 

(TelNumber: 098132699, Campus:ALB)})), 

(i403, (Dname: Marketing, {(TelNumber: 063564132, Campus:PN), 

(TelNumber: 045663188, Campus:WN)} ))} 

db( COURSEC) = 

{(i501, (CNumber: 110.001, CName:Accounting Principle, Requires: { } )), 

(i502 , (CNumber: 110.105, Taxation, Requires: { i501} )), 


(i503, (CNumber : 156.100, Principle of Marketing, Requires: {} )) , 

(i504, (CNumber: 157.221 , Information Systems Analysis, Requires: {i503 })) , 

(i505, (CNumber: 157.331 , Database Concepts, Requires: {i504 } ))} 

db(LECTU RERC ) = 

{ ( i1o1, (PersonID : 1001 , (FN ame: John, LN ame:Dever , Title: {Professor , Dr} ), 

Specialization: Commercial Law, Department:ii402, Campus:ALB)), 

(i102, (PersonID : 1002, (FName: Allan , LName:Barry, Ti t le: {Senior Lecturer, Dr} ), 

Specialization: Pricing, Department :ii403 , Campus:PN)), 

(i103, (PersonID : 2010, (FName: Shirley, LName:Churchill , Ti t le: {Lecturer} ), 

Specializat ion: Databases, Department:ii40l , Campus: WN)), 

49 

(i104, (PersonID : 1002, (FName: Jerry, LName:Hubbard , Title: {HoD Professor , Dr} ) , 

Specialization: Distributed Systems, Department :ii40l , Campus:P N)), 

(i105 , (PersonID : 1002, (FName: J ames , LName:Hooks, Tit le: {Lecturer} ), 

Specialization: Culture and Accounting, Department:ii402, Campus:WN)) } 

db(PROJECTC ) = 

{(i101 , (PNumber: pOOOl , PName: DIMO, Begin: Jun 2000, 

End: Dec 2002 , Descript ion: Mult ilevel transaction ... , 

Budget : 100000,Primarylnvestigator: {i103, i104} )), 

(i102, (PNumber: p0201 ,PName: Consumer Behavior , Begin: J an 2002, 

End: Dec 2002, Description: The patterns of ... , 

Budget: 3000,Primarylnvestigator : { i102})), 

(i703, (PNumber: p0301 ,PName: Small Business Accounting, Begin: Feb 2003 , 

End: Dec 2003, Description: Small businesses are ... , 

Budget: 5000,Primarylnvestigator : { iio1 , i uo}))} 


50 

db(PAPERC) = 

{(is01, (Course:: iso1, Semester:i201, Campus:ALB, Lecturer:i101, 

{(Day:Mon, Hour: lOam, Room:i304), 

(Day:Thur, Hour: 2pm, Room:i304)} )), 

(iso2, (Course: : iso3, Semester:i202, Campus:ALB, Lecturer:i102, 

{(Day:Wed, Hour: 9am, Room:i305), 

(Day:Fri, Hour: lpm, Room:i305)} )), 

(iso3, (Course: : iso2, Semester:i203, Campus:WN, Lecturer:i105, 

{(Day:Tues, Hour: lpm, Room:i302), 

(Day:Thur, Hour: 3pm, Room:i302)} )), 

(iso4, (Course: : iso4, Semester:i204, Campus:WN, Lecturer:i103, 

{(Day: Mon, Hour: 9am, Room:i302), 

(Day:Wed, Hour: lpm, Room:i302)}))} 

(isos, (Course: : iso4, Semester:i205, Campus:PN, Lecturer:i104, 

{(Day:Mon, Hour: 11am, Room:i301), 

(Day:Wed, Hour: 2pm, Room:i301)} )), 

(iso6 , (Course: : isos, Semester:i205, Campus:PN, Lecturer:i104, 

{(Day:Mon, Hour: llam, Room:i301), 

(Day:Wed, Hour: 2pm, Room:i303)} )), 

(is07, (Course: : isos, Semester:i207, Campus:PN, Lecturer:i104, 

{(Day:Tues, Hour: lOam, Room:i301), 

(Day:Thur, Hour: 3pm, Room:i303)} ))} 

db(STUDENTC) = 

{(i105, (PersonID:4322,StudentID: 99368,Supervisor: iio1, 

Major: i402, Minor: i4o3, {(Paper : iso1, Grade: A+), 


(Paper: iso2, Grade: B+)})) , 

(i101, (PersonID:4198, StudentID: 00695, Supervisor: i105, 

Major: i402, Minor: i1101, {(Paper : iso3 , Grade: B) , 

(Paper: iso4 , Grade: A)})) , 

(i1os, (PersonlD:4077, StudentlD: 01396, Supervisor: i105, 

Major: i401, Minor: NUL,{(Paper :isos, Grade: A-), 

(Paper: iso1, Grade:A)})) 

(i109, (PersonlD:4198, StudentlD: 02396, Supervisor: i105, 

fviajor: i401, lVIinor: NU L , {(Paper : isos, Grade: B+), 

(Paper: iso1, Grade: NU L)}) )} 

4.3 Queries 

51 

D 

For object oriented databases we must define some bas ic query algebra that can be used 

to define queries on database. To define such a query algebra we have to refer to elements 

of classes and their components. For this I will define path expressions. The discussion in 

the following subsection will cover the situations of path expressions defined on elements of 

base types, record types, set types, union types and references. 

4.3.1 Path Expressions 

In the relational data model, only the record type constructor is used and each attribute is 

defined on a base type. In the case of the object oriented data model, the whole underlying 

type system includes not only record type constructors but also some other bulk type 

constructors. A type system can be expressed as [28, p. 9]: 


52 

In fact, there could be further type constructors, e.g. for lists and multisets, but I will 

concentrate on this simplified type system here. Details of the above type system have been 

provided in Section 4.1. In object oriented databases, for a given class C with structure 

expression expc and representation type Tc, a database instance of a class C satisfies 

db(C) ~ {(i,v) Ii E dom(ID ),v E dom(Tc)} 

The complete definition of a database (or instance of a database schema) was given in 

Definition 4.5. We define the path expressions as below. 

D efinition 4 .6. Let C be a class with structure expression expc and representation type 

Tc . Path expressions for class C (or expc) may have the following formats: 

(i) ident of type ID, 

(ii) value of type Tc, 

(iii) if expc uses a record type, i.e. expc = (a1 : exp1 , . . . , an : expn) , then we get path 

expressions: 

• value.°'i of type n which is a representative type for expi and 1 ::::; i ::::; n, 

• value!ai of type To if ai is a reference to class D, i.e. ai : expi = ai : D , 

• value.ai.pathi where pathi is a path expression for expi, 

• value!ai.pathi where ai is a reference to class D , i.e. ai : expi = ai : D and pathi 

is a path expression for exp D. 

(iv) if expc use a union type, i.e. expc = (a1 : exp1) U · · · U (an: expn), then we get path 

expression: 

• value.ai of type n which is the representation type for expi, 

• value.ai.pathi where pathi is a path expression for eXPi· 

( v) if ex pc uses a set type, i.e. expc = {exp} , then we obtain path expression: 


53 

• value of type Tc= {exp}, 

• value.path where path is a path expression corresponding to exp. 

(vi) if expc uses only a reference , i.e. expc = r : D, then we obtain path expression: 

value!r.path where path is path expression for the class D. 

Remark: If r : D appears in expc, i. e. expc = ... , ai : r : D , ... , we get Ti = ID . Thus 

path expressions are: 

• value!ai.r of type Tv, 

• If TD has structure expression expD, eg. expD = (ail : expi1, ... , ain : eXPin)· Then 

this situation is treated as if r : Dis replaced by expv. Therefore the structure expres­

sion of class C can be represented with expc = ... , ai : expv, .... The representat ion 

type of class C is Tc = ... ai : TD .... Then we get path expression 

value!ai 

to refer to the value of type Tv, i.e. if expv = (ai 1 : expi 1 , ••• , ain : expin), We can 

use 

to refer to the value of the component in expD and handle it as if there were no 

references. 

If path is a path expression, we use Tpath to denote the representation type of the structure 

expression that is identified by the path. 

Example 4.5. We choose class LECTURERC from university database schema. Class 

LECTURERC IsA PERSONC Struct (PersonID: NAT, Name: (FName: STRING,LName: ....__...., '-v-' 
path 1 path2 

STRING, Title: {STRING}) , Specialization: STRING, Department: DEPARTMENTC, 
'-v-' "--v-"' 
path3 

Campus: STRING) 
path5 

There are some path expressions defined on class LECTURERC as following: 


54 

(i) ident 

(ii) value 

(iii) path1 =value.Name 

path2 = value.Name.FName 

path3 =value.Title 

path4 = value.Specialization 

(iv) path5 = value!Department 

path~ = value!Department.Name 

path5 refers to the identifier of the department being referenced and path~ refers to 

the name of the referenced department. 

D 

4.3.2 Queries 

From application information we can get a set of queries accessing databases. For the 

relational model, we can use relational algebra to model queries and to optimize queries. 

We must provide a model for querying object oriented databases. Basically, an algebra 

in the general mathematical sense is given by a set A and a set of operations op on A 

[29]. In the object oriented model, each query results in a set of pairs (id, v), where id is 

an identifier and v is a value of some proper type. The value v may contain identifiers, 

which must appear in the database, to which the query is applied. More generally, it would 

also be possible that the identifiers appear in the query result , but much more complicated 

queries are not handled here. We refer to such a set of pairs as a "class instance". A query 

Q on S consists of a structure expression expQ called answer schema (with all references 

pointing to classes in S) and an algebra expression q, i.e. a query Q is defined in the form 

Q = (expQ, q). Every operator in the query algebra accepts (one or two) class instances as 

arguments and returns another class instance as a result [25]. 


55 

We start by saying that every class name CE S can be considered as a query q = C. The 

answer schema is expc itself and database instance db is mapped via query q to db(Q) , 

which is a new class instance db(Q) extending db over Stoa new database, which contains 

a set of new objects with new identifiers created for each of them. Also, pairs (v : T) with 

value v of type T is considered as a query. 

Definition 4.7. Let S be a database schema, db be a database instance over S , id(db) be 

the set of all identifiers appearing in db , db( Q) be the resulting class instance of evaluating 

query algebra Q = (expQ, q). There are two basic query algebra operations: 

(i) q = C with C is a class name appeared in S with a query expressed as Q = (expc, q), 

result ing in db(Q) = { (idw,v) J idw :: ID ,idw ~ id(db).3id :: ID .(id,v) E db(C)}. 

(ii ) q = (v : T) with a type T and a value v of type T and Q = (T,q), resulting in 

db(Q) = {(idw, v) J idw :: ID f\ idw ~ id(db)} . 

Example 4.6. the university database instance db in Example 4.4 contains an instance 

db(RooMC) of class RooMC : 

db(RooMC) = { ( i 301 , (Building : SSLB, NO : 2, Campus : PN)), 

(i302, (Building: SST, NO: 1, Campus: WN)), 

(i303, (Building: Marsdon, NO: 1, Campus: PN)), 

(i304, (Building: AH, NO : 2, Campus: ALB)) , 

(i305, (Building: BSC, NO : 203 , Campus: ALB))} 

A query operation q =ROOMC for a query Q1 (eXPRooMC, RooMC) with a resulting 


56 

instance of Q as 

db( Qi) = {(i1001 , (Building: SSLB, NO: 2, Campus: PN)), 

(i1002, (Building: SST, NO: 1, Campus: WN)), 

(i1003, (Building: Marsdon, NO: 1, Campus: PN)), 

(i1004, (Building: AH, NO : 2, Campus: ALB)), 

(i1005, (Building: BSC, NO : 203, Campus: ALB))} 

Note the result of the query on db(C) is a new class that contains a set of objects with new 

identifiers which have not appeared in the database. But the values for all the attributes 

are the same. 

Another query Q2 =((Building: STRING, NO : NAT, Campus: STRING),(Building: SSLB, 

NO : 2, Campus: PN)) results: 

db(Q2) = {(igsoo1, (Building: SSLB, NO: 2, Campus: PN))} D 

We can define a selection operation with a query algebra q. To do this we need to use 

path expressions path for class C to define selection formulae. Path expressions have been 

defined in the previous subsection. 

Definition 4.8. Let C be a class, path be path expressions on class C, take either 

• two path expression path1, path2 of C of some type, 

• or a path expression path on class C and a value v of the type of path. 

Selection formulae c.p can be defined in either of the following forms: 

• c.p = path1 = path2, 

• or c.p =path = v. 


57 

Definition 4.9. (selection) 

Let S be a database schema, db be a database instance over S , id(db) be the set of all 

identifiers appearing in db. Take any query Q = (expQ, q) and a selection formula c.p for 

expQ we get: 

• a query algebra q' = O"cp(Q) for selection operation, 

• a query Q' = (expQ, q') with expQ' = expQ, 

• evaluating Q' on db results in a database instance: 

db(Q') = {(idw,v ) I idw :: ID /\ idw rt id(db).3id :: ID.(id ,v) E db(Q) /\ c.p(v) =true } 

Example 4. 7. There is a database instance db over the university database schema. With 

a query algebra q = PERSONC we have a query Q = (expPERsoNC , q) , then db is extended 

by adding a db( Q) within which all the objects' identifiers have not occurred before. 

Then we have a selection formula: c.p = value.Name.FName = 'John' . Using query operation 

O"cp(Q) we get a result as 

db(Q') = {(i95101 , (PersonID: 1001 , (FName: John ,LName: Dever ,{Professor,Dr} ), 

Address: 28 Victoria Av, DoB:16/ Jan/ 1953))} D 

Definition 4.10. (renaming operation) 

Let Q = (expQ, q) be a query, attr(Q) denote all the attribute names appearing in expQ 

with ai E attr(Q). A rename operation renames some of the attributes of a class. We have 

following expressions: 

• query algebra q' = Pai>--+bi, ... ,an>--+bn(Q), 

• new query Q' = (expQ, q'), 

• Evaluating query Q' on db we get: 

db(Q') = {(idw, v) I idw :: ID /\ idw rt id(db) .3id :: ID.( id, v) E db(Q)} 


58 

with bi as a new name for attribute ai· 

In order to define generalized projection, we need to use the definition of super type that is 

introduced in [26]. We use the form Te2 ::; Te1 to express that Te1 is a super type of type 

Te2 • This expression indicates a mapping: 

We can get the following definition for generalized projection. 

Definition 4.11. (generalized projection) 

Let Q = (expQ, q) be a query, expQ' be a new structure expression which is a super structure 

expression of expQ, i.e. TQ ::; TQ' for the representation types. A generalized projection is 

a mapping: 

results in a new query: Q' = (eXPQ', 1fQ1(Q)), 

with db(Q') = { (idw, v') I idw :: ID /\ idw rt id(db).3id :: I D.(id, v) E db(Q).v' = 7fg ( v)} . 

Definition 4.12. (join) 

Let Q1 = ( expQ 1 , q1), Q2 = ( expQ2 , q2) be two queries, exp be a common super structure 

expression with eXPQ; ::; exp (i = 1, 2) , then there exists: 

• a new structure for join operation exp1 f><lexp exp2, 

• and a new instance 

db( Q1 f><lexp Q2) = { ( idw, V) I idw :: ID/\ idw rt id(db ).3id :: ID.( id1, vi) E db( Qi). 

( id2, v2) E db( Q2).7r~~( vi) = 7r~~( v2) /\ 7rg~1><JQ2 ( v) = vi /\ 

7rg~rxiQ2 
( v) = v2} 

Definition 4.13. (set operations) 

Let Qi = (expQ, qi) , Q2 = (expQ, q2) be two given queries. 


• union operation on sets is expressed with structure expression: 

with a new query Qi U Q2 = (expQ, qi U q2 )· Evaluating q we get 

db(Qi UQ2) = {(idw ,v) I idw :: ID l\idw f/.id(db).3id:: ID. (idi,v) E db(Qi) 

V (id2,v) E db(Q2)} 

• intersection of two sets is expressed with structure expression: 

with a new query Qin Q2 = (expQ, qi n q2)· Evaluating q we get 

db( Qin Q2) = { (idw, v) I idw :: ID /\ idw ¢ id(db) .~id :: I D.(idi, v) E db( Qi ) 

A (i d2 ,v) E db(Q2)} 

• difference operation on sets is expressed with structure expression: 

with a new query Qi U Q2 = (expQ, qi - q2) · Evaluating q we get 

db( Qi - Q2) = {(idw, v) I idw :: ID /\ idw ¢ id(db). ~id :: I D. (idi, v) E db( Qi ) 

I\ (idi,v) ¢ db(Q2)} 

59 

With all the definitions discussed above we can define queries that access databases. In 

particular, we can define simple predicates for horizontal fragmentation in object oriented 

databases. 


Chapter 5 

Fragmentation Operations in 
Object Oriented Databases 

Distribution design involves making decisions on the fragmentation and placement of data 

across the sites of a computer network. The design process has two phases: fragmentation 

and allocation. Fragmentation of object oriented database systems is a complex problem 

because: 

• it involves set-valued and reference attributes, 

• inheritance (IsA) relationships are employed, 

• complex data types should be considered. 

In the object oriented environment, fragmenting a class may use three techniques: horizontal 

and vertical fragmentation as well as splitt ing techniques. It is also possible to combine all 

these techniques to perform mixed (hybrid) fragmentation of a class. 

Horizontal fragm entation exploits databases to be defined by sets. It partitions a class into 

a set of new classe