Copyright is owned by the Author of the thesis.  Permission is given for 
a copy to be downloaded by an individual for the purpose of research and 
private study only.  The thesis may not be reproduced elsewhere without 
the permission of the Author. 
 

LOGIC BASED QUERIES FOR XML DATABASES 

By 

Qing Wang 

SUBMITTED IN PARTIAL FULFILLMENT OF THE 

REQUIREMENTS FOR THE DEGREE OF 

MASTER OF INFORMATION SYSTEMS 

AT 

MASSEY UNIVERSITY 

PALMERSTON NORTH, NEW ZEALAND 

DECEMBER 2005 

© Copyright by Qing Wang, 2005 


MASSEY UNIVERSITY 

DEPARTMENT OF 

INFORMATION SYSTEMS 

The undersigned hereby certify that they have read and recommend to 

the Department of Information Systems for acceptance a thesis entitled "Logic 

Based Queries for XML Databases" by Qing Wang in partial fulfillment 

of the requirements for the degree of Master of Information Systems. 

Dated: December 2005 

Supervisor: 
Klaus-Dieter Schewe 

Readers: 

ii 


MASSEY UNIVERSITY 

Date: December 2005 

Author: Qing Wang 

Title: Logic Based Queries for XML Databases 

Department: Information Systems 

Degree: M.I.S. Convocation: February Year: 2006 

Permission is herewith granted to Massey University to circulate and to have 
copied for non-commercial purposes, at its discretion, the above title upon the request 
of individuals or institutions. 

Signature of Author 

THE AUTHOR RESERVES OTHER PUBLICATION RIGHTS, AND NEITHER 
T HE THESIS NOR EXTENSIVE EXTRACTS FROM IT MAY BE PRINTED OR 
OTHERWISE REPRODUCED WITHOUT THE AUTHOR'S WRITTEN PERMISSION. 

THE AUTHOR ATTESTS THAT PERMISSION HAS BEEN OBTAINED FOR THE 
USE OF ANY COPYRIGHTED MATERIAL APPEARING IN THIS THESIS (OTHER 
THAN BRIEF EXCERPTS REQUIRING ONLY PROPER ACKNOWLEDGEMENT IN 
SCHOLARLY WRITING) AND THAT ALL SUCH USE IS CLEARLY ACKNOWLEDGED. 

iii 


Table of Contents 

Table of Contents 

Acknowledgements 

1 Introduction 
1.1 XML Data Challenges 
1.2 Objectives . . . . . . . . . . . ...... . 

1.2.1 The Semantic XML Object Model. 
1.2.2 The XML Identity Query Language . 
1.2.3 The XML Calculus Language 

1.3 Principals . . . . . . . 
1.3.1 Object Bases . 
1.3.2 Logic Paradigm 

1.4 Overview of the Thesis 

2 The Semantic XML Object Model 
2.1 Object Types and Objects .. . . 
2.2 XML Databases on Object Bases . 

2.2.1 Database Instances ... .. 
2.2.2 Class Schemata and Tuples 
2.2.3 Database Schemata . 

2.3 The Semantics of Objects 
2.4 Object Connectivity ... 

2.4.1 Dominant Relation 
2.4.2 Class Hierarchy .. 
2.4.3 Object Order . .. 

2.5 Another Representation of Objects 

3 XML Identity Query Language: A Framework 
3.1 Syntax ............... . 
3.2 Semantics . . . . . . . . . . . .. . 

3.2.1 Valuations and Satisfaction 

lV 

iv 

vii 

1 
1 

3 
3 
4 

4 

5 
5 
6 
6 

8 
8 

12 
13 
14 
19 
22 
23 
23 
25 
26 
27 

30 
30 
33 
34 


3.3 
3.4 

3.2.2 
3.2.3 

Semantics of Queries . . . . . . 
Union and Intersection Types . 

3.2.4 Inflationary Fixpoint Operator . 
3.2.5 Safe Queries . . . . . . . . . 
Object Creation Mechanism .... 
Key Characteristics and Examples . 
3.4.1 Nest and Unnest .. 
3.4.2 Structural Recursion 
3.4.3 Path Expressions 
3.4.4 Schema Query . 
3.4.5 Join Operation . 
3.4.6 Sequences .... 
3.4. 7 Function Symbols . 

4 Further Issues on XML Identifier Query Language 
4.1 Duplicate Elimination . . ... . . 

4.1.1 Structured Value Duplicates 
4.1.2 A Main Result 

4.2 Copy Elimination . . . . . . . . . . 
4.3 Database Transformations . . . . . 

4.3.1 
4.3.2 
4.3.3 

Determinate Transformations 
Hybrid-deterministic Transformations . 
Completeness . . . . . . . . . . . . . . 

5 The XML Calculus Language 
5.1 Preliminaries . . ......... . . 

5.1.1 A Fundamental Type System 
5.1. 2 Higher-order Constructs 

5.2 Syntax . ..... ... ... . 
5.2.1 Terms and Variables . . 
5.2.2 Well-Formed Formulae . 
5.2.3 Free and Bound Variables 

5.3 Semantics . . . . . . . 
5. 3 .1 Interpretation . . 
5.3.2 Query Mapping . 
5.3.3 Object Creation . 
5.3.4 Active Domain Semantics 

5.4 Logical Reflection . . . . . . 
5.5 Simulation ......... . 

5.5.1 
5.5.2 
5.5.3 

Recursive structures 
Nest and Unnest 
If-Then-Else . . . . . 

V 

36 
37 
40 
41 
44 
47 
47 
49 
50 
51 
52 
53 
54 

55 
55 
5G 
57 
61 
63 
64 
65 
66 

67 
67 
68 
70 
72 
72 

73 
74 
75 
76 
78 

78 

79 
80 
84 
85 
85 
88 


5.5.4 
5.5.5 
5.5.6 
5.5.7 

Class Names and Atomic Values . 
Queries over Object Order 
Aggregation Functions 
Duplicate Elimination 

6 Related Work 
6.1 Semantic Data Models 
6.2 Object-Creating Languages 
6.3 Pure Declarative Languages 

7 Conclusions 

A Running XML Database 

Bibliography 

Vl 

89 
90 
90 
91 

92 
92 
93 
94 

96 

98 

101 


Acknowledgements 

I would like to express my gratitude to the many people who have made this thesis possible. 

Most of a ll I would like to thank Professor Klaus-Dieter Schewe, my supervisor, who 

recommended this interesting research topic to me when I was in confusion at the early 
stage of Masters study, and introduced me to the world of logic theory. Without his patient 
guidance and stimulating suggestions, I would have been lost somewhere. I am also very 
grateful to Dr. Sven Hartmann for being a constant source of support and encouragement 
at all t imes, which greatly inspired me to overcome many difficulties in my life. 

Special thanks go to Thu Trinh and Madre Chrystall. T hey not only dedicated their time 
to being a reader of my manuscript, but also provided many helpful comments. Furthermore, 

I wish to thank all the many teachers, friends and classmates who helped me get where I am 
today although it is impossible to mention them here individually. 

In particular, I wish to convey my thanks to Mrs Clark, Massey University, and Mr 
Todd and the NZVCC , for being awarded the Lovell & Berys Clark Scholarship, Massey 
University Mastcratc Scholarship and Todd Award for Excellent in 2005. These scholarships 
arc crucial to the successful completion of this research because they not only free me from 
financial worries, but also motivate me towards higher goals for this research. 

Lastly, and most importantly, I am forever indebted to my family for their understanding, 

endless patience and encouragement when it was most required, especially my three-year-old 
son Jiajun, he always seems quite understanding toward my study although he is not really 
aware of that. 

Wang, Qing 

December 1, 2005 

vii 


Chapter 1 

Introduction 

With the significant increase of web-based applications, the eXtensible Markup Language 

(XML) , as a de facto standard for data interchange on the web, has attracted considerable at­

tention in theory and practice. XML deals with irregular and heterogeneous semi-structured 

data that give rise to trees, i.e. the predominant data structure in complex data models. 

This leads to new challenges for database research such as new data structures and models , 

and new database query languages in this area. 

1.1 XML Data Challenges 

In essence, the major challenge of XML lies in its data structure, which is greatly different 

from the traditional data structure treated as relations. A comparison with respect to the 

differences between XML data and relational data has been discussed in [17] . Generally 

speaking, XML data has the following unique characteristics: 

• XML data have irregular and heterogeneous structures leading to optional values which 

are absent in relational data; 

• Metadata that describes the structure of the data is distributed throughout the data 

in the form of markup, instead of being stored separately as in relational data; 

• XML data is nested in a hierarchy, and complex nested structures might need to be 

decomposed and reconstructed on the fly to facilitate data manipulations; 

• XML data are encoded with an intrinsic order that is an important property of data 

themselves. This means that the order property must be taken into consideration for 

modelling. 

1 


CHAPTER 1. INTRODUCTION 2 

These features require more complicated data models in comparison to the relational 

data model, which merely deals with flat and rigid relations. In particular, problems such as 

rational trees, optional cardinality, repeated clements and sequence order demand additional 

functionalities to be provided to handle XML data in an efficient and sufficient manner. 

Since the unique structures of XML data further require that XML query languages 

must have the capability for querying over schema information. In order to support this 

extra functionality at the database level, the techniques used in semantic data models have 

shown to be an appropriate approach, in which t he separate notions of schema and instance 

are combined into an explicit expression. This allows the queries of schema information to 

be handled in the same manner as querying data at the language level. 

Therefore, the first task investigated in this thesis is to design a simple and natural data 

model, which supports a rich description of XML data from a structural point of view. The 

relevant techniques for this arc derived from semantic data models. 

In the theory of relational databases, query languages have been developed mainly fol­

lowing three directions: an algebraic direction e.g. the Relational Algebra, a logical direction 

e.g. the Relational Calculus, and a logic programming direction e.g. deductive languages 

such as DATALOG. Many issues such as recursion, negation and the combination of them 

have been well investigated in relational query languages. However, some new problems have 

emerged with respect to XML query languages. 

• the possible absence of a database schema; 

• the capability to support querying over schema information; 

• restructuring and creating XML data in a flexible approach; 

• recursion under the feature of optionality. 

Not surprisingly, the second task investigated in this thesis aims at developing declarative 

and powerful XML query languages in the context of XML data. My focus in this thesis is 

on a logical direction and a logic programming direction. 


1.2. OBJECTIVES 3 

1.2 Objectives 

The main goal of this thesis is to investigate the logical grounds of query languages over 

XML databases on the basis of a representation by object bases. In general, there are three 

objectives I want to achieve. 

-Define a data model at the conceptual level for XML data aiming at capturing semantic 

capabilities on object bases. 

-Develop the formal syntax and semantics for a logic programming query language 

for XML, which exploits object identifiers as primitives in the context of XML data 

structures. 

-Propose a higher-order predicate calculus language tailored to XML data by study­

ing the mathematical techniques used in relational calculus which is regarded as a 

specification of the first-order predicate calculus. 

To clarify how the objectives are fulfilled in this thesis, we provide a brief introduction 

regarding each objective in this subsection. 

1.2.1 The Semantic XML Object Model 

The semantic XML object model aims at a natural and simple data model for XML-based 

databases that can capture and express XML data from a relatively high abstraction level 

with well-defined semantics. Furthermore, it can also serve as a formal specification for 

further establishing XML data manipulation languages. 

The proposed semantic XML object model mainly has the following key features: 

• It provides considerably rich structuring capability by using objects as basic semantic 

data abstractions that are associated with concise but flexible interpretations. 

• It provides classes as the only construct to accommodate objects, thus enabling richness 

through class tuples and class schemata that link XML database instances and XML 

database schemata together. 


CHAPTER 1. INTRODUCTION 4 

• It provides attributes at the object level, which arc used to incorporate values and 

objects into a unit, and attributes at the class level, which facilitate to capture objects 

having common semantics but variant structures within a specific class schema. 

• It provides object identifiers that are used in a dual role as both structure and order 

primitive. 

1.2.2 The XML Identity Query Language 

The XML Identity Query Language (XIQL) is greatly influenced by the spirit of IQL in [6, 7] 

on a key design point that objects can be manipulated through having object identifiers as a 

powerful programming primitive. New objects can be created in a manner that is essentially 

equivalent to Skolcm function techniques. However, more work needs to be done with respect 

to IQL to reflect the characteristics of XML data. 

Compared with IQL, XIQL provides the following additional functionalities: 

• XIQL incorporates the capability for type creation in queries. On one hand new types 

can be easily added into a database, while on the other hand types can be queried as 

a special part of an instance. 

• XIQL enriches the language with union types, which can capture the optionality fea­

ture of XML data. Further, the underlying intersection types facilitate the flexible 

expressions for objects with class atoms. 

• XIQL treats set and non-set variables in a unified fashion unless a particular interest 

is indicated. In this case, a set operator can be used to handle the transformation 

between set and non-set variables. 

1.2.3 The XML Calculus Language 

The XML calculus language is a pure declarative language incorporating higher-order logics 

on the basis of the SXO model, in which semantic structures can be elegantly encapsulated 

into obj ects. 


1.3. PRINCIPALS 5 

The design goal of the XML calculus language is to obtain highly expressive power, but 

using a relatively simple interpretation. To achieve this goal, the combination of a higher­

order syntax and a first-order semantics within a language becomes a suitable choice for 

the proposed XML query language. Furthermore, through a connection of two semantics of 

objects, an equation between the first-order and higher-order interpretations can be estab­

lished. 

To handle higher-order logics, it is essential to define a fundamental type system in the 

language, on which higher-order notions can be precisely developed. Moreover, for an XML 

query language, the ability to query schema information must be taken into consideration. 

For this reason, type variables are introduced into the type system. 

Another interesting design point is to naturally reflect fixpoint semantics in this language 

by means of universal quantification, which is associated with variables under restricted 

domains. With this capability, the expressive power of the language will be greatly enhanced. 

1.3 Principals 

Throughout this thesis two main principles will be followed: considering XML databases as 

object bases and following a logic paradigm for querying them. 

1.3.1 Object Bases 

The data model developed in my thesis is built upon object bases. Instead of encapsulating 

behavioral aspects of objects in object-oriented databases, this data model concentrates on 

encapsulating structural aspects from a semantic point of view. The reasons for preference 

for an object-based model, rather than a relational model, principally arise from the following 

considerations. 

Firstly, as argued in [63], the relational data model fails to capture much of the semantics 

associated with data because the fundamental modelling construct, i.e. the tuple does not 

constitute an atomic semantic unit. Hence, additional integrity constraints are required to 

establish the intended semantics of the database. In contrast to tuples, the features of objects 

determine themselves to be the natural semantic carriers in XML databases. 

Secondly, to tackle the irregular and heterogenous structure of XML data, objects grouped 


CHAPTER 1. INTRODUCTION 6 

in a class can provide richer structure modelling facilities than tuples in a relation. By spec­

ifying classes, which represent collections of objects encoded with heterogenous structures, 

a uniform framework on object bases can be established. 

Finally, it is customary that XML data has a graphical representation, which is also 

thought of as a representation for objects in the object-based model. Hence, they coincide 

with respect to this intuition. Furthermore, both objects and graphs can be constructed to 

be self-descriptive, which is an important feature identified in XML data. However, a tabular 

representation provided in the relational data model is not suitable in this case due to the 

separation of schema and instance. 

In addition, although many XML query languages have been proposed such as XQuery 

[73], XQL [56], LOREL [8] and XML-QL [29], most of them aim at providing the XML-era 

analogue of SQL. However, I believe that the investigation on XML query languages from 

a perspective of object bases has not been sufficiently conducted yet so far, especially for 

exploiting object identifiers as structure and order primit ives. 

1.3.2 Logic Paradigm 

In the thesis, a study towards logic-based query languages in the framework of an object­

based XML data model will be conducted. The reason for considering the logical ground of 

query languages is based on the following two points: 

• Firstly, logic-based approaches have been recognized as a means to provide remarkable 

simplicity and conciseness of syntax for query languages, as were the case in relational 

calculus and Datalog for the relational data model. 

• Furthermore, logic-based approaches are essential for evaluating the expressive power 

and computable complexity of query languages. 

Most of the current XML query languages focus on path-based query processing with 

pattern matching techniques. Research regarding theory, well-defined semantics and expres­

siveness is still needed. 

1.4 Overview of the Thesis 

The remainder of the thesis is organized according to the following logical sequence. 


1.4. OVERVIEW OF THE THESIS 7 

In Chapter 2, we focus on modelling XML data in a natural way by applying techniques 

of semantic data model. The developed data model, called the SXO model, is built upon 

object bases, in which identifier and value semantics of objects are formalized. Moreover, 

XML database instances and schemata are formalized along with the notions of class schema 

and class tuple. At the end of the chapter, some insights are given into dominant rela­

tion, class hierarchy and object order. Based on this model, XIQL, a simple and powerful 

logic programming language for manipulating XML data, is proposed in Chapter 3. A basic 

framework of this language with the formal syntax and semantics is presented there. Fur­

thermore, we discuss the object creation mechanism adopted in XIQL in details. Chapter 4 

investigates the issues about structured value duplicates and copies. It turns out that XIQL 

is complete with respect to determinate transformation since both structured value dupli­

cates and copies can be eliminated in XIQL queries. To obtain a pure declarative XML query 

language, Chapter 5 incorporates higher order logics into a novel XML query language called 

XML calculus as a counterpart of relational calculus in the relational data model. This lan­

guage is developed with higher-order syntax and first-order semantics. After introducing the 

formal syntax and semantics of the language, a logical reflection regarding the relationship 

between first-order semantics and higher-order semantics is discussed. Chapter 6 reviews 

the literature from three aspects: semantic data models, object-creating languages and pure 

declarative languages, as a complement to some related work having been introduced during 

Chapter 2, 3, 4 and 5. In the end, we summarize the main results exploited in this thesis in 

Chapter 7. Several issues left for future work are also identified at the end of Chapter 7. 

Throughout the thesis we will use a simple XML database as presented in Figure A.1, 

Figure A.2 and Figure A.3 of Appendix A. All the data stem from XML Query Use Case [72] 

provided by the World Wide Web Consortium (W3C) with some minor modifications. The 

purpose of the running XML database is to apply the proposed query languages to specific 

application scenarios, and illustrate the capabilities of our languages.