Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author. On Fast and Space-Efficient Database Normalization A dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Systems at Massey University, Palmerston North, New Zealand Henning Koehler 2007 Abstract A common approach in designing relational databases is to start with a relation schema, which is then decomposed into multiple subschemas. A good choice of sub- schemas can often be determined using integrity constraints defined on the schema. Two central questions arise in this context. The first issue is what decompositions should be called “good”, i.e., what normal form should be used. The second issue is how to find a decomposition into the desired form. These question have been the subject of intensive research since relational databases came to life. A large number of normal forms have been proposed, and methods for their computation given. However, some of the most popular proposals still have problems: • algorithms for finding decompositions are inefficient • dependency preserving decompositions do not always exist • decompositions need not be optimal w.r.t. redundancy/space/update anomalies We will address these issues in this work by • designing efficient algorithms for finding dependency preserving decompositions • proposing a new normal form which minimizes overall storage space This new normal form is then characterized syntactically, and shown to extend existing normal forms. iii Acknowledgement I would like to thank my supervisor Sven Hartmann for his constant support, ranging from fruitful discussions and extensive proofreading to help with administrative hurdles and moral support. Furthermore, my thanks go to my co-supervisor Klaus-Dieter Schewe, and to my colleagues Sebastian Link and Markus Kirchberg, who helped and supported me in various forms. I dedicate this thesis to my parents, Klaus and Waltraud Koehler, and to my partner Jane Zhao. v Contents 1 Introduction 3 1.1 Relational Databases and Dependencies . . . . . . . . . . . . . . . . . . . . 3 1.2 Normal Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Linear Resolution and Faithful BCNF Decomposition 10 2.1 Linear Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 The Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.2 Improvements and Complexity Analysis . . . . . . . . . . . . . . . 17 2.1.3 Polynomial Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.4 Updating the Atomic Closure . . . . . . . . . . . . . . . . . . . . . 22 2.1.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Faithful BCNF Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 The Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Improvements and Complexity Analysis . . . . . . . . . . . . . . . 29 2.2.3 Partial Determination Cycles . . . . . . . . . . . . . . . . . . . . . 31 2.2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3 Complex-Valued Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.2 Representation Basis . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.3 Linear Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.3.4 Faithful BCNF-Decomposition . . . . . . . . . . . . . . . . . . . . . 46 2.3.5 Lossless Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.6 Testing for BCNF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3 Canonical Covers 55 3.1 Hypergraph Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.1.1 Autonomous Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.1.2 Superedges and Partial Superedges . . . . . . . . . . . . . . . . . . 62 3.1.3 Computing Autonomous Sets . . . . . . . . . . . . . . . . . . . . . 63 3.2 Computing all Canonical Covers . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2.1 Partial Covers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2.2 Relative Covers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2.3 Implication Dependencies . . . . . . . . . . . . . . . . . . . . . . . 69 3.2.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2.5 Improvements and Complexity Analysis . . . . . . . . . . . . . . . 74 1 3.2.6 LR-reduced Covers . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.3 Size and Number of Non-redundant Covers . . . . . . . . . . . . . . . . . . 76 3.4 Partial Implication Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5 Essential FDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.5.1 Deriving essential FDs . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.5.2 Testing essentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4 Domination Normal Form 91 4.1 Minimization as Normal Form . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.1.1 Ordering by Size of Instances . . . . . . . . . . . . . . . . . . . . . 92 4.1.2 Ordering by Attribute Count of Instances . . . . . . . . . . . . . . 95 4.1.3 Ordering by Containing Schema Closures . . . . . . . . . . . . . . . 96 4.2 Equivalence of Orderings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.2.1 Size vs. Attribute Count . . . . . . . . . . . . . . . . . . . . . . . . 98 4.2.2 Attribute Count vs. Containing Schema Closures - Part I . . . . . . 99 4.2.3 Subset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2.4 Attribute Count vs. Containing Schema Closures - Part II . . . . . 107 4.3 Relationship to other Normal Forms . . . . . . . . . . . . . . . . . . . . . . 110 4.3.1 A Detailed Example . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4 Computing Domination Normal Form . . . . . . . . . . . . . . . . . . . . . 114 4.4.1 Dependency Preserving DNF . . . . . . . . . . . . . . . . . . . . . 116 4.5 Combining Normal Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.5.1 DNF and BCNF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.5.2 DNF and EKNF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5 Summary 129 5.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.2 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 2 Chapter 1 Introduction In this chapter we will give a brief introduction to relational database theory, outlining challenges and how we will deal with them in this work. 1.1 Relational Databases and Dependencies A number of different database models have been suggested, such as relational, complex- valued or object-oriented databases. In particular, XML-Databases (which are a certain type of complex-valued databases) have received much attention in recent years. While the relational model is one of the oldest, relational databases are still the most common ones. The main advantage of the relational model over more powerful models is its sim- plicity. By approaching a problem, which is relevant in different data models, in the relational model first, we can concentrate on the core of the problem. Once solved for the relational case, the results can often be extended to more complex database models, with the relational model serving as common denominator. This approach seems more suitable than tackling issues in a complex model directly. We therefore will mainly focus on the relational model in this work. We begin by introducing some basic terms. A relational database schema consists of a set of relation schemas. A relation schema R = {A1, A2, . . . , An} is a finite set of attributes . Each attribute Ai has a domain dom(Ai) associated with it. Domains are arbitrary sets, but unless explicitly stated otherwise, we will assume that domains are countably infinite. A relation r over a relation schema R is a finite or infinite set of tuples, and each element ei ∈ dom(Ai) of the tuple corresponds to one attribute Ai ∈ R. Relations over a schema are also commonly referred to as schema instances or tables. Sets of schema instances, one for each relation schema in a database schema, are called database instances. A vital tool for managing data are integrity constraints . They describe, usually in a syntactic manner, what instances of a database schema are acceptable or valid. Formally, an integrity constraint on R is a function mapping relations r on R to {True, False}. We say that a constraint holds on r if it maps r to True. Constraints are usually used to avoid storing instances with inconsistent data. By only accepting updates which lead to valid instances, inconsistencies can be detected immediately. Many different types of integrity constraints have been suggested in the 3 literature. In this work we will focus on the most common type, that is on functional dependencies, and in some occasions also consider multivalued and join dependencies. With each relation schema we associate a set Σ of integrity constraints, in particular functional dependencies (FD), multivalued dependencies (MVD) and join dependencies (JD). These restrict which relations over R we may store. We say that a set Σ of con- straints over R implies a constraint (or set of constraints) c, written Σ ² c, if c holds on every relation r over R for which all constraints in Σ hold. If two sets of constraints Σ and Σ′ imply each other, we call Σ a cover of Σ′ (and vice versa). We say that a FD X → Y ∈ Σ is redundant in Σ, if Σ \ {X → Y } implies X → Y (and thus is still a cover of Σ). At this point we need to consider two options: If we allow relations to be infinite, we get a different notion of implication than we get when considering only finite relations. In the latter case, implication is commonly referred to as finite implication, and these notions of implications can be different [11]. In this work we shall consider only finite relations. However, as implication and finite implication are actually the same for functional and join dependencies [33], all our results from chapters 2 and 3 will hold for infinite relations as well. In chapter 4 we will compare relations w.r.t. their size, which is best done for finite relations only. Alternatively, we could allow infinite relations in principal, but only consider finite ones in our comparison, which would again lead to the same results. A functional dependency on R is an expression of the form X → Y (read “X deter- mines Y ”) where X and Y are subsets of R. For attribute sets X,Y and attribute A we will write XY short for X ∪ Y and A short for {A}. We say that a FD X → Y holds on a relation r over R if every pair of tuples in r that coincides on all attributes in X also coincides on all attributes in Y . We call a FD X → Y trivial if Y ⊆ X. Trivial FDs are the only FDs which hold on every relation. A set X ⊆ R is a key of R w.r.t. a set Σ of integrity constraints on R, if Σ implies X → R. Note that some authors use the term ’key’ only for minimal keys, and call keys which may not be minimal ’superkeys’. Implication of FDs can be characterized syntactically by the following derivation rules, known as the Armstrong Axioms: X → Y Y ⊆ X, X → Y XW → YW , X → Y Y → Z X → Z (1.1) Here the FDs on the top imply the FD at the bottom, and Y ⊆ X is a side condition which needs to hold for the first rule to be applicable. The Armstrong Axioms are known to be sound , meaning that only implied FDs can be derived (which is easy to see), and also complete, i.e., every implied FD can be derived [2]. We write Σ∗ for the set of all FDs on R implied by Σ. Note that it is often necessary to apply these derivation rules multiple times, as implied FDs usually cannot be derived in one step. This leads to a sequence of derivation steps, which can be written as derivation trees . Example 1.1. Consider the set of FDs Σ = {A → B,AB → C,BC → D} The FD A → D is implied by Σ, and can be derived using the Armstrong Axioms as follows: A → B A → AB AB → C AB → ABC ABC → BC BC → D ABC → D AB → D A → D 4 While such derivation trees could in principal be used to show that a FD X → Y is implied by a set Σ of other FDs, this is not very efficient. Instead we compute the closure X∗ of the left hand side X, which is the set of all attributes determined by X: X∗ := {A ∈ R | X → A ∈ Σ∗} In the rare case where the set Σ is not clear from the context, we write X∗Σ. Once computed, we only need to check whether the right hand side Y is a subset of X∗. Computing X∗ can be done quickly using the well-known closure algorithm, which can be implemented to run in linear time, as shown by Beeri and Bernstein in [5]. Algorithm “closure” INPUT: set of FDs Σ, attribute set X OUTPUT: X∗, the closure of X w.r.t. Σ X∗ := X while ∃X ′ → Y ∈ Σ with X ′ ⊆ X∗, Y * X∗ do X∗ := X∗Y end For a set X ⊆ R we denote the projection of r onto the attributes in X by r[X]. The join of two relations r[X] and r[Y ] is a relation on X ∪ Y : r[X] ./ r[Y ] := { t ∣∣∣∣ ∃t1 ∈ r[X], t2 ∈ r[Y ]. t[X] = t1 ∧ t[Y ] = t2 } A join dependency on R is an expression of the form ./ [R1, . . . , Rn] where the Ri are subsets of R with ⋃Ri = R. We say that the JD ./ [R1, . . . , Rn] holds on r if the decomposition {R1, . . . , Rn} is lossless for r, i.e., if r[R1] ./ . . . ./ r[Rn] = r A multivalued dependency on R is a join dependency ./ [R1, R2] with only two sub- schemas. It is usually written as X ³ Y where X = R1 ∩ R2 and Y = R1 \ R2 or Y = R2 \ R1. Note that while these two forms of writing are technically equivalent for a fixed relation R, their “natural interpretations” (i.e., how a designer would understand them) can be quite different. This has lead researchers to consider different notions of implication for MVDs in undetermined universes [8, 30]. These will not be relevant for our work though, and we stick with the standard notion of implication. For a more thorough introduction to relational database theory see e.g. [27, 33, 36]. 1.2 Normal Forms A common approach in designing relational databases is to start with a universal relation schema, which is then decomposed into multiple subschemas. A good choice of subschemas can often be determined using integrity constraints defined on the original schema. 5 To ensure that the schemas of a decomposition D = {R1, . . . , Rn} with Ri ⊆ R can hold the same data as the original schema R, we must ask for a decomposition that has the lossless join property, i.e., the join dependency ./ [R1, . . . , Rn] must be implied by Σ. Furthermore, the decomposition should be dependency preserving or faithful , i.e. the dependencies on the schemas Ri which are implied by Σ should form a cover of Σ. This allows a database management system to check constraints for individual relations only, without having to compute their join. The projection of a set Σ of FDs onto a subschema Ri ⊆ R is Σ[Ri] := {X → Y ∈ Σ | XY ⊆ Ri} Thus a decomposition D = {R1, . . . , Rn} is dependency preserving if (⋃ Σ∗[Ri] )∗ = (⋃ Σi )∗ = Σ∗ where Σi is a cover for Σ∗[Ri] (when describing the decomposition, we usually want to represent Σ∗[Ri] by a smaller cover for it). Note that it is not sufficient to only project Σ onto the Ri, rather than Σ∗. Example 1.2. Let R = ABC with constraints Σ = {A → B,B → C}. Then Σ[AC] = ∅, although A → C is a FD on AC which is implied by Σ. Normal forms are syntactic descriptions of “good” relation or database schemas. A number of normal forms have been proposed, depending on the types of integrity con- straints used. For functional dependencies, the two most important normal forms are Third Normal Form (3NF) and Boyce-Codd Normal Form (BCNF). A relation schema R is in Boyce-Codd Normal Form w.r.t. a set Σ of FDs on R if and only if for every non-trivial FD X → Y ∈ Σ the left hand side (LHS) X is a key for R, i.e., X → R ∈ Σ∗, where Σ∗ := {X → Y | X, Y ⊆ R,Σ ² X → Y } This normal form is desirable as it prevents redundancy and update anomalies caused by such redundancy [36]. Third Normal Form is weaker than BCNF, i.e., a relation schema in BCNF is also in 3NF. A relation schema R is in Third Normal Form if and only if for every non-trivial FD X → A ∈ Σ∗ we have that • the left hand side (LHS) X is a key for R, or • attribute A lies in a minimal key of R Attributes which lie in some minimal key are called prime attributes. While 3NF does not prevent redundancy, it does prevent certain types of update anomalies [36]. The advantage of 3NF over the more powerful BCNF is that a schema can always be decomposed into 3NF while preserving dependencies [10]. This is not the case for BCNF [5]. When decomposing a schema into BCNF (or at least 3NF), we are looking for a decomposition D = {R1, . . . , Rn} such that each Ri is in BCNF (3NF) w.r.t. to the projected constraints Σ∗[Ri]. The following decomposition algorithm, originally suggested by Codd in [12] finds a lossless BCNF decomposition: 6 Algorithm “BCNF Decomposition” INPUT: schema R with FDs Σ OUTPUT: lossless BCNF decomposition D D := {R} while D contains schema Ri not in BCNF do pick non-trivial X → A on Ri with Σ ² X → A and Σ 2 X → Ri D := D \ {Ri} ∪ {Ri \ A,XA} end We call Σ LHS-minimal or LHS-reduced if the left hand sides (LHSs) of FDs X → Y ∈ Σ cannot be reduced further, i.e., for every X ′ ( X we have Σ 2 X ′ → Y . It was shown by Biskup, Dayal and Bernstein in [10] that the following algorithm produces a lossless 3NF decomposition. Algorithm “3NF Synthesis” INPUT: schema R with LHS-minimal FDs Σ OUTPUT: dependency preserving 3NF decomposition D D := ∅ for all X → Y ∈ Σ do D := D ∪ {XY } end if D contains no key of R then add a minimal key of R to D A large number of other normal forms have been proposed, mostly for different classes of dependencies, e.g. 4NF for FDs and MVDs, and 5NF, PJ/NF, KCNF for FDs and JDs [33, 44]. However, most of them are not directly relevant to the work presented here. Those that are will be introduced where we need them. 1.3 Contributions and Outline In this work, we will address the problem of finding a “good” database schema through decomposition. Here we face two different issues. The first issue is what decompositions should be called ”good”, i.e., what normal form should be used. The second issue is how to find a decomposition into the desired form. We will first tackle the problem of how to find decompositions into the existing nor- mal form BCNF. For 3NF this has already been solved in a satisfactory manner, as a dependency preserving 3NF decomposition can be found in polynomial time [10]. Find- ing dependency preserving BCNF decompositions, however, is a more challenging issue. The problem of deciding whether a dependency preserving BCNF decomposition exist is known to be co-NP-hard [5, 42]. The algorithm proposed in [5] for finding BCNF de- compositions requires exponential time, but still does not always guarantee preservation of dependencies whenever possible. The approach described in [42] improves on that by finding a BCNF decomposition in polynomial time, but it does not generate dependency 7 preserving decompositions either. Only the algorithm suggested in [38] always finds a de- pendency preserving BCNF decomposition if one exists, but uses a brute force approach which always requires exponential time. We will address this problem in chapter 2 by developing an efficient algorithm called linear resolution for computing the atomic closure, which consists of all functional de- pendencies implied by a given set Σ of functional dependencies, which are minimal in some sense. While the size of the atomic closure can be exponential in the size of Σ, it often turns out to be rather small, and our linear resolution algorithm computes it in output-linear time. Furthermore, we identify polynomial subclasses based on the form of Σ, and describe an efficient method for updating the atomic closure when functional dependencies get added to Σ. Other applications for linear resolution, such as computing all minimal keys or computing covers on subschemas, are considered as well. In section 2.2 the atomic closure is then used to find a suitable cover to be used with the synthesis algorithm, resulting in a BCNF decomposition if one exists. For this we improve the algorithm by Osborn [38]. In addition, we show that for finding faithful BCNF decompositions we only require a subset of the atomic closure. This subset consists of all functional dependencies which participate in cycles (e.g. A → B,B → A), and we show that our linear resolution approach can easily be adapted to compute only this subset. In section 2.3 we then extend the core of our work from sections 2.1 and 2.2 to a complex-valued data model described in [29], which is based on record, list, set and multiset constructor. It turns out that our approach works in this complex-valued model as well, but special care has to be taken to combat an explosion in complexity, which arises due to an exponential number of subattributes. Another complication arises from the fact that a vital theorem for characterizing lossless and dependency-preserving decompositions in the relational model does not extend to the complex-valued model directly. We solve this by placing some suitable restrictions on the set Σ, and show that the theorem holds under these restrictions. For the work of chapter 2, the choice which set Σ of functional dependencies to preserve is central. Instead of Σ we could use a cover Σ′ of functional dependencies equivalent to Σ instead, and the choice of the cover determines the decomposition. This reduces the problem of finding the “right” decomposition to finding the “right” cover. As a result of working with covers instead of decompositions directly, we can ensure the preservation of dependencies more easily. It turns out that for many problems, including that of finding faithful BCNF decompositions, the “right” cover is in canonical form. Roughly speaking, canonical covers are minimal sets of minimal FDs - a precise definition is given in section 2.1. In chapter 3 we approach the problem of finding the “right” cover (be it for BCNF decomposition or other purposes) by developing an algorithm for computing the set of all canonical covers. However, the number of canonical covers can be huge, so computing or representing them directly becomes quickly infeasible. To counter this, we partition canonical covers into partial canonical covers, which can reduce their number dramatically, and at the same time makes them easier to compute as well. Since the set of all canonical covers forms a hypergraph, with functional dependencies as vertices and covers as edges, we develop a general approach for hypergraph decomposition in section 3.1, based on the notion of autonomous set. This is then used in section 3.2, where we characterize some (non-minimal) autonomous 8 sets for the hypergraph of all canonical covers. Furthermore, we make implications be- tween functional dependencies explicit in the form of implication dependencies, which are functional dependencies over other functional dependencies. This allows us to compute (partial) canonical covers as minimal keys w.r.t. a set of implication dependencies. We derive some simple bounds on the size and number of non-redundant covers (which includes canonical covers) in section 3.3, and develop a polynomial-time method for finding finer (but still non-minimal) autonomous sets in section 3.4. As we only require so-called essential functional dependencies which actually appear in some canonical cover, we wish to avoid inessential ones when computing the atomic closure. In section 3.5 we reduce this to the problem of identifying essential functional dependencies. It turns out that this problem, as well as the problem of identifying autonomous sets are NP-complete and co-NP-hard, respectively. Consequently we provide some necessary, but not sufficient, criteria for testing essentiality, which can be checked in polynomial time. Finally, we address the question of what we should call a “good” decomposition in chapter 4. We propose a new normal form (more precisely, a number of similar normal forms) called Domination Normal Form in section 4.1, motivated by semantic properties. Roughly speaking, we say that a decomposition is in Domination Normal Form if it is “minimal” among a set of “suitable” decompositions. The main advantages of this approach, compared to existing normal forms, are the following: • Domination Normal Form has a clear semantic motivation, optimizing the whole decomposition rather than only considering individual schemas. • Requirements for decompositions, such as preservation of dependencies, can easily be integrated into Domination Normal Form by restricting the set of “suitable” decompositions to those which meet the requirements. • A decomposition into Domination Normal Form always exists, provided the set of all “suitable” decompositions is not empty. For this we introduce two semantic notions of minimality orders, based on the size (storage space) and attribute count (how often attribute values appear) of instances. We also present a syntactical characterization in the form of a third ordering, and prove that all three orderings are equivalent in section 4.2. These new normal forms are then compared to other normal forms in section 4.3, and it turns out that they are identical to BCNF and other existing normal forms for a single relation schema, but extend them to multiple schemas in a different, and perhaps (as we will argue) more suitable way. Using the results from chapters 2 and 3, we describe algorithms for computing decompositions into Domination Normal Form efficiently in sections 4.4 and 4.5. We conclude in chapter 5 where we summarize our results briefly and discuss some problems which remain open. Experimental results where we implemented some of our algorithms are presented in an appendix. Some parts of this thesis have already been published. Results from sections 2.1 and 2.2 were presented in [24], while parts of chapter 4 are described in [25]. 9 Chapter 2 Linear Resolution and Faithful BCNF Decomposition It is well known that lossless and faithful (i.e., dependency preserving) decompositions of relational database schemas into Boyce-Codd Normal Form (BCNF) do not always exist, depending on the set of functional dependencies given, and that the corresponding decision problem is NP-hard [5, 42]. As the next lemma shows, finding a lossless and faithful BCNF decomposition is easy once we have found a faithful BCNF decomposition. This allows us to concentrate on the latter problem. Lemma 2.1. [10] A faithful decomposition of R is lossless iff it contains a subschema which forms a key of R. Every minimal key is in BCNF, as the projection of Σ∗ onto it contains only trivial FDs. Thus we can easily make a faithful BCNF decomposition D lossless by adding a minimal key as additional subschema if D does not contain a key of R, such that the new decomposition is still in BCNF (and obviously faithful). To see that there does not always exist a faithful BCNF decomposition consider e.g. the schema R = ABC with FDs Σ = {AB → C,C → B}. The well-known decomposition algorithm “BCNF Decomposition” from section 1.2 pro- duces a lossless BCNF decomposition, which however need not be faithful, even if a faithful decomposition exists (example 2.1). Example 2.1. Consider the schema CLRT containing the attributes C = Course, L = Lecturer, R = Room, T = Time with the functional dependencies Σ = {C → L,CT → R,LT → C,RT → C} The only FD in Σ for which the left hand side is not a key is C → L, so algorithm “BCNF Decomposition” produces the decomposition { (CL, {C → L}), (CRT, {CT → R,RT → C}) } 10 The missing FD LT → C is not implied by {C → L} ∪ {CT → R,RT → C} thus the decomposition is not faithful. On the other hand, the popular synthesis algorithm “3NF-Synthesis” produces a faith- ful decomposition, but the resulting relations Ri need not be in BCNF. And again, there are cases where a faithful BCNF decomposition exists, but the synthesis algorithm does not find one: Example 2.2. Consider again the schema CLRT with the FDs Σ = {C → L,CT → R,LT → C,RT → C} If we synthesize a decomposition by projecting on the attributes involved in each FD in Σ and eliminate non-maximal sets, we get the decomposition { (CLT, {C → L,LT → C}), (CRT, {CT → R,RT → C}) } While this decomposition is clearly faithful, the subschema (CLT, {C → L,LT → C}) is not in BCNF as the left hand side C of C → L is not a key for CLT . Since the schema CLT is not in BCNF, the information who is lecturer of a course is stored multiple times (once for each lecture time). Thus, in order to change the lecturer of a course, multiple tuples need to be updated. These problems are avoided by the BCNF decomposition CL,CRT from example 2.2, but there we lost LT → C which prevented us from creating tables where a lecturer is supposed to give different courses at the same time. The only algorithm to guarantee both faithfulness and BCNF (if possible) proposed so far by Osborn in [38] is a brute-force approach which always requires exponential time. To be useful in practice, e.g. in automated design tools, we require more efficient means. However, given that the problem we are trying to solve is NP-hard, we will not be able to find a polynomial time algorithm (unless P=NP). A more modest, but realistic goal would be an algorithm which requires exponential time in some cases, but performs much better in many of the cases we are interested in. In this chapter we develop such an algorithm (illustrated in Example 2.11 on page 28), which always finds a faithful BCNF decomposition if one exists, and computes a faithful decomposition into 3NF otherwise. The advantage over the approach in [38] is that in most cases our algorithm is much faster. Some experimental results can be found in the appendix. 2.1 Linear Resolution The central idea of our approach is to compute all “minimal” FDs in Σ∗. While the number of such FDs can grow exponentially in the number of attributes and FDs in Σ, it often turns out to be reasonably small. In this section we develop a fast algorithm for 11 computing the set of all “minimal” FDs, more precisely an algorithm who’s runtime is linear in the size of the output and polynomial in the input. In the following R denotes a relation schema with FD set Σ. We will use letters at the end of the alphabet (. . . , X, Y, Z) to denote subsets of R, while letters at the beginning (A,B,C, . . .) denote single attributes. Definition 2.2. We use the following terminology: (i) A FD X → A is called singular . (ii) A non-trivial singular FD X → A ∈ Σ∗ is called atomic, if and only if for all Y ( X we have Y → A /∈ Σ∗. (iii) The atomic closure Σ∗a of Σ is the set of all atomic FDs in Σ∗ (iv) A set G ⊆ Σ∗a of atomic FDs is called canonical cover if it is a cover of Σ which is minimal w.r.t. set inclusion, i.e., for all H ( G the set H is not a cover of Σ. We will focus on computing the atomic closure Σ∗a, since it is needed for finding a dependency preserving BCNF decomposition, based on the approach of Osborn [38]. A detailed description of this will be provided in section 2.2. Note that atomic FDs have also been called “elemental” FDs [46]. Osborn [38] uses the atomic closure as well, but without defining any name for it. Example 2.3. Consider the set of FDs Σ = {A → B,AB → C,BC → AD} (i) The FDs A → B and AB → C are singular but BC → AD is not. (ii) While A → B is atomic, AB → C is not, since A → C ∈ Σ∗. (iii) For the atomic closure of Σ we get Σ∗a = {A → B,A → C,BC → A,BC → D,A → D} (iv) The canonical covers of Σ are {A → B,A → C,BC → A,BC → D}, {A → B,A → C,BC → A,A → D} It is well-known that every set of FDs Σ has a canonical cover, and one can easily compute one by splitting non-singular FDs into singular ones, minimizing their left hand sides (LHS) and removing redundant FDs [33]. Algorithm for computing a canonical cover INPUT: set of FDs Σ OUTPUT: canonical cover Σ′ of Σ 12 Σ′ := ∅ for all X → Y ∈ Σ do Σ′ := Σ′ ∪ {LHS −minimize(X → A,Σ) | A ∈ Y } end for all X → A ∈ Σ′ do if Σ′ \ {X → A} ² X → A then Σ′ := Σ′ \ {X → A} end proc LHS-minimize(X → Y,Σ) for all A ∈ X do if Σ ² (X \ A) → Y then X := X \ A end The existence of a canonical cover immediately implies that Σ∗a is a cover for Σ: Lemma 2.3. For every set Σ of FDs, the atomic closure Σ∗a of Σ is a cover for Σ. Proof. Σ has a canonical cover Σ′, which must be a subset of Σ∗a, thus Σ∗a ² Σ′ ² Σ. 2.1.1 The Basic Algorithm We wish to compute Σ∗a. As we have seen in section 1.1, implication of FDs can be decided in linear time. However, creating all possible singular FDs on R and testing whether they are implied by Σ and atomic is inefficient, as the number of singular FDs on R grows exponentially in the size of R. In order to compute Σ∗a efficiently, we need some method for deriving new FDs. A first candidate for this could be the Armstrong axioms from section 1.1: X → Y Y ⊆ X, X → Y XW → YW , X → Y Y → Z X → Z However, the first two rules rule can only derive trivial or non-minimal FDs, which seems counter-productive to our goal of obtaining minimal, non-trivial FDs. When deriving minimal FDs, they are only used to bring FDs into a form which allows us to apply the third (transitivity) rule. Thus, instead of using all three Armstrong axioms, we will discard the first two rules, but make the transitivity rule more flexible. In our approach we use a single derivation rule, namely the resolution rule X → A AY → B XY → B which has already been used successfully by Gottlob in [18], and is easily checked to be sound. This simplifies the derivation of FDs significantly. Consider e.g. the derivation tree from example 1.1, which derives A → D from the FDs Σ = {A → B,AB → C,BC → D}: A → B A → AB AB → C AB → ABC ABC → BC BC → D ABC → D AB → D A → D 13 Using the resolution rule, the derivation becomes much easier: A → B AB → C BC → DAB → D A → D Note that the derivation tree (when read from bottom to top) above is right-linear, i.e., the left branch always ends in a leaf (a FD in Σ, rather than an arbitrary sub-tree, in this case A → B and AB → C). We refer to such derivation trees as linear resolution trees. Definition 2.4. For any application of the derivation rule X → A AY → B XY → B we call    X → A AY → B XY → B    the    substituting base derived    FD. The following theorem, in which we show that such derivations are possible in general, is central as it allows us to create a fast algorithm for computing Σ∗a. Theorem 2.5. Let Σ be a set of singular FDs. Then every atomic FD in Σ∗a can be derived from Σ using the resolution rule X → A AY → B XY → B (2.1) This result still holds if we restrict ourselves to derivations where the substituting FDs X → A lie in Σ, i.e., for every atomic FD Xi → Ai ∈ Σ∗a there exists a linear resolution tree deriving Xi → Ai from Σ. Proof. Let X → A ∈ Σ∗a, and thus A ∈ X∗\X. We use the known fact that the “closure” algorithm works. For any run of the “closure” algorithm let Xi → Ai ∈ Σ, i = 1 . . . k be the FDs X ′ → Y used to compute X∗ up to the point where A = Ak is added, in that order. We may assume that the set of FDs used is minimal, i.e., for every Xi → Ai with i < k the attribute Ai lies in some Xj (with i < j <= k). We start our derivation with Xk → A(= Ak), and then successively use Xi → Ai for i = k − 1, . . . , 1 in the resolution rule (2.1): Xi → Ai Ui+1 → A Ui → A In this, the derived left hand sides Ui have the form Uk = Xk, Ui B with X ′ → B′ ∈ Σ∗, we can discard X ′ → B. This is because for any FD X ′′ → B derivable from X ′ → B we can derive X ′′ → B′ from X ′ → B′, so X ′′ → B is not maximal atomic. We test condition (ii) by removing attributes from X and checking if the new reduced LHS still determines B. However, unlike in the relational model where the removal of a single attribute was the smallest reduction possible, we now need to replace the removed attribute A by all the unitary true sub-attributes of A. The reduction of X now takes the form Xreduced := (X ∪ A↓) \ A which requires us to add the completion A↓ of A to X. In this we can restrict ourselves to the completion w.r.t. rb(lhs(Σ)), since all maximal atomic FDs are rb(lhs(Σ)) → rb(rhs(Σ))-representable. If the reduced LHS still determines B, we compute its basis and replace X by it, then iterate the process. This gives us the following algorithm: 42 Algorithm “linear NA-resolution” INPUT: set of FDs Σ OUTPUT: maximal atomic closure Σ∗max compute a maximal atomic cover Σ′ of Σ Σ∗max := Σ′ for all Y → B ∈ Σ∗max do for all X → A ∈ Σ′ with A ∈ Y,B /∈ X do // derive ⌈X(Y \ A↓)⌉→ B by rule (NA-Resolution) X ′ := ⌈X(Y \ A↓)⌉ if X ′ → B is maximal then Σ∗max := Σ∗max ∪ {lhs-minimize(X ′ → B,Σ)} end end function lhs-minimize(X → B,Σ) U := X for all A ∈ U do U ′ := ⌈(U ∪ A↓rb(lhs(Σ))) \ A⌉ if U ′ → B ∈ Σ∗ then U := U ′ end return U → B We have just described an algorithm for constructing the set Σ∗max of all maximal atomic FDs in Σ∗a. However, as the following example shows, we cannot always find an uncritical cover of maximal atomic FDs whenever an uncritical cover exists. Example 2.18. Consider the nested attribute A{BC} with the set of FDs Σ = {A → {BC}, {B} → {BC}} Σ already contains all maximal atomic FDs in Σ∗a, but A → {BC} is critical. Σ does however have the uncritical atomic cover: G = {A → {B}, {B} → {BC}} This means that we cannot restrict ourselves to maximal atomic FDs only when trying to find an uncritical cover. It is simple to derive Σ∗a from Σ∗max: for every X → A ∈ Σ∗max and every unitary A′ < A check whether X → A′ is atomic (i.e., whether X is minimal) and if so add it to Σ∗a. The problem with this approach however, is that the number of unitary subattributes A′ of A can be exponential in the size of A: Example 2.19. Consider the nested attribute A{B1 . . . Bn} with Σ = Σ∗max = {A → {B1 . . . Bn}} Then {B1 . . . Bn} has 2n unitary subattributes, and Σ∗a contains 2n atomic FDs. 43 Again we solve this problem by restricting ourselves to subattributes which appear in the representation basis of Σ. This is sufficient, since we will show later that whenever Σ has an uncritical cover, it has an uncritical canonical cover representable in rb(Σ). Lemma 2.61. Let Σ be a set of FDs and A a unitary attribute with A /∈ rb(lhs(Σ)). Then for every set X of unitary attributes we have XA∗ = (XA↓ \ A)∗ ∪ A Proof. Let Y → Z ∈ Σ with Y ≤ (XA↓ \ A)∗ ∪ A Since A does not appear in the LHS of any FD in Σ, we have Y ≤ (XA↓ \ A)∗ As (XA↓ \ A)∗ is closed under Σ we also get Z ≤ (XA↓ \ A)∗ Since this holds for all Y → Z ∈ Σ, the set (XA↓ \ A)∗∪A is closed under Σ, which gives us XA∗ = ((XA↓ \ A)∗ ∪ A )∗ = (XA↓ \ A)∗ ∪ A Lemma 2.62. Let Σ be a set of FDs and X → A ∈ Σ∗a be atomic. Then X is rb(lhs(Σ))- representable. Proof. Let B ∈ X be any unitary attribute in X. If B does not lie in rb(lhs(Σ)) then X∗ = ((XB↓) \B)∗ ∪B by Lemma 2.61. Since X → A is atomic and thus non-trivial, A  B, we have A ∈ ((XB↓) \B)∗ which contradicts the minimality of X. Thus all B ∈ X are rb(lhs(Σ))-representable, and therefore X. It follows in particular that for every canonical cover G of Σ the LHS of G is rb(lhs(Σ))- representable. One might hope that the RHS of G is rb(rhs(Σ))-representable, or at least rb(Σ)-representable. This, however does not hold for all canonical covers. Example 2.20. Consider the set of FDs Σ = {A → {BCD}, {B} → {BCD}} and its canonical cover G = {A → {BC}, {B} → {BCD}} The unitary attribute {BC} ∈ rb(rhs(G)) does not lie in rb(Σ). 44 The problem in the example above comes from the fact that the RHS of A → {BC} was larger than it had to be. We therefore will restrict ourselves to sets of FDs with minimal RHSs. Definition 2.63. We call a set of FDs Σ RHS-reduced if for every FD X → A ∈ Σ the set Σ\{X → A} ∪ {X → A↓ \ A} is not a cover of Σ. Theorem 2.64. Let Σ be a set of FDs and G a RHS-reduced canonical cover of Σ. Then G is rb(lhs(Σ)) → rb(Σ)-representable. Proof. By Lemma 2.62 lhs(G) is rb(lhs(Σ))-representable, thus we only need to show that rhs(G) is rb(Σ)-representable. Assume the contrary and let X → A ∈ G with A /∈ rb(Σ). Let further H := G \ {X → A} and consider the closures X∗G of X w.r.t. G and XA∗H of XA w.r.t H. Clearly X∗G = XA∗G = XA∗H But since A does not lie in rb(lhs(H)) ⊆ rb(Σ) we have by Lemma 2.61: XA∗H = (XA↓ \ A)∗H ∪ A Again using the fact that A /∈ rb(Σ) we find that X → A is not maximal atomic by Corollary 2.60, and thus A < A′ for some unitary A′ with X → A′ ∈ Σ∗. From A′ ∈ X∗G = (XA↓ \ A)∗H ∪ A we conclude that A′ ∈ (XA↓ \ A)∗H and thus X∗G = (XA↓ \ A)∗H But this means that H ∪ {X → A↓ \ A} is a cover of G, which contradicts G being RHS-reduced. Note that in particular, all RHS-reduced canonical covers of Σ have the same repre- sentation basis, as do their LHSs. This is not true for their RHSs though, as the following example shows. Example 2.21. The sets of FDs on the attribute A{BC} Σ = {A → {B}, {B} → {BC}, {C} → {BC}} G = {A → {C}, {B} → {BC}, {C} → {BC}} are RHS-reduced canonical covers of each other, but the RBs of their RHSs differ. 45 2.3.4 Faithful BCNF-Decomposition The definition for Boyce-Codd Normal Form is the same for nested attributes as in the relational model: Definition 2.65. A subattribute S ≤ N (or extended subattribute, see Definition 2.66) is in Boyce-Codd Normal Form w.r.t. a set Σ of FDs on S if every non-trivial FDX → Y ∈ Σ is a key dependency, i.e., Σ ² X → S. Having computed the set of all rb(Σ)-representable atomic FDs of Σ, we now use it to decompose the nested attribute N into normal form while preserving dependencies. When decomposing a nested attribute, we must ask ourselves what to decompose into. A first candidate might be subattributes, but as the following example shows, this can be an undesirable restriction. Example 2.22. Let N = {AB}CD be a nested attribute with FDs Σ = {{A}{B} → C}. Then N has no faithful, lossless BCNF decomposition into subattributes: the smallest subattribute on which {A}{B} → C is defined is {AB}C, which is not in BCNF. It is however possible to decompose N into {A}{B}C and {AB}D which is lossless, faithful and in BCNF. It appears that what we need are not subattributes, but rather tuples of subattributes such as ({A}, {B}, C). Since the order of the attributes in a tuple does not matter for design purposes, we may as well talk about sets of subattributes instead. Definition 2.66. We call a set S of unitary subattributes of N an extended subattribute of N if no attribute in S is a subattribute of any other attribute in S. We say that an extended subattribute S ′ of N is an extended subattribute of S if every attribute in S ′ is a subattribute of some attribute in S. Join and meet between extended subattributes are defined according to this extended subattribute order. The requirement that subattributes in S be unitary identifies intuitively equivalent sets such as {{A}, {B}, C}, {{A}C, {B}}, {{A}, {B}C} and {{A}C, {B}C}. Prohibiting subattributes of another attribute in a tuple prevents obvious redundancy. From now on, when talking about decomposition of nested attributes, we will mean decomposition into extended subattributes. Lemma 2.67. Every faithful decomposition of R containing an extended subattribute which forms a key of R is lossless. Proof. As in the relational case [10]. In the relational case we can use this lemma to make a decomposition lossless “for free”, since every minimal key is automatically in BCNF. This does not hold when we move to nested attributes however. Example 2.23. Let N = {AB}CD with FDs Σ = {{A}C → {B}, {B}C → D, {B}D → C} Then N has the minimal keys {AB}C and {AB}D. The first key violates BCNF, the second does not. 46 Definition 2.68. A set S of attributes is called critical (w.r.t. Σ) if dSe is not in BCNF (w.r.t. Σ∗[dSe]). A FD X → Y ∈ Σ∗ is called critical if XY is critical. A cover G of Σ is called critical if it contains a critical FD. An attribute set, FD or cover that is not critical is called uncritical. In the relational case it has been shown (e.g. [27]) that every lossless decomposi- tion contains a key. However, this result does not translate directly to complex-valued databases. For the moment we shall just state the result, and postpone the proof and discussion of it to the next subsection. The term LHS-restricted will be defined there as well. Lemma 2.69. Let N be a nested attribute with FDs Σ. If Σ is not LHS-restricted, then every lossless decomposition of N contains an extended subattribute which is a key of N . Theorem 2.70. Let Σ not be LHS-restricted. Then the following are equivalent: (i) A schema (R,Σ) has a faithful, lossless decomposition into BCNF (ii) Σ has an uncritical cover and an uncritical key (iii) Σ has an uncritical RHS-reduced canonical cover and an uncritical minimal key Proof. (ii) → (i) : If Σ has an uncritical cover G, then we can synthesize a faithful BCNF decomposition D by creating a schema dXY e for every X → Y ∈ G, and if necessary add an uncritical key to D to make the decomposition lossless (Lemma 2.67). (i) → (ii) Now let D = {R1, . . . , Rn} be a faithful BCNF decomposition of (R,Σ), and Σi be the FD set associated with Ri. Since D is faithful, the union G := ⋃Σi of all FDs on the extended subattributes forms a cover of Σ. For each FD X → Y ∈ G the extended subattribute dXY e is an extended subattribute of some Ri. Since Ri is in BCNF, so are all of its extended subattributes (in particular dXY e), which means X → Y is uncritical. This makes G an uncritical cover of Σ as required. By Lemma 2.69 D contains an uncritical key. (ii) → (iii) We can construct an uncritical RHS-reduced canonical cover G′ of G, which is then a cover of Σ as well, as follows: For each FD X → Y ∈ G and each Ai ∈ Y there exists a minimal Ui ≤ X such that Ui → Ai ∈ Σ∗. Each such FD Ui → Ai is atomic, and since UiAi ≤ XY , it is uncritical. Furthermore, G′X→Y := {Ui → Ai | Ai ∈ Y } implies X → Y , therefore G′ := ⋃ X→Y ∈G G′X→Y is a cover of G, and thus an uncritical atomic cover of Σ. By removing redundant FDs from G′ we obtain an uncritical canonical cover. Reducing the RHS of the FDs remaining in G′ keeps them uncritical and preserves the cover property. Finally, every uncritical key can be reduced to a minimal key which is still uncritical. Thus, in order to find a faithful lossless BCNF decomposition, we now need to do two things: find an uncritical cover, and an uncritical key. To find an uncritical key, we can use linear NA-resolution to compute all minimal keys of N . Both of these computations are necessary. From the relational case it is already clear that the existence of an uncritical key does not imply the existence of an uncritical cover. The following example shows that the existence of an uncritical cover does not imply that every or any minimal key is uncritical, and that we can have critical and uncritical canonical covers as well as critical and uncritical minimal keys. 47 Example 2.24. Let N = {AB}CDE with FDs Σ = {{A}C → {B}, {B}C → D, {B}D → C, {B}C → E,E → C} As in example 2.23, N has the minimal keys {AB}C and {AB}D, where the first key violates BCNF, the second does not. While the FD {B}C → E is critical, it can be replaced by the uncritical FD {B}D → E, which give us the uncritical cover G = {{A}C → {B}, {B}C → D, {B}D → C, {B}D → E,E → C} If we modify Σ to Σ′ = Σ ∪ {{A}D → {B}} we still find an uncritical cover G′ = G ∪ {{A}D → B}, but now both minimal keys violate BCNF. 2.3.5 Lossless Decomposition We will prove Lemma 2.69 and demonstrate why the requirement of having Σ not LHS- restricted is necessary. For that, we first look into the corresponding proof in the relational case (adapting the correctness proof for the chase procedure [27]). Lemma 2.71. Let R be a relational schema and Σ a set of FDs on R. Then every lossless decomposition D of R contains a key of R. Proof. Let D not contain a key of R. We show that D is not lossless by constructing a sample relation r on R as follows. Let t be an arbitrary tuple on R. Then for every schema Ri ∈ D add a tuple ti to r which is identical to t on R∗i , and contains unique values for attributes outside R∗i . Then for every FD X → Y ∈ Σ, two tuples ti, tj are identical on X iff X ⊆ R∗i ∩R∗j . Since R∗i ∩R∗j is closed under Σ, Y also lies in it, so X → Y holds on r. Since D contains no key, the tuple t does not lie in r. It does however lie in r[R1] ./ . . . ./ r[Rn] which makes D not lossless. We first note that the proof requires a sufficient number of distinct attribute values for the construction. In this, we implicitly assume that attribute domains are infinite. This assumption is indeed necessary, since the lemma does not hold if we work with finite domains. For the remainder of this subsection we will allow domains (of base attributes) to be finite, although we still require them to contain at least two elements. Example 2.25. Let R = ABCD with FDs Σ = {A → BC,BC → A} and decomposition D = {ABC,BD,CD}. Let furthermore the domain of A contain only two values, say dom(A) = {0, 1}. Then D contains no key of R, but is lossless. Proof (of example 2.25). Assume D was not lossless, i.e., for some relation r on R there exists a tuple t with t ∈ (./ r[D]) \ r Then for every Ri ∈ D there must be a tuple ti ∈ r with ti[Ri] = t[Ri]. We may assume w.l.o.g. that t = (0, 0, 0, 0). This gives us the following subset of r: r′ = A B C D 0 0 0 d a1 0 c 0 a2 b 0 0 48 for some values a1, a2, b, c, d with a1, a2 ∈ dom(A) = {0, 1}. If a1 = 0 or a2 = 0 then the FD A → BC implies c = 0 or b = 0, respectively, and thus t ∈ r which contradicts the assumption. If a1 = a2 = 1 then A → BC again implies b = 0 and c = 0, which gives us the following subset of r: r′ = A B C D 0 0 0 d 1 0 0 0 But then the FD BC → A would be violated on r′ and thus on r. Note that this example can easily be generalized to work for arbitrary finite domains. For dom(A) = {1, . . . , k} chose R = AB1 . . . BkC Σ = {A → B1 . . . Bk, B1 . . . Bk → A} D = {AB1 . . . Bk, R \ AB1, . . . , R \ ABk} Obviously we need some restrictions on the domains for Lemma 2.71 to work. However, requiring all domains of attributes occurring in R to be infinite is too strong. While finite domains could perhaps be ignored in relational databases, they arise naturally in complex- valued databases for subattributes such as {λ}. In the example above, the attribute A occurred in the LHS of a FD in Σ. It turns out that requiring that these attributes have infinite domains is sufficient. Lemma 2.72. Let R be a relational schema and Σ a set of FDs on R. If all attributes which occur in the LHS of a FD in Σ have infinite domains, then every lossless decompo- sition D of R contains a key of R. Proof. As for Lemma 2.71, except that for attributes not occurring in the LHS of a FD in Σ we do not require the ti to have unique values, but only values different to t. We now extend these results to complex-valued databases. Definition 2.73. Let N be a nested attribute and S a unitary subattribute of N . We call S restricted if dom(S) is finite. A FD X → Y is LHS-restricted if some element in X is restricted, and a set Σ of FDs over N is LHS-restricted if any of its FDs are. To prove Lemma 2.69, we need to be able to construct tuples ti which are identical to some tuple t on an extended subattribute (corresponding to R∗i in the relational case), but unique (or at least different when domains are finite) for subattributes “outside” these extended subattributes. For this we use and adapt some lemmas from [29]. Lemma 2.74. [29, Lemma 5.10] Let N = L{P} ∈ NA, and ∅ 6= X ⊆ Sub(N) an ideal with respect to ≤. Then there are tN , t′N ∈ dom(N) with piNW (tN) = piNW (t′N) if and only if W ∈ X . Lemma 2.75. [29, Lemma 5.13] Let N = L〈P 〉 ∈ NA, and ∅ 6= X ⊆ Sub(N) an ideal with respect to ≤. Then there are tN , t′N ∈ dom(N) with piNW (tN) = piNW (t′N) if and only if W ∈ X . 49 Lemma 2.76. [29, Lemma 5.14] Let N ∈ NA, and ∅ 6= X ⊆ Sub(N) an ideal with respect to ≤ with the property that for reconcilable X,Y ∈ X also X unionsqY ∈ X holds. Then there are tN , t′N ∈ dom(N) with piNW (tN) = piNW (t′N) if and only if W ∈ X . Note that in the last lemma we could also consider only unitary subattributes of N , and thus no longer need to worry about reconcilable subattributes. What will take that view when formulating our next lemma. Lemma 2.77. Let N ∈ NA, and ∅ 6= Xi ⊆ N↓ for i = 1, . . . , k be ideals with respect to ≤. Then there are t, t1, . . . , tn ∈ dom(N) with the following properties (for all W ∈ N↓): (i) piNW (ti) = piNW (t) if W ∈ Xi (ii) piNW (ti) 6= piNW (t) if W /∈ Xi and W unrestricted or a unit of N (iii) piNW (ti) 6= piNW (tj) if W /∈ Xi ∩ Xj and W unrestricted Proof. We distinguish several cases, depending on the form of N . For N = A chose t = a and ti = { a if Xi = {λ,A} ai 6= a if Xi = {λ} and the ai pairwise different if dom(A) is infinite. This obviously meets conditions (i)-(iii). For N = (M1, . . . ,Mk) we construct t, t1, . . . , tn inductively on the structure of N . Let tMj , tMji denote the tuples constructed on the nested attribute Mj w.r.t. piNMj(Xi)1. We chose t = (tM1 , . . . , tMk) and ti = (tM1i , . . . , tMki ). It is easy to see that the tuples t, ti meet conditions (i)-(iii) if the tuples tMj , tMji do. For N = [M ], we have Xi = {[X] | X ∈ Yi} ∪ {λ} for some ideal Yi ⊆ M↓. If λ ∈ Yi then by Lemma 2.76 there exist tuples tM,i, t′M,i ∈ dom(M) which are identical exactly on Yi. Otherwise let tM,i be arbitrary. We then construct t, t1, . . . , tn as follows: t = [tM,1, . . . , tM,n] ti = { [tM,1, . . . , t′M,i, . . . , tM,n] if λ ∈ Yi arbitrary of length n+ i otherwise, i.e., if Xi = {λ} With this definition, tuples ti with Xi = {λ} are of unique length, and thus differ from tuples t, tj with j 6= i on all subattributes except λ. For other tuples ti, tj we have piNW (ti) = piNW (t) iff piNW (t′M,i) = piNW (tM,i) which shows conditions (i) and (ii), as well as piNW (ti) = piNW (tj) iff piNW (t′M,i) = piNW (tM,i) ∧ piNW (t′M,j) = piNW (tM,j) from which (iii) follows. 1Strictly speaking we would have to distinguish between Mj and the subattribute (λ, . . . ,Mj , . . . , λ) ≤ N . We will neglect this subtlety for ease of readability. 50 For N = 〈M〉 let tN,i, t′N,i ∈ dom(N) be tuple pairs with piNW (tN,i) = piNW (t′N,i) if and only if W ∈ Xi, which exist by Lemma 2.75. We then define t = c · tN,1 ∪ . . . ∪ cn · tN,n ti = c · tN,1 ∪ . . . ∪ ci · t′N,i . . . ∪ cn · tN,n with c ∈ N sufficiently large, i.e., larger than the multiplicity of any element in the multi- sets tN,1, . . . , tN,n, t′N,1, . . . , t′N,n. Here c · tN,1 denotes the multiset obtained by multiplying the multiplicity of all elements of tN,1 with c. This allows us to determin the ”origin” of an element e of t or ti from its multiplicity multe(t/ti): For every tuple t, t1, . . . , tn we have multe(t/ti) = multe (c ·M1 ∪ . . . ∪ cn ·Mn) = c ·multe(M1) + . . .+ cn ·multe(Mn) = c · a1 + . . .+ cn · an (2.5) for some multisets M1, . . . ,Mn and some values a1, . . . , an ∈ {0, . . . , c − 1}. Given a multiplicity multe(t/ti) there exists only one set of values a1, . . . , an ∈ {0, . . . , c− 1} such that (2.5) holds. Consequently, we can uniquely determine the multisets M1, . . . ,Mn given t or some ti. Conditions (i)-(iii) can then be shown as in the case of lists. What remains is the case of sets. For N = {M} with N restricted, we use that by lemma 5.10 there exist two tuples tN 6= t′N which are identical on N↓ \ N . We then set t = tN and ti = { tN if Xi = N↓ t′N otherwise which clearly meet condition (i)-(iii). For N = {M} with N unrestricted, we adapt the proof of lemma 5.10 in [29]. For every index i = 1, . . . , n we define different identifying terms: • τ iλ(λ) = ok, • τ iA(λ) = a, τ iA(A) = ai with a, ai ∈ dom(A), a 6= ai and a1, . . . , an pairwise different if dom(A) is infinite • τ iL(N1,...,Nn)(L(M1, . . . ,Mn)) = (τ iN1(M1), . . . , τ iNn(Mn)), • τ iL{N}(L{M}) = {τ iN(M)} and τ iL{N}(λ) = ∅, • τ iL〈N〉(L〈M〉) = 〈τ iN(M), . . . , τ iN(M)〉 of cardinality i and τ iL〈N〉(λ) = 〈 〉, • τ iL[N ](L[M ]) = [τ iN(M), . . . , τ iN(M)] of length i and τ iL[N ](λ) = [ ]. Note that with this definition the identifying terms τ iN(M) of a subattribute M with infinite domain are pairwise different. We then create pairs of tuples tN,i, t′N,i ∈ dom(N) with piNW (tN,i) = piNW (t′N,i) if and only if W ∈ Xi as in [29, Lemma 5.10], but using the corresponding identifying terms τ iN(. . .). That is, we have X = {L{X} : X ∈ Y} ∪ {λ} for some Y ⊆ Sub(M), and define tN,i = {τ iM(X) : X ≤ M} t′N,i = {τ iM(X) : X ∈ Y} 51 The tuple pairs tN,i, t′N,i are used to construct the tuples t, t1, . . . , tk as follows: t = tN,1 ∪ . . . ∪ tN,n ti = tN,1 ∪ . . . ∪ tN,i−1 ∪ t′N,i ∪ tN,i+1 ∪ . . . ∪ tN,n Condition (i) clearly holds, since equivalence of tN,i and t′N,i on W implies equivalence of t and ti on W . Now let W = {V } meet the ’if’ condition of (ii). By construction we have τ iM(V ) ∈ piNW (tN,i) \ piNW (t′N,i) Furthermore, W must be unrestricted, since N is unrestricted and the only unit of N . As the identifying terms of unrestricted subattributes are all different, τ iM(V ) does not lie in any set piNW (tN,j) for j 6= i either. Thus piNW (ti) 6= piNW (t), which shows (ii). Condition (iii) is proven analogous to (ii). The following example illustrates the construction for multisets and sets. Example 2.26. [Multiset] Let N = 〈{{A}}〉 and X1 = {λ, 〈λ〉, 〈{λ}〉, 〈{{λ}}〉} X2 = {λ, 〈λ〉, 〈{λ}〉} X3 = {λ, 〈λ〉} X4 = {λ} Lemma 2.75 might give us the following tuples: tN,1 = 〈{{a1}}〉, t′N,1 = 〈{{a2}}〉 tN,2 = 〈{{a1}}〉, t′N,2 = 〈{∅}〉 tN,3 = 〈{∅}〉, t′N,3 = 〈∅〉 tN,4 = 〈〉, t′N,4 = 〈∅〉 Since all multisets above contain only a single element (or less), it suffices to choose c = 2. Using our construction we obtain t = 2 · tN,1 ∪ 4 · tN,2 ∪ 8 · tN,3 ∪ 16 · tN,4 = 2 · 〈{{a1}}〉 ∪ 4 · 〈{{a1}}〉 ∪ 8 · 〈{∅}〉 ∪ 16 · 〈〉 = 〈{{a1}}6, {∅}8〉 t1 = 〈{{a2}}2, {{a1}}4, {∅}8〉 t2 = 〈{{a1}}2, {∅}12〉 t3 = 〈{{a1}}6, ∅8〉 t4 = 〈{{a1}}6, {∅}8, ∅16〉 where 〈En〉 means that element E occurs n times in the multiset. [Set] Now let N = {AB} with dom(A), dom(B) infinite, and X1 = {λ, {λ}, {A}, {B}} X2 = {λ, {λ}, {A}} X3 = {λ, {λ}, {B}} 52 Using our construction for sets we get tN,1 = {(a, b), (a1, b), (a, b1), (a1, b1)}, t′N,1 = {(a, b), (a1, b), (a, b1)} tN,2 = {(a, b), (a2, b), (a, b2), (a2, b2)}, t′N,2 = {(a, b), (a2, b)} tN,3 = {(a, b), (a3, b), (a, b3), (a3, b3)}, t′N,3 = {(a, b), (a, b3)} and from this t = tN,1 ∪ tN,2 ∪ tN,3 = {(a, b), (a1, b), (a, b1), (a1, b1), (a2, b), (a, b2), (a2, b2), (a3, b), (a, b3), (a3, b3)} t1 = {(a, b), (a1, b), (a, b1), (a2, b), (a, b2), (a2, b2), (a3, b), (a, b3), (a3, b3)} t2 = {(a, b), (a1, b), (a, b1), (a1, b1), (a2, b), (a3, b), (a, b3), (a3, b3)} t3 = {(a, b), (a1, b), (a, b1), (a1, b1), (a2, b), (a, b2), (a2, b2), (a, b3) } In both cases (multiset and set) it is easy to check that the t, ti constructed meet the conditions (i)-(iii) of Lemma 2.77. Comparing the last lemma to Lemma 2.76, one may wonder why we require in condition (ii) that W is unrestricted or a unit, and whether this requirement can be omitted. The following example shows that it is really needed. Example 2.27. Let N = {AB} and X1 = X2 = {λ},X3 = {λ, {λ}, {B}}. Let further t, t1, t2 be tuples on N which meet the conditions of Lemma 2.77. If t = ∅ then by condition (i) it follows that t3 = ∅, which violates condition (ii) for W = {A}. So t 6= ∅. If t1 = t2 = ∅ then condition (iii) is violated for W = {A}. So t1 6= ∅ or t2 6= ∅, say t1 6= ∅. But then we have piN{λ}(t1) = {ok} = piN{λ}(t) which shows that the restriction on W in condition (ii) is necessary. We are now ready to prove Lemma 2.69. Lemma 2.78 (2.69). Let N be a nested attribute with FDs Σ. If Σ is not LHS-restricted, then every lossless decomposition of N contains an extended subattribute which forms a key of N . Proof. We will proceed as in the proof of Lemma 2.71. Let D = {N1, . . . , Nn} be a decomposition of N not containing a key schema, i.e., an extended subattribute which forms a key of N . For Ni ∈ D define Xi := (N∗i )↓. Then there exist tuples t, t1, . . . , tn on N which meet the conditions (i)-(iii) of Lemma 2.77. We define r = {t1, . . . , tn}, and start by showing that Σ holds on r. So let X → Y ∈ Σ and ti, tj ∈ r be two tuples with piNX (ti) = piNX (tj). Then X is unrestricted by assumption, so from condition (iii) it follows that X ⊆ Xi ∩Xj. Thus, by definition of Xi,Xj, we have Y ⊆ Xi ∩ Xj, which gives us piNY (ti) = piNY (tj) by condition (i). It follows that X → Y holds on r. Also by condition (i), the tuple t lies in ./ r[D]. It remains to be shown that t /∈ r. By assumption Ni is not a key of N , so Xi does not contain all units of N . Thus ti 6= t by condition (ii), which shows that D is not lossless. 53 2.3.6 Testing for BCNF We still require an efficient way to test whether an extended subattribute S is in BCNF. In the relational case we projected Σ∗a onto the subschema to be tested. However, as example 2.19 demonstrated, computing Σ∗a is often inefficient. Instead we will show that it is sufficient to project Σ∗max onto S, although we need to adjust our notion of projection. Definition 2.79. Let Σ be a set of FDs on the nested attribute N , and S ≤ N be an extended subattribute of N . The projection of Σ onto S is Σ[S] := {X → Y u S | X → Y ∈ Σ ∧X ≤ S} The meet Y u S on extended subattributes is induced by the extended subattribute order on them (Definition 2.66), and can be computed as Y u S = d{y u s | y ∈ Y ∧ s ∈ S}e Note that with this new notion of projection we do not lose any FDs, and all FDs gained through projection are still implied by Σ. In the relational case the two notions of projection are (almost) identical for singular FDs: If X → Y ∈ Σ is singular then either Y ⊆ S or Y ∩ S = ∅, and in the latter case X → Y ∩ S is trivial. This is not true for nested attributes: Example 2.28. Consider the set of singular FDs Σ = {A → {BC}, {BC} → A} on the nested attribute N = A{BC}. The projection of Σ onto S = A{B} is Σ[S] = {A → {B}} rather than the empty set. We can use this form of projection to construct a cover on an extended subattribute. Lemma 2.80. Let Σ be a set of FDs on N , and S ≤ N . Then Σ∗max[S] is a cover for Σ∗[S]. Proof. Clearly Σ∗[S] ² Σ∗max[S], so we only need to show implication in the opposite direction. For that we use that the maximal atomic FDs (Σ∗[S])∗max ⊆ Σ∗[S] form a cover of Σ∗[S], and show that each of them is implied by a FD in Σ∗max[S]. So let X → A ∈ Σ∗[S] be maximal atomic on S. Then X → A is atomic w.r.t. Σ, and there exists a maximal atomic FD X → A′ ∈ Σ∗max with A ≤ A′. Thus X → A′ u S ∈ Σ∗max[S] and X → A′ u S implies X → A. The last lemma allows us to construct a cover of Σ∗[S] efficiently, as Σ∗max can be computed using linear NA-resolution. This cover can then be used to test for BCNF as in the relational case. 54 Chapter 3 Canonical Covers When given a set Σ of functional dependencies for schema decomposition, instance val- idation or similar tasks, we may choose to use a cover Σ′ of functional dependencies equivalent to Σ instead. The choice of Σ′ is important, as it determines the result of the decomposition (as we have seen in chapter 2), the speed of updates, and generally can have a huge impact on database performance. Optimal results are usually achieved by covers which are in some standard form, typically canonical or LR-reduced, but finding these optimal covers is often NP-hard. We approach the problem by providing an algorithm for computing the set of all canonical covers, using an efficient form of representation. Once computed, finding the best canonical (or LR-reduced) cover w.r.t. some criteria is then a simple matter of comparing covers. Example 3.1. Consider again the schema CLRT from example 2.1 with Σ = {C → L,CT → R,LT → C,RT → C} Instead of using Σ, a designer could instead use any of its canonical covers: {C → L,CT → R,LT → C,RT → C}, {C → L,CT → R,LT → C,RT → L}, {C → L,LT → R,RT → C} The last cover is the best choice, since it is smaller than the others and induced a BCNF decomposition when using it for the synthesis approach. Note though that the number of canonical covers can be exponential in the number of attributes and FDs. It may therefore be more efficient to try computing only the cover which is actually needed. However, for some problems finding the desired cover directly (or testing whether one exists) can be difficult - we will see an example for this in section 4.5.1. But even if more efficient methods can be found, which compute the desired cover without computing all canonical covers first, the tools we develop for computing all canonical covers may prove useful in finding specific covers as well. In order to make computing the set of all canonical covers feasible when the number of such covers is huge, we need to decompose it into smaller “partial” covers, to represent it efficiently. The decomposition we have in mind is the following: Example 3.2. Let Σ consist of the following FDs: Σ = {AB → C,C → A,A → D,D → E,E → A} 55 Then Σ has the following canonical covers: CC(Σ) =    {AB → C,C → A,A → D,D → E,E → A}, {AB → C,C → A,A → E,E → D,D → A}, {AB → C,C → A,A → D,D → A,A → E,E → A}, {AB → C,C → A,A → D,D → A,D → E,E → D}, {AB → C,C → A,A → E,E → A,D → E,E → D}, {DB → C,C → A,A → D,D → E,E → A}, {DB → C,C → A,A → E,E → D,D → A}, {DB → C,C → A,A → D,D → A,A → E,E → A}, {DB → C,C → A,A → D,D → A,D → E,E → D}, {DB → C,C → A,A → E,E → A,D → E,E → D}, {EB → C,C → A,A → D,D → E,E → A}, {EB → C,C → A,A → E,E → D,D → A}, {EB → C,C → A,A → D,D → A,A → E,E → A}, {EB → C,C → A,A → D,D → A,D → E,E → D}, {EB → C,C → A,A → E,E → A,D → E,E → D}, {AB → C,C → D,A → D,D → E,E → A}, {AB → C,C → D,A → E,E → D,D → A}, {AB → C,C → D,A → D,D → A,A → E,E → A}, {AB → C,C → D,A → D,D → A,D → E,E → D}, {AB → C,C → D,A → E,E → A,D → E,E → D}, {DB → C,C → D,A → D,D → E,E → A}, {DB → C,C → D,A → E,E → D,D → A}, {DB → C,C → D,A → D,D → A,A → E,E → A}, {DB → C,C → D,A → D,D → A,D → E,E → D}, {DB → C,C → D,A → E,E → A,D → E,E → D}, {EB → C,C → D,A → D,D → E,E → A}, {EB → C,C → D,A → E,E → D,D → A}, {EB → C,C → D,A → D,D → A,A → E,E → A}, {EB → C,C → D,A → D,D → A,D → E,E → D}, {EB → C,C → D,A → E,E → A,D → E,E → D}, {AB → C,C → E,A → D,D → E,E → A}, {AB → C,C → E,A → E,E → D,D → A}, {AB → C,C → E,A → D,D → A,A → E,E → A}, {AB → C,C → E,A → D,D → A,D → E,E → D}, {AB → C,C → E,A → E,E → A,D → E,E → D}, {DB → C,C → E,A → D,D → E,E → A}, {DB → C,C → E,A → E,E → D,D → A}, {DB → C,C → E,A → D,D → A,A → E,E → A}, {DB → C,C → E,A → D,D → A,D → E,E → D}, {DB → C,C → E,A → E,E → A,D → E,E → D}, {EB → C,C → E,A → D,D → E,E → A}, {EB → C,C → E,A → E,E → D,D → A}, {EB → C,C → E,A → D,D → A,A → E,E → A}, {EB → C,C → E,A → D,D → A,D → E,E → D}, {EB → C,C → E,A → E,E → A,D → E,E → D}    56 Instead of using this bulky “direct” representation, we decompose CC(Σ) into sets of partial covers. We then obtain a full cover by selecting one partial cover from each set and forming their union: CC(Σ) =    {AB → C}, {DB → C}, {EB → C}   ∨    {C → A}, {C → D}, {C → E}   ∨    {A → D,D → E,E → A}, {A → E,E → D,D → A}, {A → D,D → A,A → E,E → A}, {A → D,D → A,D → E,E → D}, {A → E,E → A,D → E,E → D}    Using this decomposed representation, dealing with CC(Σ) becomes much easier. 3.1 Hypergraph Decomposition The set of all canonical covers of a given set of FDs forms a simple hypergraph. We shall therefore establish some results on representing hypergraphs in general, which will be useful in representing (and even computing) the set of all canonical covers in particular. Definition 3.1. A hypergraph H on a vertex set V is a set of subsets of V , i.e., H ⊆ P(V ). The elements of H are called edges. A hypergraph is called simple if none of its edges is included in another. Definition 3.2. The set ϑH of vertices actually appearing in edges of a hypergraph H is called the support of H: ϑH := ⋃ e∈H e Note that we do allow the empty edge in a hypergraph, and that we do not require that V = ϑH . The latter point is not important but simplifies some arguments. Definition 3.3. Let H,G be hypergraphs. We define the cross-union H ∨G of H and G as H ∨G := {eH ∪ eG | eH ∈ H, eG ∈ G} If VH , VG are the vertex sets of H and G then H ∨G is a hypergraph on VH ∪ VG. Definition 3.4. Let H be a hypergraph on vertex set V and S some vertex set. The projection H[S] of H onto S is H[S] := {e ∩ S | e ∈ H}, which is a hypergraph on V ∩ S. 3.1.1 Autonomous Sets We shall introduce the concept of an autonomous vertex set. Note that our definition is not meant to extend any use of the term ”autonomous set” in the context of graphs, where it is better known as “module”, and characterizes vertex sets M in which each vertex v ∈ M has the same neighbors outside M . Using our terminology, autonomous sets are only interesting for hypergraphs but not for graphs. Essentially the only graphs with non-trivial autonomous sets are complete bipartite graphs, with the possibility to add isolated vertices and/or loops to all vertices of one side of the bipartition. 57 Definition 3.5. Let H be a hypergraph on the vertex set V . We call a vertex subset S ⊆ V autonomous if H = H[S] ∨H[S] where S := V \ S denotes the complement of S. Clearly the complement of an autonomous set is itself autonomous, and H ⊆ H[S] ∨H[S] for any S ⊆ V . The sets ∅, V are autonomous for any hypergraph H on V , as are all subsets of V \ ϑH and their complements. Example 3.3. Consider the vertex set V = ABCDE, and on it the hypergraph H = {AC,AD,BC,BD} H is simple, and its support is ϑH = ABCD. The set S = AB is autonomous for H, as is its complement S = CDE, since H[AB] ∨H[CDE] = {A,B} ∨ {C,D} = {AC,AD,BC,BD} = H Lemma 3.6. Let S, T ⊆ V be autonomous. Then S ∩ T is autonomous as well. Proof. We need to show that for every pair of edges e1, e2 ∈ H the edge e′1 ∪ e′2 with e′1 := e1 ∩ (S ∩ T ) and e′2 := e2 ∩ S ∩ T lies in H as well. Since S is autonomous, the edge e′ := (e1 ∩ S) ∪ (e2 ∩ S) lies in H. Thus, since T is autonomous, the edge e′′ := (e′ ∩ T ) ∪ (e2 ∩ T ) lies in H. Clearly e′′ ∩ (S ∩ T ) = e′ ∩ (S ∩ T ) = e1 ∩ (S ∩ T ) = e′1 and similarly e′′ ∩ S ∩ T = (e′ ∩ (T \ S)) ∪ (e2 ∩ T ) = (e2 ∩ (T \ S)) ∪ (e2 ∩ T ) = e′2 which shows e′1 ∪ e′2 = e′′ ∈ H. Corollary 3.7. Let S, T ⊆ V be autonomous. Then S ∪ T is autonomous. Proof. The complements and intersections of autonomous sets are autonomous by defini- tion and by Lemma 3.6, respectively, and we have S ∪ T = S ∩ T Proposition 3.8. Let H,G be hypergraphs and S1, S2 vertex sets. Then we have (i) H[S1][S2] = H[S1 ∩ S2] (ii) (H ∨G)[S1] = H[S1] ∨G[S1] 58 Lemma 3.9. Let H be a hypergraph on V and S ⊆ V autonomous for H. Then for any T ⊆ V the set S ∩ T is autonomous for H[T ]. Proof. Since S is autonomous for H we have H = H[S] ∨H[S]. Thus H[T ] = (H[S] ∨H[S])[T ] = H[S][T ] ∨H[S][T ] = H[T ][S ∩ T ] ∨H[T ][S ∩ T ] Theorem 3.10. Let H be a hypergraph on V and {S1, . . . , Sn} a partition of V into autonomous sets. Then H = H[S1] ∨ . . . ∨H[Sn] Proof. By induction on n. The equation hold trivially for n = 1. Assume now the theorem holds for a fixed value of n. To show the theorem for n+1 we use that Sn∪Sn+1 is autonomous by Corollary 3.7, so that by assumption we have H = H[S1] ∨ . . . ∨H[Sn−1] ∨H[Sn ∪ Sn+1] By Lemma 3.9 Sn is autonomous for H[Sn ∪ Sn+1], i.e., we have H[Sn ∪ Sn+1] = H[Sn] ∨H[Sn+1] which shows the theorem for n+ 1. When talking about minimal autonomous sets , we will always mean minimal w.r.t. inclusion among all non-empty autonomous sets, even though the empty set is always autonomous by definition. While it would be more precise to call them minimal non- empty autonomous sets, this quickly becomes tedious. Theorem 3.11. Every hypergraph H has a finest partition {S1, . . . , Sn} into minimal autonomous sets. The autonomous sets of H are just the unions of these sets. Proof. Let S1, . . . , Sn be the minimal autonomous sets of H. By Lemma 3.6 they are pairwise disjoint. The union of autonomous sets is itself autonomous by Corollary 3.7, in particular S := n⋃ i=1 Si Furthermore the complement of S is autonomous, and since S does not include any min- imal autonomous set it is empty, i.e., S = V . Thus the sets S1, . . . , Sn form a partition of V . Whenever an autonomous set T intersects with some Si it must include it completely, since otherwise T ∩Si would be a smaller non-empty autonomous set by Lemma 3.6. Thus each autonomous set is the union of those Si it intersects with. Example 3.4. Consider again H = {AC,AD,BC,BD} on V = ABCDE. Then H = H[AB] ∨H[CD] ∨H[E] = {A,B} ∨ {C,D} ∨ {∅} so the minimal autonomous sets of H are AB,CD,E. Thus H has a total of 23 au- tonomous sets: ∅, AB,CD,E,ABCD,ABE,CDE,ABCDE 59 We now consider another type of decomposition, which will help us in characterizing the autonomous sets of simple hypergraphs. Definition 3.12. Let H be a hypergraph on vertex set V . The subhypergraph H 〈S〉 of H induced by S ⊆ V is H 〈S〉 := {e ∈ H | e ⊆ S} Definition 3.13. Let H be a hypergraph on the vertex set V . We call a vertex subset S ⊆ V isolated if H = H 〈S〉 ∪H 〈S〉. Clearly S is isolated if and only if every edge it intersects with lies completely in S. As with minimal autonomous sets, we will mean by minimal isolated sets the minimal sets (w.r.t. inclusion) among all non-empty isolated sets. Definition 3.14. As with graphs, we say that two vertices v1, vn in a hypergraph H are connected if there exists a sequence v1, v2, . . . , vn such vi, vi+1 always lie in some common hyperedge ofH. H is connected if all its vertices are connected. The connected components of H are its connected subhypergraphs. It follows immediately that the minimal isolated sets of H are the vertex sets of its maximal connected components, and that the isolated sets of H are the unions of them. Definition 3.15. Let H be a hypergraph on V . A set t ⊆ V is a transversal of H if t intersects with every edge of H. We denote the set of all minimal transversals (w.r.t. inclusion) by Tr(H), and call Tr(H) the transversal hypergraph of H. Clearly Tr(H) is a simple hypergraph on V , even if H is not simple. Theorem 3.16. Let H,G be hypergraphs on disjoint vertex sets VH and VG. Then Tr(H ∨G) = Tr(H) ∪ Tr(G) Tr(H ∪G) = Tr(H) ∨ Tr(G) Proof. (1) We first show that a set t ⊆ V := VH ∪ VG is a transversal of H ∨ G iff it intersects with every edge of H or with every edge of G. For the “if” part, assume w.l.o.g. that t intersects with every edge of H. Since every edge e ∈ H ∨G is of the form e = eH ∪ eG with eH ∈ H, eG ∈ G, t intersects with e because it intersects with eH . We show the “only if” part by contraposition and assume that there be edges eH ∈ H, eG ∈ G such that t intersects with neither of them. But then t does not intersect eH ∪eG ∈ H ∨G either, i.e., t is not a transversal of H ∨G. We thus have that the transversals of H ∨ G are the transversals of H plus the transversals of G. Thus the minimal transversals of H ∨ G are the minimal elements of Tr(H) ∪ Tr(G). Since VH and VG are disjoint, all elements of Tr(H) ∪ Tr(G) are minimal. Thus Tr(H ∨G) = Tr(H) ∪ Tr(G) (2) By definition a set t ⊆ V is a transversal of H ∪ G iff it is a transversal of both H and G. Thus the transversals of H ∪ G are the unions of transversals of H with transversals of G. The minimal transversals of H ∪G are therefore the minimal elements of Tr(H) ∨ Tr(G). Since VH and VG are disjoint, all elements of Tr(H) ∨ Tr(G) are minimal. Thus Tr(H ∪G) = Tr(H) ∨ Tr(G) 60 Lemma 3.17. [7] Let H be a simple hypergraph. Then Tr(Tr(H)) = H. We are now able to characterize the autonomous sets of a simple hypergraph. Theorem 3.18. Let H be a simple hypergraph. Then the autonomous sets of H are the isolated sets of its transversal hypergraph Tr(H). Proof. Let S ⊆ V be autonomous for H, i.e., H = H[S] ∨H[S]. Then Tr(H) = Tr(H[S]) ∪ Tr(H[S]) by Theorem 3.16, so S is isolated for Tr(H). Conversely let S be any isolated set of Tr(H). Then Tr(H) = Tr(H) 〈S〉 ∪ Tr(H) 〈S〉 and by Theorem 3.16 we have Tr(Tr(H)) = Tr(Tr(H) 〈S〉) ∨ Tr(Tr(H) 〈S〉) Thus S is autonomous for Tr(Tr(H)) = H. Example 3.5. The requirement that H be simple in Theorem 3.18 is necessary: Consider the hypergraphs H = {AC,AD,BC,BD} and H ′ = H ∪ {ABC} Both H and H ′ have the same minimal transversals Tr(H) = Tr(H ′) = {AB,CD} Clearly AB,CD are isolated sets of Tr(H ′), but AB and CD are not autonomous for H ′: H ′[AB] ∨H ′[CD] = {A,B,AB} ∨ {C,D} = {AC,AD,BC,BD,ABC,ABD} = H ′ ∪ {ABD} 6= H ′ Since graphs are just special hypergraphs, our theory of autonomous sets applies to them as well. Clearly all complete bipartite graphs have a non-trivial partition into two autonomous sets, but one may wonder whether there are others. Lemma 3.19. A simple graph G without isolated vertices has a non-trivial partition into autonomous sets iff it is complete bipartite. Proof. Let S /∈ {∅, ϑG} be autonomous. Since G contains no isolated vertices, G[S] and G[S] contain non-empty edges. As all edges in a simple graph contain exactly two vertices, the edges of G[S] and G[S] contain exactly one vertex each. Thus G = G[S] ∨ G[S] is complete bipartite. We note that non-simple graphs with non-trivial autonomous partition may also have loops on all vertices of one side of the bipartition, as well as isolated vertices. 61 3.1.2 Superedges and Partial Superedges While canonical covers will form the edges in our hypergraph, we will have to argue about atomic covers, i.e., sets of atomic FDs which form a cover, but may contain more FDs than needed. We call such supersets of edges “superedges”. Definition 3.20. Let H be a hypergraph on V . A set E ⊆ V is called a superedge of H if it includes some edge e ∈ H, i.e. e ⊆ E. We call E ⊆ S ⊆ V a partial (super)edge on S if E is a (super)edge of H[S]. Lemma 3.21. Let H be a hypergraph on V and S ⊆ V . A set S ′ ⊆ S is a partial superedge on S iff S ′ ∪ S is a superedge. Proof. By definition S ′ is a partial superedge on S iff it includes a partial edge eS ∈ H[S], i.e. iff there exists an edge e ∈ H with e ∩ S = eS ⊆ S ′. Since e = eS ∪ (e ∩ S) ⊆ S ′ ∪ S this implies that S ′ ∪S is a superedge. Conversely, if S ′ ∪S is a superedge, it includes an edge e ∈ H, which gives us e ∩ S ⊆ (S ′ ∪ S) ∩ S = S ′ Lemma 3.22. Let H be a hypergraph on V and P = {S1, . . . , Sn} a partition of V into autonomous sets. A set E ⊆ V is a superedge iff Ei := E ∩ Si is a partial superedge on Si for i = 1, . . . , n. Proof. If E is a superedge then it includes an edge e ∈ H. Thus Ei includes e∩Si ∈ H[Si], which makes Ei a partial superedge on Si. Now for each i = 1, . . . , n let Ei be a partial superedge on Si, including the partial edge ei ∈ H[Si]. Thus E includes e := e1 ∪ . . . ∪ en ∈ H[S1] ∨ . . . ∨H[Sn] (thm 3.10)= H which makes E a superedge of H. We can therefore strengthen Lemma 3.21 when S is autonomous: Lemma 3.23. Let H be a hypergraph on V , E ⊆ V a superedge and S ⊆ V autonomous. A set S ′ ⊆ S is a partial superedge on S iff S ′ ∪ (E \ S) is a superedge. Proof. By Lemma 3.22, the set S ′ ∪ (E \ S) is a superedge iff S ′ is a partial superedge on S and E \ S is a partial superedge on S. Since E is a superedge, Lemma 3.22 assures that E \ S = E ∩ S is a partial superedge on S. 62 3.1.3 Computing Autonomous Sets To complete this section, we now address the question of computing the minimal au- tonomous sets of H. While Theorem 3.18 suggests an approach (at least for simple hypergraphs), computing the transversal hypergraph can lead to exponential runtime. Instead, we shall utilize the following observation. Lemma 3.24. Let H be a hypergraph on V and P = {S1, . . . , Sn} the partition of V into minimal autonomous sets. Let further H ′1 ⊆ H[S1] be non-empty, and H ′ ⊆ H be the hypergraph H ′ := H ′1 ∨H[S2] ∨ . . . ∨H[Sn]. Then S2, . . . , Sn are minimal autonomous sets of H ′. Proof. By definition S2, . . . , Sn are autonomous for H ′. If one of those Si were not minimal for H ′, i.e., could be partitioned into smaller autonomous sets T1, . . . , Tk, then H[Si] = H[T1] ∨ . . . ∨H[Tk] and thus the sets Ti would be autonomous for H as well, contradicting the minimality of Si. We use this to compute the partition of V into minimal autonomous sets as follows. We pick some vertex v ∈ V and split H into two hypergraphs, one containing all the edges which contain v, the other one containing all those edges which do not contain v. We will need only one of them, so let Hv be the smaller one of the two, i.e., the one with fewer edges (if both contain exactly the same number of edges we may choose either one): Hv := smaller of { {e ∈ H | v ∈ e} {e ∈ H | v /∈ e} If Hv is empty, then v lies in all or no edges of H, and in both cases the set {v} is autonomous for H. This reduces the problem of finding the minimal autonomous sets of H to finding the minimal autonomous sets of H[v], where v := ϑH \ {v}, as they are also minimal autonomous sets of H. Consider now the case where Hv is not empty. Let S1 be the minimal autonomous set of H containing v. Then Hv has the same form as H ′ in Lemma 3.24, where H ′1 contains either the edges of H[S1] which do or those which do not contain v. We now compute the minimal autonomous sets of Hv, and check for each set whether it is autonomous for H. By Lemma 3.24 the sets autonomous for H are exactly the S2, . . . , Sn, while the sets not autonomous for H partition S1. Taking the union of those non-autonomous sets and keeping the autonomous ones thus gives us the minimal autonomous sets of H. Note that the set {v} is always autonomous for Hv, as v is contained in either all or no edges of Hv. Thus it suffices to compute the minimal autonomous sets of Hv[v]. In either case we have reduced the problem of finding the minimal autonomous set of H to that of finding the minimal autonomous sets of a hypergraph with fewer vertices. This gives us the following recursive algorithm. 63 Algorithm “Recursive Autonomous Partitioning” INPUT: hypergraph H OUTPUT: partition of ϑH into minimal autonomous sets function RAP(H) select vertex v ∈ ϑH Hv := smaller of { {e ∈ H | v ∈ e} {e ∈ H | v /∈ e} if Hv = ∅ then return {{v}} ∪RAP (H[v]) else Aut := ∅, S1 := {v} Autv := RAP (Hv[v]) for all S ∈ Autv do if S autonomous for H then Aut := Aut ∪ {S} else S1 := S1 ∪ S end return {S1} ∪ Aut While the test whether a set S is autonomous for H can be performed by computing H ′ := H[S]∨H[S] and comparing it to H, the resulting set can easily contain up to |H|2 edges if S is not autonomous. We observe that always H ⊆ H ′ and thus H = H ′ iff |H| = |H ′|. Since |H ′| = |H[S]| · |H[S]|, the later condition can be checked faster without actually computing H ′. Theorem 3.25. Let H be a hypergraph with k vertices and n edges. Then the ”Re- cursive Autonomous Partitioning” algorithm computes the partition of ϑH into minimal autonomous sets of H in time O(nk2). Proof. We have already argued that the algorithm computes the minimal autonomous sets of H correctly, so we only need to show the time bound. The depth of recursion is at most k. In each call we compute Hv, which can be done in O(n). If Hv = ∅ we only need to compute H[v], which is possible in O(nk). Thus this part of the algorithm can be performed in O(nk2). If Hv 6= ∅ we need to test each set found to be autonomous for Hv whether it is autonomous for H. The number of such tests is at most k, and each test can be performed in O(nk), by computing H[S] and H[S] and testing whether |H| = |H[S]| · |H[S]|. This leads to a complexity of O(nk2). Since the number of edges of Hv is at most half of the number of edges of H, the number of steps required for performing the tests on Hv (or the next subgraph in the recursion for which tests are required) is at most half as many. This leads to a total complexity of O((n+ n2 + n 4 + . . .)k 2) = O(nk2) 64 3.2 Computing all Canonical Covers Recall that we wish to compute the set of all canonical covers as a general method for finding covers which are best in some sense. Of cause, this approach only works if at least some of the optimal solutions are canonical, and comparing two given covers is easy. We will see later that these conditions are met for a number of hard problems, for which no efficient algorithms are known. In the following we will develop an algorithm for computing the set of all canonical covers. As far as we know, no such algorithm has appeared in the literature so far. Definition 3.26. Let Σ be a set of FDs. We denote the set of all canonical covers of Σ by CC(Σ) := {G ⊆ Σ∗a | G is a canonical cover of Σ} Note that the problem of finding canonical covers is similar to that of finding minimal keys. Instead of seeking all minimal sets of attributes that determine all other attributes, we seek all minimal sets of FDs which imply all other FDs. For this we will use the linear resolution algorithm introduced in chapter 2. 3.2.1 Partial Covers The set of all canonical covers of Σ forms a simple hypergraph on the FDs in Σ∗a. We may thus use the terms defined for hypergraphs for canonical covers as well. In particular, we shall talk about autonomous sets of FDs, and (partial) superedges. Note that in this context the superedges are the atomic covers, while the edges are the canonical covers. Definition 3.27. We call a set of FDs in Σ∗a autonomous if it is autonomous for the hypergraph CC(Σ). When talking about transversals, we always mean transversals of CC(Σ). Lemma 3.28. A set G ⊆ Σ∗a is a cover of Σ iff it intersects with all minimal transversals of CC(Σ). Proof. G is a cover iff it is a superedge of CC(Σ). Furthermore, CC(Σ) is simple, and by Lemma 3.17 the edges of a simple hypergraph are the minimal sets which intersect with all minimal transversals. Thus superedges are simply sets (not necessarily minimal) which intersect with all minimal transversals. As superedges become (atomic) covers for the hypergraph CC(Σ), partial superedges become partial covers. Definition 3.29. Let Σ be a set of FDs and G ⊆ S ⊆ Σ∗a. We call G a partial cover of Σ on S if G is a partial superedge of CC(Σ) on S. When S is autonomous, testing whether a set of FDs is a partial cover on S is easy: Lemma 3.30. Let S ⊆ Σ∗a be autonomous, and let Σ′ ⊆ Σ∗a be an atomic cover of Σ. Then a set G ⊆ S is a partial cover on S iff G ∪ (Σ′ \ S) is a cover of Σ. Proof. Follows directly from Lemma 3.23. 65 Clearly G∪ (Σ′ \S) is a cover of Σ iff G∪ (Σ′ \S) ² Σ′∩S, which allows us to perform this test quickly. We will identify some autonomous (but not necessarily minimal) sets of CC(Σ). The- orem 3.18 relates autonomous sets to the minimal transversals of CC(Σ). The following lemmas establish some results about the form of these minimal transversals. Lemma 3.31. Let S ⊆ Σ∗a be a minimal transversal of CC(Σ) and X → A ∈ S. Then S = Σ∗a \ S is not a cover of Σ, but S ∪ {X → A} is. Proof. By Lemma 3.28, S is not a cover of Σ since it does not intersect with S. If S ∪ {X → A} were not a cover, then every cover would contain a FD in S ∪ {X → A} = S \ {X → A} Thus S \ {X → A} would be a transversal, which contradicts the minimality of S. Definition 3.32. The sets of attributes X and Y are equivalent under a set of FDs Σ, written X ↔ Y , if X → Y and Y → X lie in Σ∗. Lemma 3.33. Let X → A, Y → B be contained in a common minimal transversal S ⊆ Σ∗a of CC(Σ). Then X and Y are equivalent under S = Σ∗a \ S. Proof. By Lemma 3.31 we have S 2 Y → B S ∪ {X → A} ² Y → B (3.1) Let us denote the closure of Y under S by Y ∗S. If X * Y ∗S then Y ∗S = Y ∗S∪{X→A} which contradicts (3.1). Thus S ² Y → X, and by symmetry S ² X → Y . Definition 3.34. Let Σ be a set of FDs on R. We denote the set of FDs in Σ∗a with LHS equivalent to X ⊆ R as EQX := {Y → Z ∈ Σ∗a | Y ↔ X} The partition of Σ∗a into non-empty equivalence sets is denoted as EQ := {EQX | ∃Y.X → Y ∈ Σ∗a} Theorem 3.35. Let Σ be a set of FDs on R. Then every set EQX ∈ EQ is autonomous. Proof. By Lemma 3.33 all FDs in a (maximal) connected component of Tr(CC(Σ)) have equivalent LHSs under Σ. Thus EQX is the union of vertex sets of maximal connected components of Tr(CC(Σ)), and therefore an isolated set of Tr(CC(Σ)). By Theorem 3.18 isolated sets of Tr(CC(Σ)) are autonomous for CC(Σ). We are now ready to prove our main theorem for this section. Theorem 3.36. Let Σ be a set of FDs on R. A set G ⊆ Σ∗a is a cover of Σ iff G∩EQX is a partial cover of Σ on EQX for every EQX ∈ EQ. 66 Proof. By Theorem 3.35 the sets EQX form a partition of Σ∗a into autonomous sets, so the theorem is a special case of Lemma 3.22. Theorem 3.36 allows us to split the task of finding and representing all canonical covers of Σ into several smaller tasks. For every EQX ∈ EQ we find the set CX of all non-redundant partial covers on EQX . By Theorem 3.10 these describe CC(Σ) in decomposed form: CC(Σ) = CX1 ∨ . . . ∨ CXn While we could easily compute CC(Σ) by taking their cross-union, this decomposed de- scription of CC(Σ) is usually much smaller, and thus better suited for most tasks. Note that the equivalence classes EQX need not be minimal autonomous sets of CC(Σ). If we could find smaller autonomous sets we could speed up the computation of CC(Σ) even more. However, we will show later (Theorem 3.81) that finding the min- imal autonomous sets of CC(Σ) is hard. In section 3.4 we will give an algorithm to find finer, but not necessarily minimal autonomous sets. 3.2.2 Relative Covers Partial covers on a set S of atomic FDs are obtained by taking an atomic cover and intersecting it with S. A different concept which will become useful is that of relative covers, which can be obtained through ’relativation’ of covers onto sets of attributes. Definition 3.37. Let Σ be a set of FDs on R, and ΣH be a set of FDs on H ⊆ R,H := R \H. We call ΣH a relative cover of Σ on H if for all X,Y ⊆ H we have ΣH ² X → Y ⇔ Σ ² X ∪H → Y Relative covers have been used previously by Saiedian and Spencer in [39] under the name contraction. Definition 3.38. The relativation of a FD X → Y onto an attribute set H is X → Y ]H[ := X ∩H → Y ∩H The relativation of a set Σ of FDs onto H is Σ ]H[ := {X → Y ]H[ | X → Y ∈ Σ} Note that we do allow FDs with empty LHS. They arise naturally when relativating sets of FDs with non-empty LHSs. We will show that relative covers can be constructed through relativation. Lemma 3.39. Let ΣH be a relative cover of Σ on H. Then Σ∗H = Σ∗ ]H[ Proof. By definition we have Σ∗ ]H[ = {X ∩H → Y ∩H | X → Y ∈ Σ∗} 67 Thus for any X,Y ⊆ H we get X → Y ∈ Σ∗ ]H[ ⇔ ∃X ′ ⊆ H with X ∪X ′ → Y ∈ Σ∗ ⇔ X ∪H → Y ∈ Σ∗ ⇔ X → Y ∈ Σ∗H The last correspondence holds since ΣH is a relative cover. Lemma 3.40. Let Σ be a set of FDs on R and H ⊆ R. (a) If Σ ² X → Y then Σ ]H[ ² X ∩H → Y ∩H (b) If Σ ]H[ ² X → Y then Σ ² X ∪H → Y Proof. (a) If X → Y ∈ Σ then X ∩ H → Y ∩ H ∈ Σ ]H[. Otherwise X → Y can be derived from Σ using the Armstrong Axioms (1.1). We show that X ∩H → Y ∩H can be derived from Σ ]H[ by induction on the length of the derivation tree used to derive X → Y . This is straight forward: derivation from Σ derivation from Σ ]H[ X → Y Y ⊆ X y X ∩H → Y ∩HY ∩H ⊆ X ∩H X → Y XW → YW y X ∩H → Y ∩H XW ∩H → YW ∩H X → Y Y → Z X → Z y X ∩H → Y ∩H Y ∩H → Z ∩H X ∩H → Z ∩H (b) If X → Y ∈ Σ ]H[ then Σ contains a FD X ∪X ′ → Y ∪Y ′ with X ′, Y ′ ⊆ H, which implies X ∪H → Y . The remaining argument proceeds as for (a). Lemma 3.41. Let Σ be a set of FDs on R and H ⊆ R. Then Σ∗ ]H[ = (Σ ]H[)∗ Proof. We can show Σ∗ ]H[ ⊆ (Σ ]H[)∗ as follows: X ′ → Y ′ ∈ Σ∗ ]H[ ⇔ ∃X → Y ∈ Σ∗ with X ′ = X ∩H, Y ′ = Y ∩H ⇔ Σ ² X → Y (Lemma 3.40a) ⇒ Σ ]H[ ² X ∩H → Y ∩H ⇔ X ′ → Y ′ ∈ (Σ ]H[)∗ The proof for (Σ ]H[)∗ ⊆ Σ∗ ]H[ is similar: X → Y ∈ (Σ ]H[)∗ ⇔ Σ ]H[ ² X → Y (Lemma 3.40b) ⇒ Σ ² X ∪H → Y ⇔ X ∪H → Y ∈ Σ∗ ⇒ X → Y ∈ Σ∗ ]H[ 68 Using lemmas 3.39 and 3.41 we get: Theorem 3.42. Let Σ be a set of FDs on R and H ⊆ R. Then the relativation Σ ]H[ of Σ is a relative cover of Σ on H. Proof. Let ΣH be a relative cover of Σ on H. We need to show that Σ∗H = (Σ ]H[)∗. This is clear by lemmas 3.39 and 3.41: Σ∗H = Σ∗ ]H[ = (Σ ]H[)∗ Note that a variant of Theorem 3.42 has been shown previously: It corresponds to lemma 6 in [39]. Note further that the converse of Theorem 3.42 does not hold: not every relative cover is the relativation of a cover. Example 3.6. Consider the set of FDs Σ = {AB → C,C → D,B → D}. Its relativation onto H = BCD is a relative cover of Σ on H: Σ ]H[ = {B → C,C → D,B → D} The FD B → D is redundant in Σ ]H[ so that the set ΣH := {B → C,C → D} is also a relative cover of Σ on H. However, every cover of Σ contains B → D or B → BD, so ΣH cannot be the relativation of a cover of Σ. 3.2.3 Implication Dependencies In section 2.1.5 we have seen how minimal keys can be computed efficiently, and we noted that the problem of finding canonical covers is similar. However, in the case of the key finding problem, a set of FDs was used to describe determination between attribute sets, whereas implication of FDs is given implicitly. To utilize our linear resolution algorithm, we need to make implications explicit. This is done as follows. Definition 3.43. Let Σ be a set of FDs. We call an expression of the form S ⇒ T where S, T ⊆ Σ an implication dependency (ID). An ID S ⇒ T is the equivalent of a FD S → T over Σ, where Σ is regarded as attribute set (i.e., we regard the FDs in Σ as independent attributes without any connection). We thus use the terminology defined for FDs for IDs as well, assuming an equivalent definition. In particular, we say that a set Π of IDs implies an ID S ⇒ T iff S ⇒ T can be derived from Π using the equivalent of the Armstrong axioms (with → replaced by ⇒). Definition 3.44. Let Σ be a set of FDs. We call a set Π of IDs on Σ an implication cover of Σ if for all sets S, T ⊆ Σ we have S ⇒ T ∈ Π∗ iff S ² T We call Π sound if S ⇒ T ∈ Π∗ implies S ² T , and complete for Σ if S ² T implies S ⇒ T ∈ Π∗. Furthermore, we call ΠH a relative implication cover on H ⊆ Σ if for all S, T ⊆ H we have S ⇒ T ∈ Π∗H iff S ∪ (Σ \H) ² T 69 Note that the relationship of implication covers and relative implication covers is the same as for covers and relative covers: For any implication cover Π the condition S ⇒ T ∈ Π∗H iff S ∪ (Σ \H) ² T is equivalent to ΠH ² S ⇒ T iff Π ² S ∪ (Σ \H) ⇒ T which resembles precisely Definition 3.37. In particular this gives us Theorem 3.42 for implication covers: Corollary 3.45. Let Σ be a set of FDs with implication cover Π and H ⊆ Σ. Then Π ]H[ is a relative implication cover of Σ on H. Let us now recall Lemma 3.21. For the hypergraph CC(Σ) it states the following: Corollary 3.46. Let Σ be a set of FDs and H ⊆ Σ∗a. A set S ⊆ H is a partial cover on H iff S ∪ (Σ∗a \H) is a cover of Σ. The last condition of Corollary 3.46 is equivalent to S ∪ (Σ∗a \H) ² Σ∗a, and thus to S ∪ (Σ∗a \H) ² H Comparing this to Definition 3.44, we can rewrite the condition as S ⇒ H ∈ Π∗H for any relative implication cover ΠH of Σ∗a on H. This characterizes S as a key of H w.r.t. the set of IDs ΠH , giving us the following lemma. Lemma 3.47. Let S ⊆ H ⊆ Σ∗a, and ΠH be a relative implication cover of Σ∗a on H. Then S is a partial cover of Σ∗a on H iff S ⇒ H ∈ Π∗H . Proof. See above. The last lemma allows us to find partial canonical covers as follows: We first find a relative implication cover ΠH , then use linear resolution to find all minimal keys w.r.t. ΠH , which are the partial canonical covers needed. The next theorem allows us to compute a relative implication cover. To make the soundness proof for a (relative) implication cover easier, we first show a simple lemma. Lemma 3.48. A set Π of IDs on Σ is sound iff S ² T for all S ⇒ T ∈ Π. A set ΠH of IDs on H ⊆ Σ is sound iff S ∪ (Σ \H) ² T for all S ⇒ T ∈ ΠH . Proof. We only need to show that S ² T for all derived IDs S ⇒ T ∈ Π∗, and S∪(Σ\H) ² T for IDs S ⇒ T ∈ Π∗H . For each application of the armstrong axioms for IDs, it is easy to see that soundness of the premises (i.e., S ² T or S∪ (Σ\H) ² T , respectively) implies soundness of the derived IDs. 70 Theorem 3.49. Let Σ be a set of FDs, EQ the partition of Σ∗a into equivalence classes, and H = EQX for some set EQX ∈ EQ. Construct ΠX as follows: For every pair of (different) FDs Y → A ∈ H,Z → A ∈ Σ∗a with Σ ² Y → Z let ΠX contain the ID {Y → Zi ∈ H | Zi ∈ Z} ∪ {Z → A} ⇒ Y → A (3.2) provided Z → A ∈ H, or {Y → Zi ∈ H | Zi ∈ Z} ⇒ Y → A (3.3) otherwise. Then ΠX is a relative implication cover of Σ∗a on H. Proof. We first show that ΠX is sound. By Lemma 3.48 it suffices to show that for every ID in ΠX of the form (3.2) or (3.3) we have {Y → Zi ∈ H | Zi ∈ Z} ∪ {Z → A} ∪ (Σ∗a \H) ² Y → A (3.4) For every Zi ∈ Z \ Y there is a minimal Yi ⊂ Y such that Yi → Zi ∈ Σ∗a. If Yi 6= Y , then Yi and Y are not equivalent, since Y is a minimal LHS. Thus Yi → Zi ∈ Σ∗a \H, so that all Yi → Zi are contained in the LHS of (3.4). Clearly {Yi → Zi | Zi ∈ Z \ Y } ∪ {Z → A} implies Y → A. To prove that ΠX is complete, let S, T ⊆ H with S ∪ (Σ∗a \H) ² T . We need to show that S ⇒ T ∈ Π∗X . Assume the contrary, so that for U ′ := S ∪ (Σ∗a \ H) there exists a FD Y ′ → A′ ∈ T (and thus Y ′ → A′ ∈ H) with (note that U ′ ∩H = S): U ′ ² Y ′ → A′ and U ′ ∩H ⇒ Y ′ → A′ /∈ Π∗X Now let U ⊆ U ′ be minimal such that there exists a FD Y → A ∈ H for which U ² Y → A and U ∩H ⇒ Y → A /∈ Π∗X Consider closure computation for Y under U : Since we have U ² Y → A there must be a FD Z → A ∈ U such that UA := U \{Z → A} implies Y → Z. Equivalently UA ² Y → Zi for all Zi ∈ Z. Since UA ( U and U was chosen minimal, we get UA ∩H ⇒ Y → Zi ∈ Π∗X for all Y → Zi ∈ H,Zi ∈ Z. Since UA ∩H ⊆ U ′ ∩H = S this gives us S ⇒ {Y → Zi ∈ H | Zi ∈ Z} ∈ Π∗X If Z → A ∈ H then Z → A ∈ U ∩H ⊆ S, and since ΠX contains the ID (3.2) it follows that S ⇒ Y → A ∈ Π∗X which contradict our assumption. For Z → A /∈ H the same follows with the ID (3.3). The size of the relative implication cover of EQX constructed is clearly polynomial in the size of Σ∗a. We note that using Theorem 3.36 to split up the problem of finding canonical covers into finding partial canonical covers for equivalence classes of FDs is helpful in two ways. First it allow us to represent the set of all canonical covers in an efficient manner. At the same time it simplifies the problem of finding implication covers by allowing for small relative implication covers. As example 3.7 demonstrates, it can happen that every implication cover of Σ∗a is exponential in the size of Σ∗a. 71 Example 3.7. Let X1 . . . X2n, Y1 . . . Yn, A be attributes and X = X1 . . . X2n, X i = X \Xi, Y = Y1 . . . Yn, Y i = Y \ Yi be attribute sets. Let further Σ =    X1 → Y1, X2 → Y1, X3 → Y2, X4 → Y2, . . . . . . X2n−1 → Yn, X2n → Yn, Y → A    and thus (note that X i ∪Xj = X for i 6= j) Σ∗a = Σ ∪    X1Y 1 → A, X2Y 1 → A, X3Y 2 → A, X4Y 2 → A, . . . . . . X2n−1Y n → A, X2nY n → A, X → A    Then every implication cover of Σ∗a contains (at least) the 2n atomic IDs {Z1Y 1 → A,Z2 → Y2, Z3 → Y3, . . . , Zn → Yn} ⇒ X → A where each Zi is replaced by X2i−1 or X2i. This is because the FDs in the LHSs do not imply any other FD in Σ∗a \ {X → A}. 3.2.4 The Algorithm We summarize the algorithm developed below. Algorithm “divide and resolve” INPUT: set of FDs Σ OUTPUT: set of all partial canonical covers CCX for every equivalence class EQX of Σ∗a compute Σ∗a partition Σ∗a into equivalence classes EQ for each EQX ∈ EQ do construct relative implication cover ΠX of EQX CCX := {minimal keys of EQX w.r.t. ΠX} end The sets CCX are the partial canonical covers of Σ on EQX by Lemma 3.47, and together they represent CC(Σ) as described in Theorem 3.36. Note that the partition of Σ∗a into autonomous sets EQ might not be minimal. We will see later in section 3.5 that deciding whether a set of FDs is autonomous for CC(Σ) is co- NP-complete, even when given Σ∗a. However, given all minimal autonomous sets, testing whether a set is autonomous can be done in polynomial time by Theorem 3.11. Thus, 72 unless P=NP, finding the minimal autonomous sets will not be possible in polynomial time. We contented ourselves with the non-minimal partition EQ since it was easy to identify and fast to compute. In section 3.4 we will discuss an efficient method for computing a finer partition into autonomous sets, although these sets may not be minimal either. If we want to find the minimal autonomous partition after the sets CCX have been computed, e.g., to store CC(Σ) more efficiently, we can partition the hypergraphs CCX further using the “Recursive Autonomous Partitioning” algorithm from section 3.1. The following example calculation illustrates the algorithm “divide and resolve”. Example 3.8. Our goal is to compute all canonical covers for the set of FDs Σ = {AB → C,AC → B,AD → C,AE → C,BE → A} We start by computing the atomic closure Σ∗a of Σ Σ∗a = Σ ∪ {AE → B,AD → B,BE → C} and partitioning Σ∗a into equivalence classes EQ = {EQAB, EQAD, EQAE} with EQAB = {AB → C,AC → B} EQAD = {AD → C,AD → B} EQAE = {AE → B,AE → C,BE → A,BE → C} We then construct the relative implication cover for each equivalence class: ΠAB = ∅ ΠAD = { {AD → B} ⇒ {AD → C}, {AD → C} ⇒ {AD → B} } ΠAE =    {AE → B} ⇒ {AE → C}, {AE → B,BE → C} ⇒ {AE → C}, {AE → C} ⇒ {AE → B}, {BE → A,AE → C} ⇒ {BE → C}, {BE → A} ⇒ {BE → C}    The partial canonical covers can now be computed as minimal keys w.r.t. the relative implication covers: CCAB = {{AB → C,AC → B}} CCAD = {{AD → B}, {AD → C}} CCAE = {{AE → B,BE → A}, {AE → C,BE → A}} Together this gives us all four canonical covers in CC(Σ) = CCAB ∨ CCAD ∨ CCAE =    {AB → C,AC → B,AD → B,AE → B,BE → A}, {AB → C,AC → B,AD → B,AE → C,BE → A}, {AB → C,AC → B,AD → C,AE → B,BE → A}, {AB → C,AC → B,AD → C,AE → C,BE → A}    73 3.2.5 Improvements and Complexity Analysis When using linear resolution to find CCX , the most time consuming step of eliminating extraneous attributes from the LHS of an ID is actually that of eliminating redundant FDs from a partial atomic cover on EQX . When minimizing the LHS of a FD, we remove one attribute A at a time. We then check whether A is still determined by the remaining attributes by computing their closure, using the FDs in Σ. The corresponding approach for an ID L ⇒ EQX would be to remove one FD Y → A from its LHS L, and compute the closure of the remaining FDs L \ {Y → A} w.r.t. the relative implication cover ΠX . However, since ΠX can be large compared to Σ, it is usually more efficient to check whether L \ {Y → A} is still a partial cover using Lemma 3.30. Since we know that L is a partial cover, we only need to check whether Y → A is implied by (L \ {Y → A}) ∪ (Σ \ EQX) This gives us the following simple procedure: proc minimize-partial-cover(L, EQX ,Σ) for all Y → A ∈ L do L′ := (L \ {Y → A}) ∪ (Σ \ EQX) Y ∗L′ := closure of Y w.r.t. L′ if A ∈ Y ∗L′ then L := L \ {Y → A} end To establish an upper bound for the complexity of the “divide and resolve” approach, we use the variables f = |Σ∗a|, n = |Σ|, k = |R| as defined earlier, as well as c = max {|CCX | | EQX ∈ EQ} i.e., c is the maximum number of partial canonical covers over all EQX . Computing Σ∗a can be done in O(f · k2n2). Partitioning Σ∗a can be done in O(f · kn) by computing the closure of each LHS. Constructing all ΠX takes O(f 2 · k2) using the closures computed before. Computing the sets CCX by linear resolution without the optimization described above would lead to a complexity of O(c · f 6), using that |ΠX | is bounded by f 2. Using the partial cover test instead leads to a worst-time complexity of O(c ·f 4 ·k), which can be argued as follows: The number of LHS minimizations is bounded by c ·∑ |ΠX | ≤ c · f 2, and each minimization requires at most f redundancy tests. Each redundancy test requires one closure computation relative to a subset of Σ∗a, which can be performed in O(f · k). The overall computation time is therefore bounded by the term O(c · f 4 · k + f · k2n2 + f 2 · k2) If we assume that n, k are bounded by f , which holds in all but some “lucky” cases (”lucky” because this means f is very small indeed), this can be simplified to O(c · f 4 · k) 74 3.2.6 LR-reduced Covers The general purpose of finding all canonical covers is to find covers which are best in some sense. This approach only works if we can be sure that the best cover (or at least one of the best if there are multiple optimal solutions) for a given problem is canonical. This is often the case, but not always. Another common type of cover are LR-reduced covers. Definition 3.50. A set Σ of FDs is called LR-reduced if no attribute can be removed from any FD in Σ while maintaining the property of being a cover. While we only computed the set CC(Σ) of all canonical covers, the set of all LR- reduced covers can be constructed from CC(Σ) easily. By “splitting” a FD X → Y into singular FDs, we mean to replace it by {X → A | A ∈ Y }. Lemma 3.51. [32] Splitting FDs into singular FDs turns an LR-reduced cover into a canonical cover. Combining FDs with identical LHSs turns a canonical cover into an LR-reduced cover. As an example for the usefulness of LR-reduced covers, one may consider the problem of finding a cover with minimal number of attributes, which is known to be NP-hard [32]. Definition 3.52. The area of a FD X → Y is |X| + |Y |, i.e., the number of attributes appearing in X → Y . The area of a set of FDs Σ is the sum of the areas of all FDs in Σ. Definition 3.53. A set Σ of FDs is called area optimal if there exists no cover G of Σ with smaller area. Area optimal covers can help in reducing the workload for checking whether depen- dencies hold on a relation, as well as speed up various algorithms [32]. Area optimal covers are rarely canonical, since combining FDs with equal LHS reduced the area. It is easy to see though that they are always LR-reduced. Lemma 3.54. [32] An area optimal set of FDs is LR-reduced. Thus, we may combine FDs with identical LHSs to obtain all LR-reduced covers in which no FDs have identical LHSs. Clearly those include all area optimal covers. We may determine all area optimal partial covers for each EQX separately to get a concise representation for all area optimal covers. In [4] Ausiello, D’Atri and Sacca introduce several similar minimality criteria for covers, and show that some of the corresponding decision problems are NP-hard. While these minimal covers are not always canonical or LR-reduced, it is easy to see that some of them always are. Thus, while our approach may not find all minimal covers (w.r.t. the minimality criteria in [4]), we can always find some. In particular, optimal covers are always minimal w.r.t. the criteria from [4]. 3.2.7 Related Work Maier already noted in [32] that there is a correspondence between the equivalence classes of non-redundant covers. Our work generalizes these results by investigating the projec- tions of (canonical) covers onto arbitrary autonomous sets, and placing them into a more general theoretic framework. 75 A unique representation for a set of unitary FDs is given in [26]. This representation is obtained by factoring the attribute determination graph via the equivalence relation on attributes induced by Σ. From this, partial canonical covers for the equivalence classes could be constructed as minimal strongly connected directed graphs. In [28] the authors describe a unique representation for a set of functional dependencies, which is independent of the choice of cover. Ausiello, D’Atri and Sacca introduce a graph representation to describe sets of FDs in [3], and use hypergraphs for describing them in [4]. However, their approach is completely different from ours, as their vertices are attributes or attribute sets, while each FD is modeled as a (hyper)edge. 3.3 Size and Number of Non-redundant Covers In this section we establish some results concerning how much the number of FDs in two non-redundant covers can differ, and how many non-redundant covers a set of FDs can have. Theorem 3.55. Let Σ, G be equivalent non-redundant sets of FDs over R. Then |G| ≤ |Σ| · |R|. Proof. Since Σ and G are equivalent, every FD X → A in Σ is implied by G. When computing the closure of X using G, we only use FDs which contribute at least one new attribute. Thus G includes a subset GX ⊆ G of cardinality at most |R \X| which implies X → A. Since their union ⋃ X→A∈F GX ⊆ G is already a cover of Σ and G is non-redundant, ⋃ X→A∈Σ GX = G. Clearly the cardinality of ⋃ X→A∈Σ GX is bounded by |Σ| · |R|. In [19] Gottlob shows the bound |G| ≤ |Σ| · (|R| − 1) for FDs with non-empty LHSs, and gives the following example to show that the bound is tight. Example 3.9. Let R = {A,B1, . . . , Bn} and Σ = {A → B1 . . . Bn}. Then G = {A → B1, . . . , A → Bn} is a non-redundant cover of Σ, and |G| = n = |Σ| · (|R| − 1). While the example above was based on splitting non-singular FDs into singular ones, the next example shows that the bound of Theorem 3.55 cannot be improved significantly even if we restrict ourselves to canonical covers. Note though that |Σ| = |G| if we restrict ourselves to non-redundant covers with FDs of the form X → X∗, since these are actually minimal covers [40]. Example 3.10. Consider the relation schema R = {A1, . . . , An, B1, . . . , Bn, C} with 2n+1 attributes. Associate with R the canonical set of FDs Σ =    A1 → C, . . . , An → C, C → B1, . . . , C → Bn, B1 . . . Bn → C    76 of size 2n+ 1. The set G =    A1 → B1, . . . , An → B1,... A1 → Bn, . . . , An → Bn, C → B1, . . . , C → Bn, B1 . . . Bn → C    is a canonical cover of Σ and contains n2 + n+ 1 FDs. Thus |G| > 14 · |Σ| · |R|. Using the bound established in Theorem 3.55, we will argue why it can make sense to try and compute all canonical covers. We first note that the number of arbitrary covers for a set Σ of FDs over a schema R can be (and usually is) hyper-exponential in the number of attributes. This is the case since the number of FDs in Σ∗ can be exponential in the number of attributes, and the number of covers can be exponential in the number of FDs in Σ∗. Even by restricting ourselves to the “most powerful” FDs of the form X → X∗ (with minimal LHS X) and requiring covers to be non-redundant (which together with the restriction on the form of FDs makes them minimal), we cannot avoid this. For brevity, we shall call non-redundant sets of LHS-reduced FDs of the form X → X∗ full. Example 3.11. Consider R = {A1, . . . , A2n} and let Σ consist of all FDs X → R,X ⊆ R with |X| = n. Clearly Σ is non-redundant and contains (2nn ) FDs. However, Σ has no full covers (other than itself). To create a large number of full covers, we add two extra attributes A and B to R, which gives us R′ = R ∪ {A,B}, and change Σ to Σ′ = {AX → R′ | X → R ∈ Σ} ∪ {A → B,B → A}. It is easy to check that Σ′ is full. However, each FD AX → R′ can be replaced by BX → R′, and these replacements can be done independently from one another. Thus Σ has at least 2(2nn ) full covers. We are thus trying to compute a potentially hyper-exponential number of covers, which at first glance seems rather infeasible even for small cases. However, when constructing the example above, we used a set of FDs Σ whose size is exponential in the number of attributes. This is rather unusual, and for small sets Σ we can establish better bounds for the number of non-redundant covers. Theorem 3.56. Let Σ be a set of FDs over R, with cardinalities |Σ| = n|, |R| = k. The number of non-redundant covers of Σ is at most (22knk ) < 22nk2. Proof. The number of FDs on R is 22k, and any non-redundant cover of Σ contains at most nk FDs by Theorem 3.55. While the bound given can be improved, the relevant fact is that the number of covers is “only” exponential in the size of the input, rather than hyper-exponential. When con- sidering arbitrary non-redundant covers, we can obtain different non-redundant covers by changing FDs slightly, e.g. by adding LHS attributes to the RHS. As such changes can be done independently from one another, the number of non-redundant covers is practi- cally always exponential. Many of those variations are avoided by restricting ourselves to canonical covers. Using partial covers to represent them efficiently, we can hope to reduce the size of our representation to a reasonably small number. Experimental results can be found in the appendix. 77 3.4 Partial Implication Cycles We wish to find a partition of Σ∗a into autonomous sets which is finer than EQ. This is motivated by the observation that schemas with multiple minimal keys can easily have a large number of canonical covers, which cannot be represented efficiently using the partition EQ. Autonomous sets of a finer partitions have a smaller number of partial covers on them. Example 3.12. Consider the set Σ = {A → BC1 . . . Cn, B → A} with the atomic closure Σ∗a = { A → B,A → C1, . . . , A → Cn, B → A,B → C1, . . . , B → Cn } . Each canonical cover of Σ contains A → B,B → A and for each i = 1 . . . n either A → Ci or B → Ci, for a total of 2n canonical covers. The LHSs A and B are equivalent, so the partition EQ = {Σ∗a} is trivial. However, CC(Σ) can be represented efficiently using a finer decomposition: CC(Σ) = {{A → B,B → A}}∨ {{A → C1}, {B → C1}}∨ . . . {{A → Cn}, {B → Cn}} The example above is not a rare case - one can expect to frequently find schemas with multiple minimal keys in practice. It is thus vital to find a good partition for cases similar to the last example. For this, we use the same idea as in section 2.2.3. We define adjacency of FDs in Σ∗a w.r.t. the hypergraph Tr(CC(Σ)). Definition 3.57. We call FDs adjacent iff they lie in a common minimal transversal. Note that by Lemma 3.31 adjacent FDs can be used to derive each other, i.e., each lies in a minimal subset of Σ∗a which implies the other FD. If we represent the relation between FDs “partially implies” by a directed graph, we find that adjacent FDs lie on a directed cycle. Definition 3.58. Let X → A, Y → B ∈ Σ∗a. We say that X → A partially implies Y → B, written X → A p⇒ Y → B if X → A lies in a minimal S ⊆ Σ∗a with S ² Y → B. We denote the relation “partially implies” by PIΣ ⊆ Σ∗a × Σ∗a. Lemma 3.59. Let X → A, Y → B ∈ Σ∗a be adjacent. Then X → A and Y → B partially imply each other. Proof. Follows from Lemma 3.31. Since relations can be regarded as directed graphs, we use terminology from graph theory. The last lemma then tells us that adjacent FDs lie on a directed cycle of PIΣ. This gives us the following. 78 Theorem 3.60. The maximal strongly connected components (MSCs) of PIΣ are au- tonomous sets of CC(Σ). Proof. Let S ⊆ Σ∗a be (the vertex set of) a MSC of PIΣ. By Lemma 3.59 every minimal transversal that intersects with S lies fully in S. Thus S is the union of maximal connected components of Tr(CC(Σ)), which are minimal autonomous sets of CC(Σ) by Theorem 3.18. This makes S autonomous by Theorem 3.11. Theorem 3.60 provides us with an alternative proof for Theorem 3.35: By Lemma 2.35 the partial implication graph PIΣ contains only arcs where the LHS of the target FD determines the LHS of the source FD. Thus there are no cycles between FDs in different equivalence classes, which means that every equivalence class is the union of MSCs of PIΣ. The partition of Σ∗a formed by the MSCs of PIΣ is at least as fine as the one by equivalence classes and often finer (example 3.12), but it need not be minimal. This is because PIΣ contains “extra” arcs between non-adjacent FDs, which can lead to larger MSCs. Extra arcs can make the partition found less fine, but it is still a partition into autonomous sets. Example 3.13. Consider the set of FDs Σ∗a = { A → B,C → D,BD → A,BD → C, E → A,E → B,E → C,E → D } The subgraph of PIΣ induced by the equivalence class EQE = {E → A,E → B,E → C,E → D} can be drawn as (E → A) ← (E → D) ↑↓ ↑↓ (E → B) → (E → C) so EQE is a MSC of PIΣ. However, the sets {E → A,E → B} and {E → C,E → D} are smaller autonomous sets: CC(Σ) = {{A → B,C → D,BD → A,BD → C}}∨ {{E → A}, {E → B}}∨ {{E → C}, {E → D}} We still have the problem of constructing PIΣ. We will solve it with the help of implication covers. Definition 3.61. Let Σ be a set of FDs. The atomic implication closure AIC(Σ) of Σ is the atomic closure of any implication cover of Σ: AIC(Σ) := { S ⇒ {Y → B} ∣∣∣∣ S ⊆ Σ∗a, Y → B ∈ Σ∗a \ S, S minimal with S ² Y → B } This allows us to describe partial implication w.r.t. AIC(Σ). 79 Definition 3.62. We say that X → A partially implies Y → B w.r.t. some set of implication dependencies Π iff Π contains an ID with X → A in its left-, and Y → B in its right hand side. With this definition we get: Proposition 3.63. Let Σ be a set of FDs, and X → A, Y → B ∈ Σ∗a. Then X → A partially implies Y → B iff X → A partially implies Y → B w.r.t. AIC(Σ). This means that partial implication is the same as partial determination, as defined in section 2.2.3, but for AIC(Σ). That is, we have PIΣ = RAIC(Σ) Since we are only interested in the MSCs of PIΣ, we may as well use its transitive closure PI+Σ instead, or any other relation with PI+Σ as its transitive closure. By Theorem 2.30 such a relation can be constructed using any atomic implication cover of Σ. We have seen earlier that implication covers of Σ can be large, even if Σ∗a is small. On the other hand, the size of relative implication covers of equivalence classes EQX of Σ∗a is at most quadratic in Σ∗a. We thus want to use relative implication covers instead, which we can construct easily using Theorem 3.49. Lemma 3.64. Let H ⊆ Σ∗a. Then the partial implication graph w.r.t. the relativation of the atomic implication closure AIC(Σ) ]H[ is the subgraph of PIΣ induced by H. Proof. PIΣ is the partial implication graph w.r.t. AIC(Σ) by Lemma 3.63. Relativating AIC(Σ) onto H removes the FDs not in H, and thus removes from PIΣ the arcs adjacent to FDs not in H. Clearly the result is the subgraph of PIΣ induced by H. Example 3.14. Consider the set of FDs Σ∗a = {A → B,A → C,B → C,C → B} The atomic implication closure of Σ is AIC(Σ) = { {A → B,B → C} ⇒ {A → C}, {A → C,C → B} ⇒ {A → B} } This gives us the partial implication graph PIΣ: (A → B) ← (C → B) ↑↓ (A → C) ← (B → C) Relativating AIC(Σ) onto the set H = {A → B,A → C} gives us AIC(Σ) ]H[ = { {A → B} ⇒ {A → C}, {A → C} ⇒ {A → B} } which induces the partial implication graph w.r.t. AIC(Σ) ]H[: (A → B) ↑↓ (A → C) which is also the subgraph of PIΣ induced by H. 80 Theorem 3.65. Let H ⊆ Σ∗a and ΠH an atomic relative implication cover of Σ∗a on H. Let further PIH be the partial implication graph of H w.r.t. ΠH . Then the MSCs of PIH are subsets of the MSCs of PIΣ. Proof. Let PIΣ[H] be the subgraph of PIΣ induced by H. The MSCs of PIΣ[H] are clearly subsets of the MSCs of PIΣ, and we will show that they include the MSCs of PIH . By Lemma 3.64 PIΣ[H] is the partial implication graph w.r.t. AIC(Σ) ]H[. Since AIC(Σ) is an implication cover of Σ∗a, AIC(Σ) ]H[ is a relative implication cover of Σ∗a on H by Corollary 3.45. Note that AIC(Σ) ]H[ need not be atomic, as the relativation of an atomic ID need not be atomic anymore (w.r.t. the relativated implication cover). It could however be transformed into an atomic relative implication cover Π′H by removing IDs with empty right hand side, and LHS-minimizing the remaining IDs. For the corresponding partial implication graph PI ′H w.r.t. Π′H this means that some arcs got removed, but no new arcs are added, so the MSCs of PI ′H are subsets of the MSCs of PIΣ[H]. PI ′H and PIH are both atomic relative (implication) covers on the same set H, so by Lemma 3.39 they are atomic covers of each other. Thus their MSCs are identical by Theorem 2.30. Theorem 3.65 allows us to use atomic relative implication covers to find a partition which is at least as fine and possibly finer than the partition into autonomous sets induced by PIΣ. We will show next that the sets in this finer partition are still autonomous, so that the use of relative implication covers not only makes the construction of an autonomous partition faster, but also leads to a better (i.e., finer) partition. Theorem 3.66. Let H ⊆ Σ∗a be an autonomous set, and ΠH a relative implication cover on H. Let further PIH be the partial implication graph w.r.t. ΠH . Then the MSCs of PIH are autonomous sets. Proof. We may assume ΠH to be atomic, as making it atomic does not add arcs to PIH and thus may only lead to a finer partition. By Theorem 2.30 we may then assume ΠH to be its atomic closure, since this does not affect the MSCs of PIH . We only show Lemma 3.59 for partial implication w.r.t. PIH , the argument then continues as in Theorem 3.60. So let X → A, Y → B ∈ H be adjacent. Then there exists some minimal transversal S with X → A, Y → B ∈ S, and by Lemma 3.31 we have (Σ∗a \ S) ∪ {X → A} ² Y → B Thus U ⇒ {Y → B} ∈ ΠH for some minimal U ⊆ (Σ∗a \ S) ∪ {X → A}, and since X → A ∈ U by Lemma 3.31, X → A partially determines Y → B w.r.t. ΠH . We summarize the steps needed to compute a partition of Σ∗a into autonomous sets below. Algorithm “partial implication partitioning” INPUT: set of FDs Σ OUTPUT: partition of Σ∗a into autonomous sets 81 compute Σ∗a and partition it into equivalence classes EQ for each EQX ∈ EQ do construct relative implication cover ΠX of EQX make ΠX atomic construct PIX from ΠX compute MSCs of PIX end Note again that the MSCs of a directed graph can be computed in linear time [41]. If PIX is not strongly connected, we can try to partition its MSCs msc1, . . . ,mscn again, using a relative implication cover on each msci, which by Theorem 3.42 can easily be obtained by relativating ΠX onto msci. This can be repeated until no finer partition is found. Instead of using partial implication cycles to find autonomous sets of CC(Σ), we can use partial determination cycles to find autonomous sets of the hypergraph of all minimal keys: H = {X ⊆ R | X is a minimal key of R} For this we start we the trivial autonomous set R, and use Σ instead of an implication cover. Saiedian and Spencer describe a similar approach in [39]. Using their algorithm for computing all partial canonical covers as the minimal keys w.r.t. ΠX is equivalent to determining smaller autonomous sets through partial implication cycles first, and then using some other algorithm (e.g. [31]) to find the minimal keys on them. In [20] Gottlob, Pichler and Wei also use partial determination cycles to identify autonomous sets of H. This simplifies the test whether an attribute A ∈ R is prime w.r.t Σ, since one only needs to check whether it is prime w.r.t. Σ ]S[, where S is the autonomous set containing A. 3.5 Essential FDs We are now interested in FDs that appear in some (or all) canonical covers. Definition 3.67. We call an atomic FD X → A ∈ Σ∗a    forced essential inessential    iff it appears in    every some no    canonical cover of Σ. Note that essential FDs correspond to prime attributes, and that a FD is essential iff it lies in some minimal transversal. It would be desirable if we could compute all essential FDs without computing inessen- tial ones as well. This would allow us to speed up computations of all or specific (e.g. uncritical) canonical covers if the number of inessential FDs is large, as e.g. in example 2.12, where we had Σ = {A1 → B1, . . . , An → Bn, B1 . . . Bn → C} 82 This leads to two different problems: “How can we decide whether a FD is essential?” and “How can we avoid computing inessential ones?”. In the following we will develop some answers to these questions. 3.5.1 Deriving essential FDs As with adjacency, we define the neighborhood of FDs in Σ∗a w.r.t. the hypergraph Tr(CC(Σ)). Definition 3.68. For G ⊆ Σ∗a we define the neighborhood N(G) of G as N(G) := {X → A ∈ Σ∗a | ∃Y → B ∈ G.X → A adjacent to Y → B} In section 3.2.1 we already established some results about the minimal transversals of CC(Σ). We will now show that they are cartesian products of LHSs and RHSs. In this, we use the following definition. Definition 3.69. We call a set Σ of FDs cartesian if X → A, Y → B ∈ Σ ⇒ X → B ∈ Σ In the following, transversal will always mean transversal of CC(Σ). Theorem 3.70. All minimal transversals are cartesian. Proof. Let X → A, Y → B be contained in a common minimal transversal S ⊂ Σ∗a. Since X ↔ Y under S = Σ∗a \ S by Lemma 3.33 we have X → B ∈ Σ∗. Furthermore X → B is non-trivial since S 2 Y → B, which means there exists UX ⊂ X with UX → B ∈ Σ∗a. Assume UX → B /∈ S. Then S ² Y → X,UX → B and therefore S ² Y → B, which contradicts that S is a minimal transversal. Thus UX → B ∈ S. Again by Lemma 3.33 we have UX ↔ X. Since X is a minimal LHS, this is only possible if UX = X. Definition 3.71. We denote the essential closure of Σ by Σ∗e := {X → A ∈ Σ∗a | X → A is essential} We call a set G ⊆ Σ∗a essentially cartesian (w.r.t. Σ) iff X → A, Y → B ∈ G,X → B ∈ Σ∗e ⇒ X → B ∈ G The essentially cartesian closure G∗ec of G ⊆ Σ∗a is the smallest essentially cartesian superset of G: G∗ec := G ∪ {X → B ∈ Σ∗e | ∃X → A, Y → B ∈ G} Definition 3.72. Let G,H ⊆ Σ∗a. We say that G dominates H if every minimal transver- sal which intersects with H intersects with G as well1. An atomic FD X → A dominates (is dominated) iff {X → A} dominates (is dominated). If two FDs dominate each other, we call them equally dominating. 1The property defined is more closely related to the vertex covering property than to the dominating set property. However, as the term ”covers” already has a distinct meaning for sets of FDs, we use ”dominates” to avoid confusion. 83 Lemma 3.73. Let G,H ⊆ Σ∗a. If G ² H then G ∩N(H) dominates H. Proof. As only FDs in N(H) can contribute to dominating H, it suffices to show that G dominates H. Assume the contrary, and let S be a minimal transversal containing a FD X → A ∈ H which does not intersect with G. As G ² X → A this contradicts Lemma 3.31. Lemma 3.74. Let X → A, Y → A ∈ Σ∗e be essential. Then X → A dominates Y → A iff Σ∗a \N(Y → A) ² Y → X. Proof. If X → A dominates Y → A then Σ∗a \ N(Y → A) ∪ {X → A} intersects with every minimal transversal and thus is a cover of Σ. Argue as in Lemma 3.33. On the other hand, let Σ∗a \N(Y → A) ² Y → X. Then Σ∗a \N(Y → A)∪{X → A} implies Y → A and X → A dominates Y → A by Lemma 3.73. When using linear resolution to compute the atomic closure, we would like to avoid intermediate inessential FDs, just as we avoided intermediate non-atomic ones by mini- mizing LHSs. This is possible, provided we can recognize inessential FDs. Theorem 3.75. Let Σ be essentially cartesian. Then every (essential) FD X → A ∈ Σ∗a can be obtained from Σ by linear resolution (with LHS-minimization of derived FDs) using only base FDs which dominate X → A. The choices for LHS-minimization2 are irrelevant. Proof. Trivial if X → A ∈ Σ or X → A inessential. Otherwise let M ⊆ Σ be of minimal cardinality with M ² X → A, and let X1 → A1, . . . , Xn → A ∈ M be the FDs in M in the order as they are used in computing the closure of X. Then there exists a linear resolution tree with LHS-minimized intermediate FDs - note that we make no assumption about which LHS-minimization is chosen - which uses only FDs in M \ {Xn → A} as substituting FDs in (1). Let Y → A ∈ Σ∗a be any (intermediate) base FD used in the resolution. We need to show that Y → A dominates X → A. We have T := {Y → A} ∪M \ {Xn → A} ² X → A and thus T ∩N(X → A) dominates X → A by Lemma 3.73. Assume X → A were adjacent to Xi → Ai ∈ M \ {Xn → A}. Then Xi → A ∈ Σ∗e by Theorem 3.70, and thus Xi → A ∈ Σ since Σ is essentially cartesian. Since X1 → A1, . . . , Xi−1 → Ai−1 imply X → Xi, the set M ′ := {X1 → A1, . . . , Xi−1 → Ai−1, Xi → A} ⊆ Σ implies X → A and is of smaller cardinality then M . Thus X → A is not adjacent to any FD in M \ {Xn → A}, which means T ∩N(X → A) = {Y → A}. Definition 3.76. We call an essential FD essentially derivable from Σ, iff it can be derived from Σ using linear resolution (with LHS-minimization) without intermediate inessential FDs. Corollary 3.77. Let Σ be essentially cartesian. Then all essential FDs in Σ∗e are essen- tially derivable from Σ. Proof. Every FD that dominates an essential FD is essential itself. 2E.g. it might be possible to LHS-minimize AB → C to A → C or B → C. 84 We may thus discard any inessential FD we obtain during the linear resolution process. The prerequisite that Σ be essentially cartesian can easily be met by computing the essential cartesian closure of a canonical cover of Σ (or, if testing essential is too hard, adding all atomic FDs X → B with X → A, Y → B ∈ Σ). Furthermore, Theorem 3.75 assures us that the base FDs used in the linear resolution process all dominate the final FD X → A. Thus they are all pairwise adjacent, and we only need to consider derivations in which base and derived FD are adjacent (after LHS-minimization). It is of interest to know whether the requirement in Theorem 3.75 that Σ be essentially cartesian is really necessary. The following example show that it is. Example 3.15. Consider the canonical cover Σ = {B → A,CD → B,DE → C,AE → C} It is easily checked that DE → A ∈ Σ∗a is essential: Σ′ = {B → A,CD → B,DE → A,AE → C} forms a canonical cover of Σ. The only linear derivation tree which derives DE → A from Σ is DE → C CD→B B→ACD→A DE → A The FDs CD → B,B → A are forced, as they are not implied by the remaining FDs in Σ∗a = { B → A,CD → B,DE → C,AE → C, CD → A,DE → B,BE → C,DE → A } This shows that CD → A is inessential, as it is implied by a set of forced FDs3. Thus we cannot derive DE → A without intermediate inessential FDs. In the example above, the essential FD DE → A was contained in the essentially cartesian closure of Σ. To show that this is not always the case, we could simply add the FDs F → E,E → F to Σ in example 3.15, and try to derive the essential FD DF → A. However, the essential FD DF → C could still be derived directly, without intermediate inessential FDs: F → E DE → C DF → C so that we still could get DF → A by computing the essentially cartesian closure of the set of all essentially derivable FDs. This is not by accident. Theorem 3.78. Let Σ be atomic and E be the set of all FDs essentially derivable from Σ. Then the essentially cartesian closure E∗ec of E includes Σ∗e, i.e. all essential FDs. Proof. Same as proof for Theorem 3.75, except that we change the RHS attribute of the final derived FD, not of the substituting FD. The last theorem provides an alternative for computing Σ∗e by computing the essen- tially cartesian closure for the final set of FDs rather than the initial one. While computing the essentially cartesian closure can be more time-consuming, due to the larger size of the final set, we start with a smaller set of base FDs which may speed up our linear resolution computation. 3Note that this is a sufficient but not necessary criteria. 85 3.5.2 Testing essentiality Theorems 3.75 and 3.78 show that all essential FDs can be derived while avoiding inessen- tial ones. However, to do this, we require a test to recognize inessential FDs. We will develop such a test next, and start by providing an efficient criteria for testing whether a FD is forced. Theorem 3.79. An atomic FD X → A ∈ Σ∗a is not forced iff there exists a B ∈ X with X∗ \ AB → A ∈ Σ∗. Proof. Let X → A be not forced, i.e., Σ∗a \ {X → A} ² X → A. By looking at the closure algorithm we see that there exists some Y → A ∈ Σ∗a \ {X → A} with Y ⊆ X∗. As Y is a minimal LHS for A it cannot include X (else it were equal to X), so there must be some B ∈ X \ Y . Thus Y ⊆ X∗ \ AB which shows X∗ \ AB → A ∈ Σ∗. Now let X∗ \ AB → A ∈ Σ∗ for some B ∈ X. For every Ai ∈ X∗ \ XA there exists a minimal set Ui ⊆ X with Ui → Ai ∈ Σ∗a \ {X → A}. As well, there exists a minimal U ⊆ X∗ \AB with U → A ∈ Σ∗a \ {X → A}. Together they imply X → A, thus X → A is not forced. As all forced FDs appear in every canonical cover of Σ, computing the set of all forced FDs in Σ∗a is easy. What we really need though is a criterion for testing whether a FD is essential. This, however, is an NP-complete problem, which we will show by reducing the following problem to it, which is known to be NP-complete [31]: Problem “prime attribute” Given a set Σ of FDs on schema R and an attribute A ∈ R, is A a prime attribute, i.e., does A lie in a minimal key of R? Theorem 3.80. Given a set Σ of FDs, the problem of deciding whether a FD is essential is NP-complete. Proof. To verify that a FD is essential, we only need to guess the canonical cover con- taining it. By Theorem 3.55 the size of any canonical cover is polynomial in Σ, so the problem lies in NP. We prove completeness by reducing the “prime attribute” problem to it. Let G be a set of FDs on R, and A /∈ R an additional attribute. We construct Σ as Σ := G ∪ {A → Ai | Ai ∈ R} We claim that Ai is a prime attribute w.r.t. G iff A → Ai is essential w.r.t. Σ. Since A does not appear in any FD in G we have Σ∗a = G∗a ∪ {A → Ai | Ai ∈ R} Now let Σ′ be any canonical cover of Σ, and let Σ′A := {A → Ai ∈ Σ′}. Clearly the FD A → R ∈ Σ∗ is implied by Σ′ iff A → K is implied by Σ′A for some minimal key K of R (w.r.t. G). This is the case iff Σ′A consists of exactly (since Σ′ is non-redundant) those FDs A → Ai for which Ai ∈ K. Thus A → Ai is essential iff Ai is prime. From this, we can deduce that identifying the (minimal) autonomous sets of CC(Σ) is difficult as well. 86 Theorem 3.81. Given a set Σ of FDs and an atomic FD X → A ∈ Σ∗a, the problem of deciding whether the set {X → A} is autonomous for CC(Σ) is co-NP-complete. Proof. {X → A} is autonomous iff X → A appears in all or no canonical covers of Σ. Thus, if {X → A} is not autonomous, we only need to guess one canonical cover which contains X → A, and one which does not. By Theorem 3.55 the size of these canonical covers is polynomial in Σ. This shows that the problem lies in co-NP. To show co-NP-hardness, we use that it is NP-hard to decide whether a FD X → A is essential. Let Σ′ be any canonical cover of Σ. Such a canonical cover Σ′ can be computed in polynomial time. By definition, X → A is essential iff it appears in some, but not necessarily all canonical covers of Σ. We distinguish two cases. (1) If X → A ∈ Σ′ then X → A is essential. (2) If X → A /∈ Σ′ then X → A is essential iff it appears in some but not all canonical covers of Σ, i.e., iff {X → A} is not autonomous. We thus reduced the NP-hard problem of deciding whether X → A is essential to the problem of deciding whether {X → A} is not autonomous. The last theorem shows that the autonomous set problem is hard when given Σ. However, when we try to find autonomous sets of CC(Σ), we first compute Σ∗a, which can be exponential in the size of Σ. Thus it might be possible to decide whether a given set is autonomous in time polynomial in the size of Σ∗a. We show next that this is not the case. Theorem 3.82. Given a set Σ of FDs, its atomic closure Σ∗a and an atomic FD X → A ∈ Σ∗a, the problem of deciding whether the set {X → A} is autonomous for CC(Σ) is co-NP-complete. Proof (Sketch). In [31] the NP-hardness of the “prime attribute” problem is shown by first reducing the “vertex cover” problem to the “key of cardinality m” problem, and then in turn to “prime attribute”. We will describe a slightly modified version of this reduction. Let G be the graph for the vertex cover, Σcard the set of FDs for the “key of cardinality m” problem (denoted D[0]′ in [31]), and Σprime the set of FDs for the “prime attribute” problem (denoted D[0] in [31]). For the first reduction, Σcard is constructed as follows: Σcard := {N(v) → v | v is vertex in G} where N(v) is the neighborhood of v in G. The second reduction is more complicated. We give a modified version next (the modification occurs in condition (ii) below), which can be shown to be correct as in [31]. Let A′ be the set of attributes occurring in Σcard, A′′ of cardinality m < |A′| and b a new attribute. The attribute set for the “prime attribute” problem is then A = A′∪ [A′′×A′], and Σprime is constructed as follows: (i) for E → F ∈ Σcard add {b} ∪ E → F to Σprime (ii) for e ∈ A′ add {e} → A′′ × {e} to Σprime (iii) for i ∈ A′′ and e ∈ A′ add {b, (i, e)} → {e} to Σprime 87 (iv) for i ∈ A′′ and e, f ∈ A′ distinct add {(i, e), (i, f)} → {b} to Σprime (v) for e ∈ A′ add {e} → {b} to Σprime Using these constructions, we want to establish some bounds on the size of the LHSs of FDs. This can be done by verifying the following statements. • The size of the LHS of a FD in Σcard is the degree of the corresponding vertex in G • Σ∗acard = Σcard • The LHS of a FD in Σ∗aprime is at most one larger that the LHS of a FD in Σ∗acard The reductions from the “prime attribute” problem to the problems “essential FD” and then “autonomous set” do not increase the size of the LHSs of atomic FDs. Thus the maximal size of the LHSs of FDs in Σ∗a is at most one larger that the maximal degree of vertices in G. It has been shown in [17] that the vertex cover problem is still NP-hard for graphs of maximal degree 3. These cases reduce to instances of the “autonomous set” problem for which all FDs in Σ∗a contain at most 4 attributes in their LHS. But there exist less than k5 such FDs, where k is the number of different attributes occuring in Σ, so the size of Σ∗a is polynomial in the size of Σ, and can therefore be computed in polynomial time by Lemma 2.13. This shows the theorem. In the following we will construct some sufficient (but not necessary) criteria that a FD is inessential. Recall that N(X → A) denotes the neighborhood of X → A in Tr(CC(Σ)). Lemma 3.83. X → A ∈ Σ∗a is essential iff Σ∗a \N(X → A) 2 X → A. Proof. If X → A is essential then Σ∗a \ N(X → A) ∪ {X → A} intersects with every minimal transversal and thus is a cover of Σ. As Σ∗a \N(X → A) is not a cover of Σ by Lemma 3.31, Σ∗a \N(X → A) cannot imply X → A. If X → A is inessential then N(X → A) = {X → A}, and Σ∗a \ {X → A} implies X → A. Thus, if we can find a set S ⊆ Σ∗a \N(X → A) with S ² X → A then X → A must be inessential. Instead of Σ∗a we use only Σ (making Σ canonical if needed). Determining which FDs in Σ are adjacent to X → A is no easier than testing essentiality (since every essential FD has an adjacent FD in Σ). However, we do have some criteria that assure that a FD Y → B is not adjacent to X → A: Clearly all forced FDs are adjacent only to themselves, and adjacent FDs have equivalent LHSs by Lemma 3.33. Thus we construct S from all forced FDs and all FDs in Σ with LHS smaller4 than X, as FDs with LHS not implied by X cannot be used in deriving X → A. Note that we do not lose anything by using Σ instead of Σ∗a: The FDs in Σ with LHS smaller than X form a cover for the set of all FDs in Σ∗a with LHS smaller than X by Lemma 2.35. We can adapt the basic “linear resolution” algorithm to compute all essential FDs (and possibly more due to the inaccuracy of our essentiality test) as follows: 4w.r.t. the determination pre-order induced by Σ 88 Algorithm “essential linear resolution” INPUT: set of FDs Σ OUTPUT: superset of Σ∗e compute a canonical cover Σ′ of Σ FΣ := ∅ for all X → A ∈ Σ′ do if is forced(X → A) then add X → A to FΣ end Σ∗e := Σ′ for all Y → B ∈ Σ∗e do for all X → A ∈ Σ′ with A ∈ Y,B /∈ X do // derive (XY \ A) → B by rule (2.1) find U ⊆ (XY \ A) with U → B atomic if U → B /∈ Σ∗e and is essential(U → B) then add U → B to Σ∗e end end // compute essentially cartesian closure for all X ∈ LHS(Σ∗e) do RHSX := X∗ \X for all A ∈ X do remove (X \ A)∗ from RHSX end for all B ∈ RHSX do if X → B /∈ Σ∗e and is essential(X → B) then add X → B to Σ∗e end end function is essential(X → A) smaller(X → A) := {Y → B ∈ Σ | Y ⊆ X∗ ∧X * Y ∗} if FΣ ∪ smaller(X → A) implies X → A then return false else return true function is forced(X → A) for all B ∈ X do if Σ ² X∗ \ AB → A then return false end return true 89 Other optimizations such as those used in the revised linear resolution algorithm can be used in computing the essential closure (or a superset thereof) as well. Also, the test is essential(X → B) for computing the essentially cartesian closure can be performed for all values B at once, by computing the closure of X under FΣ ∪ smaller(X → B). 90 Chapter 4 Domination Normal Form A common approach in designing relational databases is to start with a relation schema, which is then decomposed into multiple subschemas [10, 34, 38, 42]. A good choice of subschemas can be determined using integrity constraints defined on the schema, such as functional, multivalued or join dependencies. Many normal forms proposed so far, such as BCNF, 4NF or KCNF, characterize the absence of redundancy [1, 44]. This is desirable for several reasons, foremost the avoidance of update anomalies [33] and minimization of storage space [9]. However, these normal forms have significant drawbacks. First, they only consider a single relation schema, in- stead of considering all schemas in a decomposition together. While this is not strictly true for the normal form proposed by Topor and Wang in [45], their approach only consid- ers pairs of schemas at a time, and thus cannot capture redundancy across larger schema sets. The common generalization to multiple schemas is that the whole schema collection is in that normal form, if every schema taken individually is. But this means that those normal forms cannot capture redundancy which exists across multiple relations. As a trivial example, we can duplicate a schema. Clearly the extra schema is then superfluous in the whole schema collection, but each schema taken individually may still be redundancy free. But even if no schemas or attributes in a schema collection are superfluous, the design may not be desirable. Example 4.1. Let R = ABCD and Σ = {AB → CD,CD → B}. Then R is not in BCNF, but has a dependency preserving BCNF decomposition into the subschemas ABC,ABD,BCD. For any instance r of R, the projections of r onto the schemas ABC and ABD together already take up more space than the original relation r: no tuples are lost in the projection since AB is a key, and the attributes A and B are stored twice. While we have not defined what redundancy means for multiple schemas, it seems intuitively clear that this decomposition should not be called “redundancy free”. From a storage space point-of- view, it is clearly less desirable than the original schema R. The second big problem is that dependency preserving decompositions into these nor- mal forms do not always exist [5]. Thus, when faced with such a case, a designer must either accept the loss of some dependencies, or cannot achieve the normal form in question. We believe that what a normal form should do, is the following: Characterize “good” representations (i.e., decompositions) of a schema, in such a way that a “good” representation does always exist. 91 In the following we will propose a normal form which meets this criterion. 4.1 Minimization as Normal Form The approach we suggest is the following: among a set of suitable decompositions (e.g. the set of all lossless, or lossless and dependency preserving decompositions), characterize the “best” ones. We do so by defining an order on the decompositions, such that the “best” decompositions are the minimal ones with respect to that order. This leaves the question of when to call one decomposition better than another one. The motivation for many normal forms proposed so far has been the elimination of re- dundancy (and with it, the absence of update-anomalies). This may suggest to define a quantitative measure of redundancy over multiple schemas, as has been done by Arenas and Libkin in [1]. We take a different approach here: instead of trying to minimize redundancy, we try to minimize the size of instances. Intuitively this should lead to similar results, an assumption which is supported by the findings of Biskup in [9], but measures for size appear easier to construct than measures for redundancy (cf. [1]). In the following we will define and motivate three different orders on decompositions, all of which intuitively compare decompositions by the size of instances for them. By proving them to be equivalent, we will establish a syntactic characterization for a semantically motivated definition. 4.1.1 Ordering by Size of Instances Our first approach measures the space required to store an instance. For that we need to know for each element of a domain how much storage space it requires. We represent this knowledge by associating with each domain Dom a size function size : Dom → N Consider e.g. the following domains: • STRING containing strings of arbitrary length • STRING[40] containing strings of length up to 40 • INT containing arbitrarily large integers • INT (64) containing all 64-bit integers • BOOLEAN containing the values TRUE and FALSE A realistic measure for the size of a string might be its length, the size of an integer i might be defined as log(i), and the size of TRUE and FALSE might be one. In this work we shall assume that all domains are infinite, and that the size functions on them are positive and unbounded, i.e., can grow arbitrarily large. This can be justified as follows: For any infinite domain a bounded size function is not realistic, since we can store only a finite number of elements when restricted to a fixed amount of space. While not all domains are truly infinite, they often contain far more elements than the number of subschemas in a typical decomposition (e.g. 25640 for STRING[40] or 264 for INT (64)). 92 Treating these domains as infinite will allow us to draw a sharp boundary between small (bounded) increases in size from duplicated attributes on one hand, and potentially large (unbounded) increases in size from instances with large numbers of tuples on the other. We note that this argument fails for domains such as BOOLEAN , and that a con- stant size function may be more realistic for domains such as STRING[40] or INT (64). However, in this work we will not concern ourselves with such considerations. As it will turn out, the assumptions about infinite domains and unbounded size func- tions, which are common assumptions in database theory [27], are all we need to charac- terize our new normal form, i.e., we do not require detailed knowledge about the actual size functions. Definition 4.1. For a relation r over R and a decomposition D = {R1, . . . , Rn} of R =⋃Rj we denote the decomposition of r by D as r[D] := {r[R1], . . . , r[Rn]} where r[Rj] is the projection of r onto the attributes in Rj. When talking about the tuples in r[D] containing an attribute A, we will mean the tuples from relations Rj ∈ D with A ∈ Rj. Definition 4.2. Let R = {A1, . . . , Ak} be a schema. For a finite relation r over R we define the size of r as size(r) := ∑ t∈r k∑ i=1 size(piAi(t)) where piAi(t) denotes the projection of tuple t onto the attribute Ai. We then define the size of the decomposition of r by D = {R1, . . . , Rn} of R as size(r[D]) := n∑ j=1 size(r[Rj]) While this gives us a suitable definition of size for any instance, we wish to compare decompositions w.r.t. the size of all valid instances. If for every valid instance r on R a decomposition D1 requires no more storage space than a decomposition D2, then D1 ≤ D2 should certainly hold, indicating that D1 is “at most as big” and thus “at least as good” as D2. Recall that we only consider suitable decompositions, in particular lossless or lossless and dependency preserving ones. This alone, however, is not sufficient to characterize good decompositions: for an instance r containing only a single element, the trivial decomposition {R} requires less storage space than any other lossless decomposition, as those typically need to duplicate some attributes. It would be hard to argue though that decomposition is never necessary. So how can we distinguish decompositions finer, based on the size of instances? Example 4.2. Let R = ABC and Σ = {B → C}. R can be faithfully decomposed into D = {AB,BC}. Clearly every relation r decomposed by D (which is a set of relations) is at most twice as large as r. On the other hand, for every natural number k we can construct a relation r = A B C 1 1 ”a very long string” 2 1 ”a very long string” ... ... ... k + 1 1 ”a very long string” 93 which is more than k times larger than in decomposed form: r[AB] = A B 1 1 2 1 ... ... k + 1 1 r[BC] = B C1 ”a very long string” This observation motivates the following definitions, which compare decompositions similarly to the “big-O” comparison (e.g. 3x2 + x ∈ O(x2)) from complexity theory. Definition 4.3. Let R be a schema with constraints Σ and D1,D2 be decompositions of R. We say that D1 c-dominates D2 (where ”c” stands for complexity) if there exists a constant k such that for all finite relations r over R that satisfy Σ we have size(r[D1]) ≤ k · size(r[D2]) We further say that D1 dominates1D2 if the above relationship holds for k = 1. We abbreviate c-domination and domination as D1 ≤c D2 and D1 ≤ D2, respectively. We say that D1 strictly (c-)dominates D2, written D1 <(c) D2, if D1 (c-)dominates D2 but not vice-versa. It is easy to see that both domination and c-domination are reflexive and transitive, and thus are pre-orders. Clearly domination implies c-domination. Proposition 4.4. Let D1,D2 be decompositions of R. If D1 dominates D2 then D1 c- dominates D2. Note however that strict domination does not imply strict c-domination. In example 4.1 the original schemaABCD strictly dominates the decomposition {ABC,ABD,BCD}, but both decompositions are equivalent w.r.t. c-domination. Sometimes both crite- ria, domination and c-domination, are used to characterize the best decomposition for a schema, as the following example shows. Example 4.3. Let R = ABCDE and Σ = {AB → CD,B → E}. Then the decomposition D1 = {ABC,ABD,BE} is minimal w.r.t. c-domination but strictly dominated by D2 = {ABCD,BE}. The trivial decomposition {R} is minimal w.r.t. domination but strictly c-dominated by both D1 and D2. We are now ready to define our normal form based on the idea of minimizing storage space. Definition 4.5. Let R be a schema with constraints Σ and D be a decomposition of R. We say that D is in domination normal form (DNF) if (i) D is minimal w.r.t. domination, and (ii) D is minimal w.r.t. c-domination 1This is not related to domination as defined in section 3.5 94 with minimal meaning that no strictly smaller decomposition exists among a given set of ’suitable’ decompositions. Note that this definition depends on the choice which decompositions we consider ’suitable’. We will investigate two different cases (though other choices might be of interest as well): the set of all lossless, and the set of all lossless and dependency preserving decompositions. In each case, we effectively obtain a different DNF. As the number of suitable decompositions of a given schema is finite, there must exist a decomposition among them which is minimal w.r.t. domination, as well as a (possibly different) schema which is minimal w.r.t. c-domination. It is however not clear yet wether a decomposition into DNF always exists, i.e., one which is minimal w.r.t. both criteria at once. We will show this next. Theorem 4.6. Every schema has a decomposition into DNF. Proof. We use the fact that domination implies c-domination. The c-domination pre-order induces a partition of all the (suitable) decompositions of R into equivalence classes and defines a partial order on these equivalence classes. Let EQ be a minimal equivalence class w.r.t. that order. Choose D to be minimal w.r.t. domination among the decompositions in EQ. We claim that D is minimal w.r.t. domination among all decompositions of R, and thus in DNF. Let D′ be any decomposition with D′ ≤ D. Then D′ ≤c D, and since D is minimal w.r.t. c-domination, D′ ∈ EQ. But D is also minimal w.r.t. domination in EQ, and thus D ≤ D′. Thus no decomposition D′ strictly dominates D. 4.1.2 Ordering by Attribute Count of Instances We now introduce an order which does not rely on a size function, but is otherwise similar to the order we used in the last section. Instead of measuring the total size of instances, we count the number of tuples an attribute appears in. This gives us domination and c-domination pre-orders for each attribute, and we can then combine these to get another pair of orderings for decompositions. Definition 4.7. Let R = {A1, . . . , Ak} be a schema. For a finite relation r over R and an attribute A we define the count of A on r as countA(r) := { |r| if A ∈ R 0 if A /∈ R where |r| denotes the number of tuples in r. We then define the count of A on r decom- posed by a decomposition D = {R1, . . . , Rn} of R as countA(r[D]) := n∑ j=1 countA(r[Rj]) Definition 4.8. LetR be a schema with FDs Σ, A an attribute andD1,D2 decompositions of R. We say that D1 c-dominates D2 w.r.t. A if there exists a constant k such that for all finite relations r over R that satisfy Σ we have countA(r[D1]) ≤ k · countA(r[D2]) We further say that D1 dominates D2 w.r.t. A if the above relation holds for k = 1. 95 Thus, for each attribute A, we get a c-domination and domination pre-orders. We combine those pre-orders by intersection. Definition 4.9. Let R be a schema and D1,D2 decomposition of R. We say that D1 { dominates c-dominates } D2 w.r.t. attribute count if for every attribute A ∈ R we have D1 { dominates c-dominates } D2 w.r.t. A. We will prove later on that domination and c-domination w.r.t. attribute count and w.r.t. size are exactly the same orders. 4.1.3 Ordering by Containing Schema Closures The previous two order pairs introduced are defined by considering all valid instances. This is not very practical if we wish to actually decide for two given decompositions whether one (c-) dominates the other. We therefore present a third pair of orders, which is defined by considering only the decompositions, rather than instances on them. The approach we use is similar to attribute counting. The count of an attribute depends on the set of schemas it lies in. While it also depends on the instance r, it is easy to show that the number of tuples in r[Rj] is determined by the closure R∗j of Rj (and by r). Recall that the closure R∗j of Rj under Σ is R∗j := {A ∈ R | Σ ² Rj → A} where Σ is a given set of constraints on R. In this work, we shall mainly be interested in functional, multi-valued and join dependencies. Lemma 4.10. Let R be a schema with constraints Σ, and X ⊆ R. Then for all relations r on R we have |r[X]| = |r[X∗]|. Proof. We can obtain r[X] by projecting from r[X∗]. The only way for the number of tuples to decrease, is for r[X∗] to contain different tuples which are identical on X. But this cannot happen since X functionally determines X∗. This motivates the following definitions. Definition 4.11. Let R be a schema with constraints Σ, and D a decomposition of R. Then for any attribute A ∈ R we define the containing schema closures (CSC) of A in D as the multiset CSCA(D) = {R∗j | A ∈ Rj ∈ D} The idea behind this definition is that the containing schema closure of an attribute A represents the attribute count for A in a way which does not rely on particular instances, but still allows comparison of different decompositions. It is necessary to use multisets rather than sets to represent the correct attribute count. 96 Example 4.4. Consider again the schema R = ABCDE with constraints Σ = {AB → CD,B → E} from example 4.3, and the decompositions D1 = {ABC,ABD,BE} D2 = {ABCD,BE} They produce the multisets CSCA(D1) = {ABCDE,ABCDE} CSCA(D2) = {ABCDE} which indicate that, for any relation r on R, the attribute A appears in twice as many tuples in r[D1] as in r[D2]. Using sets would hide this difference. We will now compare decompositions using the respective CSCs of all attributes. For that we need mappings between multi-sets. We allow different instances of the same value in the source domain to map to different values, and call a mapping injective if a value in the target domain is mapped to at most as often as it occurs in the target domain. Definition 4.12. Let M1,M2 be two multisets2 of (attribute) sets. We say that (i) M1 weakly inclusion-dominates M2 if there exists a mapping f : M1 → M2 with e ⊆ f(e) for all e ∈ M1. (ii) M1 strongly inclusion-dominates M2 if there exists an injective mapping f : M1 → M2 with e ⊆ f(e) for all e ∈ M1. Definition 4.13. Let D1,D2 be two decompositions of R. We say that D1 weakly/strongly csc-dominates D2 if for all attributes A ∈ R we have that CSCA(D1) weakly/strongly inclusion-dominates CSCA(D2). We will prove in the next section that weak csc-domination implies c-domination, and that strong csc-domination implies domination. While the opposite does not hold for arbitrary types of constraints, we will be able to show that it holds for sets of functional dependencies, and in the case of weak csc-domination/c-domination also for multi-valued and join dependencies. Thus we obtain a syntactic characterization for c-domination and, at least in the case of functional dependencies, for domination. For multi-valued dependencies, domination does not imply strong csc-domination, as will become evident in example 4.10. Finding a syntactic characterization for domination in this case is an open problem. 4.2 Equivalence of Orderings We will show that, if the only integrity constraints on R are functional dependencies, then all three order pairs defined in section 4.1 are identical. If multi-valued and join dependencies are also allowed, then domination w.r.t. size or attribute count need not imply weak csc-domination, but we will show that the other equivalences between the orders in question still hold. As multi-valued dependencies are just a special case of join dependencies, it suffices to consider only functional and join dependencies. We will assume throughout this section that functional and join dependencies are the only types of integrity constraints occurring. 2Note that for definition (i) we do not really need multi-sets. 97 4.2.1 Size vs. Attribute Count We start by showing that the orders defined by size and attribute count are identical. Recall that we assume that all domains are infinite, and that the size functions associated with them are positive and unbounded. Lemma 4.14. Let R be a schema with functional and join dependencies Σ, and D1,D2 be decompositions of R. If D1 does not (c-)dominate D2 w.r.t. attribute count, then D1 does not (c-)dominate D2 w.r.t. size. Proof. If D1 does not c-dominate D2 w.r.t. attribute count, then for every integer k there exists a relation r and an attribute A with countA(r[D1]) > k · countA(r[D2]) For each k and associated r and A, we will construct a relation r′ for which size(r′[D1]) > k · size(r′[D2]) holds. This shows that D1 does not c-dominate D2 w.r.t. size. For domination we only need to consider the case k = 1. The construction works as follows. Since the relations in r[D1] with attribute A contain more than k times as many tuples as those in r[D2], there must be an attribute value vA for A which appears more than k times as often in r[D1] as in r[D2]. We construct r′ from r by substituting every occurrence of vA by a new value v′A which does not appear in r. As Σ contains only functional and join dependencies, these constraints still hold for r′. Let o1, o2 be the number of occurrences of vA in r[D1], r[D2]. We choose v′A sufficiently large, i.e., such that size(v′A) > k · size(r[D2])− size(r[D1]) o1 − k · o2 + size(vA) This gives us (note that o1 − k · o2 > 0): (o1 − k · o2) · (size(v′A)− size(vA)) > k · size(r[D2])− size(r[D1]) size(r[D1]) + o1 · (size(v′A)− size(vA)) > k · size(r[D2]) + k · o2 · (size(v′A)− size(vA)) size(r′[D1]) > k · size(r′[D2]) This concludes the proof. Lemma 4.15. Let R be a schema with functional and join dependencies Σ, and D1,D2 be decompositions of R. If D1 does not (c-)dominate D2 w.r.t. size, then D1 does not (c-)dominate D2 w.r.t. attribute count. Proof. The proof is analogous to the last one. For every k, r (k = 1 for domination) with size(r[D1]) > k · size(r[D2]) we need to construct a relation r′ such that for some attribute A we get countA(r′[D1]) > k · countA(r′[D2]). 98 Again we use that the attribute values occurring in r[D1] and r[D2] are the same. Since size(r[D1]) > k · size(r[D2]), there must exist some attribute value vA of an attribute A which occurs more than k times as often in r[D1] than in r[D2]. We construct r′ from r by selecting exactly those tuples which have the value vA on attribute A. As Σ contains only functional and join dependencies, these constraints still hold for r′. And clearly we now have countA(r′[D1]) > k · countA(r′[D2]). We can combine the last two lemmas. Theorem 4.16. Let R be a schema with functional and join dependencies Σ, and D1,D2 be decompositions of R. Then D1 (c-)dominates D2 w.r.t. size iff D1 (c-)dominates D2 w.r.t. attribute count. Proof. Follows immediately from the lemmas shown. 4.2.2 Attribute Count vs. Containing Schema Closures - Part I We will now show that the orders defined by attribute count and containing schema closures are actually the same. One direction of implication is easy to show. Theorem 4.17. Let R be a schema with constraints3 Σ, and D1,D2 be decompositions of R. If D1 { weakly strongly } csc-dominates D2 then D1 { c-dominates dominates } D2. Proof. Let r be any relation on R and A some attribute in R. If D1 weakly csc-dominates D2, then for every schema R1 ∈ D1 with A ∈ R1 there exists a schema R2 ∈ D2 with A ∈ R2 and R∗1 ⊆ R∗2. By Lemma 4.10 we have |R1| ≤ |R2|, and each such schema R2 is mapped to at most |D1| times. Therefore the number of tuples containing attribute A in r[D1] is at most |D1| times larger than the number of tuples with A in r[D2]. Thus D1 c-dominates D2 with k = |D1|. If D1 strongly csc-dominates D2, then by Lemma 4.10 and due to the injectivity of the mapping f in Definition 4.12, the number of tuples with attribute A in r[D1] is no larger than the number of those in r[D2]. Thus D1 dominates D2. To show implication in the other direction, we will assume that D1 does not weakly or strongly csc-dominate D2, and construct example relations which show that D1 does not c-dominate or dominate D2. These constructions will require some work, and we devote the next subsection to them. 4.2.3 Subset Construction Our goal is to construct relations over a schema D with functional and join dependencies Σ, for which the number of tuples in their projection onto non-key subschemas varies by an arbitrarily large factor. Definition 4.18. Let R be a schema with constraints Σ and D = {R1, . . . , Rn} a decom- position of R. We say that a non-empty relation r over R is k-reducing w.r.t. D for an integer k if 3Theorem 4.17 holds for arbitrary constraints, not just functional and join dependencies. 99 (i) all constraints in Σ hold on r, and (ii) for every subschema Rj ∈ D which is not a key of R, the projection of r onto Rj contains at most 1k times as many tuples as r. The following example illustrates the basic idea for constructing k-reducing relations. Example 4.5. Consider the schema R = ABCD with constraint set Σ = {AB → C,C → AB,A → D,B → D} and the (faithful and lossless) decomposition D = {ABC,AD,BD}. We can construct a 2-reducing instance of R w.r.t. D as follows: r = A B C D 1 1 (1, 1) 1 1 2 (1, 2) 1 2 1 (2, 1) 1 2 2 (2, 2) 1 The FDs in Σ clearly hold, and the projections of r onto AD and BD contain only 2 tuples, compared to 4 tuples in r. We constructed the example relation above by creating two variables vA, vB with do- main {1, 2}, and for each value pair (vA = a, vB = b) creating a tuple in r, making the value of A dependent on a, the value of B dependent on b, the value of C dependent on both and the value of D dependent on neither. We formalize this idea as follows. Definition 4.19. Let Ω be a finite set and R be a set of attributes, where each Ai ∈ R has a subset Si of Ω associated with it. For a positive integer k we define the k-mappings of Ω as the total functions from Ω into {1, . . . , k}. For each k-mapping f we define its lifting onto R as the tuple tf on R, in which the attribute Ai has as value the partial function f |Si : Ω → {1, . . . , k} := {x → y ∈ f | x ∈ Si} We get the k-lifting of R by taking all k-mappings of Ω and lifting them all onto R. Note that the k-lifting is a set of tuples on R, and thus a relation on R. While the attribute values constructed by lifting k-mappings are partial functions rather than elements of the attribute’s domain, it is easy to see that one could always substitute those values with values from the proper domains (recall that we assumed do- mains to be infinite) to get an isomorphic relation on R. Note that while this substitution may affect the size of relations, it has no affect on attribute count. As we are only try- ing to relate the containing schema closure measures to attribute count, rather than size measures directly, we will not worry about size or domains any further. When giving examples, we will use attributes “A,B, . . .” rather than “A1, A2, . . .”, and denote their associated subsets by “SA, SB, . . .” instead of ”S1, S2, . . .”. Example 4.6. The given definitions can be related to example 4.5 as follows. We have Ω = {vA, vB}, SA = {vA}, SB = {vB}, SC = {vA, vB}, SD = ∅ 100 Writing the total function {vA 7→ a, vB 7→ b} short as (a, b), we obtain the set of all 2-mappings as {1, 2}Ω =    (1, 1), (1, 2), (2, 1), (2, 2)    . These k-mapping get lifted onto R as follows, writing the partial function {vA 7→ a} as (a,−): A B C D (1, 1) y (1,−) (−, 1) (1, 1) (−,−) (1, 2) y (1,−) (−, 2) (1, 2) (−,−) (2, 1) y (2,−) (−, 1) (2, 1) (−,−) (2, 2) y (2,−) (−, 2) (2, 2) (−,−) Ignoring blanks and substituting (−,−) by 1, we obtain relation r from example 4.5. The lifting of a k-mapping depends on the sets Si associated with the attributes Ai, and these sets Si are the only free choices we have in our construction. Note that elements of Ω which do not appear in any Si do not affect the construction, thus we may as well assume that Ω = ⋃Si. Given a schema R with constraints Σ and a decomposition D of R, we want to choose the sets Si in such a way that the constructed relation is k-reducing. We will first consider the case where Σ contains only FDs. Definition 4.20. For a subschema X ⊆ R we call the set SX := ⋃ Ai∈X Si the subset associated with X. For v ∈ Ω we call Rv := {Ai ∈ R | v ∈ Si} the set of attributes depending on v. Note that {Si | Ai ∈ R} and {Rv | v ∈ Ω} are dual hypergraphs. Lemma 4.21. Let Ω and R be as in Definition 4.19, X a subschema of R and SX its associated subset of cardinality s. Let rk be the k-lifting of R and xk the k-lifting of X. Then xk = rk[X] and xk contains exactly ks tuples. Proof. Let f be a k-mapping of Ω, and tf,R and tf,X its liftings onto R and X, respectively. By definition tf,X = tf,R[X], and since this holds for all f we get xk = rk[X]. Clearly there are ks k-mappings of SX . We show that xk contains ks tuples by giving a one-to-one mapping between tuples of xk and k-mappings of SX . The components of a tuple tf,X ∈ xk are the partial functions f |Si . By taking their union we obtain f |SX , which is a k-mapping of SX . Conversely, we can obtain tf,X from f |SX by restricting f |SX to the associated sets Si ⊆ SX with Ai ∈ X. This gives us the one-to-one mapping we wanted. 101 Lemma 4.22. Let Ω and R be as in Definition 4.19, k ≥ 2 and rk the k-lifting of R. Let further X, Y ⊆ R, and SX , SY be their associated subsets. Then the FD X → Y holds on rk iff SX ⊇ SY . Proof. (1) Let SX ⊇ SY , and let t1, t2 ∈ rk be tuples in rk. If t1[X] = t2[X] then the k-mappings f1, f2 which were lifted onto R to obtain t1 = tf1 and t2 = tf2 have the same restriction to SX , that is f1|SX = f2|SX . Since SX ⊇ SY this implies that f1|SY = f2|SY , and therefore t1[Y ] = t2[Y ]. Thus X → Y holds on rk. (2) Let SX + SY , i.e., there exists some v ∈ SY \ SX . Let f1, f2 be k-mappings of Ω which differ only on v. Then for their liftings tf1 , tf2 ∈ rk onto R we have tf1 [X] = tf2 [X] but tf1 [Y ] 6= tf2 [Y ]. Thus X → Y does not hold on rk. Definition 4.23. Let R be a schema with constraints Σ. We say that a subschema X ⊆ R is open if its complement R \X is closed under Σ, that is, (R \X)∗ = R \X Lemma 4.24. Let Ω, R, Σ and Si be as in Definition 4.19, k ≥ 2 and rk the k-lifting of R. Let Σ contain only FDs. Then Σ holds on rk iff for every v ∈ Ω the subschema Rv of attributes depending on v (Definition 4.20) is open. Proof. (1) Let all Rv be open. Assume that some FD X → Y ∈ Σ does not hold on rk. Then by Lemma 4.22 there exists some v ∈ SY \ SX . This means that X contains no attributes which depend on v, i.e., no attributes in Rv, so X ⊆ R \ Rv. Since R \ Rv is closed under Σ this implies Y ⊆ R \ Rv, and thus v /∈ SY . This contradicts v ∈ SY \ SX and disproves our assumption. (2) Let Rv not be open for some v ∈ Ω. Then there exists an attribute A ∈ Rv with Σ ² R \ Rv → A. Due to A ∈ Rv we have v ∈ SA, and clearly v /∈ SR\Rv by Definition 4.20. Therefore the FD R \Rv → A does not hold on rk by Lemma 4.22. Lemmas 4.21 and 4.24 indicate how we should construct the Si to obtain a relation on R which is k-reducing w.r.t. some decomposition D. For every subschema Rj ∈ D which is not a key of R, there should be an element v ∈ Ω which does not lie in the subset SRj associated with Rj. At the same time, the set Rv should be open. This leads to the following construction. Subset Construction for FDs: Let R be a schema with FDs Σ, and D a decomposition of R. For every subschema Rj ∈ D we form its closure R∗j , and add a unique element vj to every set Si for which Ai /∈ R∗j . The set Ω from Definition 4.19 is then Ω := ⋃ Si = {vj | Rj ∈ D and R∗j 6= R} Note that if D contains multiple subschemas with the same closure, it would suffice to consider only one of them when constructing the sets Si. Example 4.7. Consider again the schema R and decomposition D from examples 4.5 and 4.6. Applying the subset construction we get ABC∗ = ABCD y do nothing BD∗ = BD y add vBD to SA, SC AD∗ = AD y add vAD to SB, SC 102 This gives us the associated subsets SA = {vBD}, SB = {vAD}, SC = {vBD, vAD}, SD = {} which (except for element names) are the same as in example 4.6. Theorem 4.25. Let R be a schema with FDs Σ, and D a decomposition of R. Let the Si be constructed using the subset construction for FDs. Then for every k the k-lifting rk of R is k-reducing w.r.t. D. Proof. (i) Let v = vj be added to Ω when considering Rj during the subset construction. By construction Rv = R \R∗j is open, so Σ holds on rk by Lemma 4.24. (ii) For every subschema Rj ∈ D which is not a key of R, SR = Ω contains at least one more element than SRj , namely the element vj which was associated with all attributes in R \R∗j . Thus by Lemma 4.21, the projection of rk onto Rj contains at most 1k times as many tuples as rk. We will now generalize the subset construction to work with functional and join de- pendencies. We first need to establish when a join dependency holds on a k-lifting. Definition 4.26. Let Ω and R be as in Definition 4.19 and ./ [R1, . . . , Rn] a join- dependency on R. For every v ∈ Ω the synchronization hypergraph of v w.r.t. ./ [R1, . . . , Rn] is the projection of the hypergraph {R1, . . . , Rn} onto Rv (the set of at- tributes depending on v), i.e., sync(v) = {R1 ∩Rv, . . . , Rn ∩Rv} Lemma 4.27. Let Ω and R be as in Definition 4.19, k ≥ 2 and rk the k-lifting of R. Let further ./ [R1, . . . Rn] be a join-dependency on R. Then ./ [R1, . . . , Rn] holds on rk iff for all v ∈ Ω the synchronization hypergraph Hv of v w.r.t. ./ [R1, . . . , Rn] is connected. Proof. Let SR be the subset of Ω associated with R. By definition, the join dependency ./ [R1, . . . Rn] holds on rk iff rk[R1] ./ . . . ./ rk[Rn] =: jk ⊆ rk (1) Consider a single tuple t ∈ jk, and recall that all its attribute values are partial functions Ω → {1, . . . , k}. Let us denote the union of these partial functions by ⊔ t, which is a relation ft ⊆ Ω × {1, . . . , k}. If ft is a function, then t is the lifting of ft onto R, and thus t ∈ rk. On the other hand, every t′ ∈ rk is the lifting of some function f : Ω → {1, . . . , k}, and thus ⊔ t′ = f . Together this gives us that a tuple t ∈ jk lies in rk iff ⊔ t is a function. (2) Let t ∈ jk, v ∈ Ω. For every subschema Rj from ./ [R1, . . . , Rn] we have that t[Rj] = t′[Rj] for some t′ ∈ rk, and therefore that ⊔ t[Rj] is a partial function. Furthermore we have ⊔ t[Ri ∪Rj] = ⊔ t[Ri] ∪ ⊔ t[Rj] If there exists an attribute A ∈ Ri ∩ Rj, then every v associated with A has the same image4 under ⊔ t[Ri] as it has under ⊔ t[Rj], so ⊔ t[Ri ∪Rj] is a function for v, i.e., it 4One could say that the mapping of v is “synchronized”, hence the name “synchronization hyper- graph”. 103 maps v to only a single value. For a given v ∈ Ω let Rv be the set of attributes depending on v. Then such an attribute A exists for v iff Ri ∩Rv and Rj ∩Rv are not disjoint, i.e., iff the partial synchronization hypergraph with edges Ri ∩Rv and Rj ∩Rv is connected. (3) If Hv is connected, we can use the argument of (2) multiple times to show that for any t ∈ jk ⊔ t is a function for v. If all Hv are connected, ⊔ t is a function, which by (1) shows t ∈ rk, and thus jk ⊆ rk. This proves the “if” direction of the lemma. (4) If Hv is disconnected for some v (which implies v ∈ SR), then we can partition {R1, . . . , Rn} into two sets P1, P2 such that the attribute sets ⋃P1 ∩ Rv and ⋃P2 ∩ Rv are disjoint and non-empty. Since k ≥ 2 we can find two functions f1, f2 : Ω → {1, . . . , k} which differ exactly for v. Let t1, t2 ∈ rk be their liftings onto R. Then by definition of jk there exists a tuple t ∈ jk with t[P1] = t1[P1] and t[P2] = t2[P2] But this gives us ⊔ t = ⊔ t1[P1] ∪ ⊔ t2[P2] which is not a function since ⊔ t1[P1] and ⊔ t2[P2] map v to different values. As ⊔ t is not a function we get t /∈ rk by (1), and thus jk * rk. This shows the “only if” direction and completes the proof. Definition 4.28. Let X ⊆ R be an attribute set and Σ a set of functional and join dependencies on R. The dependency basis of X w.r.t. Σ (and R) is the finest partition DBΣ(X) of R\X∗ into non-empty sets, such that for every Y ∈ DBΣ(X) the multivalued dependency X ³ Y is implied by Σ. Where Σ is clear from the context we will just write DB(X) for DBΣ(X). It is well known that such a unique finest partition always exists [36]. Note that some texts define the dependency basis of X as the finest partition of R rather than R \ X∗. As Σ implies X ³ A for all A ∈ X∗, this partitions X∗ into sets each consisting of only a single attribute. Thus knowing the partition of R \X∗ immediately gives us the partition of R as well. We chose the given definition as it makes some formulations easier. The following inference rules for functional and multivalued dependencies are well- known to be correct [33, 36], and will be useful in what follows: X ³ Y X ³ XY (Augmentation) X ³ Y Y ³ Z X ³ Z \ Y (Pseudo-Transitivity) X ³ Y Y → Z X → Z \ Y (Mixed Pseudo-Transitivity) Lemma 4.29. Let Σ be a set of functional and join dependencies on R. Then for every X ⊆ R, every Y ∈ DB(X) is open, i.e., R \ Y is closed under Σ. Proof. Assume that R \ Y were not closed under Σ. Then for some attribute A ∈ Y we have Σ ² R \ Y → A. Since Y ∈ DB(X) we have Σ ² X ³ R \ Y . Using these two facts together we can derive X ³ R \ Y R \ Y → A X → A (Mixed Pseudo-Transitivity) 104 Thus A ∈ X∗ ∩ Y , which is a contradiction to Y ∈ DB(X), since DB(X) partitions R \X∗. Our intermediate goal is to construct (for given R,Σ and D) the sets Si, such that for every k the resulting k-lifting rk is k-reducing. We have found such a construction for the case where Σ contains only functional dependencies. When Σ contains join dependencies as well, we need to adapt our construction, since otherwise the join dependencies in Σ need not hold on rk. Example 4.8. Consider the schema R = ABCD with constraints Σ = {./ [AB,AC,AD]} ≡ {A ³ B|C|D} and decomposition D = {AB,AC,AD}. Using the subset construction for functional dependencies, we would get the sets SA = {}, SB = {vAC , vAD}, SC = {vAB, vAD}, SD = {vAB, vAC} which in turn lead to the 2-lifting (again writing partial functions as tuples): r2 = A B C D (−,−,−) (−, 1, 1) (1,−, 1) (1, 1,−) (−,−,−) (−, 1, 2) (1,−, 2) (1, 1,−) (−,−,−) (−, 2, 1) (1,−, 1) (1, 2,−) (−,−,−) (−, 2, 2) (1,−, 2) (1, 2,−) (−,−,−) (−, 1, 1) (2,−, 1) (2, 1,−) (−,−,−) (−, 1, 2) (2,−, 2) (2, 1,−) (−,−,−) (−, 2, 1) (2,−, 1) (2, 2,−) (−,−,−) (−, 2, 2) (2,−, 2) (2, 2,−) It is easy to check that ./ [AB,AC,AD] does not hold on r2. We observe that the join dependency ./ [AB,AC,AD] in the example above does not hold because the sets SB, SC , SD share common elements, i.e., they are not pairwise disjoint. This establishes a connection between the values of B,C and D for every tuple in rk. To avoid such connections, we could associate vAB, vAC and vAD only with the attributes of one set Y in DB(AB), DB(AC) and DB(AD), respectively. While in general it is critical that for every v ∈ Ω the set Rv of attributes associated with v is open, in order to ensure that the FDs in Σ hold on rk (Lemma 4.24), Lemma 4.29 ensures us that every Y ∈ DB(X) is open, for any X ⊆ R. This motivates the following construction: Generalized Subset Construction: Let R be a schema with functional and join de- pendencies Σ, and D a decomposition of R. For every subschema Rj ∈ D with R∗j 6= R we form its dependency basis DB(Rj). Then for some (arbitrary) set Y ∈ DB(Rj), we add a unique element vj to every set Si for which Ai ∈ Y . Note that when Σ contains only functional dependencies, we have DB(Rj) = {R \R∗j} Thus the generalized subset construction is identical to the subset construction for FDs in such cases, which justifies its name. 105 Example 4.9. Consider again the schema R = ABCD with constraints Σ = {./ [AB,AC,AD]} and decomposition D = {AB,AC,AD}. Using the generalized subset construction we might get the sets (depending on the choices for Y ) SA = {}, SB = {vAC}, SC = {vAD}, SD = {vAB} which lead to the 2-lifting: r2 = A B C D (−,−,−) (−, 1,−) (−,−, 1) (1,−,−) (−,−,−) (−, 1,−) (−,−, 2) (1,−,−) (−,−,−) (−, 2,−) (−,−, 1) (1,−,−) (−,−,−) (−, 2,−) (−,−, 2) (1,−,−) (−,−,−) (−, 1,−) (−,−, 1) (2,−,−) (−,−,−) (−, 1,−) (−,−, 2) (2,−,−) (−,−,−) (−, 2,−) (−,−, 1) (2,−,−) (−,−,−) (−, 2,−) (−,−, 2) (2,−,−) It is easy to verify that ./ [AB,AC,AD] now holds on r2. Proposition 4.30. For every join dependency ./ [R1, R2, . . .] we have: ./ [R1, R2, . . .] ² ./ [R1 ∪R2, . . .] i.e., if we replace any two (or more) subschemas in a join dependency by their union, we obtain an implied join dependency. Proof. Clear by definition of join dependency. Theorem 4.31. Let R be a schema with functional and join dependencies Σ, and D a decomposition of R. Let the Si be constructed using the generalized subset construction. Then the k-lifting rk of R is k-reducing, for every k ∈ N. Proof. Part (ii) of the k-reducing property can be shown as in Theorem 4.25. The dif- ficulty lies in showing part (i), namely that all constraints in Σ hold. This is clear for functional dependencies by lemmas 4.29 and 4.24. Assume that some join dependency ./ [R1, . . . , Rn] ∈ Σ does not hold for rk, k ≥ 2 (k = 1 is trivial). Then by Lemma 4.27 there must be some v ∈ Ω for which the synchronization hypergraph Hv = {R1 ∩Rv, . . . , Rn ∩Rv} of v w.r.t. ./ [R1, . . . , Rn] is disconnected. Then we can partition Rv, the set of attributes depending on v, into non-empty sets Rv = H1 ∪H2 106 such that every Rj from ./ [R1, . . . , Rn] is disjoint to H1 or H2. Let U1 be the union of Rj disjoint to H2, and U2 of those disjoint to H1. Then by Proposition 4.30 we have: Σ ² ./ [R1, . . . , Rn] ² ./ [U1, U2] Now let X be the closure of the subschema in D for which v was added. By construction we have Rv ∈ DB(X), so that we can partition R into X,Rv and the remaining attributes Z := R \ (X ∪Rv), so that R = X ∪ Z ∪Rv = X ∪ Z ∪H1 ∪H2 Then U1 ⊆ X ∪ Z ∪H1 and U2 ⊆ X ∪ Z ∪H2, and thus Σ ² ./ [U1, U2] ² ./ [X ∪ Z ∪H1, X ∪ Z ∪H2] ≡ XZ ³ H1 Since Z is the union of elements of the dependency basis of X, we have X ³ Z, and thus X ³ Z XZ ³ H1 X ³ H1 (Pseudo-Transitivity) This is a contradiction, since ∅ 6= H1 ( Rv, and Rv ∈ DB(X). 4.2.4 Attribute Count vs Containing Schema Closures - Part II We are now ready to complete the equivalence proof for attribute count and csc-domination orders. Lemma 4.32. Let Σ be a set of functional and join dependencies on R, and X, Y ⊆ R. Then Σ ∪ {Y → R} ² X → Y iff Σ ² X → Y . Proof. If Σ implies X → Y then clearly Σ ∪ {Y → R} implies X → Y as well. Now let Σ 2 X → Y , so that there exists a relation r on R for which all dependencies in Σ hold, but not X → Y . Then there exist at least two tuples t1, t2 ∈ r with t1[X] = t2[X], t1[Y ] 6= t2[Y ] Among all such pairs of tuples, let (t1, t2) be one for which the set of attributes D ⊆ R on which t1 and t2 differ is minimal. We claim that r′ := {t1, t2} is a relation for which the dependencies in Σ∪{Y → R} hold but not X → Y , thus showing Σ∪{Y → R} 2 X → Y . Clearly Y → R holds for r′, but not X → Y . Furthermore all FDs in Σ hold for r′ since r′ ⊆ r. Now for any join dependency ./ [R1, . . . , Rn] ∈ Σ let r′′ := r′[R1] ./ . . . ./ r′[Rn] be the result of the corresponding project-join mapping of r′. By definition ./ [R1, . . . , Rn] holds for r′ iff r′′ ⊆ r′. Let t3 ∈ r′′ be arbitrary. Since t1[Y ] 6= t2[Y ] at least one of the inequalities t3[Y ] 6= t1[Y ] and t3[Y ] 6= t2[Y ] holds, say t3[Y ] 6= t1[Y ]. For every attribute A ∈ R we have 107 t3[A] ∈ {t1[A], t2[A]} by construction of r′′. Since t1[R \D] = t2[R \D], t3 differs from t1 at most on the attributes in D. But t1, t2 were chosen to make D minimal, and due to r′′ ⊆ r[R1] ./ . . . ./ r[Rn] = r we have t3 ∈ r. Thus t3 differs from t1 (and therefore equals t2) on all attributes in D. Consequently t3 = t2 ∈ r′, which shows r′′ ⊆ r′ and completes the proof. Theorem 4.33. Let R be a schema with functional and join dependencies Σ, and D1,D2 be decompositions of R. If D1 does not weakly csc-dominate D2, then D1 does not c- dominate D2. Proof. Recall that by definitions 4.12 and 4.13 D1 weakly csc-dominates D2 iff for every attribute A and every schema R1 with A ∈ R1 ∈ D1 there exists a schema R2 ∈ D2 with A ∈ R2 and R∗1 ⊆ R∗2. As D1 does not weakly csc-dominate D2, there must exist A,R1 with A ∈ R1 ∈ D1, such that for every schema R2 ∈ D2 with A ∈ R2 we have R∗1 * R∗2, i.e., Σ 2 R2 → R1. To show that D1 does not c-dominate D2, we construct a counterexample for any value k. We construct the counterexample r on R by using the generalized subset construction for CSCA(D2), but for an extended set of constraints Σ′ := Σ ∪ {R1 → R} This makes R1 a key of R w.r.t. Σ′, and by Lemma 4.32 we still have Σ′ 2 R2 → R1 for all R2 in question. Then by Lemma 4.10 the number of tuples in r[R1] equals the number of tuples in r, which by Theorem 4.31 is at least k times larger than the number of tuples in r[R2], for any R2 ∈ D2 with A ∈ R2. Thus D1 does not c-dominate D2. The following theorem is well-known [21]. Theorem 4.34 (Hall’s Theorem). Let M1,M2 be finite sets and pi : M1 → P(M2) asso- ciate a set of permitted values with each element in M1. Then there exists an injective mapping f : M1 → M2 with f(e) ∈ pi(e) for all e ∈ M1 iff for all m1 ⊆ M1 we have |m1| ≤ |pi(m1)|, where pi(m1) := ⋃ {pi(e) | e ∈ m1} Lemma 4.35. Let Σ be a set of functional dependencies on R, and further X, Y1, . . . , Yn ⊆ R. Then Σ ∪ {Y1 → R, . . . , Yn → R} ² X → R holds iff Σ ² X → Yi for some i ∈ {1, . . . , n}. Proof. (1) If Σ ² X → Yi then Σ ∪ {Yi → R} ² X → R by transitivity (rule 1.1). (2) Otherwise let X∗ denote the closure of X under Σ, and let r = {t1, t2} be a relation on R containing two tuples with t1[X∗] = t2[X∗] and t1[A] 6= t2[A] for all A /∈ X∗. Then Σ holds on r, and since for all Yi we have Yi * X∗, the functional dependencies Yi → R hold as well. It is clear though that X → R does not hold on r, and thus is not implied by Σ ∪ {Y1 → R, . . . , Yn → R}. 108 Theorem 4.36. Let R be a schema with functional dependencies Σ, and D1,D2 be de- compositions of R. If D1 does not strongly csc-dominate D2, then D1 does not dominate D2. Proof. Let D1 not strongly csc-dominate D2. This means that for some A ∈ R there exists no injective mapping f : M1 := CSCA(D1) → M2 := CSCA(D2) with e ⊆ f(e) for all e ∈ M1. The permitted values for e ∈ M1 in such a mapping would be pi(e) := {e′ ∈ M2 | e ⊆ e′} Note that M1,M2 are multisets, rather than sets. However, to make the formulation of the following arguments easier, we shall regard them as sets by treating multiple oc- currences of elements in M1,M2 as different. This does not change whether an injective mapping f exists. By Theorem 4.34 there exists a set m1 ⊆ CSCA(D1) with |m1| > |m2|, where m2 := pi(m1). We now construct a counterexample r on R by using the subset construction for M2 \m2, but again using an extended set of constraints Σ′ := Σ ∪ {Y → R | Y ∈ m1} This makes all elements in m1 keys of R w.r.t. Σ′, and by Lemma 4.35 we still have Σ′ 2 X → R for all X ∈ M2 \m2. Then by Lemma 4.10 we have |r[m1]| = |m1| · |r| and |r[m2]| = |m2| · |r|. By Theorem 4.25 the number |r| of tuples in r is at least k times larger than the number of tuples in r[X], for any X ∈ M2 \m2. By choosing k large enough we get |r[M2 \m2]| < |r| This gives us |r[M2]| = |r[m2]|+ |r[M2 \m2]| < |m2| · |r|+ |r| ≤ |m1| · |r| = |r[m1]| ≤ |r[M1]| which shows that D1 does not dominate D2. In the last theorem we restricted ourselves to functional dependencies. This is because it does not hold in the presence of multi-valued or join dependencies, as the following example shows. Example 4.10. Let R,Σ,D1,D2 be as follows: R = ABC,Σ = {A ³ B}, D1 = {AB,AC},D2 = {ABC,A} It is easy to see that D1 does not strongly csc-dominate D2 w.r.t. the attribute A: CSCA(D1) = {AB,AC} CSCA(D2) = {ABC,A} 109 It is clear that D1 dominates D2 w.r.t. B and C. To show domination w.r.t. attribute count, it thus suffices to prove that for every relation r on R we have countA(r[D1]) ≤ countA(r[D2]) To do so, we partition r into disjoint relations ri with |ri[A]| = 1 and ri[A] 6= rj[A] for i 6= j. Then for each subschema R′ of R containing A (i.e., all schemas in D1,D2), r[R′] is the disjoint union of the ri[R′], and thus countA(r[D1/2]) = ∑ countA(ri[D1/2]) Therefore we only need to show countA(ri[D1]) ≤ countA(ri[D2]) for all relations ri. Note that A ³ B still holds for all ri, and thus ri[BC] = ri[B] ./ ri[C] Abbreviating the cardinalities of ri[B], ri[C] with CB, CC we obtain |ri[AB]| = CB |ri[AC]| = CC |ri[ABC]| = CB · CC |ri[A]| = 1 This gives us countA(ri[D2])− countA(ri[D1]) = CB · CC − (CB + CC) + 1 = (CB − 1) · (CC − 1) ≥ 0 which shows that D1 dominates D2 w.r.t. attribute count, even though D1 does not strongly csc-dominate D2. It is an open question how domination w.r.t. size or attribute count can be character- ized syntactically in the presence of functional and join dependencies. 4.3 Relationship to other Normal Forms A number of normal forms for characterizing well designed relational databases have been proposed, depending on the types of integrity constraints given. For functional dependencies, BCNF and 3NF are the most popular ones. 4NF [15] is an extension of BCNF for functional and multivalued dependencies. For functional and join dependencies 4NF has been extended to PJ/NF [16], 5NF [33, 43] and KCNF [44]. The last normal form, KCNF, will be of particular interest to us. Definition 4.37. [44] Let R be a schema with functional and join dependencies Σ. Then R is in Key-Complete Normal Form (KCNF), if for every join dependency ./ [R1, . . . , Rn] implied by Σ, the keys among R1, . . . , Rn contain all attributes in R. That is, we have R = ⋃ {Ri ∈ ./ [R1, . . . , Rn] | R∗i = R} 110 Note that in the case where Σ contains only functional and multivalued dependencies, KCNF is equivalent to 4NF, and when Σ contains only functional dependencies KCNF is equivalent to BCNF [44]. Given a single schema with constraints Σ, the absence of redundancy, as defined in [1, 44], is precisely characterized by BCNF, 4NF and KCNF. That is, a schema R is free of redundancy iff it is in BCNF, 4NF or KCNF [1, 44]. While our normal form has been designed to minimize size rather than redundancy, the intuition is that minimizing one minimizes the other as well. We will show that this intuition holds in so far, as that a single schema is in KCNF (and thus free of redundancy) iff it is in DNF among all lossless decompositions. Thus lossless DNF can be seen as an extension of BCNF, 4NF and KCNF. Recall that a decomposition is in DNF if it is minimal among a given set of ’suitable’ decompositions, and that for each such set we may obtain a different DNF. Theorem 4.38. Let R be a schema with functional and join dependencies Σ. Then R is in KCNF iff {R} is in DNF w.r.t. all lossless decompositions of R. Proof. (1) Let R be in KCNF. Let D = {R1, . . . , Rn} be any lossless decomposition of R. Then Σ implies the join dependency ./ [R1, . . . , Rn], and since R is in KCNF, every attribute A ∈ R lies in some Ri which forms a key of R. Thus R ∈ CSCA(D) for all A, so {R} strongly csc-dominates D. Therefore {R} dominates D by Theorem 4.17. As this holds for all lossless decompositions D, {R} is in DNF. (2) Let R not be in KCNF. Then Σ implies a join dependency ./ [R1, . . . , Rn] such that ⋃ {Ri | Σ ² Ri → R} 6= R The decomposition D = {R1, . . . , Rn} is lossless, and there exists an attribute A ∈ R which does not lie in any Ri which forms a key of R. Clearly D weakly csc-dominates {R}, and since R /∈ CSCA(D) we have that {R} does not weakly csc-dominate D. Thus D strictly c-dominates {R} by theorems 4.17 and 4.33, showing that {R} is not in DNF. Note that, while lossless DNF and BCNF are the same for a single schema, they differ significantly when applied to multiple schemas. We will demonstrate those differences using a detailed example. 4.3.1 A Detailed Example A university has oral examinations at the end of each semester, and wants to manage related data using a relational database. The relevant attributes to be stored are R = {Student, Course, Chapter, T ime,Room} Here Chapter denotes a chapter from the course textbook the student will be examined about. Every student can get examined about multiple chapters, and chapters may vary for each student. Multiple students can get examined at the same time in the same room, but the course must be the same. Further constraints are that a student gets examined for a course only once, and cannot be in multiple rooms at the same time. Those conditions can be expressed through functional dependencies as follows: Σ =    {Student, Course} → Time, {Student, T ime} → Room, {Time,Room} → Course    111 We are now presented with the task of decomposing R. If it is deemed necessary to preserve dependencies, a reasonable Boyce-Codd Normal Form decomposition could be synthesized as follows: DDP−BCNF =    {Student, Course, T ime}, {Student, T ime,Room}, {Course, T ime,Room}, {Student, Course, Chapter}    This decomposition, however, is not in dependency preserving Domination Normal Form - it is strictly dominated by the decomposition DDP−DNF = { {Student, Course, T ime,Room}, {Student, Course, Chapter} } Note that the latter decomposition is in dependency preserving DNF (although we will not show this here), but not in BCNF. If dependencies need not be preserved, we could use the well-known BCNF decompo- sition algorithm [27, 33, 36] to obtain the following BCNF decomposition (decomposing first by {Student, Course} → Time, then by {Student, Course} → Room): DBCNF =    {Student, Course, T ime}, {Student, Course,Room}, {Student, Course, Chapter}    Again, this is strictly dominated by the decomposition DDP−DNF , which is in DNF even among all lossless decompositions. At first look it might seem that DDP−DNF is strictly c-dominated by DDNF =    {Student, T ime,Room}, {Course, T ime,Room}, {Student, Course, Chapter}    A closer look reveals though that both c-dominate each other, since the attribute Course already appears in the key schema {Student, Course, Chapter} in both cases. The latter decomposition is both in DNF and BCNF. The decomposition D′DP−DNF = { {Student, Course, T ime,Room}, {Student, T ime,Chapter} } however, while in dependency preserving DNF, is not in DNF w.r.t. all lossless decompo- sition, since it is strictly c-dominated by D′DNF =    {Student, T ime,Room}, {Course, T ime,Room}, {Student, T ime, Chapter}    Which decomposition to choose is ultimately up to the designer, as storage space and re- dundancy are not the only design criteria to consider. The schema {Student, Course, Chapter} appears to be a more intuitive choice than {Student, T ime, Chapter}, although deciding this requires domain knowledge which is not encoded in the constraints. 112 We conclude this example by providing an instance of R and its projection onto the schemas appearing in the first four decompositions given. While DNF considers all valid instances rather than just a given one, we chose an instance which visualizes the relation- ship between a schema’s closure and the size of relations projected onto it. It is easy to see that DDP−DNF and DDNF require less storage space than DDP−BCNF and DBCNF . Student Course Ch. T ime Ro. J.C. Denton Networks 2 3/10, 1pm 101 J.C. Denton Networks 6 3/10, 1pm 101 J.C. Denton Security 1 4/10, 1pm 104 J.C. Denton Security 5 4/10, 1pm 104 L. Nasher Networks 3 3/10, 1pm 101 L. Nasher Networks 4 3/10, 1pm 101 L. Nasher Security 4 4/10, 1pm 104 L. Nasher Security 7 4/10, 1pm 104 O. Shrek Networks 2 3/10, 1pm 101 O. Shrek Networks 8 3/10, 1pm 101 O. Shrek Security 5 4/10, 1pm 104 O. Shrek Security 2 4/10, 1pm 104 M. Smith Security 4 4/10, 2pm 104 M. Smith Security 6 4/10, 2pm 104 M. Anderson Networks 3 3/10, 1pm 101 M. Anderson Networks 5 3/10, 1pm 101 A. Cheng Networks 2 3/10, 1pm 103 A. Cheng Networks 4 3/10, 1pm 103 A. Cheng Security 4 4/10, 2pm 104 A. Cheng Security 5 4/10, 2pm 104 N. Cheng Networks 1 3/10, 1pm 103 N. Cheng Networks 7 3/10, 1pm 103 N. Cheng Security 5 4/10, 2pm 104 N. Cheng Security 6 4/10, 2pm 104 J.Zhao Networks 2 3/10, 1pm 103 J.Zhao Networks 5 3/10, 1pm 103 Student Course Ch. J.C. Denton Networks 2 J.C. Denton Networks 6 J.C. Denton Security 1 J.C. Denton Security 5 L. Nasher Networks 3 L. Nasher Networks 4 L. Nasher Security 4 L. Nasher Security 7 O. Shrek Networks 2 O. Shrek Networks 8 O. Shrek Security 5 O. Shrek Security 2 M. Smith Security 4 M. Smith Security 6 M. Anderson Networks 3 M. Anderson Networks 5 A. Cheng Networks 2 A. Cheng Networks 4 A. Cheng Security 4 A. Cheng Security 5 N. Cheng Networks 1 N. Cheng Networks 7 N. Cheng Security 5 N. Cheng Security 6 J.Zhao Networks 2 J.Zhao Networks 5 113 Student Course T ime Ro. J.C. Denton Networks 3/10, 1pm 101 J.C. Denton Security 4/10, 1pm 104 L. Nasher Networks 3/10, 1pm 101 L. Nasher Security 4/10, 1pm 104 O. Shrek Networks 3/10, 1pm 101 O. Shrek Security 4/10, 1pm 104 M. Smith Security 4/10, 2pm 104 M. Anderson Networks 3/10, 1pm 101 A. Cheng Networks 3/10, 1pm 103 A. Cheng Security 4/10, 2pm 104 N. Cheng Networks 3/10, 1pm 103 N. Cheng Security 4/10, 2pm 104 J.Zhao Networks 3/10, 1pm 103 Student Course T ime J.C. Denton Networks 3/10, 1pm J.C. Denton Security 4/10, 1pm L. Nasher Networks 3/10, 1pm L. Nasher Security 4/10, 1pm O. Shrek Networks 3/10, 1pm O. Shrek Security 4/10, 1pm M. Smith Security 4/10, 2pm M. Anderson Networks 3/10, 1pm A. Cheng Networks 3/10, 1pm A. Cheng Security 4/10, 2pm N. Cheng Networks 3/10, 1pm N. Cheng Security 4/10, 2pm J.Zhao Networks 3/10, 1pm Student T ime Ro. J.C. Denton 3/10, 1pm 101 J.C. Denton 4/10, 1pm 104 L. Nasher 3/10, 1pm 101 L. Nasher 4/10, 1pm 104 O. Shrek 3/10, 1pm 101 O. Shrek 4/10, 1pm 104 M. Smith 4/10, 2pm 104 M. Anderson 3/10, 1pm 101 A. Cheng 3/10, 1pm 103 A. Cheng 4/10, 2pm 104 N. Cheng 3/10, 1pm 103 N. Cheng 4/10, 2pm 104 J.Zhao 3/10, 1pm 103 Student Course Ro. J.C. Denton Networks 101 J.C. Denton Security 104 L. Nasher Networks 101 L. Nasher Security 104 O. Shrek Networks 101 O. Shrek Security 104 M. Smith Security 104 M. Anderson Networks 101 A. Cheng Networks 103 A. Cheng Security 104 N. Cheng Networks 103 N. Cheng Security 104 J.Zhao Networks 103 Course T ime Ro. Networks 3/10, 1pm 101 Security 4/10, 1pm 104 Security 4/10, 2pm 104 Networks 3/10, 1pm 103 4.4 Computing Domination Normal Form Having defined when schemas are in DNF, we are now looking for an algorithm which, given a schema R with constraints Σ, computes a decomposition of R which is in DNF. For this we will restrict ourselves to the case where the only constraints given are functional dependencies. 114 Lemma 4.39. Let R be a schema with constraints Σ, and D be a decomposition of R. If D contains two different R1, R2 ∈ D with R∗1 = R∗2, then D′ := D \ {R1, R2} ∪ {R1 ∪R2} dominates D. If R1 ∩R2 6= ∅, then D′ strictly dominates D. Proof. For every relation r on R the projections r[R1], r[R2] can be obtained by projecting from r[R1 ∪R2]. As R1 an R2 are keys of R1 ∪R2 ⊆ R∗1 = R∗2 no tuples are lost in this projection. Thus the size of D and D′ varies by the size for the values of attributes in R1 ∩R2, as those are stored twice in D. In chapter 3 we defined equivalence classes for functional dependencies, or equivalently for schemas. We shall use them again here, and recall the definition. Definition 4.40. Let R be a schema with constraints Σ and X, Y ⊆ R. We call X and Y equivalent if X∗ = Y ∗. We say that X is of higher order than Y if X∗ ⊇ Y ∗. We call two functional dependencies X1 → Y1, X2 → Y2 implied by Σ equivalent if their left hand sides X1 and X2 are equivalent (and similarly for higher order). When partitioning a set Σ of FDs into equivalence classes, we denote them by EQX := {Y → Z ∈ Σ | Y ∗ = X∗} and call X∗ the closure of EQX . This groups schemas and functional dependencies into equivalence classes. As there is an obvious correspondence between equivalence classes of schemas and those of functional dependencies, we will not always distinguish between them, and will compare equivalence classes w.r.t. higher order a well. When searching for a DNF decomposition, Lemma 4.39 tells us that it suffices to only consider decompositions which contain at most one schema for each equivalence class of schemas. Definition 4.41. Let D be a decomposition of R and X ⊆ R. We define the higher order schemas and higher order attributes of X in D as HOSX(D) := {Rj ∈ D | X∗ ⊆ R∗j} HOAX(D) := ⋃ HOSX(D) Similarly, the strictly higher order schemas and strictly higher order attributes of X in D are SHOSX(D) := {Rj ∈ D | X∗ ( R∗j} SHOAX(D) := ⋃ SHOSX(D) 115 Lemma 4.42. Let D1,D2 be two decompositions of R. Then D1 weakly csc-dominates D2 iff for every schema X ∈ D1 we have X ⊆ HOAX(D2) Proof. (1) By definition D1 weakly csc-dominates D2 iff for every attribute A ∈ R and every R∗1 ∈ CSCA(D1) there exists R∗2 ∈ CSCA(D2) with R∗1 ⊆ R∗2. In other words, for every pair (A,R1) with A ∈ R1 ∈ D1 there exists R2 with A ∈ R2 ∈ D2 and R∗1 ⊆ R∗2. (2) Furthermore we have X ⊆ HOAX(D2) for all X ∈ D1 iff for every pair (A,X) with A ∈ X ∈ D1 there exists Y ∈ HOSX(D2) with A ∈ Y , that is Y with A ∈ Y ∈ D2 and X∗ ⊆ Y ∗. Using (1) and (2) together (with X = R1 and Y = R2) we obtain the claim. Definition 4.43. Let R be a schema with FDs Σ, and D a set of subschemas of R. We denote the set of all FDs in Σ which lie in schemas in D by Σ[D] := ⋃ Rj∈D Σ[Rj] 4.4.1 Dependency Preserving DNF We are now able to construct an algorithm to compute a DNF decomposition for the case where Σ contains only functional dependencies, and where we consider only lossless and dependency preserving decompositions. For this, we recall the definition and properties of partial covers from sections 3.1 and 3.2.1, rephrased slightly to suit our needs. Definition 4.44. Let Σ be a set of FDs, and EX ⊆ Σ be an equivalence class of Σ. A set C ⊆ Σ∗ is a partial cover of EX (w.r.t. Σ) if C[X∗] ∪ (Σ \ EX) is a cover of Σ. Lemma 4.45. Let Σ be a set of FDs, and let EQ be the partition of Σ into equivalence classes. Then a set C of FDs is a cover of Σ iff C is a partial cover (w.r.t. Σ) for all equivalence classes EQj ∈ EQ. Lemma 4.46. Let R be a schema with FDs Σ, and D a dependency preserving decompo- sition of R. Let further EQX be an equivalence class of Σ. Then Σ∗a[HOSX(D)] ⊆ Σ∗a[HOAX(D)] each form a partial cover of EQX . Proof. Since D is dependency preserving, Σ∗a[D] must form a cover of Σ, and thus a partial cover of EQX . By Definition 4.44 the only FDs of interest for forming a partial cover of EQX are those LHS-equivalent to the FDs in EQX . All of them lie in schemas Rj with X∗ ⊆ R∗j , so Σ∗a[{Rj ∈ D | X∗ ⊆ R∗j}] already forms a partial cover of EQX . We use this to synthesize a dependency preserving decomposition as follows. 116 Algorithm “dependency preserving DNF decomposition” INPUT: schema R, canonical cover Σ OUTPUT: decomposition D in DNF D := ∅ if Σ contains no key dependencies then add minimal key Rkey of R to D partition Σ into equivalence classes EQ while EQ 6= ∅ do pick maximal EQj ∈ EQ and remove it from EQ Rj := closure of FDs in EQj AD := SHOARj(D) first for all A ∈ Rj \ AD and then for all A ∈ Rj ∩ AD do if Σ∗a[D ∪ {Rj \ A}] is a partial cover of EQj then Rj := Rj \ A end if Rj 6= ∅ then D := D ∪ {Rj} end Theorem 4.47. Algorithm “dependency preserving DNF decomposition” returns a loss- less, dependency preserving DNF decomposition of R. Proof. (1) We first show that D is dependency preserving and lossless. This is the case iff Σ∗a[D] is a cover of Σ, and D contains a key of R. If Σ contains no key dependency, then a minimal key Rkey of R is added to D. Otherwise it suffices to show that Σ∗[D] is a cover of Σ, since this implies that D contains a key of R. In each iteration of the while loop, it is ensured that for the decomposition D computed so far, Σ∗[D] forms a partial cover for the equivalence classes removed from EQ. After processing all equivalence classes, we therefore obtain a decomposition for which Σ∗[D] covers Σ. It remains to show that D is not strictly dominated or c-dominated by any other lossless and dependency preserving decomposition D′ of R. (2) We start by proving that D′ does not strictly dominate D. Consider the schemas R1, . . . , Rn ∈ D (including Rkey if it was added) in the order as they were added to D (with indices describing this order). Let Rk be the first such schema which is not contained in D′ (if all Rj are contained in D′ then clearly D dominates D′), and C := R∗k its closure. By Lemma 4.46 the set Σ∗[HOSC(D′)] forms a partial cover of EQk. Let further H ′k := ⋃ (HOSC(D′) \ {R1, . . . , Rk−1}) so that Σ∗[{R1, . . . , Rk−1} ∪H ′k] forms a partial cover of EQk. Since Rk was constructed as minimal such that Σ∗[R1, . . . , Rk] forms a partial cover of EQk, H ′k ( Rk cannot hold. If Hk = Rk then all schemas in HOSC(D′) \ D must be of order EQk. By Lemma 4.39 117 we may assume that this does not happen, so Hk * Rk. Thus there exists at least one attribute A which lies in some schema R′A ∈ HOSC(D′) \ {R1, . . . , Rk−1} but not in Rk. Consider the containing schema closures CSCA(D) and CSCA(D′). If D′ were to dominate D, then CSCA(D′) would strongly inclusion dominate CSCA(D). Equiva- lently, CSCA(D′ \ {R1, . . . , Rk−1}) would have to strongly inclusion dominate CSCA(D \ {R1, . . . , Rk−1}). However, we have R′∗A ∈ CSCA(D′ \ {R1, . . . , Rk−1}) but no schema in CSCA(D \ {R1, . . . , Rk−1}) includes R′∗A. This is a contradiction, which shows that D′ does not dominate D. (3) Finally, we need to show that D′ does not strictly c-dominate D. Assume the contrary, so that by Lemma 4.42 we have for all R′ ∈ D′ that R′ ⊆ HOAR′(D), and there exist a schema Rw ∈ D for which Rw * HOARw(D′) Let Rw be the first such schema in the sequence of schemas R1, . . . , Rn ∈ D, and let C := R∗w be its closure. Then for every Rj ∈ SHOSC(D) we get Rj ⊆ HOARj(D′) ⊆ SHOAC(D′) and thus SHOAC(D) ⊆ SHOAC(D′) Inclusion in the opposite direction holds by similar argument, showing equivalence: SHOAC(D) = SHOAC(D′) This attribute set is computed as the set AD during the construction of Rw: we have AD = SHOAC({R1, . . . , Rw−1}) = SHOAC(D) = SHOAC(D′) Since Rw * HOARw(D′) ⊇ AD we have Rw * AD. As we tried removing attributes outside AD first when constructing Rw, the set Σ∗[AD] = Σ∗[SHOAC(D′)] cannot form a partial cover of EQw. By Lemma 4.46 the set Σ∗[HOAC(D′)] does form a partial cover of EQk, so D′ must contain at least one schema R′w with R′∗w = C. By Lemma 4.39 we may assume that R′w is the only such schema in D′. We have by assumption that R′w ⊆ HOAC(D) = SHOAC(D) ∪Rw = AD ∪Rw 118 and thus R′w \ AD ⊆ Rw \ AD Furthermore, we can split HOSC(D′) into SHOSC(D′) and R′w, and get (using Lemma 4.46 once more) that Σ∗[AD ∪R′w] ⊇ Σ∗[SHOSC(D′) ∪ {R′w}] = Σ∗[HOSC(D′)] must be a partial cover of EQw. However, by trying to remove attributes not in AD first, we constructed Rw such that Rw \AD is minimal with Σ∗[AD ∪Rw] being a partial cover for EQw (note that Σ∗[{R1, . . . , Rw−1}] ⊆ Σ∗[AD]) . Thus R′w\AD cannot be a true subset of Rw \ AD, which gives us R′w \ AD = Rw \ AD But this means that Rw ⊆ R′w ∪ AD = R′w ∪ SHOAC(D′) = HOAC(D′) which contradicts our initial assumption for Rw. 4.5 Combining Normal Forms While a decomposition into DNF minimizes storage space, other classical normal forms such as BCNF (for multiple schemas) offer other benefits, such as e.g. fast checking whether the given FDs hold [27]. We are therefore interested in the possibility of achieving both normal forms at the same time, and thus benefitting from both. 4.5.1 DNF and BCNF Dependency preserving DNF and BCNF cannot always be combined - dependency pre- serving BCNF decompositions are known to not always exist, and deciding whether one exists is NP-hard [5]. We are now looking for an algorithm to decide whether a schema R with set of FDs Σ has a dependency preserving DNF decomposition D, for which every schema Rj ∈ D is in BCNF. Furthermore, the algorithm should find such a decomposition D if it exists. Note that algorithm “dependency preserving DNF decomposition” only produces de- compositions which contain at most one schema for each equivalence class. This was motivated by Lemma 4.39. However, when looking for BCNF as well, we must con- sider decompositions which contain multiple schemas per equivalence class (i.e., multiple schemas with the same closure). Example 4.11. Let R = ABCDEFG with constraints Σ =    AB → C,DE → F, DF → AG,AG → DF, EF → BG,BG → EF, AC → DG,DG → AC, BC → EG,EG → BC    Then R has a dependency preserving DNF and BCNF decomposition: D = {ABC,DEF,ADFG,BEFG,ACDG,BCEG} 119 Note though that D has two key schemas: ABC and DEF . Joining these two schemas leads to a decomposition D′ = {ABCDEF,ADFG,BEFG,ACDG,BCEG} which dominates D (and vice versa) and thus is also in dependency preserving DNF. However, D′ is no longer in BCNF. The general idea for finding a dependency preserving DNF decomposition which is also in BCNF, is based on the results of chapter 3. We compute all canonical covers, and for each canonical cover C we generate all minimal (w.r.t. domination/c-domination) decompositions which preserve C. By “preserve” we mean that each FD X → A ∈ C lies in some schema Rj ⊇ XA. If a dependency preserving DNF and BCNF decomposition exists, it will be among them. While this general idea is not very efficient, we will optimize it in several ways. First, we will show that for a given canonical cover C, it suffices to consider only one particular decomposition which preserves C. Definition 4.48. For a FD X → Y we call XY the schema induced by X → Y . For a set of FDs Σ we call H = {XY | X → Y ∈ Σ} the hypergraph induced by Σ. The maximal connected components of H partition its support (cf. section 3.1) ϑH = ⋃ H We call ϑH the schema induced by Σ, and the maximal connected components ofH (which we regard as a set of schemas) the schema partition induced by Σ. We will generate a decomposition which preserves a particular canonical cover Σ as follows. We first partition Σ into equivalence classes.We then preserve each equivalence class ΣX by adding the schema partition induced by ΣX to the decomposition. Example 4.12. Consider again R = ABCDEFG from example 4.11 with constraints Σ =    AB → C,DE → F, DF → AG,AG → DF, EF → BG,BG → EF, AC → DG,DG → AC, BC → EG,EG → BC    To preserve Σ we need to preserve (in particular) the partial cover of EQR ΣR = {AB → C,DE → F} The hypergraph induced by ΣAB is H = {ABC,DEF} which is also the schema partition induced by ΣR, while ϑH = ABCDEF is the schema induced by ΣR. We add the schemas ABC,DEF to our decomposition, rather than the schema ABCDEF , because ABC and DEF are in BCNF. 120 Definition 4.49. Let R be a schema with FDs Σ, and EQ the partition of Σ into equiv- alence classes. For each eq ∈ EQ let ϑeq be the schema induced by eq. Then we call D1 := {ϑeq | eq ∈ EQ} the principal decomposition induced by Σ, and D2 := ⋃ eq∈EQ {schema partition induced by eq} the connected component decomposition induced by Σ. We will now show that it suffices to consider only the decompositions described above. For the following lemmas it is important to note the subtle difference between a depen- dency preserving decomposition of (R,Σ), and a decomposition of R which preserves Σ: A dependency preserving decomposition only needs to preserve a cover of Σ, not Σ itself. Lemma 4.50. Let R be a schema with FDs Σ. Then every BCNF decomposition D′ which preserves Σ is dominated by the principal decomposition D induced by Σ. Proof. Let D′ be any BCNF decomposition which preserves Σ, and A an attribute in R. By Theorem 4.17 strong csc-domination implies domination. Thus it suffices to show that there exists an injective function f , which maps every schema ϑ ∈ D with A ∈ ϑ to a schema ϑ′ ∈ D′ with A ∈ ϑ′ and ϑ∗ ⊆ ϑ′∗. Now let ϑ ∈ D with A ∈ ϑ. By construction of D there must exists a FD X → Y ∈ Σ with A ∈ XY and XY ∗ = ϑ∗. Since D′ preserves Σ, there exists a schema ϑ′ ∈ D′ with XY ⊆ ϑ′. Thus we have A ∈ ϑ′ and ϑ∗ = XY ∗ ⊆ ϑ′∗, so we let f map ϑ to ϑ′. It remains to show that the function f constructed above is injective. As D is in BCNF, and X → Y is contained in ϑ′, we have XY ∗ = ϑ′∗, and thus ϑ∗ = ϑ′∗. Since D contains only one schema per equivalence class, f is injective. Note that the principal and connected component decompositions are not lossless if Σ contains no key FD. In order to make them lossless, we need to add a minimal key. We consider this situation next. Corollary 4.51. Let R be a schema with FDs Σ, such that Σ contains no key FDs. Let D be the principal decomposition induced by Σ. Then for every lossless BCNF decomposition D′ which preserves Σ, there exists a minimal key schema Rkey such that D′ is dominated by D ∪ {Rkey}. Proof. Since D′ is lossless, it contains a key schema R′key by Lemma 2.1. Since D′ is in BCNF, R′key preserves no FDs in Σ, so D′ \ {R′key} still preserves Σ. Thus D dominates D′ \ {R′key} by Lemma 4.50. It follows that for every minimal key schema Rkey ⊆ R′key the decomposition D ∪ {Rkey} dominates D′. We are now ready to show that for finding a dependency preserving DNF and BCNF decomposition, is suffices to consider only connected component decompositions, plus minimal keys, if needed. 121 Theorem 4.52. Let R be a schema with FDs Σ, such that R has a decomposition in dependency preserving DNF and BCNF. Then there exists a canonical cover Σ′ of Σ and a minimal key Rkey of R, such that the following statements hold: (i) If Σ contains a non-trivial key FD then DΣ′ is a dependency preserving DNF and BCNF decomposition of R. (ii) If Σ contains no non-trivial key FD then DΣ′ ∪ {Rkey} is a dependency preserving DNF and BCNF decomposition of R. where DΣ′ is the connected component decomposition induced by Σ′. Proof. Let D be a decomposition of R in dependency preserving DNF and BCNF, which exists by assumption. Since D is dependency preserving there exists a cover Σ′ of Σ such that D preserves Σ′. Let DpΣ′ be the principal decomposition induced by Σ′. (i) If Σ contains a non-trivial key FD, then so does Σ′, which makes DpΣ′ lossless (and de- pendency preserving). By Lemma 4.50 DpΣ′ dominates D. Since D is in dependency preserving DNF, the opposite must hold as well, i.e., D dominates DpΣ′ . (ii) If Σ contains no non-trivial key FD, then Σ′ contains no key FD. Then by Corollary 4.51 there exists a minimal key Rkey of R such that DpΣ′ ∪{Rkey} dominates D. Now DpΣ′ ∪ {Rkey} is lossless by Lemma 2.1, and dependency preserving. Thus, since D is in dependency preserving DNF, D dominates DpΣ′ ∪ {Rkey}. It is clear by definition that the principal decomposition DpΣ′ , and the connected compo- nent decomposition DΣ′ induced by Σ′ dominate each other. Thus D dominates DΣ′ and vice versa, or DΣ′ ∪ {Rkey} and vice versa, for case (i) and case (ii) respectively. Since D is in dependency preserving DNF, and DΣ′ (or DΣ′ ∪ {Rkey}) is lossless and dependency preserving and dominates D, the decomposition DΣ′ (or DΣ′ ∪{Rkey}) is also in dependency preserving DNF. It remains to show that DΣ′ (or DΣ′∪{Rkey}) is in BCNF. This is trivial for the minimal key Rkey, so we only need to show it for DΣ′ in either case. We have established that D,DΣ′ and DpΣ′ dominate each other. By Theorem 4.36 all attributes have the same containing schema closures w.r.t. D,DΣ′ and DpΣ′ . Thus for every schema Rj ∈ DpΣ′ the schemas in D with the same closure as Rj partition Rj, and similarly for the schemas in DΣ′ . By construction DΣ′ uses the finest such partitions possible while preserving Σ′, in particular as least as fine as those of D. Thus every schema in DΣ′ is a subset of some schema in D. Since D is in BCNF, DΣ′ must be in BCNF as well. The last theorem allows us to restrict our search for dependency preserving DNF and BCNF decompositions to connected component decompositions induced by canonical covers, plus minimal key schemas if needed. When generating the connected component decomposition DΣ induced by some canon- ical cover Σ we apply another optimization. DΣ is the disjoint union of several connected component decompositions, each induced by the partial canonical cover Σ[eq] for some equivalence class eq of Σ. We thus compute the connected component decompositions in- duced by partial canonical covers for each equivalence class of Σ∗a individually. These can then be combined to obtain all connected component decompositions induced by canonical covers. 122 We now describe an algorithm which is based on our constructions so far. Definition 4.53. Let ISP be a partial function mapping schemas to sets of schemas. Then for any set of schemas D we define ISP (D) := {s ∈ D | ISP (s) undefined} ∪⋃ {ISP (s) | s ∈ D and ISP (s) defined} We will use this to replace all schemas by their partitions, except for the minimal key schema Rkey which is added to make the decomposition lossless, if needed, and has no partition. Algorithm “dependency preserving DNF and BCNF decomposition” INPUT: schema R, set of FDs Σ OUTPUT: decomposition of R in DNF and BCNF partition Σ∗a into equivalence classes EQ for each EQj ∈ EQ do compute CCj, the set of all partial canonical covers on EQj ISj := ∅ (the induced schemas for EQj) ISP := ∅ (associated partitions in BCNF) for each partial canonical cover pc ∈ CCj do compute the schema partition sppc induced by pc Rpc := ⋃ sppc (the schema induced by pc) ISj := ISj ∪ {Rpc} if (all schemas in sppc are in BCNF and ISP does not map Rpc to anything) then ISP := ISP ∪ {Rpc 7→ sppc} end remove all non-minimal (w.r.t. inclusion) schemas from ISj remove all schemas Rpc from ISj with ISP (Rpc) undefined end if EQ contains no key class then for all minimal keys Rkey of R do decompose(EQ, {Rkey}) end else decompose(EQ, ∅) 123 proc decompose(EQ,D) if EQ = ∅ then output ISP (D) exit pick maximal EQj ∈ EQ compute AD := SHOAEQj(D) for all Rj ∈ ISj do if (Rj \ AD minimal with (y c-domination optimal) EQj[D ∪ {(Rj \ AD) ∪ AD}] partial cover for EQj and Rj minimal with (y domination optimal) EQj[D ∪ {Rj}] partial cover for EQj) then decompose(EQ \ {EQj},D ∪ {Rj}) end Theorem 4.54. Algorithm “dependency preserving DNF and BCNF decomposition” out- puts such a decomposition if one exists. Proof. (1) Only schema partitions with schemas in BCNF get added to the ISPj, and only those get added to ISP (D). Thus ISP (D) is in BCNF. For each essential equivalence class EQj, a set of schemas sp gets added to ISP (D) (first ⋃ sp is added to D and then replaced), which holds a partial cover for EQj. These partial covers together form a cover for Σ by Lemma 4.45, thus ISP (D) is dependency preserving. Furthermore, a key schema is added to ISP (D), either initially or to hold a partial cover for EQkey. Together with the fact that ISP (D) preserves dependencies, we obtain that ISP (D) is lossless. It is easy to see that the algorithm’s output ISP (D) and D are equivalent w.r.t. domination. It thus suffices to show that D is in DNF, which then implies the same for ISP (D) by definition of DNF. To do this, we will argue that D could have been pro- duced using algorithm “dependency preserving DNF decomposition”, for which we have already proven correctness. To keep things readable, we abbreviate algorithm “depen- dency preserving DNF and BCNF decomposition” by A-DP-DNF-BCNF and algorithm “dependency preserving DNF decomposition” by A-DP-DNF. In A-DP-DNF-BCNF a minimal key Rkey is added to D iff EQ contains no key class. Since EQ (at this time) contains all essential equivalence classes, this condition is equiv- alent to the corresponding condition in A-DP-DNF that the canonical cover Σ contains no key FDs. After that, both algorithms add one schema Rj to D for each essential equivalence class EQj ∈ EQ, picking maximal classes first. In A-DP-DNF attributes are removed from Rj as long as Σ∗a[D ∪ {Rj}] is partial cover for EQj, removing attributes in AD first. Only FDs with LHSs equivalent to FDs in EQj contribute towards a partial cover of EQj, and since in A-DP-DNF-BCNF the equivalence classes EQj contain all such (atomic) FDs, we have for every decomposition E : Σ∗a[E ] is a partial cover of EQj iff EQj[E ] is 124 As there are no other restrictions on the order in which attributes A are checked for removal, this can generate any schema Rj for which Rj \ AD minimal with EQj[D ∪ {(Rj \ AD) ∪ AD}] partial cover for EQj and Rj minimal with EQj[D ∪ {Rj}] partial cover for EQj, In particular this includes any schema Rj selected by A-DP-DNF-BCNF, as the conditions are checked in procedure ”decompose”. (2) It remains to show that our algorithm always finds a dependency preserving DNF and BCNF decomposition if such a decomposition exists. Our argument is the following: Algorithm “dependency preserving DNF and BCNF decomposition” computes all partial canonical covers pc for each essential equivalence class and their induced schema parti- tions sppc. If we were to check all combinations of them, we would obtain all connected component decompositions induced by canonical covers of Σ. By Theorem 4.52 some DP-DNF-BCNF decomposition E would have to be amongst them (we name it E to avoid confusion with the variable D in the algorithm). Our algorithm does essentially that, although it applies some optimizations, and we need to argue that none of them prevents us from finding E (or at least some decomposition in DP-DNF-BCNF). The first optimization happens when multiple partial canonical covers for an equiva- lence class EQj induce the same schema but different schema partitions. Here we check which of those schema partitions are in BCNF, and keep only one of them (if any BCNF partition exists at all). Neglecting schema partitions not in BCNF has no impact on the result, since they cannot be part of a BCNF decomposition E . Different partitions of the same schema dominate each other, since all subschemas in the partition have the same closure. Thus, while we may not find the decomposition E , we will find a decomposition E ′ which is equivalent to E w.r.t. domination and thus also in dependency preserving DNF and BCNF. The second optimization comes with the line remove all non-minimal (w.r.t. inclusion) schemas from IS Let s ∈ IS be non-minimal, i.e., s′ ( s for some s′ ∈ IS. Any decomposition containing s is strictly dominated by a decomposition using s′ instead of s, and thus not in DNF. Thus removing s from IS does not hinder us in finding E . The last optimization comes in the form of pruning the search tree with the check if (Rj \ AD minimal with (y c-domination optimal) EQj[D ∪ {(Rj \ AD) ∪ AD}] partial cover for EQj and Rj minimal with (y domination optimal) EQj[D ∪ {Rj}] partial cover for EQj) Here we will show that any Rj which do not pass the ”if-condition” cannot lead to a DP-DNF-BCNF decomposition. This can be argued as follows. If Rj \ AD is not minimal then we have R′j \ AD ( Rj \ AD 125 for some R′j such that EQj[D ∪ {(R′j \ AD) ∪ AD}] forms a partial cover for EQj. But then D ∪ {Rj} is strictly c-dominated by D ∪ {R′j} by Theorem 4.33, and this still holds after substituting schemas by their partitions using ISP or adding schemas (none of those being of higher order than Rj) to obtain a complete dependency preserving BCNF decomposition. Thus ISP (D∪{Rj}) cannot be completed to a DP-DNF-BCNF decomposition. If Rj is not minimal we argue similarly with domination and Theorem 4.36. We finish by discussing further improvements, which can help to speed up the algo- rithm. First, we want to stress that the check and Rj minimal with (y domination optimal) EQj[D ∪ {Rj}] partial cover for EQj) in procedure “decompose” is not redundant, despite the removal of all non-minimal Rj in ISj. The remaining Rj are minimal such that EQj[Rj] forms a partial cover of EQj, but this is a different condition: While ISP (D) is in BCNF, D need not be, so the FDs in EQj[D] could contribute towards a partial cover of EQj. However, the check can be skipped if EQj[D] = ∅, which can be tested quickly. The partial covers for a minimal equivalence class EQX of FDs all induce the same schema, namely the closure X∗ for EQX . This can be argued as follows: Only the FDs in EQX can help in computing the closure of a minimal key of EQX , and every attribute in X∗ must be part of some FD. Furthermore, X∗ is in BCNF since no FDs Y → A ∈ Σ∗a with closure Y ∗ ( X∗ exist. Thus we do not need to compute the partial canonical covers on minimal equivalence classes. Computing the set of all partial canonical covers on an equivalence class EQX directly can be inefficient. To improve this, we try to partition EQX into smaller autonomous sets S1, . . . , Sn ⊆ EQX using partial implication cycles as described in section 3.4, and compute all partial canonical covers on them. We could now obtain the partial canonical covers on EQX by forming the cross-union of the partial canonical cover sets on S1, . . . , Sn, and then proceed with computing the induced schema partitions. Instead, we first compute the sets ISP1, . . . , ISPn of schema partitions induced by the partial canonical covers on S1, . . . , Sn. Note that different partial canonical covers can (and often do, as our experiments show) induce the same schema partitions. We then compute the cross-union HX := ISP1 ∨ . . . ∨ ISPn which is a set of hypergraphs on R, with each hypergraph corresponding naturally to one or more partial canonical covers on EQX . It is easy to see that each hypergraph in HX induces the same schema partitions as the partial canonical covers it corresponds to. Since the number of hypergraphs in HX can be much smaller than the number of partial canonical covers on EQX , computing the induced schema partitions from HX can be much faster. Finally, we may abort immediately if some ISP |ISj is empty, since then no partial dependency preserving BCNF decomposition without duplicate attributes exists for EQj. 126 4.5.2 DNF and EKNF As dependency preserving BCNF decompositions do not always exist, combining DNF and BCNF is not always a feasible option. To avoid this problem, or as an alternative should a dependency preserving BCNF decomposition not exist, one could try to ensure other normal forms which always allow dependency preserving decompositions. The most well-known normal form with this property is 3NF. However, as was pointed out in [46], 3NF does not always enforce beneficial decomposition, even though they may not cause any loss of dependencies. The following example, taken from [46], illustrates this. Example 4.13. Let R = ABC and Σ = {A → B,B → A}. Then AC and BC are minimal keys of R, and thus all attributes are prime. Therefore R is already in 3NF, even though dependency preserving decompositions exist, such as {AB,BC} or {AB,AC}. As an improvement, the authors suggest a new normal form which is stronger than 3NF but still allows dependency preserving decompositions. They strengthen 3NF by allowing as RHS of a non-key FD only those prime attributes, which appear in the LHS of an atomic key dependency. Note that atomic FDs are called elementary in [46]. Definition 4.55. [46] Let R be a schema with FDs Σ. A FD X → A is called elementary if Σ∗ contains no FD X ′ → A with X ′ ( X. A key is elementary if it forms the LHS of an elementary FD. An attribute is an elementary key attribute if it lies in an elementary key of R. Definition 4.56. [46] Let R be a schema with FDs Σ. Then R is in elemental key normal form (EKNF) if for every non-trivial FD X → A on R (a) X is a key of R, or (b) A is an elementary key attribute for R. Note that the schema R from example 4.13 is not in EKNF, since neither A nor B are elementary key attributes. Thus EKNF may enforce useful decomposition which 3NF does not. As it turns out, algorithm “dependency preserving DNF decomposition” already pro- duces a decomposition into EKNF. Lemma 4.57. Let R be a single schema with FDs Σ. If R is in dependency preserving DNF, then it is also in EKNF. Proof. Assume that R is not in EKNF. Then there exists a FD X → A ∈ Σ∗a such that X is not a key of R, and A does not lie in the LHS of any key FD in Σ∗a. Furthermore, R is strictly c-dominated by the decomposition D := {R \ A} ∪ {S ( R | S is not a key of R} since A does not lie in R \ A, which is the only key schema in D. It remains to show that D is dependency preserving. Clearly the only FDs in Σ∗a which do not lie in Σ∗[D] are key FDs containing A. They must be of the form Y → A, since A does not lie in the LHS of any key FD. However, the FDs Y → X and X → A both lie in Σ∗[D], and together they imply Y → A. 127 Theorem 4.58. Let D be a decomposition produced by algorithm “dependency preserving DNF decomposition”. Then every schema RX ∈ D is in dependency preserving DNF w.r.t. Σ∗[RX ]. Proof. Let DX be any dependency preserving decomposition of RX . Then DX is domi- nated by the single schema R′X := ⋃ {Rj ∈ DX | Rj is a key of RX} Clearly R′X preserves all key FDs in Σ∗[D]. Thus EQX [DX ] ⊆ EQX [R′X ], so EQX [R′X ] implies EQX [RX ]. However, RX has been constructed minimal such that EQX [RX ] has some partial cover property. Thus R′X = RX , which shows that RX dominates every dependency preserving decomposition DX . It follows that RX is in dependency preserving DNF. Corollary 4.59. Algorithm “dependency preserving DNF decomposition” produces a de- composition into EKNF. Proof. Follows immediately from the last lemma and theorem. We note that this result is only due to our construction method, i.e., dependency preserving DNF does not imply EKNF in general. Example 4.14. Let R = ABCD and Σ = {A → B,B → C,CD → A}. Then the decomposition D = {ABC,ACD} is in dependency preserving DNF (note that it is not strictly c-dominated by {AB,BC,ACD} since C already appears in the key schema ACD). However, D is not in EKNF, since ABC contains B → C which violates EKNF. One could say that the benefit of EKNF is that it enforces ”locally” well-designed schemas, something which may not be forced by DNF if this “local optimization” does not provide a significant benefit for the overall size of the decomposition. The same holds true for BCNF or other “local” normal forms, i.e., normal forms which consider only a single schema. 128 Chapter 5 Summary We will briefly summarize the main results we obtained, and related problems which still remain open. 5.1 Main Results In chapter 2 we developed algorithms for computing a dependency preserving BCNF decomposition. The main result was an “linear resolution” algorithm for computing the atomic closure Σ∗a for a set of functional dependencies Σ. While Σ∗a can be exponential in Σ, we identified polynomial cases and showed that for finding a dependency preserving BCNF decomposition, it suffices to compute a subset of Σ∗a. Finally, we demonstrated how the results can be extended to a complex-valued data model. The “linear resolution” algorithm was then used in chapter 3 to compute the set of all canonical covers CC(Σ). For that we showed how hypergraphs can be decomposed using autonomous sets, which led to an efficient representation of CC(Σ). Perhaps more im- portant than the actual algorithm, we obtained insights into how functional dependencies interact. In particular, Theorem 3.36 allows us to split the task of creating a canonical cover into smaller, independent tasks of creating partial covers. Our theory of autonomous sets may well have applications in other disciplines. In chapter 4 we returned to database normalization by defining a new normal form “DNF” and providing algorithms for computing decompositions into DNF. This new nor- mal form was characterized both semantically and syntactically, and one of the main difficulties was in showing that the characterizations match. We established that in some sense, DNF is the proper generalization of existing normal forms, in particular BCNF, onto multiple schemas. Finally, we showed how dependency preserving DNF decomposi- tions can be computed, and how dependency preserving DNF and BCNF can be obtained simultaneously. Overall, this work focused on computing good schema decompositions. We provided characterizations of such “good” decompositions and practical algorithms to obtain them. The results offer new insights, in particular into the interaction of functional dependencies, and are of immediate practical use in creating automated design tools. 129 5.2 Open Problems We will point out some open problems related to our work, in the order as they appear. The results we obtained are based on the relational data model. In section 2.3 we discussed how the main results of chapter 2 can be extended to complex-valued data models. We believe that most (if not all) of the work done here can be extended to complex-valued data models, including XML-databases, in a similar fashion, though we did not investigate this. Other extensions to the classic relational model may also be of interest for further research. In particular, FDs behave differently in the presence of null values, which may lead to quite different results for derivation and normalization problems. In section 3.5 we discussed how we could restrict ourselves to deriving essential FDs when computing all canonical covers. This could be of use when trying to find dependency preserving decompositions, as we did in chapter 2. The problem here lies in testing whether a FD is critical, and it is not clear how this can be done efficiently, having computed only essential FDs. In some cases, the work done by Gottlob [18] might offer an efficient solution. The connection between the semantic and syntactic characterizations of DNF has been established using theorems 4.33 and 4.36. However, Theorem 4.36 is restricted to functional dependencies, and does not hold in the presence of multivalued or join dependencies. It would be interesting to know what the correct syntactic characterization of DNF would be in such cases. Also, our assumption that all domains are infinite and unbounded is not always real- istic. It is not clear yet how DNF behaves for finite or bounded domains. Our definition of DNF focuses on minimizing size, with the intuition that this could minimize redundancy and update anomalies as well. Alternatively one could attempt to minimize redundancy or update anomalies directly, and it would be interesting to see whether the results are the same. Additionally, one could try to achieve other desirable properties for a decomposition, such as e.g. acyclicity [6]. While we have an algorithm for computing a decomposition into dependency preserving DNF, we currently lack such an algorithm for lossless DNF. Finally, it is unclear how to test whether a given decomposition is in DNF, and how difficult such a test is. We suspect that it is at least co-NP hard. 130 Index 3NF, 6 algorithm 3NF Synthesis, 7 BCNF Decomposition, 7 closure, 5 d.p. DNF and BCNF decomposition, 120 d.p. DNF decomposition, 114 divide and resolve, 70 essential linear resolution, 86 least critical cover synthesis, 28 revised, 31 linear NA-resolution, 43 linear resolution, 16 revised, 18 partial implication partitioning, 79 recursive autonomous partitioning, 61 substitute FD, 30 update atomic closure, 23 armstrong axioms, 4 associated set, 98 atomic closure, 12 atomic implication closure, 77 attribute, 3 cyclic, 33 depending, 98 extraneous, 15 flat, 36 nested, 36 basis, 41 completion, 41 representable, 41 unitary, 40 prime, 6, 83 autonomous set, 55, 62 BCNF, 6 c-domination, 91–93 cartesian, 80 complete, 4 connected components, 57 containing schema closures, 93 cover, 4 canonical, 12 LR-reduced, 72 partial, 63, 113 relative, 64 csc-domination, 94 cycle, 32 cyclic closure, 33 decomposition connected component, 118 lossless, 5 principal, 118 dependency functional, 4 adjacent, 76 atomic, 12, 41 base, 14 critical, 26, 46 derived, 14 essential, 80 essentially derivable, 82 forced, 80 maximal atomic, 41 neighborhood, 80 redundant, 4 singular, 12, 41 standard form, 41 substituting, 14 trivial, 4 unitary, 20 join, 5 multivalued, 5 dependency basis, 101 derivation trees, 4 domain, 3 domination, 81, 91–93 131 domination normal form, 91 edge partial, 59 elemental key normal form, 124 elementary, 124 equivalence, 63, 64, 112 essential closure, 81 essentially cartesian closure, 81 faithful, 6 finite implication, 4 fixed parameter tractable, 22 hypergraph, 54 simple, 54 implication cover, 67 relative, 67 implication dependency, 67 inclusion-domination, 94 integrity constraints, 3 isolated set, 57 join, 5, 38 lossless, 6 k-mapping, 97 k-reducing, 96 KCNF, 107 LHS-minimal, 7 LHS-reduced, 7 LHS-restricted, 49 lifting, 97 LM-resolution, 15 meet, 38 minimal autonomous sets, 56 minimal cover, 18 minimal isolated set, 57 partial determination, 32 partial implication, 76, 77 projection, 6, 53 relation, 3 relativation, 65 representation basis, 40 resolution:linear, 15 restricted, 49 schema, 3 closed, 99 induced, 117 induced partition, 117 open, 99 sound, 4 subset construction for functional dependencies, 99 generalized, 102 superedge, 59 partial, 59 support, 54 synchronization hypergraph, 100 transversal, 57, 62 width, 21 132 Bibliography [1] M. Arenas and L. Libkin. An information-theoretic approach to normal forms for relational and XML data. In PODS, pages 15–26, 2003. [2] W. W. Armstrong. Dependency structures of data base relationships. In IFIP Congress, pages 580–583, 1974. [3] G. Ausiello, A. D’Atri, and D. Sacca`. Graph algorithms for functional dependency manipulation. J. ACM, 30(4):752–766, 1983. [4] G. Ausiello, A. D’Atri, and D. Sacca`. Minimal representation of directed hypergraphs. SIAM J. Comput., 15(2):418–431, 1986. [5] C. Beeri and P. A. Bernstein. Computational problems related to the design of normal form relational schemas. ACM Transactions on Database Systems, 4(1):30–59, 1979. [6] C. Beeri, R. Fagin, D. Maier, and M. Yannakakis. On the desirability of acyclic database schemes. J. ACM, 30(3):479–513, 1983. [7] C. Berge. Hypergraphs: Combinatorics of Finite Sets. Elsevier Science Pub. Co., 1989. [8] J. Biskup. Inferences of multivalued dependencies in fixed and undetermined uni- verses. Theor. Comput. Sci., 10:93–105, 1980. [9] J. Biskup. Achievements of relational database schema design theory revisited. In L. Libkina and B. Thalheim, editors, Semantics in Databases, volume 1358 of Lecture Notes in Computer Science, pages 29–54. Springer, 1995. [10] J. Biskup, U. Dayal, and P. A. Bernstein. Synthesizing independent database schemas. In SIGMOD Conference, pages 143–151, 1979. [11] M. A. Casanova, R. Fagin, and C. H. Papadimitriou. Inclusion dependencies and their interaction with functional dependencies. In PODS ’82, pages 171–176, New York, NY, USA, 1982. ACM Press. [12] E. F. Codd. Further normalization of the data base relational model. IBM Research Report, San Jose, California, RJ909, 1971. [13] R. G. Downey and M. R. Fellows. Parameterized Complexity. Springer, 1999. 133 [14] R. Fadous and J. Forsyth. Finding candidate keys for relational data bases. In W. F. King, editor, Proceedings of the 1975 ACM SIGMOD International Conference on Management of Data, San Jose, California, May 14-16, 1975, pages 203–210. ACM, 1975. [15] R. Fagin. Multivalued dependencies and a new normal form for relational databases. ACM Trans. Database Syst., 2(3):262–278, 1977. [16] R. Fagin. Normal forms and relational database operators. In P. A. Bernstein, editor, SIGMOD Conference, pages 153–160. ACM, 1979. [17] M. R. Garey, D. S. Johnson, and L. J. Stockmeyer. Some simplified NP-complete graph problems. Theor. Comput. Sci., 1(3):237–267, 1976. [18] G. Gottlob. Computing covers for embedded functional dependencies. In PODS, pages 58–69, 1987. [19] G. Gottlob. On the size of nonredundant FD-covers. Inf. Process. Lett., 24(6):355– 360, 1987. [20] G. Gottlob, R. Pichler, and F. Wei. Tractable database design through bounded treewidth. In PODS, pages 124–133, 2006. [21] F. Harary. Graph Theory. Addison-Wesley, 1995. [22] J. Ja¨rvinen. Dense families and key functions of database relation instances. In R. Freivalds, editor, FCT, volume 2138 of Lecture Notes in Computer Science, pages 184–192. Springer, 2001. [23] P. Kandzia and M. Mangelmann. On covering Boyce-Codd normal forms. Inf. Pro- cess. Lett., 11(4/5):218–223, 1980. [24] H. Koehler. Finding faithful Boyce-Codd normal form decompositions. In S.-W. Cheng and C. K. Poon, editors, AAIM, volume 4041 of Lecture Notes in Computer Science, pages 102–113. Springer, 2006. [25] H. Koehler. Domination normal form: decomposing relational database schemas. In ACSC ’07: Proceedings of the thirtieth Australasian conference on Computer science, pages 79–85, Ballarat, Australia, 2007. Australian Computer Society, Inc. [26] J. Lechtenbo¨rger. Computing unique canonical covers for simple FDs via transitive reduction. Inf. Process. Lett., 92(4):169–174, 2004. [27] M. Levene and G. Loizou. A Guided Tour of Relational Databases and Beyond. Springer, 1999. [28] E. A. Lewis, L. C. Sekino, and P. D. Ting. A canonical representation for the relational schema and logical data independence. In IEEE COMPSAC, pages 276– 280, 1977. [29] S. Link. Dependencies in Complex-valued Databases. PhD Thesis, 2004. 134 [30] S. Link. On multivalued dependencies in fixed and undetermined universes. In FoIKS, pages 258–277, 2006. [31] C. L. Lucchesi and S. L. Osborn. Candidate keys for relations. Journal of Computer and System Sciences, 17(2):270–279, 1978. [32] D. Maier. Minimum covers in the relational database model. Journal of the ACM, 27(4):664–674, 1980. [33] D. Maier. The Theory of Relational Databases. Computer Science Press, 1983. [34] M. E. Majster-Cederbaum. Ensuring the existence of a BCNF-decomposition that preserves functional dependencies in O(n) time. Information Processing Letters, 43(2):95–100, 1992. [35] H. Mannila and K.-J. Ra¨iha¨. Design by example: An application of armstrong rela- tions. J. Comput. Syst. Sci., 33(2):126–141, 1986. [36] H. Mannila and K.-J. Ra¨iha¨. The Design of Relational Databases. Addison-Wesley, 1987. [37] H. Mannila and K.-J. Ra¨iha¨. Practical algorithms for finding prime attributes and testing normal forms. In Proceedings of the Eighth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, March 29-31, 1989, Philadelphia, Pennsylvania, pages 128–133. ACM Press, 1989. [38] S. L. Osborn. Testing for existence of a covering Boyce-Codd normal form. Informa- tion Processing Letters, 8(1):11–14, 1979. [39] H. Saiedian and T. Spencer. An efficient algorithm to compute the candidate keys of a relational database schema. Comput. J., 39(2):124–132, 1996. [40] R. C. Shock. Computing the minimum cover of functional dependencies. Inf. Process. Lett., 22(3):157–159, 1986. [41] R. E. Tarjan. Depth-first search and linear graph algorithms. SIAM J. Comput., 1(2):146–160, 1972. [42] D.-M. Tsou and P. C. Fischer. Decomposition of a relation scheme into Boyce-Codd normal form. In ACM annual conference, pages 411–417, 1980. [43] M. W. Vincent. A corrected 5NF definition for relational database design. Theor. Comput. Sci., 185(2):379–391, 1997. [44] M. W. Vincent. Redundancy elimination and a new normal form for relational database design. In Semantics in Databases, volume 1358 of Lecture Notes in Com- puter Science, pages 247–264. Springer, 1998. [45] J. Wang and R. W. Topor. Removing XML data redundancies using functional and equality-generating dependencies. In H. E. Williams and G. Dobbie, editors, ADC, volume 39 of CRPIT, pages 65–74. Australian Computer Society, 2005. [46] C. Zaniolo. A new normal form for the design of relational database schemata. ACM Trans. Database Syst., 7(3):489–499, 1982. 135 Appendix Some of the algorithms described in this work have been implemented, and we present experimental results using randomly generated test data. Test Data One issue that always arises in this context is that of using appropriate test data. In our case, we require relations R and sets of FDs Σ on them. These were created as follows: For a fixed number k of attributes in R and n FDs in Σ, we generate n FDs on R independently at random. In that, a FD X → Y is constructed by independently choosing the attribute sets X, Y ⊆ R. We create an attribute subset by selecting m attributes uniformly at random. In this we allow duplicates, so that the resulting set could have cardinality less than m. The probability for m to have value v is 1 2 v, i.e., there is a chance of 12 that m = 1, a chance of 14 that m = 2 a.s.o. Note that this approach favors FDs with small left- and right hand sides, but every FD X → Y on R with non-empty sets X, Y can potentially be generated. Algorithms The following algorithms have been implemented and tested: • Linear Resolution (for computing atomic closure) We implemented the revised version from section 2.1.2. This improved performance slightly, compared to the basic version of the algorithm. • Least Critical Cover Synthesis (for finding faithful BCNF decomposition) We implemented the revised version from section 2.2.2. This had a huge impact on performance in all cases, again compared to the basic algorithm. • Divide and Resolve (for computing all partial canonical covers) We used partial implication cycles to find smaller autonomous sets, as described in section 3.4. This had a huge impact on performance in many cases. • dependency preserving DNF and BCNF Decomposition We implemented all the improvements described in section 4.5.1. Computing the induced schema partitions on the smaller autonomous sets before forming the cross- union had a huge impact on performance in some cases. 136 Results The tests are organized by number of attributes k in the relation R and number of FDs n in the original cover Σ. For each case we ran our algorithms on 1000 test sets of FDs and recorded the following data: • size of the atomic closure Σ∗a, measured in number of FDs • number of canonical covers in CC(Σ) • number of partial canonical covers per autonomous set (average) • number of disjoint non-trivial1 autonomous sets found • time taken for computing: – atomic closure Σ∗a – canonical covers CC(Σ) in decomposed form (after computing Σ∗a) – dependency preserving BCNF decomposition (after computing Σ∗a) – dependency preserving DNF and BCNF decomposition (after computing CC(Σ)) For purpose of judging runtime, we note that the implementation was in java, and tests were performed on a 1.8 GHz PC. We summarize the test results by reporting the mean value for each parameter mea- sured, i.e., the minimum value x such that at least 50% of the test results are less or equal to x. We report the mean instead of e.g. the average, since the parameters measured can grow exponentially, and do so in some cases. As a result, only the maximal value has a significant impact on the average. Furthermore, we aborted tests in some cases to avoid excessive computation time and/or memory problems. Such invalid tests do no pose much of a problem for measuring the mean values, since we can safely count them as results larger than the mean (as long as less than 50% of all tests are invalid, which was always the case). #Att #FDs #AFDs #CCs #PCCs #Aut tAC tCCs tBCNF tDNF 5 5 6 1 1 4 < 1ms 2ms 1ms 1ms 10 5 10 1 1 6 < 1ms 3ms 1ms 1ms 10 10 26 12 1.4 9 1ms 24ms 1ms 3ms 15 10 37 2 1 13 3ms 20ms 2ms 4ms 15 15 71 284 1.9 17 7ms 144ms 4ms 10ms 20 15 88 18 1.2 21 48ms 98ms 5ms 11ms 20 20 156 10935 3.4 25 62ms 450ms 7ms 36ms 30 15 105 2 1 24 59ms 32ms 6ms 12ms 30 20 222 24 1.1 31 100ms 177ms 9ms 21ms 50 20 198 1 1 36 113ms 88ms 10ms 68ms 50 30 1026 24 1.1 52 826ms 445ms 373ms 664ms 1By trivial we mean autonomous sets for which the empty set is a partial cover. 137 Observations We shall conclude by noting some observations we made when examining the test results. • The algorithm “Least Critical Cover Synthesis” performs much faster than suggested by the complexity bound which we established in section 2.2.2. This appears to be due to the improvements we made over the basic algorithm. It is not clear whether the complexity bound established can be improved. • Algorithm “dependency preserving DNF and BCNF Decomposition” also performed much faster than we expected. The main reason for this appears to be that back- tracking after a recursive call of procedure “decompose” is hardly ever necessary. • In cases where the ratio of FDs over attributes is large, we very often obtained huge numbers of canonical covers. The reason for this becomes clear when we observe the asymptotic case, where Σ contains all non-trivial FDs with non-empty LHS. Then every attribute in R is a key, and Σ∗a is small (quadratic in the number of attributes k). However, the number of canonical covers becomes hyper-exponential in the number of attributes. This can be seen as follows: Every canonical cover corresponds to a minimal strongly connected graph on k vertices. This includes all directed cycles (among others), and there are (k − 1)! such cycles. In all examples we observed, where the number of canonical covers was huge, we found a maximal equivalence class which had a structure very similar to the asymp- totic case. It thus might be worth investigating how computing all partial canonical covers on such equivalence classes could be avoided. 138