Current Microsoft MVP Awardees

MS Access™ MVP Awardee from July, 2007 through June, 2019


Active Topics

Screen shot of Blog

3. Answering the Big Questions with (Relatively) Simple Answers

Publish Date -- Mar 13, 2021


Mathematical and Linguistic Logic — The Rules of Normalization and the Rules of Rhetoric.

Relational database design is based on predicate logic mathematics, more specifically, the mathematical theory of sets. Unfortunately for many folks, the mathematics involved tend to be dauntingly complex. And I'm speaking from personal experience here as well as from years of participating in online Access forums.

At least partly for that reason, people have codified a set of design principles, or rules, to make it more accessible to a wider audience. We know these principles as the Rules of Normalization. There are at least five Rules of Normalization, although only the first three are considered crucial to sound database design in many, if not most, cases. To make matters more challenging, different writers have come up with different formulations of the rules! Here's a fairly typical overview of one of those formulations.


Mathematical and Linguistic Logic — The Rules of Normalization and the Rules of Rhetoric.

Relational database design is based on predicate logic mathematics, more specifically, the mathematical theory of sets. Unfortunately for many folks, the mathematics involved tend to be dauntingly complex. And I'm speaking from personal experience here as well as from years of participating in online Access forums.

At least partly for that reason, people have codified a set of design principles, or rules, to make it more accessible to a wider audience. We know these principles as the Rules of Normalization, as first elaborated by Edgar R Codd*. There are at least five Rules of Normalization, although only the first three are considered crucial to sound database design in many, if not most, cases. To make matters more challenging, different writers have come up with different formulations of the rules! Here's a fairly typical overview of one of those formulations.

For a table to be in the First Normal Form, it should follow these 4 rules:

  1. It should only have single (atomic) valued attributes/columns.
  2. Values stored in a column should be of the same domain
  3. All the columns in a table should have unique names.
  4. The order in which data is stored does not matter.

For a table to be in the Second Normal Form

  1. It should be in the First Normal form.
  2. It should not have Partial Dependency.

A table is said to be in the Third Normal Form when

  1. It is in the Second Normal form.
  2. It doesn't have Transitive Dependency.

Do WHAT Now?

See what I mean about accessibility? What are atomic valued attributes? What is a partial dependency and how does it differ from a transitive dependency and how do you know either of them when you see them? It's enough to make a grown person cry, right?

Well, maybe not THAT bad, but seriously, while it is does take dedication and study, nearly anyone can understand and implement the Rules of Normalization. As a matter of fact, if you intend to work with MS Access to build relational database applications for yourself, your organization or clients and customers, you must do so. What I'm going to try to do here in this post is explain these concepts in a more accessible set of metaphors, which I hope will make more sense to new-comers to the field.

As self-identified MS Access trainer and advisor, one of the ways I've tried to penetrate that layer of complexity on behalf of new developers is to offer analogies between the natural languages we all speak and the "language of the database" elaborated in the Rules of Normalization. In a recent post, for example, I compared Tables, Fields and Relationships between tables in a database to the Nouns, Adjectives and Verbs of a natural language.

I would like to expand on that further in this post and in some of the posts that follow it.

Natural Language and Rhetoric, Database Language and Normalization

While the range of potential topics for discussion within a natural language is open-ended, we rely on a commonly understand set of rhetorical patterns for all of them. In other words, we use the same language structures to talk about an endless variety of content: poetry, politics, business, science, wisdom, war, disease, culture, family, religion, history, architecture, and on and on and on. Content varies, but the rhetorial patterns we employ do not.

The important concept at work here is that there is considerable overlap in the rhetorical structures that organize our discussions of each of those subjects, regardless of the specific vocabulary used. It is, therefore, my contention that we can also extend the concepts of rhetorical patterns from Natural Languages to the Language of Relational Database Applications. That is why I propose the analogy of Tables to Nouns, Fields to Adjectives, and Relationships to Verbs. My task here is to explain how that works.

One central thesis of Will Pitkin's PhD Dissertation, A Pedagogical Model of Discourse** was his argument that virtually all "discourse blocs" follow the same clearly identifiable rhetorical patterns. That's important because understanding how discourse is built up from those building blocs is an essential skill in correctly interpreting the intent of any expository writing. As laypersons, we seldom pay explicit attention to that fact, but it is crucial to our ability to communicate complex subject matter. They are the equivalent of the "Rules of Normalization", but applied to rhetoric and we all intuitively rely on them to communicate in writing. I think that, because we are universally familiar with discourse patterns, they can serve as a bridge to understanding database normaliation which follows similar rules.

Databases, in my opinion, are very much like Pitkin's Discourse Blocs.

Any given database uses the same building blocks, or more accurately, the same strategies for assembling building blocks, regardless of the subject matter it contains.

In many discussions of database design, the components I am talking about are commonly referred to as the MetaData of the database. You may have heard that term. I think there is considerable congruence between that and my concept of the Language of the Database.

A college professor, for example, might have a database to organize her library of poetry and literature, but it would probably not have a component devoted to rare mammals of the tropics. That means tables in her database would include things like "Author", "Book" or "Publisher". The scientist studying rare mammals, on the other hand, would have tables devoted to "Species" and "Habitats", but probably nothing about rhyme and meter. Yet poet and scientist both rely on the components of a database—tables, attributes and relationships between tables—in exactly the same ways.

The Disparate Role of Ambiguity

One problem for a database developer is that natural languages tend to be a lot more forgiving of ambiguity than the language of the database application. We can contrast natural languages and database languages is to recognize that ambiguity is an inherent, and sometimes even a desirable, element of natural language discourse. Poetry relies on it, as do politics and humor. The interplay between two or more possible interpretations of a phrase or sentence delights us, confuses us, and even enrages or mislead us. All of us have employed ambiguity at one time or another to achieve one or more of those goals. And the more skillful the writer, the more enjoyable it can be.

Technical discourse—which is the closest natural language analogue to the relational database applications of interest to us—is the most likely branch of rhetoric to try to limit ambiguity and for the same reasons.

To be useful, the language of a database must exclude ambiguity to the greatest extent possible. Reducing ambuiguity, or conversely, delineating differences is, after all, one of the primary reasons the scientist goes to the trouble of classifying the specimens collected for study in any given field. If you can't differentiate accurately and precisely between specimens, what's the point? It matters a great deal that the attributes of a small furry mammal are clearly and unambiguously stored in fields in a table.

So how does this relate to our theory of database design? Let's take a look at the three components needed to "describe" any subject in the language of the database. Natural languages actually consist of several additional components, but in my view, these three are enough to adquately convey the concepts.

The Language of the Database

  • Tables are the Nouns of the Database Language, i.e. the names of things
  • Fields are the Adjectives of the Database Language, i.e. the attributes, or distinguishing properties of things
  • Relationships are the Verbs of the Database Language, i.e. the actions or statuses which connect one thing to another
  • Tables—The "Nouns" of the Database Language

    Tables are the objects in which we store data about the subjects of concern to that database. A table is the equivalent of a noun, in other words. In a database of poetry, for example, we'll need to keep track of things like "poets" and "poems", at the very least. If the database is intended to support analysis of poems, then it also needs to track (or talk about) things like "rhyme" and "meter" and so on. If it's intended to support biographical history of poets, it will need still other tables capturing the identities of those poets. Each of those topics—or entities to use a more database-friendly term— is a candidate to become one table. The names of those things will eventually correspond to the names of the tables. Such a database, in other words, will have one table called "Poet", one table called "Poem", it may have one Lookup table called "Rhyme", and one Lookup table called "Meter". More importantly, only those things identified as "poems" can be stored in the poem table. On the surface, this seems obvious when we are in the natural language realm. A poem is a poem, not a helicopter. But it's not always that simple, unfortunately.

    Is James Joyce's Ulysses, for example, intended to be a novel or a poem? It's certainly much longer than most other poems, akin to a novel, but the language and style of it set it apart from most novels. It's been compared to Homer's Odyssey, which is itself an extended poetic tour de force hundreds of pages long. So is it a poem or a novel, or a bit of both? That's where ambiguity fits nicely in a natural language. It leads to intriguing commentary and analysis by scholars. But it is much less desirable in a database. As a scholar, you can talk about Ulysses without ever classifying it definitively. However, as a data management task, you can't store it in a database both as a poem and as a novel. It's one or the other so far as a relational database application is concerned. So you have to decide, poem or novel.

    And that means we have to identify and define the characteristics by which we classify "Poems" in this database. By identifying the appropriate "poetic" attributes, we can safely store data about poems without ambiguity. And that leads to the second element.

    Attributes—The "Adjectives" of the Database Language

    To make it clear what is, and what isn't, a Poem, the Poem table needs to store attributes, or characteristics, that pertain to Poems and ONLY to Poems. Poems may or may not have rhyme schemes, for example. And if they do, that rhyme scheme will be one of those previously defined for that attribute, iambic pentameter, for example, or blank verse. That means one of the fields needed for the Poem table is the attribute field called "RhymeScheme". There are many different possible values for that attribute, but there's only one category, or field, for the RhymeScheme attribute. I'll provide more information on this in a later blog.

    Relationships Between Tables—The "Verbs" of the Database Language

      Poets WRITE poetry.
      Poems ARE WRITTEN BY poets.

    Within the context of this relational database, no other relationship exists between poems and poets. Poems are not BAKED BY poets. Poets don't DRIVE poems. Poets can, I suppose HATE some poems or LOVE other poems. But unless the database is intended to track that kind of information (i.e. a database of Wordsworth's favorite poetry), that's not relevant. In order for the database of poetry to make sense, then, there is one and only one possible relationship between the tables for Poets and Poems: writing them. If other relationships must be tracked, then additional tables are needed to support those relationships.

    The verb "WRITE" in this natural language formulation of the relationship corresponds to the logical database relationship between the Poet table and the Poem table.

    Let's foreshadow a future discussion by pointing out that Poets seldom limit themselves to a single poem in their entire career, at least they don't do so by design. So, while the relationship itself is between "poet" and "poem", the reality is that we have a table of poets and a table of poems, there will be one OR MORE poems in the poem table for each of the poets. In other words, we can identify two ways to classify this relationship. One poem is written by a single poet, but a single poet can write many poems. I think that's enough for now. I'll continue my thoughts on the Language of Databases in future installments. I'll also try to explain the Rules of Normalization, as outlined above, in the context of those discussions.

    =============================================

    * Edgar F. Codd, inventor of the relational model for database management

    **A Pedagogical Model of Discourse, Willis J. Pitkin, 1973

    Elements of English Grammar by Kollakolla from Pixabay


    Feedback is welcome and appreciated. ghepworth@gpcdata.com

    ©Copyright 2010- Grover Park Consulting. All Rights Reserved.
    Free Resources and Full Working Demos for Microsoft Access Database Design and Development