A book in a class of its own. I mean, how often is an IT book from 1978 still relevant decades later?
Speaking of years, what I’ve read was actually the 2nd edition from 2000. Somebody wrote on the internet (so it must be right) that the editions after that were heavily edited after the author’s death, and if you want the real thing, you get to try and find none other than the 2nd edition. I tried hard, but in the end there was a .pdf, some printing, some sewing, some glueing, and the result can be seen on the photo in this blog post.
So, what is it and was it worth it? Yes, and yes.
If you think that “yes” is an unsuitable answer to the question “what is it”, you should definitely read this book. It’s hard to describe this book, it’s a bit of everything. But first and foremost it’s philosophy. Yeah, philosophy applied to data engineering, this is what I’m going with. You won’t learn anything directly quotable for your next FAANG interview from this one. You might on the other hand step into a few existential crises if you haven’t had enough already. And you might laugh a few times.
This is the last paragraph of the preface to the second edition:
I repeat the invitation, made in the book’s original preface, to discover for yourself what you might think the book is about. It just might be about you. But if that’s too much pop psychology for your comfort, if that’s too invasive of your personal space, then just read it for its insights into data processing and reality.
Back when I was reading this book (almost a year ago), I was training the skill of summarizing, so here are my notes.
chapter 1. The map is not the territory, and what we’re doing with data modeling is not even making a map of reality (e.g.: we store data about past events — how real are they?, fictional characters, unborn heirs; fuzzily defined concepts — what’s ‘one street ’? — separated into ever-custom categories, which are pretty much always incompatible between systems). Instead we’re making a map of (certain) people’s perceptions of some ever-ambiguous information.
chapter 2. There are different layers in information systems: application/user views, physical layout, and logical (concerned with entities and relationships). All of them change, and that better be handled without having to touch all of them at once. A role of Database Administrator is emerging as a response to that need, and they (as a person or a group) are the “certain people” from the chapter 1 summary. One can separate an information system into a repository (storage), data processor, and an interface between them, but the distinction between the repository and the processor is not that clear: the system contains not only the data inserted into it, but also the data deduced from what was inserted (e.g. through calculation or even more implicitly by the mere fact of existence of some record).
chapter 3. Names are ambiguous, and sometimes even contradictory (how many blackboards are black?). This is handled by adding context and thus narrowing the scope, aiming for uniqueness. Most names contain additional information — they are not just identifiers, they imply things like ethnic background of a person, or (in a particular form) their profession or degree. Names and objects can be in many-to-many relationships. Names can change over time. Usually this is either disallowed by the information system (pretending change doesn’t exist) or treating names as attributes (increased storage space). Artificial identifiers (“surrogates”) are easier to deal with. You can have one-to-one relationship to the things they represent; such identifiers don’t imply anything about related entities; etc. The author argues that non-artificial identifiers (words) are inherently ambiguous, because there is an infinity of concepts to be named and a limited number of reasonably long words one can make up of a finite alphabet (and pronounce, I would add). The fuzziness is also a feature — if the word “chair” is stretchy enough, you don’t need new words every time designers come up with something slightly different.
chapter 4. Relationships can have different features. Complexity: from one-to-one to many-to-many. Categories: either side of the relationship might be constrained by zero, one or several categories. Self-relation: to themselves and/or to things in the same category. Optionality: on either side. Multiplicity: more than one relationship between the same pair of things. Transitivity. Symmetry. Anti-symmetry. Implication: the occurrence of two relationships implies the third. Subset: one relationship is subset of another. Restrictions: maximum number of things to be related to, or any other condition. Naming conventions: no name for relationship, one, or two (for each direction). And this is just for binary relationships. The challenge to express a relationship in the system usually comes from it having various combinations of these features. Relationships can also be related to each other, or their instances might be. They can be established and broken, and if at that moment they aren’t represented explicitly by some link, bugs are likely. There is plenty of relationships' behaviors for an information system to support.
chapter 5. The author says he can’t strictly define the difference between attributes, relationships, and categories, therefore they’re all the same thing.
chapter 6. Types and sets can be expressed through entities and relationships, and since those are more general, you get closer to representing the real world (and its complexity unfortunately — but the other option is to get inconsistency by pretending that complexity doesn’t exist). Set theory is not directly applicable to any data processing system since set theory only deals with sets that are determined by their population — there is no notion of a set with changing population. Also there is only one empty set, while an empty table of employees is not the same as an empty table of unicorns.
chapter 7. Putting business things into a model is a big investment, so it’s important to get it right, preferably from the first attempt. But before the system with a well-architectured model is in place, it’s hard to justify the effort, thus everyone ends up with an easier but worse model instead. Usually the model describes data processing activities rather than whatever the business is dealing with. The prevalent data model is the record-based one.
chapter 8. Record-based data model is better suited for data processing than for reflecting the natural structure of information. For example, this model assumes or prefers that any thing has exactly one type and is always referred to by the same name in the same format. That all entities are always distinguishable from each other. That if they share a type, they share the set of their attributes in the same format. It’s not at all obvious for someone trying to use the information in the system why some queries are possible/easy and others aren’t. Relationships are often represented differently even if they are fundamentally not that different (implicitly as a field in a record for a one-to-many kind, explicitly as a separate table for the many-to-many, etc). It can make them hard or impossible to traverse in a query or to apply changes properly. An example of a simple but not easy relationship: an employee or a department can own a vehicle or a piece of furniture. One approach is to create another level of abstraction: super-types for owners and owned things. Another is to define 4 kinds of “owns” relationships, which would complicate queries and validation a lot, and that’s just with 2x2 types.
chapter 9. Three other models sharing most of the issues with the record-based one: relational, hierarchical, and network. Relational is pretty equivalent to record-based. One of the distinguishing features is the set-oriented nature of operations — deal with sets of records, not process them one by one. The concept of a domain is present but they are mostly defined as a class of strings (characters, integers, etc) so it doesn’t change much in terms of ensuring the validity of relationships. Hierarchies (e.g. the IMS system): a family of tree-like sets of one-to-many relationships; one kind of thing can only appear on one level. With a workaround of so called logical relationships one can make many-to-many relationships kinda possible, but programs won’t be able to process those easily. Networks, in particular DBTG. Doesn’t really support many-to-many relationships.
chapter 10. 1:n relationships are often modeled differently than m:n for reasons like physical storage. This affects the ways available for the users to get the information. n-ary relationships can be decomposed into binary ones. One approach is: instead of A relates to B relates to C, make "A relates to B" a thing that relates to C. Choosing one of all the possible combinations is sad though. Another approach is: treat the whole relationship between A, B, and C as its own thing X. Now we have three binary relationships: AX, BX, CX. There’s a third approach, involving a hypergraph, which is isomorphic to the second approach. But with hypergraph, the thing itself can’t be connected to something else. Binary relationships are simple. Simplicity can mean either a small vocabulary or concise descriptions. A reducible record is one that can be split into shorter records so that the original can be reconstructed without losing information or, more importantly, generating spurious false data. Binary relationships can be irreducible/"elementary facts" which might be more simple to handle (add, delete, update). For example by making the implicit coexistence constraint of things in one long record into an explicit one. Without explicitly named relationships, a user can’t pose queries like “what relationships exist between X and Y”. They can be implemented as macros, hiding the details unnecessary for the user. Another way is giving the user (materialized) views. Of course this makes querying easier but updating harder. Also it limits the user to what the database administrator has defined. Ordering is a relationship that is hard to model as a binary one. You either add a field by which to sort or a relationship like “precedes”. Each is meh. Existence of entities is usually not modeled by itself but rather implied by the fact of their participation in various relationships. But sometimes it’s nice to have a list of all the currently known things of a particular kind. The author says the relational model has a few things that partially cover this need.
Chapter 11 starts with “Thus far we have been largely critical, and negative” and suggests some ways of dealing with, well, everything outlined in the 10 chapters before. And Chapter 12 is titled “Philosophy”. This is its last paragraph:
So, at bottom, we come to this duality. In an absolute sense, there is no singular objective reality. But we can share a common enough view of it for most of our working purposes, so that reality does appear to be objective and stable. But the chances of achieving such a shared view become poorer when we try to encompass broader purposes, and to involve more people. This is precisely why the question is becoming more relevant today: the thrust of technology is to foster interaction among greater numbers of people, and to integrate processes into monoliths serving wider and wider purposes. It is in this environment that discrepancies in fundamental assumptions will become increasingly exposed.
This book is as meta as it gets, and still it is very humane. A couple of chapters out of the first ten have not aged that well, but the rest is pure gold if you want to tickle your brain and to take this whole data modeling thing a little bit more light-heartedly but also a little bit more seriously.
There is no comments section, but if you'd like to give feedback or ask questions about this post, please contact me.