Event Report: Sham Navathe on E.F. Codd

(This is a liveblog of Sham Navathe’s lecture on E.F. Codd as part of the Turing 100 @ Persistent lecture series.)

Sham Navathe does not really need an introduction – since he is famous for his book “Fundamentals of Database Systems” written by with Ramez Elmasri, which is prescribed for undergraduate database courses all over the world. His full background can be looked up in his Wikipedia page, but it is worth mentioning that Navathe is a Punekar, being a board topper from Pune in his HSC. He regularly visits Pune since he has family here.

Today’s lecture is about E.F. Codd, a British mathematicians, logician, and analyst, who was given the Turing award in 1981 for his invention of the relational databases model. He is one of the 3 people to have received a Turing award for work in databases (the other two being Charlie Bachman in 1973 for introducing the concept of data structures and the network database model, and Jim Gray in 1998 for his work on database transaction processing research and technical leadership in system implementation.)

Ted Codd studied in Oxford, initially studying Chemistry, before doing a stint with the Royal Air Force and then getting degree in Maths. He later emigrated to US, worked in IBM, did a PhD from University of Michigan, and finally went back to IBM. At that time, he led the development of the world’s first “multiprogramming system” – sort of an operating system.

Codd quit IBM in 1984 because he was not happy with the “incomplete implementation of the relational model.” He believed that SQL is a “convenient” and “informal” representation of the relational model. He published rules that any system must follow before it could be called a relational database management system, and complained that most commercial systems were not really relational in that sense – and some were simply thin pseudo-relational layer on top of older technology.

Invention of the Relational Model

In 1963-64, IBM developed the IMS database management system based on the hierarchical model. In 1964-65 Honeywell developed IDS, based on the network model. In 1968, Dave Childs of Michigan first proposed a set-oriented database management system. In 1969 Codd published “The derivability, redundancy, and consistency of relations stored in large databases” (IBM research report, RJ-599, 1969). This was the work that led to the seminal paper, “A Relational Model for Large Shared Data Banks” (CACM, 13:6, June 1970). Other classic papers are: “Extending the Relational Model to capture more meaning” (ACM TODS, 4:4, Dec 1979), which is called the RM/T model. He is also the inventor of the term OLAP (Online Analytical Processing).

After Codd’s proposal of the relational model, IBM was initially reluctant to commercialize the idea. Instead, Michael Stonebraker of UC-Berkeley along with PhD students created INGRES, the first fully relational system. INGRES ultimately became Postres database which is one of the leading open source databases in the world today. In the meantime, Relational Software Inc. brought another relational database product to the market. This ultimately became Oracle. After this, IBM heavily invested in System R that developed the relational DBMS ideas fully. Codd was involved in the development of System R – and most of the fundamental ideas and algorithms underlying most modern RDBMS today are heavily influenced by System R.

Interesting RDBMS developments after Codd’s paper:

1975: PhD students in Berkeley develop an RDBMS
1976: System R first relational prototype system from IBM
1979: First proposal for a cost based optimizer for database queries
1981: Transactions (by Jim Gray)
1981: SQL/DS First commercial RDBMS

Two main motivations for the relational model:

- Ordering dependence
- Indexing dependence
- Access path dependence
In DBMS before RDBMS, there was a heavy dependence of the program (and programmer) on the way the data is modeled, stored and navigated:

All of this was hardcoded in the program. And Codd wanted to simplify database programming by removing these dependencies.
– Loss of programmer productivity due to manual optimization.

Codd’s fundamental insight was that freeing up the application programmer from knowing about the layout of the data on disk would lead to huge improvements in productivity. For example, in the network or hierarchical models, a data model in which a Student has a link to the Department that he is enrolled in, is very different from a model in which each Department links to all the students that are enrolled there. Depending upon which model is used, all application programs would be different, and switching from one model to another would be difficult later on. Instead, Codd proposed the relational model which would store this as the Student relation, the Department relation, and finally the Enrolment relation that connects Student and Department.

The Relational Data Model

A relation is simply an unordered set of tuples. On these relations, the following operations are permitted:

Permutation
Projection
Join
Composition

Of course, other operators from set theory can be applied to relations, but then the result will not be a relation. However, the operations given above take relations and the results are also relations. Thus, all the relational operators can again be applied to the results of this operation.

He defined 3 types of relations:

Expressible: is any relation that can be expressed in the data model, using the data access language
Named: is any relation that has been given a specific name (i.e. is listed in the schema)
Stored: is a relation that is physically stored on disk

He also talked about 3 important properties of relations:

Derivability: A relation is derivable if it can be expressed in terms of the data access language (i.e. can be expressed as a sequence of relational operations)
Redundancy: A set of relations is called strongly redundant if one of the relations can be derived from the other relations. i.e. it is possible to write a relational operation on some of the relations of the set whose result is the same as one of the other relations. A set of relations is weakly redundant if there is a relation in that set which has a projection that is derivable from the other relations. Good database design entails that strongly redundant sets of relations should not be used because of problems with inconsistency. However, weakly redundant relations are OK, and used for performance purposes. (Materialized views.)
Consistency / Inconsistency: Codd allowed the definition of constraints governing the data in a set of relations, and a database is said to be consistent if all the data in the database satisfies those constraints, and is said to be inconsistent if not.

In the years that followed, a bunch of IBM research reports on normalization of databases followed.

Turing Award Lecture

His talk is titled: “Relational Databases: A Practical Foundation for Productivity”. His thoughts at that time:

Put users in direct touch with databases
Increase productivity of DP professionals in developing applications
Concerned that the term “relational” was being misued

He points out that in relational data model, data can be addressed symbolically, as “relation name, primary key value, attribute name”. This is much better than embedding links, or positional addressing (X(i, j)).

The relational data model encompasses structure, manipulation and integrity of data. Hence, it is a complete model, because all 3 aspects are important for data management.

Characteristics of relational systems:

MUST have a data sub-language that allows users to query the data using SELECT, PROJECT and JOIN operators
MUST NOT have user visible navigation links between relations
MUST NOT convey any information in the way tuples are ordered

He was worried that relational system might not be able to give performance as good as the performance of non-relational systems. He talked about:

performance oriented data structures
efficient algorithms for converting user requests into optimal code

In future work, he mentioned the following

Domains and primary keys
Updating join-type views
Outer-joins
Catalogs
Design aids at logical and physical level
Location and replication transparency in distributed databases
Capture meaning of data
Improved treatment of missing, null and inapplicable values
Heterogeneous data

This was a remarkably prescient list. In the 30 years since this talk, most of this has actually happened either in commercial databases or in research labs. We have pretty much achieved #1 to #6, while #7 to #9 have seen a lot of research work but not wide commercial acceptance yet.

Concluding Thoughts

Relational model is a firm foundation for data management. Nothing else compares.
On this foundation we were able to tackle difficult problems in the areas of design, derivability, redundancy, consistency, replication as well as language issues. All of these would have been very difficult otherwise
Proponents of NoSQL databases as well as map-reduce/hadoop type of systems need to keep in mind that large data management cannot really be done in an ad hoc manner.
Codd’s RM/T model was an attempt to capture metadata management, but fell short of what was needed.

Audience Questions

Q: Why did object-oriented databases not catch on?

A: There was a lack of understanding amongst the wider community as to the best way of using object-oriented ideas for data management. OODBMS companies were not really able to really educate the wider community, and hence failed. Another problem is that object-oriented DBMS systems made the data model complex but there were not corresponding developments in the query language and optimization tools.

Q: When the relational model was developed, did they constrain themselves due to the hardware limitations of those days?

A: Codd did mention that when deciding on a set of operations for the relational model, one consideration was ‘Which of these operations can be implemented on today’s hardware’. On the other hand, there were lots of companies in the 80s which tried to implement specialized hardware for relational/DBMS. However, none of those companies really succeeded.

In the future, it is very unlikely that anyone will develop a new data model with improvements in hardware and processing power. However, new operators and new ways of parallelizing them will certainly be developed.

Q: There are areas of data management like incomplete, inexact data; common sense understanding of data; deduction and inferencing capabilities. These are not really handled by today’s DBMS systems. How will this be handled in the future.

A: There have been many interesting and elegant systems proposed for these areas, but non have seen commercial success. So we’ll have to wait a while for any of these to happen.

Will be updated every 15 minutes. Please refresh regularly.

punetech.com

Connecting together Pune's Technologists