While the actual database design usually is able to handle the amount of stored data without major performance problems newly added clients and their data volume changes that. Once the size of a database grows close to 1 GB, the number of concurrent user exceeds 20 or the number of new a-records goes beyond 200 an hour, the need for speed improvement is raised.
This document is not in search for reasons why a database is designed the way it is.In general there are two types of improvements involved: Direct and indirect.
It has to be understood that direct changes always force changes in the way client applications access the data. This document only considers direct changes.
The best way to get performance is to pay up-front for tuning. It is not a hidden cost if included in the initial design effort and is exponentially cheaper than performing tuning as an add-on feature.
Put simply, tuning involves three basic activities:
Usually you start of with the existing diagram of your database.
Oops - you do not have one? What about reverse engineering? Most case tools offer that option.
Unfortunately too often reversed-engineered diagrams of existing databases are an unreadable mess of 60 pages or more. This easily proves that databases grow into far too complex systems that do not allow the easy use of a case tool anymore. It is recommended to take the proper steps to not only simplify the database but to also force the existence of an easily readable and up-to-date diagram.
"Performance Design Review" should be part of both the "Database Design Review" and "Application Design Review" processes.
These are experiences while working with many different databases sized 2 GB up to 100 GB. Some of the information in this document can be found on the Internet as others faced the same challenges.
To simplify things blob fields and large varchar field (> 254 characters, but mainly > 2048 characters) will be named blobs and seen as synonym.
There are two major times when the behavior of databases changes: Once the database hits the 2GB size and then once again when the DB grows over 10GB. Once a table hits more then 10 million rows there will be changes in the behavior too, but they are less compared to the 2 GB size. And usually they appear around the same time, especially in non normalized databases. Some of the speed problems appear earlier, but they are usually covered by the use of expensive hardware or spending lots of time to fine tune the used hardware and/or software.
Once a database exceeds the 2GB there will be a few others - name them natural - issues:
Some of the described speed improvements will lead to more redundancy of data. While this is an unwanted side effect the improved speed makes up for it.
There are quite a few major areas of improvement, which can gain more speed with less work:
Some RAID controllers can cause database corruption. It is not clear under which circumstances and what vendor. Reports about this behavior can be found on the internet.
Please avoid the mistake of throwing good hardware at bad design.
Generally the use of Unix based systems will speed up things.
Experience shows that changing the database parameters is a time intensive game which does not necessarily leads to consistent results. What works with one DB might fail with another. Having multiple databases on the same server makes a perfect set of settings for a specific database almost impossible. It will be very hard to measure the gain or estimate the time needed for finding the best settings.
Blob fields should be moved out of the main database. Especially since different database vendors handle blobs differently. On top of that some databases need more temporary space for sorting or retrieving data when one of the involved tables contains a blob field. Queries have to be optimized way more to force the RDBMs to use the proper indices and minimize disk IO.
Most times blobs shouldn't be stored in the database at all. Access restriction can be handled via the OS.
In case blobs will be stored in the database there should be only one blob table in the whole system. The access to the proper data will be done via the unique PK. See primary key discussion below.
As described earlier huge databases containing many tables and blob fields tend to corrupt faster.
The main condition for proper key design is the understanding that two unique keys have to be considered:
The primary key is nothing else then a physical unique identifier for every single row. Most RDBMs have this key implemented for internal use, but not available for the programmers use.
The primary key must to be stable. With other words it can NEVER be changed.
The primary key must never be the same as the logical key, because the logical key is made up of entities whose values can change.
Way too often databases contain many tables where an additional field was added to force the uniqueness of the logical key: e.g. A_ID got a SEQ_NUM added for that sole purpose. Meaning keys represent a hierarchy among the data. This is inflexible and limits the number of levels. And it is plain wrong.
If a logical key is used as primary key and this key is changed all foreign key references to this table have to be updated. Migration of logical keys used as primary keys will have terrible implications on database flexibility. A primary key independent of all attributes or relationships allows changes to the logical key without cascading changes. It allows maintaining data easily and sufficiently.
Another major advantage is the minimal impact when adding attributes to the logical key. Since entities are defined by relationships programmers are tempted to use foreign key columns in the primary key. As stated above this is bad practice because changing the primary key will lead to major changes in all client applications and stored procedures. When using a real primary key there would be only the change to the logical key in the involved tables and any place where inserts and updates happen. All references to primary keys stay the way they are.
Yes. Just do it!
When a table contains rows which have to be referenced from other tables in the database then that table absolutely needs a primary key. In all other cases it depends on the way programmers use the data. In cases it will make sense to add the primary key to simplify the physical view of the data. From past experience we know that systems and their uses change.
To be future-compatible always adding the primary key is best practice. Primary keys should be integers filled from one generator per DB.
There is only one major disadvantage: It takes a lot more time to implement the use of primary keys into an existing system then starting with primary keys right from the beginning. And it has to be done with a lot of discipline.
Most systems are designed in a way, which allows the easy retrieve of data in specific predefined ways. Adding new ways is complicated and time intensive.
More often than not tables need a re-design to better reflect the data stored in them.
A perfect example would be a table containing one varchar field to store numeric, alphanumeric and date values. This table should be split into more then one table to reflect the different types. It should be one logical table with one primary key but containing multiple physical tables, at least one for each different data type stored.
All columns only contain atoms
All columns not containing the primary key are dependent on the primary key
All columns which do neither contain the primary nor the logical key are independent from each other
Has been proven as not relevant to actual RDBMs and will be therefore ignored.
Normalizing the tables will lead to speed improvements and a more simplified ERWin diagram.
Query optimization depends heftily on the used database and will be discussed in another white paper.
This Genius One Article is published here for the purpose of helping to improve your personal and organizational performances; designed to provide accurate and authoritative information in regard to the subject matter covered. If legal advice or other expert assistance is required, the services of a competent professional should be sought.
Please contact us directly if you need a word or pdf version of this article.