I started using Microsoft Access 97 in late 1997 and through to late 1999 when it was replaced by Microsoft Access 2000. It was in these two versions of Access that Microsoft made it possible to develop serious relational database applications, in conjunction with an implementation of Visual Basic for Applications (VBA) based on Visual Basic 6.0, with its use of implementation classes and a degree of object-orientation.
The method I use to develop relational database applications, based on rapid application techniques and the setting of short term development milestones, each with a specific deliverable, is intended to establish a third normal form representation of the source data set in the shortest possible time.
How is this done? I shall use the US Zip Code database application as an example to illustrate the method. In this post we shall concentrate on writing queries to identify our various look ups and their relationship with columns in the primary table, and writing queries to populate the tables designed to accommodate the look up values.
Once these tables are designed and populated, we move on to queries that resolve the relationships implied in the source data set with the explicit relationships and tables in the relational model when populating the primary table. And finally to establishing relationships with existing tables and relationships.
Initially, we find ourselves writing queries – or more conveniently, using the query designer to create queries semi-graphically using query-by-value and query-by-example – and running them manually. But we shall soon move to writing code to automate the procedure, which prompts further questions to be answered in the next post.
Create canonical source data set.
Import the source data, an Excel workbook with a single worksheet, into a table for further analysis, using the Import Wizard.
This is the table Zip_code_database with 42,522 rows and 17 columns with non-standard names. We can use the original column names to associate data in the source data set with correctly named columns in tables in our relational model.
Can we dispense with any columns? Yes, we decided in an earlier post to ignore world_region, estimated_population and notes.
Identify 1NF and 2NF attributes.
Can we identify columns that contain 0, 1 or more discrete values? We observe:
acceptable_cities, unacceptable_cities and area_codes contain comma-delimited lists of 0 or more elements (1st);
county and timezone contain 0 or 1 elements (1st);
The point about these columns is that they prevent us from seeing the row as 1st normal form, an equal number of fields, where there are zero or more instances of a repeating entry, such as the acceptable_cities and unacceptable_cities columns.
type, primary_city. state and country contain exactly 1 element of repeating values (2nd).
Finally, latitude, longitude and decommissioned are single-valued non-integer or boolean values.
Identify the primary table and its candidate key.
In the US Zip Code database we call our primary table tblZipCode, to capture facts about a zip code: zip, type, primary_city, state, country, latitude, longitude and decommissioned; and we call our candidate key txtCode, the zip column.
Construct queries for 2NF look up tables and populate primary table.
By listing the fields in the primary table, we can immediately identify our look up columns and write queries to capture the requirement:
SELECT Zip_code_database.type AS txtZipType, Len([type]) AS numSize
GROUP BY Zip_code_database.type, Len([type])
ORDER BY Zip_code_database.type;
SELECT Zip_code_database.primary_city AS txtCityName, Len([primary_city]) AS numLength
GROUP BY Zip_code_database.primary_city, Len([primary_city])
ORDER BY Zip_code_database.primary_city;
SELECT Zip_code_database.state AS txtName
GROUP BY Zip_code_database.state
ORDER BY Zip_code_database.state;
We except the country column because that is dealt with in another table. But we want to deal with the county column and the timezone column:
SELECT Zip_code_database.county AS txtCountyName, Len([county]) AS numLength
GROUP BY Zip_code_database.county, Len([county])
HAVING (((Zip_code_database.county) Is Not Null))
ORDER BY Zip_code_database.county;
SELECT Zip_code_database.timezone AS txtName, Len([timezone]) AS numLength
GROUP BY Zip_code_database.timezone, Len([timezone])
HAVING (((Zip_code_database.timezone) Is Not Null))
ORDER BY Zip_code_database.timezone;
We can immediate create the tables tblZipType, tblZipCity, tblZipState, tblCounty and populate them as follows.
INSERT INTO tblZipType ( txtName )
INSERT INTO tblZipCity ( txtName )
SELECT Us_states_extended.Field1 AS txtCode, Us_states_extended.Field2 AS txtName, Len([Field2]) AS numLength, Zip_code_database.state
FROM Us_states_extended INNER JOIN Zip_code_database ON Us_states_extended.Field1 = Zip_code_database.state
GROUP BY Us_states_extended.Field1, Us_states_extended.Field2, Len([Field2]), Zip_code_database.state
ORDER BY Us_states_extended.Field1;
using the imported Us_states_extended table, and populating the table tblState (999appZipState)
INSERT INTO tblState ( txtCode, txtName )
SELECT [999selUniqueStateExtended].txtCode, [999selUniqueStateExtended].txtName
INSERT INTO tblZipCounty ( txtName )
INSERT INTO tblZipTimezone ( txtName )
Finally, we can populate tblZipCode using the above tables. Firstly, by selecting the data set using joins on the original source data set with the now populated look up tables to identify the foreign keys in the target primary table to establish and enforce referential integrity (999selZipCode)
SELECT Zip_code_database.zip, tblZipType.lngPKZipTypeID, Zip_code_database.type, tblZipCity.lngPKZipCityID, Zip_code_database.primary_city, tblState.lngPKStateID, Zip_code_database.state, tblCountry.lngPKCountryID, Zip_code_database.country, Zip_code_database.latitude, Zip_code_database.longitude, Zip_code_database.decommissioned
FROM tblCountry INNER JOIN (tblState INNER JOIN (tblZipCity INNER JOIN (tblZipType INNER JOIN Zip_code_database ON tblZipType.txtName = Zip_code_database.type) ON tblZipCity.txtName = Zip_code_database.primary_city) ON tblState.txtCode = Zip_code_database.state) ON tblCountry.txtCode = Zip_code_database.country;
and populating tblZipCode (999appZipCode)
INSERT INTO tblZipCode ( txtCode, lngFKZipTypeID, lngFKZipCityID, lngFKStateID, lngFKCountryID, dblLatitude, dblLongitude, blnDecommissioned )
SELECT [999selZipCode].zip, [999selZipCode].lngPKZipTypeID, [999selZipCode].lngPKZipCityID, [999selZipCode].lngPKStateID, [999selZipCode].lngPKCountryID, [999selZipCode].latitude, [999selZipCode].longitude, [999selZipCode].decommissioned
We assert that the latitude, longitude and decommissioned columns are facts about the zip column.
Write code to establish 1NF in 0-to-many relationships.
We assert that the cities in primary_city, acceptable_cities and unacceptable_cities refer to possibly unique city values of one of three types: primary, acceptable and unacceptable, giving rise to the following relationships with examples:
tblZipCityCityType (lngPKZipCityCityTypeID, lngFKZipCityID, lngFKZipCityTypeID)
(1, Aaronsburg, Primary)(18672, Mercedita, Acceptable)(28512, Comunidad Las Flores, Unacceptable)
tblZipCodeCityCityType (lngPKZipCodeCityCityTypeID, lngFKZipCodeID, lngFKZipCityCityTypeID)
(25158, 34240, (Unacceptable, Lakewood Ranch))(5821,34240, (Acceptable, Lakewood Rch))
And we can establish the relationships for area_codes, county and timezone:
tblZipCodeAreaCode (lngPKZipCodeAreaCodeID, lngFKZipCodeID, lngFKAreaCodeID)
(1, 00501, 631) (2, 00544, 632) (3, 00601, 787) (4, 00601, 939) (5, 00602, 787) (6, 00603, 787)
tblZipCodeCounty (lngPKZipCodeCountyID, lngFKZipCodeID, lngFKZipCountyID)
(1, 00501, Suffolk County) (2, 00544, Suffolk County) (3, 00601, Adjuntas) (4, 00602, Adjuntas)
tblZipCodeTimezone (lngPKZipCodeTimezoneID, lngFKZipCodeID, lngFKZipTimezoneID)
(1, 00501, America/New York) (2, 00544, America/New York) (3, 00601, America/Puerto Rico)
We design tables to implement this, as well as write code to create the tables and establish relationships and enforce referential integrity.
Establish relationships with existing relational tables.
When we have tables and relationships we have derived from other authoritative sources, which for the US Zip Code database are the state and country values, we need to ensure these are incorporated correctly in our primary table, or other tables where we need look ups.
For the state column, we have a foreign key and relationship:
tblZipCode.lngFKStateID : tblState (lngPKStateID, lngFKStateTypeID, txtCode, txtName)
tblState.lngFKStateTypeID : tblStateType (lngPKStateTypeID, txtName)
for the country column, we have a foreign key and relationship:
tblZipCode.lngFKCountryID : tblCountry (lngPKCountryID, lngFKContinentID, txtCode, txtLongCode, txtNumber, txtName)
tblCountry.lngFKContinentID : tblContinent (lngPKContinentID, txtName)
As long as these tables are populated at the right point in the database population procedure, results from the table can be incorporated through joins to ensure the correct population of the primary table, as seen in the queries (999selZipCode), (999appZipCode) above.
In this post we have seen there are a discrete number of steps to move from a flat file representation of the source data set to the relational model, but we should note that this is an incremental process of stepwise refinement, where one or more of the steps have to be revisited and reapplied to move from a less complete to a more complete relational model. Indeed the steps and their names and functions are, up to a point, interchangeable and may blend one with another. But we may be confident that our heuristic, which is what we have, will work for a very broad class of data set we may want to normalize.
Now on to how we put this together and make it easy to use.