What Are The Implications Of The Same Data Being Stored In Many Different Databases

4 Affiliate 4: Data and Databases

Dave Conservative and David T. Bourgeois

Learning Objectives

Upon successful completion of this affiliate, you volition be able to:

depict the differences betwixt information, information, and knowledge;
define the term database and identify the steps to creating 1;
depict the role of a database management system;
depict the characteristics of a data warehouse; and
define data mining and describe its part in an organization.

Please note, at that place is an updated edition of this volume bachelor at https://opentextbook.site. If you are not required to use this edition for a class, you lot may want to check it out.

Introduction

You have already been introduced to the first 2 components of information systems: hardware and software. However, those two components by themselves practise not make a computer useful. Imagine if you turned on a estimator, started the word processor, but could not save a document. Imagine if y'all opened a music player merely at that place was no music to play. Imagine opening a web browser but there were no web pages. Without information, hardware and software are not very useful! Data is the third component of an data system.

Data, Information, and Knowledge

Data are the raw $.25 and pieces of information with no context. If I told y'all, "15, 23, xiv, 85," y'all would not have learned anything. But I would have given you lot data.

Data tin can be quantitative or qualitative. Quantitative data is numeric, the result of a measurement, count, or some other mathematical calculation. Qualitative data is descriptive."Ruby Red," t he color of a 2013 Ford Focus, is a north example of qualitative information. A number can be qualitative as well: if I tell y'all my favorite number is 5, that is qualitative data considering it is descriptive, not the result of a measurement or mathematical calculation.

By itself, data is not that useful. To be useful, information technology needs to be given context. Returning to the instance above, if I told y'all that "15, 23, 14, and 85″ are the numbers of students that had registered for upcoming classes, that would be information . By adding the context – that the numbers represent the count of students registering for specific classes – I have converted data into information.

Once we have put our information into context, aggregated and analyzed it, we can use it to make decisions for our organisation. We can say that this consumption of information produces noesis. This cognition tin can be used to make decisions, set policies, and fifty-fifty spark innovation.

The final step upwardly the data ladder is the step from knowledge (knowing a lot about a topic) towisdom. We can say that someone has wisdom when they tin combine their knowledge and experience to produce a deeper understanding of a topic. It often takes many years to develop wisdom on a particular topic, and requires patience.

Examples of Data

Near all software programs require data to do anything useful. For example, if you are editing a document in a word processor such equally Microsoft Word, the document you are working on is the data. The discussion-processing software can dispense the data: create a new document, duplicate a document, or modify a document. Some other examples of data are: an MP3 music file, a video file, a spreadsheet, a web page, and an east-book. In some cases, such equally with an e-book, you lot may only have the ability to read the data.

Databases

The goal of many data systems is to transform data into data in order to generate knowledge that can be used for conclusion making. In guild to do this, the organization must exist able to have data, put the data into context, and provide tools for assemblage and analysis. A database is designed for just such a purpose.

A database is an organized collection of related information. Information technology is anorganized collection, because in a database, all data is described and associated with other data. All information in a database should be related as well; carve up databases should be created to manage unrelated information. For example, a database that contains information about students should not likewise hold data nigh company stock prices. Databases are not ever digital – a filing cabinet, for example, might be considered a grade of database. For the purposes of this text, we will only consider digital databases.

Relational Databases

Databases can be organized in many different ways, and thus take many forms. The nearly pop form of database today is the relational database. Popular examples of relational databases are Microsoft Access, MySQL, and Oracle. A relational database is 1 in which information is organized into one or more tables. Each table has a set up of fields, which define the nature of the data stored in the table. A tape is one case of a fix of fields in a table. To visualize this, think of the records every bit the rows of the table and the fields as the columns of the table. In the example beneath, we have a table of student data, with each row representing a student and each column representing one piece of data near the educatee.

In a relational database, all the tables are related by one or more fields, so that it is possible to connect all the tables in the database through the field(s) they have in common. For each table, 1 of the fields is identified as a primary key. This key is the unique identifier for each record in the table. To help you understand these terms further, let's walk through the process of designing a database.

Designing a Database

Suppose a university wants to create an information system to track participation in educatee clubs. After interviewing several people, the pattern team learns that the goal of implementing the system is to requite better insight into how the academy funds clubs. This volition be accomplished by tracking how many members each club has and how active the clubs are. From this, the team decides that the organisation must go along track of the clubs, their members, and their events. Using this information, the pattern team determines that the following tables need to be created:

Clubs: this will rails the order name, the club president, and a curt description of the guild.
Students: student name, email, and yr of birth.
Memberships: this table will correlate students with clubs, assuasive united states to have any given pupil bring together multiple clubs.
Events: this table will track when the clubs meet and how many students showed up.

At present that the pattern team has determined which tables to create, they need to define the specific information that each table volition hold. This requires identifying the fields that will exist in each table. For example, Club Proper name would exist i of the fields in the Clubs table. First Name and Last Proper name would be fields in the Students tabular array. Finally, since this volition be a relational database, every table should have a field in mutual with at to the lowest degree one other table (in other words: they should have a relationship with each other).

In social club to properly create this relationship, a primary fundamental must be selected for each table. This cardinal is a unique identifier for each record in the table. For case, in the Students table, it might be possible to apply students' last proper name as a manner to uniquely identify them. Even so, it is more than than likely that some students volition share a last proper name (similar Rodriguez, Smith, or Lee), and then a different field should exist selected. A educatee's east-mail address might be a good choice for a chief central, since e-postal service addresses are unique. However, a primary key cannot modify, and so this would mean that if students changed their east-mail accost we would have to remove them from the database and so re-insert them – not an attractive proposition. Our solution is to create a value for each student — a user ID — that will human activity as a primary key. We will as well exercise this for each of the pupil clubs. This solution is quite mutual and is the reason you lot have so many user IDs!

You lot tin can see the final database blueprint in the figure beneath:

Student Clubs database diagram — Pupil Clubs database diagram

With this design, not but do nosotros have a mode to organize all of the information we need to run into the requirements, but we accept also successfully related all the tables together. Hither's what the database tables might await like with some sample data. Note that the Memberships table has the sole purpose of allowing us to relate multiple students to multiple clubs.

Student clubs table with sample data

Student table with sample data

Normalization

When designing a database, 1 important concept to understand is normalization. In uncomplicated terms, to normalize a database means to pattern it in a way that: 1) reduces duplication of information between tables and 2) gives the tabular array equally much flexibility every bit possible.

In the Pupil Clubs database design, the pattern team worked to accomplish these objectives. For instance, to track memberships, a simple solution might accept been to create a Members field in the Clubs table and then just list the names of all of the members there. Yet, this design would mean that if a student joined two clubs, then his or her information would have to be entered a 2d fourth dimension. Instead, the designers solved this trouble by using 2 tables: Students and Memberships.

In this design, when a pupil joins their start order, nosotros first must add the educatee to the Students table, where their starting time proper noun, concluding name, e-mail address, and birth yr are entered. This addition to the Students tabular array will generate a student ID. At present nosotros will add a new entry to announce that the student is a fellow member of a specific club. This is achieved by adding a record with the pupil ID and the club ID in the Memberships tabular array. If this student joins a second club, nosotros practice non take to indistinguishable the entry of the educatee's proper name, e-mail, and birth yr; instead, we just need to brand another entry in the Memberships table of the 2d club's ID and the student's ID.

The design of the Student Clubs database also makes it simple to change the blueprint without major modifications to the existing structure. For example, if the blueprint team were asked to add functionality to the system to track faculty advisors to the clubs, we could easily reach this by adding a Faculty Advisors table (similar to the Students table) and and so adding a new field to the Clubs table to hold the Faculty Counselor ID.

Data Types

When defining the fields in a database table, we must give each field a data type. For example, the field Birth Year is a yr, and then information technology will be a number, while Offset Proper name will be text. Well-nigh modern databases allow for several different data types to be stored. Some of the more than common information types are listed hither:

Text: for storing non-numeric information that is brief, generally under 256 characters. The database designer can identify the maximum length of the text.
Number: for storing numbers. At that place are usually a few different number types that can exist selected, depending on how large the largest number will be.
Yes/No: a special form of the number data blazon that is (ordinarily) one byte long, with a 0 for "No" or "False" and a 1 for "Yes" or "True".
Appointment/Time: a special form of the number data type that can be interpreted as a number or a fourth dimension.
Currency: a special form of the number information type that formats all values with a currency indicator and two decimal places.
Paragraph Text: this data blazon allows for text longer than 256 characters.
Object: this data type allows for the storage of data that cannot exist entered via keyboard, such as an paradigm or a music file.

There are 2 important reasons that we must properly define the data type of a field. First, a data blazon tells the database what functions can be performed with the information. For example, if we wish to perform mathematical functions with 1 of the fields, we must be sure to tell the database that the field is a number information type. And then if we have, say, a field storing birth year, we tin can subtract the number stored in that field from the electric current year to become historic period.

The second important reason to define information blazon is so that the proper amount of storage space is allocated for our information. For instance, if the First Name field is defined as a text(50) data type, this means fifty characters are allocated for each get-go name we desire to shop. Nevertheless, even if the beginning name is simply five characters long, l characters (bytes) will be allocated. While this may not seem like a big deal, if our tabular array ends up property l,000 names, we are allocating 50 * l,000 = 2,500,000 bytes for storage of these values. Information technology may be prudent to reduce the size of the field so we practise not waste material storage space.

Sidebar: The Difference between a Database and a Spreadsheet

Many times, when introducing the concept of databases to students, they quickly determine that a database is pretty much the same as a spreadsheet. Later all, a spreadsheet stores data in an organized fashion, using rows and columns, and looks very similar to a database table. This misunderstanding extends beyond the classroom: spreadsheets are used as a substitute for databases in all types of situations every day, all over the world.

To be fair, for simple uses, a spreadsheet can substitute for a database quite well. If a simple list of rows and columns (a unmarried table) is all that is needed, then creating a database is probably overkill. In our Student Clubs example, if we simply needed to track a listing of clubs, the number of members, and the contact information for the president, we could go away with a single spreadsheet. Even so, the demand to include a listing of events and the names of members would exist problematic if tracked with a spreadsheet.

When several types of data must exist mixed together, or when the relationships between these types of data are complex, then a spreadsheet is not the all-time solution. A database allows data from several entities (such as students, clubs, memberships, and events) to all be related together into one whole. While a spreadsheet does allow you to define what kinds of values tin can be entered into its cells, a database provides more intuitive and powerful ways to ascertain the types of data that go into each field, reducing possible errors and allowing for easier analysis.

Though not adept for replacing databases, spreadsheets can exist ideal tools for analyzing the data stored in a database. A spreadsheet package can be connected to a specific tabular array or query in a database and used to create charts or perform assay on that information.

Structured Query Language

One time you take a database designed and loaded with data, how will yous do something useful with information technology? The primary way to work with a relational database is to employ Structured Query Language, SQL (pronounced "sequel," or simply stated as S-Q-L). Almost all applications that piece of work with databases (such equally database direction systems, discussed below) brand employ of SQL as a way to clarify and manipulate relational data. As its name implies, SQL is a language that can be used to piece of work with a relational database. From a simple request for information to a complex update performance, SQL is a mainstay of programmers and database administrators. To give you a taste of what SQL might look like, here are a couple of examples using our Student Clubs database.

The post-obit query will recall a list of the first and last names of the club presidents:

SELECT "Beginning Name", "Last Proper noun" FROM "Students" WHERE "Students.ID" = "Clubs.President"

The post-obit query will create a list of the number of students in each club, listing the club name and so the number of members:

SELECT "Clubs.Lodge Name", COUNT("Memberships.Educatee ID") FROM "Clubs" LEFT Bring together "Memberships" ON "Clubs.Club ID" = "Memberships.Guild ID"

An in-depth clarification of how SQL works is beyond the scope of this introductory text, only these examples should give y'all an idea of the power of using SQL to manipulate relational data. Many database packages, such equally Microsoft Admission, permit you to visually create the query you want to construct and and then generate the SQL query for you.

Other Types of Databases

The relational database model is the most used database model today. Still, many other database models exist that provide different strengths than the relational model. The hierarchical database model, pop in the 1960s and 1970s, connected data together in a hierarchy, allowing for a parent/child human relationship betwixt data. The document-centric model immune for a more unstructured data storage by placing information into "documents" that could and so be manipulated.

Mayhap the most interesting new evolution is the concept of NoSQL (from the phrase "not only SQL"). NoSQL arose from the need to solve the problem of large-calibration databases spread over several servers or fifty-fifty across the world. For a relational database to piece of work properly, information technology is of import that only 1 person be able to manipulate a piece of information at a time, a concept known as record-locking. But with today's large-scale databases (think Google and Amazon), this is just not possible. A NoSQL database can work with information in a looser way, allowing for a more unstructured surroundings, communicating changes to the data over time to all the servers that are part of the database.

Database Direction Systems

Screen shot of the Open up Function database management system

To the estimator, a database looks like one or more files. In social club for the data in the database to exist read, changed, added, or removed, a software programme must access it. Many software applications have this ability: iTunes can read its database to give you a listing of its songs (and play the songs); your mobile-phone software can collaborate with your list of contacts. But what nigh applications to create or manage a database? What software can you use to create a database, modify a database'southward structure, or simply practice analysis? That is the purpose of a category of software applications called database management systems (DBMS).

DBMS packages generally provide an interface to view and alter the pattern of the database, create queries, and develop reports. About of these packages are designed to work with a specific type of database, but mostly are compatible with a wide range of databases.

For instance, Apache OpenOffice.org Base (see screen shot) can exist used to create, modify, and analyze databases in open-database (ODB) format. Microsoft'south Admission DBMS is used to work with databases in its ain Microsoft Access Database format. Both Admission and Base have the power to read and write to other database formats as well.

Microsoft Admission and Open up Office Base are examples of personal database-management systems. These systems are primarily used to develop and analyze unmarried-user databases. These databases are non meant to be shared across a network or the Cyberspace, but are instead installed on a particular device and work with a single user at a time.

Enterprise Databases

A database that tin only be used by a single user at a time is not going to meet the needs of most organizations. Every bit computers take get networked and are at present joined worldwide via the Internet, a course of database has emerged that can be accessed by ii, ten, or fifty-fifty a million people. These databases are sometimes installed on a unmarried computer to be accessed by a group of people at a unmarried location. Other times, they are installed over several servers worldwide, meant to be accessed past millions. These relational enterprise database packages are built and supported past companies such equally Oracle, Microsoft, and IBM. The open-source MySQL is also an enterprise database.

As stated before, the relational database model does not scale well. The term scale here refers to a database getting larger and larger, being distributed on a larger number of computers continued via a network. Some companies are looking to provide large-calibration database solutions by moving abroad from the relational model to other, more flexible models. For example, Google now offers the App Engine Datastore, which is based on NoSQL. Developers can use the App Engine Datastore to develop applications that access data from anywhere in the world. Amazon.com offers several database services for enterprise use, including Amazon RDS, which is a relational database service, and Amazon DynamoDB, a NoSQL enterprise solution.

Big Data

A new buzzword that has been capturing the attention of businesses lately is big data. The term refers to such massively large data sets that conventional database tools practise not have the processing power to analyze them. For instance, Walmart must process over one million customer transactions every hour. Storing and analyzing that much data is beyond the power of traditional database-management tools. Understanding the all-time tools and techniques to manage and analyze these large data sets is a problem that governments and businesses akin are trying to solve.

Sidebar: What Is Metadata?

The term metadata can exist understood as "data most data." For instance, when looking at one of the values of Year of Nascency in the Students tabular array, the data itself may be "1992". The metadata about that value would be the field name Year of Nativity, the fourth dimension it was last updated, and the information type (integer). Some other example of metadata could exist for an MP3 music file, like the one shown in the image below; information such as the length of the song, the creative person, the album, the file size, and even the album cover fine art, are classified equally metadata. When a database is beingness designed, a "data dictionary" is created to hold the metadata, defining the fields and structure of the database.

Metadata most a camera image (Public Domain)

Data Warehouse

As organizations accept begun to utilize databases equally the centerpiece of their operations, the demand to fully empathize and leverage the data they are collecting has go more than and more apparent. However, straight analyzing the data that is needed for day-to-day operations is not a good idea; we do not want to taxation the operations of the company more than we need to. Farther, organizations too want to clarify data in a historical sense: How does the data we have today compare with the same set of data this fourth dimension terminal calendar month, or last yr? From these needs arose the concept of the information warehouse.

The concept of the data warehouse is simple: extract data from one or more of the organization's databases and load it into the data warehouse (which is itself another database) for storage and analysis. Notwithstanding, the execution of this concept is not that simple. A information warehouse should be designed and then that it meets the post-obit criteria:

Information technology uses not-operational data. This ways that the information warehouse is using a copy of data from the active databases that the company uses in its day-to-mean solar day operations, and then the information warehouse must pull data from the existing databases on a regular, scheduled footing.
The data is fourth dimension-variant. This means that whenever data is loaded into the data warehouse, information technology receives a time postage stamp, which allows for comparisons between different fourth dimension periods.
The data is standardized. Because the data in a data warehouse commonly comes from several different sources, it is possible that the data does not use the aforementioned definitions or units. For example, our Events table in our Educatee Clubs database lists the event dates using the mm/dd/yyyy format (e.g., 01/x/2013). A table in another database might use the format yy/mm/dd (e.thousand., 13/01/10) for dates. In society for the data warehouse to friction match upward dates, a standard date format would have to exist agreed upon and all data loaded into the data warehouse would have to be converted to employ this standard format. This process is called extraction-transformation-load (ETL).

There are two primary schools of idea when designing a data warehouse: bottom-upwardly and top-downwards. The bottom-upwards approach starts by creating small information warehouses, called data marts, to solve specific business problems. As these data marts are created, they can be combined into a larger information warehouse. The top-down arroyo suggests that we should start by creating an enterprise-wide information warehouse and and so, as specific business needs are identified, create smaller data marts from the data warehouse.

Data warehouse process (top-down) — Data warehouse process (top-downwards)

Benefits of Data Warehouses

Organizations find information warehouses quite beneficial for a number of reasons:

The process of developing a data warehouse forces an organization to better empathize the data that information technology is currently collecting and, equally important, what information is not being collected.
A data warehouse provides a centralized view of all data being collected across the enterprise and provides a means for determining data that is inconsistent.
One time all data is identified as consequent, an organization can generate 1 version of the truth. This is important when the company wants to report consistent statistics about itself, such equally acquirement or number of employees.
By having a information warehouse, snapshots of information can exist taken over fourth dimension. This creates a historical record of data, which allows for an analysis of trends.
A information warehouse provides tools to combine information, which can provide new information and analysis.

Information Mining

Data mining is the process of analyzing information to observe previously unknown trends, patterns, and associations in order to make decisions. Generally, data mining is accomplished through automated means against extremely large data sets, such as a information warehouse. Some examples of data mining include:

An assay of sales from a large grocery chain might determine that milk is purchased more than oft the mean solar day afterwards it rains in cities with a population of less than fifty,000.
A banking concern may find that loan applicants whose bank accounts show item deposit and withdrawal patterns are not practiced credit risks.
A baseball game team may find that collegiate baseball players with specific statistics in hitting, pitching, and fielding make for more than successful major league players.

In some cases, a data-mining project is begun with a hypothetical result in heed. For example, a grocery chain may already accept some idea that buying patterns change after it rains and want to get a deeper understanding of exactly what is happening. In other cases, at that place are no presuppositions and a information-mining plan is run confronting large data sets in lodge to discover patterns and associations.

Privacy Concerns

The increasing ability of data mining has caused concerns for many, especially in the area of privacy. In today's digital globe, information technology is becoming easier than e'er to take data from disparate sources and combine them to practice new forms of analysis. In fact, a whole industry has sprung upwards effectually this applied science: information brokers. These firms combine publicly attainable data with information obtained from the government and other sources to create vast warehouses of data about people and companies that they can then sell. This subject will be covered in much more detail in chapter 12 – the chapter on the ethical concerns of information systems.

Business concern Intelligence and Business organisation Analytics

With tools such as data warehousing and information mining at their disposal, businesses are learning how to use data to their advantage. The term business organisation intelligence is used to describe the process that organizations employ to take data they are collecting and analyze it in the hopes of obtaining a competitive advantage. Also using data from their internal databases, firms ofttimes purchase information from data brokers to get a big-picture agreement of their industries. Concern analytics is the term used to describe the utilize of internal company data to improve business processes and practices.

Noesis Direction

We end the chapter with a discussion on the concept of noesis direction (KM). All companies accumulate knowledge over the course of their existence. Some of this knowledge is written down or saved, but not in an organized fashion. Much of this knowledge is not written down; instead, it is stored within the heads of its employees. Knowledge management is the process of formalizing the capture, indexing, and storing of the company's noesis in order to benefit from the experiences and insights that the company has captured during its existence.

Summary

In this chapter, nosotros learned about the role that data and databases play in the context of data systems. Data is fabricated up of small facts and information without context. If yous give data context, and so you lot have information. Knowledge is gained when data is consumed and used for conclusion making. A database is an organized drove of related information. Relational databases are the most widely used type of database, where data is structured into tables and all tables must exist related to each other through unique identifiers. A database management organization (DBMS) is a software application that is used to create and manage databases, and can have the form of a personal DBMS, used by i person, or an enterprise DBMS that can be used by multiple users. A data warehouse is a special class of database that takes data from other databases in an enterprise and organizes it for analysis. Data mining is the process of looking for patterns and relationships in large data sets. Many businesses use databases, information warehouses, and data-mining techniques in order to produce business intelligence and gain a competitive advantage.

Study Questions

What is the difference between information, information, and knowledge?
Explain in your ain words how the data component relates to the hardware and software components of information systems.
What is the divergence between quantitative data and qualitative information? In what situations could the number 42 be considered qualitative information?
What are the characteristics of a relational database?
When would using a personal DBMS make sense?
What is the difference between a spreadsheet and a database? List three differences betwixt them.
Depict what the term normalization ways.
Why is it important to ascertain the data type of a field when designing a relational database?
Name a database y'all interact with oft. What would some of the field names be?
What is metadata?
Proper name three advantages of using a data warehouse.
What is data mining?

Exercises

Review the design of the Student Clubs database earlier in this chapter. Reviewing the lists of data types given, what data types would you assign to each of the fields in each of the tables. What lengths would you assign to the text fields?
Download Apache OpenOffice.org and use the database tool to open up the "Educatee Clubs.odb" file available here. Take some fourth dimension to learn how to modify the database structure and then run into if you tin can add the required items to support the tracking of faculty advisors, every bit described at the end of the Normalization section in the chapter. Hither is a link to the Getting Started documentation.
Using Microsoft Access, download the database file of comprehensive baseball game statistics from the website SeanLahman.com. (If you don't have Microsoft Access, you tin can download an abridged version of the file here that is compatible with Apache Open up Office). Review the construction of the tables included in the database. Come with three different information-mining experiments you would like to attempt, and explain which fields in which tables would have to be analyzed.
Do some original research and discover two examples of data mining. Summarize each example and then write well-nigh what the two examples have in common.
Acquit some independent research on the process of business organisation intelligence. Using at to the lowest degree 2 scholarly or practitioner sources, write a two-page newspaper giving examples of how business organization intelligence is being used.
Behave some contained research on the latest technologies being used for noesis management. Using at least ii scholarly or practitioner sources, write a two-page paper giving examples of software applications or new technologies being used in this field.