3.3: Metadata

Last updated
Save as PDF

Page ID: 20567

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Looking at the contents of the file, we can see that it contains data about the cities of Los Angeles, London, and Singapore. A comma separates each field or attribute, and the file also contains a header row that tells us about the data contained in each column. Or does it? What does the column “sun” refer to? Is it the number of sunny days this year, last year, annually, or when? What about “temp”? Does this refer to the average daytime, evening, or annual temperature? For that matter, how is temperature measured? In Celsius? Fahrenheit? Kelvin? The column “precip” refers to precipitation, but again, what are the units or time frames for such measures and data? Finally, where did these data come from? Who collected them, when were they collected, and for what purpose?

Incredibly, such a small text file can lead to so many questions. Let us extend the example to a file with one hundred records on ten variables, one thousand records on one hundred variables, or ten thousand records on one thousand variables. Through this simple example, several general but central issues related to data emerge. Such issues range from the relatively mundane naming conventions that are used to identify individual records (i.e., rows) and distinguish one field (i.e., column) from another to the issue of providing documentation about what data are included in a given file; when the data were collected; for what purpose are the data to be used; who collected them; and, of course, where did the data come from?

The previous simple text file illustrates how we cannot and should not take data and information for granted. It also highlights two important concepts regarding the source of data and the contents of data files. First, data can be put into two distinct categories regarding data sources. The first category is called primary data. Primary data refer to data collected directly or on a firsthand basis. For example, if you wanted to examine the variability of local temperatures in May, and you recorded the temperature at noon every day in May, you would be constructing a primary data set. Conversely, secondary data refer to data collected by someone else or another party. For instance, we use secondary data when working with census or economic data collected and distributed by the government.

Several factors influence the decision behind the construction and use of primary data sets versus secondary data sets. Data acquisition costs in terms of money, availability, and time are essential factors. The data acquisition and integration phase of most geographic information system (GIS) projects are often the most time-consuming. In other words, locating, obtaining, and putting together the data for a GIS project, whether you collect the data yourself or use secondary data, may take up most of your time. Of course, depending on the purpose, availability, and need, it may not be necessary to construct an entirely new data set (i.e., primary data set). However, considering the vast amounts of data and publicly available information, for example, via the Internet, the cost and time savings of using secondary data often offset any benefits associated with primary data collection.

Now that we understand the difference between primary and secondary data and the rationale, how do we find the data and information we need? As noted earlier, there is an incredibly vast and growing amount of data and information available to us and performing an online search for “deforestation data” will return hundreds—if not thousands—of results. We need to turn to even more data to overcome this data and information overload. We are looking for a special kind of data called metadata. So defined, metadata is data about data. At one level, a header row in a simple text file like those discussed in the previous section is analogous to metadata. The header row provides data (e.g., names and labels) about the subsequent rows of data.

However, header rows may need an additional explanation, as previously illustrated. Furthermore, when working with or searching through several data sets, it can be pretty tedious or impossible to open every file to determine its contents and usability. Enter metadata. Many files, particularly secondary data sets, come with a metadata file. These metadata files contain items such as general descriptions of the contents of the file, definitions for the various terms used to identify records (rows) and fields (fields), the range of values for fields, the quality or reliability of the data, and measurements, how the data were collected, when the data were collected, and who collected the data. Though not all data are accompanied by metadata, it is easy to see and understand why metadata is essential and valuable when searching for secondary data and when constructing primary data that may be shared in the future.

Just as simple files come in all shapes, sizes, and formats, so do metadata. As the amount and availability of data and information increase each day, metadata plays a critical role in making sense. The metadata class we are most concerned with when working with a GIS is called geospatial metadata. As the name suggests, geospatial metadata is data about geographical and spatial data. According to the Federal Geographic Data Committee (FGDC) in the United States (see http://www.fgdc.gov), “Geospatial metadata are used to document digital geographic resources such as GIS files, geospatial databases, and earth imagery. A geospatial metadata record includes core library catalog elements such as Title, Abstract, and Publication Data; geographic elements such as Geographic Extent and Projection Information; and database elements such as Attribute Label Definitions and Attribute Domain Values.” The definition of geospatial metadata is about improving transparency regarding data and promoting standards. Take a few moments to explore and examine the contents of a geospatial metadata file that conforms to the FGDC here.

Standards refer to widely promoted, accepted, and followed the rules and practices. Given the range and variability of data and data sources, identifying a common thread to locate and understand the contents of any given file can be a challenge. However, just as the rules of grammar and mathematics provide the foundations for communication and numeric calculations, metadata provides similar frameworks for working with and sharing data and information from various sources.

The central point behind metadata is that it facilitates data and information sharing. Within the context of large organizations such as governments, data and information sharing can eliminate redundancies and increase efficiencies. Moreover, access to data and information promotes the integration of different data to improve analyses, inform decisions, and shape policy. The role that metadata, and geospatial metadata, play in GIS is critical and offers enormous benefits in terms of cost and time savings. The sharing, widespread distribution, and integration of various geographic and nongeographic data and information enabled by metadata drive some of the most exciting and compelling innovations in GIS and the broader geospatial information technology community. More important, widespread access, distribution, and sharing of geographic data and information have essential social costs and benefits and yield better analyses and more informed decisions.

Files and Formats

When we collect data about your home, rainforests, or anything, we usually need to put them somewhere. Though we may scribble numbers and measures on the back of an envelope or write them down on a pad of paper, if we want to update, share, analyze, or map them in the future, it is often helpful to record them in digital form so a computer can read them. Though we will not bother ourselves with the bits and bytes of computing, it is necessary to discuss some fundamental elements of computing that are both relevant and required when learning and working with a GIS.

One of the most common elements of working with computers and computing is the file. Files in a computer can contain any number of things, from a complex set of instructions (e.g., a computer program) to a list of numbers and letters (e.g., an address book). Furthermore, computer files come in all varied sizes and types. One of the clues we can use to distinguish one file from another is the file extension. A file extension refers to the letters that follow the period (“.”) after the file’s name. The table below contains some of the most common file extensions and the types of files with which they are associated.

filename.txt Simple text file

filename.doc Microsoft Word document

filename.pdf Adobe portable document format

filename.jpg Compressed image file

filename.tif Tagged image format

filename.html Hypertext markup language (used to create websites)

filename.xml Extensible markup language

filename.zip Zipped/compressed archive

Some computer programs may be able to read or work with only specific file types, while others are more adept at reading multiple file formats. As you begin to work more with information technology and GIS, you will realize that familiarity with different file types is essential. In addition, learning how to convert or export one file type to another is also a beneficial and valuable skill to obtain. In this regard, recognizing and knowing how to identify different and unfamiliar file types will undoubtedly increase your proficiency with computers and GIS.

Of the numerous file types, one of the most common and widely accessed files is simple text, plain text, or just text file. Simple text files can be read widely by word processing programs, spreadsheet and database programs, and web browsers. Often ending with the extension “.txt” (i.e., filename.txt), text files contain no special formatting (e.g., bold, italic, underlining) and contain only alphanumeric characters. In other words, images or sophisticated graphics are not well suited for text files. Text files, however, are ideal for recording, sharing, and exchanging data because most computers and operating systems can recognize and read simple text files with programs called text editors.

When a text file contains organized or structured data in some fashion, it is sometimes called a flat file (but the file extension remains the same, i.e., .txt). Flat files are organized in a tabular format or line by line. In other words, each line or row of the file contains one and only one record. So, if we collected height measurements on three people, Tim, Jake, and Harry, the file might look something like this:

Name Height

Tim 6’1″

Sarah 5’7″

Maria 5’5″

Each row corresponds to one and only one record, observation, or case. There are two other essential elements to know about this file. First, note that the first row does not contain any data; instead, it describes the data contained in each column. When the first row of a file contains such descriptors, it is referred to as a header row or just a header. Columns in a flat-file are also called fields, variables, or attributes. For example, “Height” is the attribute, field, or variable that we are interested in, and the observations or cases in our data set are “Tim,” “Jake,” and “Harry.” In short, rows are for records; columns are for fields.

The second unseen but critical element of the file is the spaces between each column or field. For example, a space separates the “name” column from the “height” column in the example. Upon closer inspection, however, note how the initial values of the “height” column are aligned. If a single space were used to separate each column, the height column would not be aligned. In this case, a tab is being used to separate the columns of each row. The delimiter or separator is the character used to separate columns within a flat file. Though any character can be used as a delimiter, the most common delimiters are the tab, the comma, and a single space. The following are examples of each.

Tab-Delimited Single-Space-Delimited Comma-Delimited

Name Height Name Hight Name, Height

Tim 6’1″ Tim 6’1″ Tim, 6’1″

Sarah 5’7″ Sarah 5’7″ Sarah, 5’7″

Maria 5’5″ Maria 5’5″ Maria, 5’5″

Knowing the delimiter to a flat-file is essential because it enables us to distinguish and separate the columns efficiently and without error. Sometimes such files are referred to by their delimiters, such as a “comma-separated values” file or a “tab-delimited” file.

The same general format is applied when recording and working with geographic data. Rows are reserved for records, or in the case of geographic data, locations and columns or fields are used for the attributes or variables associated with each location. For example, the following tab-delimited flat file contains data for three places (i.e., countries) and three attributes or characteristics of each country (i.e., population, language, continent), as noted by the header.

Country Population Languages Continent

France 65,000,000 French Europe

Brazil 192,000,000 Portuguese South American

Jordan 9,531,712 Arabic Southwest Asia

Files like those presented here are the building blocks of the various tables, charts, reports, graphs, and other visualizations that we see online, in print, and on television every day. They are also vital components of GIS maps and geographic representations. Rarely if ever, however, will you work with one and only one file or file type. Often, especially when working with GIS, you will work with multiple files. Such a grouping of multiple files is called a database. Since the files within a database may be of different sizes, shapes, and even formats, we need to devise a system that will allow us to work, update, edit, integrate, share, and display the various data within the database. Such a system is referred to as a database management system (DBMS). So databases and DBMSs are crucial to GIS, and a later chapter is dedicated to them. Geodatabases are a collection of geographic data contained within a standard file system.