2.3: Phase 2- Data Acquisition
- Page ID
- 44904
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In the data acquisition phase, you obtain the data for your GIS. Getting all the data together (and in a suitable format) is the most costly and time-consuming task for any GIS project. Most estimates suggest that between 75 to 80 percent of your time is spent collecting, entering, cleaning, and converting data (phases 2 and 3 of this chapter). Before proceeding to data sources, we define the term “data” and look at various data issues including accuracy, precision, and metadata. Figure 2.7 features a list of data acquisition tasks.

Figure 2.7: Key tasks of the data acquisition phase.
Data
Data are frequently called facts. This definition suggests that they are pure, but all data are selected for a particular purpose and are shaped by that purpose. The term “raw data” implies an even greater purity—an objective truth—but even the most objective scientist has beliefs and knowledge that underlie data collection. Data obtained for any project are preconceived.
You can classify data as either primary or secondary and observable or non-observable. The geographer Frank Aldrich devised this simple data model (Figure 2.8) to depict the different data categories. Primary data (the light colored inner circle) are measurements that you or your team collects. They are usually derived from experiments or from fieldwork. Secondary data (the doughnut shaped darker ring) are datasets that someone else collects. These datasets, collected from experiments or fieldwork, were collected for a purpose other than your own. Most researchers prefer primary data because they have not been previously conceived and shaped. Still, secondary datasets are tremendously valuable if you determine how and why they were collected and if your project can accept those preconceptions.

Figure 2.8: Aldrich’s data model.
You can also classify data as either observable or non-observable. Notice in the figure above that a vertical line bisects the primary and secondary circles. Observable data (to the left of the bisecting axis) are when someone or something observes the characteristic or the behavior. Non-Observable data (to the right of the bisecting axis) are when respondents are asked questions in an interview or on a questionnaire, but the data gatherer does not physically observe the characteristic or behavior.
Combinations of these two types of characteristics are made. Primary Observable data are datasets you collect and observe. Secondary Observable data are datasets someone else collects and observes. Satellite imagery is an example. Primary Non-Observable data are datasets you collect, but the characteristic or behavior is not observed. Surveys and interviews that you administer are examples. Secondary Non-Observable data are datasets that someone else collects and does not observe. Census data falls into this category.
Data Evaluation
What is the quality of your data? Good datasets are both accurate and precise. There could be both spatial location problems and attribute errors. It is critical that you evaluate your data, especially your secondary datasets, because errors (if not corrected) can make a project’s results worthless. Do not be lulled into a false sense of security when you receive or construct a dataset. Instead, evaluate your data. To better understand the types of errors that occur in GIS datasets, we turn to the term “error” and two parts of its definition.
Errors are obviously mistakes, and in the context of database errors, they are the result of two things: inaccuracies and data imprecision. Accuracy is the quality to which the data matches true and accepted values. In a layer, there can be inaccuracies in the location of features (and in their attributes). For example, features may be located in the wrong locations, not present when they should be, or a fictitious feature may be entered where one does not exist.
Precision is the exactness of a measurement or value, and it refers to the degree the value can be reproduced by similar data collection techniques. In other words, precision is a measure of exactness and repeatability. The precision of a feature’s location could be accurate to a few inches or perhaps it has an acceptable precision of 10 meters. 10 meters may not seem precise, but it is usually good enough for many projects.
Many types of errors affect a GIS dataset’s quality. Some are obvious, but others are difficult to uncover. Learn to recognize error and decide on an acceptable degree of precision for your project. The following discussion looks at the types of errors one might uncover. Chapters 3 and 4 address how to identify and fix these errors.
Accuracy Errors:
- How old is the dataset? This is one of the most important questions to have answered because some datasets may be too old to be useful. If it exists, a metadata file, a file that describes a feature layer, is a good place to start. Metadata files are discussed later in this chapter.
- The dataset may have locational inaccuracies. As described above, features may be sloppily positioned, omitted, or placed where they do not exist. This is usually the result of faulty fieldwork and observations, input error, or problems associated with converting data. On this last point, the accurate conversion of hard-copy maps to digital form is a major challenge because processing errors occur during digitizing and scanning (defined later in this chapter).
- The attributes may be faulty due to keyboard error, faulty observation techniques, defective or non-calibrated instruments, or researcher bias.
- If you are using secondary data, discover how the digital dataset was created and whether it underwent any digital conversion. Conversion problems occur when processing data from different formats, projections, data models, and resolutions. There can be inconsistencies (even slight) during the translation that alters geographic position and (in the case of raster resampling) the cell values.
Precision Errors:
- The dataset may have positional imprecision due to the scale of the original maps. A map with a scale of 1:24,000 illustrates finer details than a smaller scale map of 1:500,000. With a vector-based GIS, as you zoom into your map, the feature locations may look precise, but its precision is based on the scale of the original map. It might not be as precise as it looks. Using a raster GIS, you can generate imprecision by starting with a low-resolution layer (large pixel size) and increasing its resolution. The pixels are capable of holding a higher resolution of data, but the precision will not improve.
- Pixel resolution can also contribute to a dataset’s positional imprecision. As discussed in Chapter 1, if a layer has a large pixel size, it may poorly represent the positional precision of its features. Small and narrow features will swim within large pixels. Their precise placement is unknown.
- Positional imprecision can also occur if feature position is difficult to determine. Some features like streets and parcels are fairly easy to place, but there are features like soils, vegetation, and climate regimes that have fuzzy or less discrete borders. Some lines across a map are value judgments. The data may be accurate; they are just not precise.
Metadata
Given these possible errors, you can understand the danger and uncertainty of using undocumented data. Metadata is a data quality document, and its frequently repeated definition is “data about data” (although it is perhaps more accurate to define it as “information about data”). It describes the attributes and the location of the features in the layer. It gives you an impression about the dataset’s accuracy and precision. Metadata includes basic information about the dataset, including a description and if there are any use rules. Good metadata files should provide the answers to the following questions:
- What is the dataset’s age?
- What is the area covered by the dataset?
- Who created it?
- How was it constructed (digitizing, scanning, overlay, etc.)?
- What projection, coordinate system, and datum are used?
- What was the original map’s scale (if applicable)?
- How accurate and precise are the locations and attributes?
- What data model (vector or raster) does the layer use?
- How were the data checked (both location and attributes)?
- Why were the data compiled? What was their need or motivation?
After looking over the metadata file, you should ask yourself and your colleagues an additional question: is the data provider reliable? The presence and condition of the metadata might help answer this question.
Also within the metadata are data dictionaries that describe all of the features’ attributes. As you saw in Figure 2.4, attributes frequently have odd, short field names, like Zblack_00, that are difficult to decode. Metadata describes the attributes more fully. In addition, the values that go into these attribute fields are often coded with short values (or abbreviations) instead of using longer words, which are more subject to keyboard error. Data dictionaries decipher these codes and abbreviations. The metadata file in Figure 2.9 is an example. It documents a California weather station GIS layer from the California Spatial Information Library (CaSIL). It has been modified (shortened) for the purposes of this e-text.
Metadata is important. Years ago when GIS was in its infancy, there were no standards relating to data documentation and very little metadata existed. GIS layers were created with only a particular project’s specifications in mind, and many of the details went missing when the personnel that created the dataset moved on. As GIS datasets became numerous and agencies began sharing data, a common set of specifications arose to describe the GIS layers. A metadata file is generally attached or closely associated with a GIS layer.

Figure 2.9: A typical metadata file that provides information on both spatial and attribute data.
Despite its importance, metadata is still neglected. It is time consuming, but it is essential in an environment where data sets are shared and the background of the data must be known.
On-Line GIS Data
The Internet is a great place to start looking for data. If you find existing GIS datasets that serve your purpose and passes your specifications, it saves you time and money. A search may reveal multiple copies of what seems to be the same data, but check the details—examine the metadata—because minor differences might make one dataset better than the other. Much base map data (countries, states, counties, major roads, rivers, township and range) exists on the Internet.
It would be convenient to retrieve all of your GIS datasets from the Internet, and although more and more data are available, the Internet will not provide you with everything you need. When searching the Internet, don’t just look for GIS files. Search for data that might be GIS compatible. Perhaps a spreadsheet can be modified and linked to an existing geographic layer. CAD files, aerial photographs, and satellite imagery might also be relevant. Finding good digital datasets frees you from collecting and entering the data yourself.
Other Sources of GIS Data
Most likely, you will contact GIS personnel at various government agencies, commercial data sources, and organizations about their data sources. Ask questions about the datasets accuracy and completeness. If you think their datasets might be helpful to you, ask them for permission to obtain and use the data. Frequently, a short agreement will be written and signed that allows one to use the data under certain conditions.
Many data companies modify “public” data to create a “value-added” product that you can purchase and load directly into your GIS. “Value-added” datasets usually originate from a government agency or an organization that creates the basic GIS dataset, but a commercial company obtains the data and “improves” it by adding attributes or improving its spatial precision. The commercial company can then sell the “value-added” portion of the data. Many of these datasets can also be obtained over the Internet.
Converting from one GIS format to another.
When obtaining GIS data from the Internet or from other sources, it requires extensive preprocessing to make it work with your other GIS datasets. Initially, the newly acquired dataset may require “extracting” (unzipping). They are often compressed to make them smaller for storage on a CD or to make their downloading quicker. If the acquired file’s type is ZIP, TAR or GZ, it needs to be extracted, which can be done with free or cheap software (like WinZip). At times, extracting files can be more complicated if the compressed files are nested within a compressed file. This requires extracting the first compressed file and then extracting the nested files.
GIS datasets are typically stored in one of the leading GIS software formats (frequently as a shapefile) or in a format specified by the U.S. federal government. Fortunately, most GIS packages read or convert many of the most common GIS formats. At times, however, you may need access to “third party” software to read the data and export it into a format that your GIS program reads.
Even after the datasets are extracted and converted into your GIS format, they will require further processing. Many types of data preprocessing, manipulation, and conversion are discussed in Chapters 3 and 4.


