6.4: Data Classification

Last updated
Save as PDF

Page ID: 20588

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

The process of data classification combines raw data into predefined classes or bins. These classes may be represented in a map by some unique symbols or, in the case of choropleth maps, by a unique color or hue (for more on color and hue, see Chapter 8 “Geospatial Analysis II: Raster Data,” Section 8.1 “Basic Geoprocessing with Rasters”). Choropleth maps are thematic maps shaded with graduated colors to represent some statistical variable of interest. Although seemingly straightforward, several different classification methodologies are available to a cartographer. These methodologies break the attribute values down along various interval patterns. Monmonier (1991) noted that different classification methodologies could significantly impact the interpretability of a given map as the visual pattern presented is easily distorted by manipulating the specific interval breaks of the classification. In addition to the methodology employed, the number of classes chosen to represent the feature of interest will also significantly affect the ability of the viewer to interpret the mapped information. Including too many classes can make a map look overly complex and confusing. On the other hand, too few classes can oversimplify the map and hide important data trends. Most effective classification attempts utilize approximately four to six distinct classes.

While problems potentially exist with any classification technique, a well-constructed choropleth increases the interpretability of any given map. The following discussion outlines the classification methods commonly available in geographic information system (GIS) software packages. In these examples, we will use the US Census Bureau’s population statistics for US counties in 1997. These data are freely available on the US Census website (http://www.census.gov).

Equal Interval Classification Method

The equal interval (or equal step) classification method divides the range of attribute values into equally sized classes. The user determines the number of classes. The equal interval classification method is best used for continuous datasets such as precipitation or temperature. For example, in the case of the 1997 Census Bureau data, county population values across the United States range from 40 (Yellowstone National Park County, MO) to 9,184,770 (Los Angeles County, CA) for a total range of 9,184,770 − 40 = 9,184,730. If we decide to classify this data into five equal interval classes, the range of each class will cover a population spread of 9,184,730 / 5 = 1,836,946. The advantage of the equal interval classification method is that it creates a legend that is easy to interpret and present to a non-technical audience. The primary disadvantage is that specific datasets will end up with most data values falling into only one or two classes, while few to no values will occupy the other classes. For example, as shown in the figure “Equal Interval Classification for 1997 US County Population Data”, almost all the counties are assigned to the first (yellow) bin.

Quantile Classification Method

The quantile classification method places equal numbers of observations into each class. This method is best for data that is evenly distributed across its range. Figure “Quantiles” shows the quantile classification method with five total classes. As there are 3,140 counties in the United States, each class in the quantile classification methodology will contain 3,140 / 5 = 628 different counties. The advantage of this method is that it often excels at emphasizing the relative position of the data values (i.e., which counties contain the top 20 percent of the US population). The primary disadvantage of the quantile classification methodology is that features placed within the same class can have wildly differing values, mainly if the data are not evenly distributed across its range. In addition, the opposite can also happen, whereby values with small range differences can be placed into different classes, suggesting a broader difference in the dataset than exists.

Natural Breaks (or Jenks) Classification Method

The natural breaks (or Jenks) classification method utilizes an algorithm to group values in classes that are separated by distinct breakpoints. This method is best used with unevenly distributed data but not skewed toward either end of the distribution. The “Natural Breaks” figure shows the natural breaks classification for the 1997 US county population density data. One potential disadvantage is that this method can create classes containing widely varying number ranges.

Accordingly, class 1 is characterized by a range of just over 150,000, while class 5 is characterized by over 6,000,000. It is often helpful to “tweak” the classes following the classification effort or change the labels to some ordinal scale such as “small, medium, or large.” The latter example, in particular, can result in a map that is more comprehensible to the viewer. A second disadvantage is that comparing two or more maps created with the natural breaks classification method can be challenging because the class ranges are particular to each dataset. In these cases, datasets that may not be overly disparate may appear in the output graphic.

Standard Deviation Classification Method

Finally, the standard deviation classification method forms each class by adding and subtracting the standard deviation from the mean of the dataset. The method is best suited for data that conforms to a normal distribution. In the county population example, the mean is 85,108, and the standard deviation is 277,080. Therefore, as shown in the figure on “Standard Deviation,” the central class contains values within a 0.5 standard deviation of the mean, while the upper and lower classes contain values of 0.5 or more standard deviations above or above the mean.

In conclusion, several viable data classification methodologies can be applied to choropleth maps. Although other methods are available (e.g., equal area, optimal), those outlined here represent the most commonly used and widely available. Each of these methods presents the data differently and highlights different aspects of the trends in the dataset. Indeed, the classification methodology and the number of classes utilized can result in wildly varying interpretations of the dataset. Therefore, it is incumbent upon you, the cartographer, to select the method that best suits the needs of the study and presents the data in as meaningful and transparent a way as possible.

Search

Text Color

Text Size

Margin Size

Font Type