17.4: Datasets
- Page ID
- 24685
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In pandas, a DataFrame can be thought of as a collection of Series objects that share a common index, allowing for the alignment of data across rows. This structure is exceedingly efficient for handling one-dimensional sequences of values, typically time series or other forms of sequentially indexed data, which are common in various domains including finance, economics, and simple observational studies.
Similarly, Xarray Datasets can be thought of as a collection of DataArrays with common coordinates, similar to the way a Pandas DataFrame is a collection of Series. Xarray's Dataset enables the alignment and joint operation of multiple DataArrays across their shared coordinates, providing a robust framework for managing the intricacies of multidimensional datasets, like those generated from climate models or satellite observations.
Here is one example of how you would create a Dataset:
import numpy as np
import pandas as pd
import xarray as xr
# Define the coordinates
latitudes = np.linspace(-90, 90, 181) # 181 points from South to North Pole
longitudes = np.linspace(-180, 180, 361) # 361 points for full longitude range
pressure_levels = np.array([1000, 850, 700, 500, 300, 200, 100]) # in hPa
times = pd.date_range('2005-01-01', '2005-12-01', freq='MS') # Monthly intervals for 2005
# Generate random data
temperature_data = np.random.rand(len(times), len(pressure_levels), len(latitudes), len(longitudes))
precipitation_data = np.random.rand(len(times), len(pressure_levels), len(latitudes), len(longitudes))
# Create the xarray Dataset
ds = xr.Dataset({
'temperature': (['time', 'pressure', 'latitude', 'longitude'], temperature_data),
'precipitation': (['time', 'pressure', 'latitude', 'longitude'], precipitation_data)
}, coords={
'time': times,
'pressure': pressure_levels,
'latitude': latitudes,
'longitude': longitudes
})
In this example, the xr.Dataset
constructor is used to create a new Dataset. The first argument is a dictionary that maps variable names to dimension names and data arrays. The coords
argument is a dictionary that assigns coordinate labels to the dimensions. The times
array generated with pd.date_range
provides monthly intervals throughout the year 2005, while latitudes
and longitudes
are evenly spaced points on the globe, and pressure_levels
represent typical atmospheric pressure levels in hPa.
Accessing the variables within an Xarray dataset is straightforward, similar to how one might interact with a dictionary in Python. Each variable in the dataset is a Xarray DataArray, which can be retrieved using the variable name as the key. For example, to access the 'temperature' variable from the dataset `ds` created in the previous example, you simply use `ds['temperature']` or `ds.temperature`. This retrieves the DataArray corresponding to temperature, complete with its associated dimensions, coordinates, and attributes. This structure allows for intuitive querying and manipulation of the data. You can similarly access the 'precipitation' DataArray using `ds['precipitation']` or `ds.precipitation`. These variables can be handled independently or in conjunction with other dataset variables, depending on the analysis required. It's this simplicity in variable access that makes Xarray a potent tool for handling high-dimensional datasets typical in fields like atmospheric sciences.
When performing calculations or operations on an Xarray dataset, the operation propagates across all the included variables, applying the computation en masse. This attribute of Xarray is particularly advantageous for dataset-wide adjustments or analyses. Consider a scenario where we want to add one to every data point in the dataset. We could do this with the command `ds = ds + 1`, which would add one to each variable in `ds`.
Similarly, when performing coordinate-based selections, the operation is universally applied. If you were to select a subset of data along a specific coordinate, say a particular range of latitudes, executing something like `ds.sel(latitude=slice(10, 20))` would concurrently extract the slice for all variables within `ds` that correspond to those latitudes. It would return a new dataset with the same variables but with the requested range of latitudes. Remember that this is a view, not a copy, so you may get irregular results if you try to modify the contents of the view.