Skip to main content
Geosciences LibreTexts

15.3: Groupby()

  • Page ID
    24182
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    The `.groupby()` method in Pandas is a cornerstone for data aggregation and summarization tasks. Think of it as the Swiss Army knife for slicing, dicing, and summarizing datasets  It allows you to segment your data into meaningful groups, apply a function to each group independently, and then combine the results, all in a few lines of code. This is a fundamental operation in data analytics, enabling you to extract valuable insights from raw data efficiently.

    Let's generate a DataFrame with some random data about 8th graders:
     
    # Generate random names as the index
    random_names = ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank', 'Grace', 'Helen', 'Isaac', 'Jill']
     
    # Generate random gender based on the names
    gender = ['Girl', 'Boy', 'Boy', 'Boy', 'Girl', 'Boy', 'Girl', 'Girl', 'Boy', 'Girl']
     
    # Generate random weights and heights
    weights = np.random.randint(120, 151, size=10)
    heights = np.random.randint(40, 61, size=10)
     
    # Generate random eye colors
    eye_colors = np.random.choice(['Blue', 'Brown'], size=10)
     
    # Create the DataFrame
    df_demo = pd.DataFrame({
        'Gender': gender,
        'Weight': weights,
        'Height': heights,
        'Eye Color': eye_colors
    }, index=random_names)

     
    It generates this DataFrame:

    clipboard_e217bd0f76e5f05e33bbaa6a8989e6c62.png

    The "Gender" column specifies whether the individual is a boy or a girl, the "Weight" column contains random weights between 120 and 150 pounds, and the "Height" column features random heights between 40 and 60 inches, and "Eye Color" is randomly selected to be "Blue" or "Brown"

    Now, let's explore how to use `.groupby()` to obtain some insights from this DataFrame. We'll start by grouping the data by gender and calculating the average weight and height for each group.

    grouped_by_gender = df_demo.groupby('Gender').mean()
     
    The line of code `grouped_by_gender = df_demo.groupby('Gender').mean()` performs two key operations on the `df_demo` DataFrame. First, it groups the DataFrame by the "Gender" column, segregating the data into subsets based on the unique gender labels ('Boy' and 'Girl'). This is the "splitting" phase in the `groupby` operation. Second, it calculates the mean for each numerical column within these gender-based groups. In our example, it computes the average "Weight" and "Height" for both boys and girls. The result is stored in a new DataFrame called `grouped_by_gender`, where the index consists of the unique gender labels and the columns contain the corresponding average weights and heights. This one-liner essentially condenses the original DataFrame into a summary table, providing a quick snapshot of how the numerical attributes differ by gender:

    clipboard_e7c213966f3cb75ddead76e51bf33ccd6.png

    It's worth noting that the "Eye Color" column is automatically excluded from the mean calculation since it's a non-numerical column. This is one of the conveniences of using .groupby(); it can smartly handle different data types while performing calculations.

    We can, of course, group by eye color:

    # Group the df_demo DataFrame by 'Eye Color' and calculate the mean for numerical columns
    grouped_by_eye_color = df_demo.groupby('Eye Color').mean()

    This produces the DataFrame:

    clipboard_e50f71df4a5e8092fbc1ecd8c0de066fa.png

    This example illustrates the power of `.groupby()` for quickly summarizing data based on specific categories. With just a single line of code, we've gained a valuable understanding of how weight and height differ between gender and eye color in this dataset. This kind of data summarization is crucial for a myriad of applications, from scientific research to business analytics.


    15.3: Groupby() is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

    • Was this article helpful?