Analysis of Lake Water Quality Measurements#

This notebook parses and visualizes water quality data collected by past Environmental Science classes at Lakeshore High School.

Parse Data#

We start by parsing the data into Python.

# We will use the Pandas library to read the data from the Google Sheet
import pandas as pd
url = 'https://docs.google.com/spreadsheets/d/1Y_7_AertX2dd86mXmVtc5ju-sbccSAGZYByXZrDT9sw/export?gid=0&format=csv'
data = pd.read_csv(url)
# Next, let's print the first 20 rows of the data
data.head(20)
Year Lake Class Hour Temperature (°C) Conductivity (μS) pH Alkalinity (ppm) Dissolved Oxygen (ppm) Copper (mg/L) Iron (mg/L) Total Hardness (ppm) Nitrate (mg/L) Nitrite (mg/L) Total Chlorine (ppm) Phosphate (ppm) Turbidity (NTUs)
0 2020 Lake Michigan 2nd 19.5 0.25 6.5 80 4.0 0.10 0 120 0 0 0.1 0.1 NaN
1 2021 Lake Michigan 2nd 16.9 0.32 7.0 120 8.5 0.10 0 180 0 0 0.0 0.2 NaN
2 2021 Lake Michigan 3rd 18.0 0.32 7.0 120 8.5 0.10 0 120 0 0 0.1 0.2 NaN
3 2022 Lake Michigan 3rd 22.8 0.00 7.5 120 8.5 0.00 0 120 0 0 0.1 NaN 13.0
4 2022 Lake Michigan 6th 23.9 0.33 7.5 120 8.5 0.00 0 50 0 0 0.0 NaN 10.0
5 2023 Lake Michigan 2nd 21.4 0.30 7.0 240 7.0 0.00 0 120 0 0 0.3 0.0 6.0
6 2023 Lake Michigan 6th 22.7 0.27 7.5 240 10.0 0.00 0 120 0 0 0.1 0.0 6.0
7 2020 North Lake 2nd 17.9 0.50 6.5 120 6.0 0.10 0 120 0 0 0.0 0.1 NaN
8 2021 North Lake 2nd 17.6 0.46 8.0 180 10.0 0.10 0 250 0 0 0.1 0.2 NaN
9 2021 North Lake 3rd 19.0 0.45 7.5 180 10.0 0.10 0 180 0 0 0.1 0.2 NaN
10 2022 North Lake 3rd 21.0 0.40 7.5 180 10.0 0.05 0 180 0 0 0.0 NaN 24.0
11 2022 North Lake 6th 22.2 0.61 7.5 180 10.0 0.00 0 120 0 0 0.0 NaN 21.0
12 2023 North Lake 2nd 17.4 0.55 7.0 240 5.0 0.00 0 250 0 0 0.1 0.0 8.0
13 2023 North Lake 6th 20.8 0.51 7.5 120 4.5 0.00 0 120 0 0 0.1 0.0 8.0

Visualize Data#

Next, we will plot the data using different symbols for the two lakes. We will make one plot per quantity.

# We will use the Matplotlib library to plot the data
import matplotlib.pyplot as plt

# The following code sets default font sizes and styles for the plots
# Modified from https://stackoverflow.com/questions/3899980/how-to-change-the-font-size-on-a-matplotlib-plot
SMALL_SIZE = 14
MEDIUM_SIZE = 16
BIGGER_SIZE = 18

plt.rc('font', size=SMALL_SIZE)  # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)  # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)  # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)  # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)  # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)  # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title
plt.rc('lines', linewidth=3)
# Define the two lakes we are interested in
lakes = ['Lake Michigan', 'North Lake']

symbols = ['o', 's']
colors = ['blue', 'red']

# Get the names of the columns we want to plot
columns = data.columns[3:]

# Loop over the columns
for i, c in enumerate(columns):

    # Loop over the lakes
    for j, l in enumerate(lakes):
        # Get the data for the current lake
        df = data[data['Lake'] == l]

        # Plot the data as a scatter plot
        plt.scatter(df['Year'], df[c], label=l, marker=symbols[j], color=colors[j])

    # Add an extra year to the x-axis to allow students to plot their new data
    years = list(df['Year'].unique())
    years.append(max(years) + 1)
    plt.xticks(years)

    # Add labels and legend
    plt.xlabel('Year')
    plt.ylabel(c)
    plt.legend(loc='lower center', bbox_to_anchor=(0.5, 1.0), ncol=2)
    plt.grid()
    plt.show()
_images/d6e635118d5969f12622e2caf38cefb6692199e44b55f30e667fe9ba63d11a41.png _images/c1ce9c2fbaea75339a5423e1f4f037f1ec032c1b7cf3cdd48b545ce3c32f1aec.png _images/45be9097183a1564e7bc0ba38792d1b935edf043a0197326f6c3e38bcb70c204.png _images/8bbfd7987b08ce935532edba570dc0dc81e988b54b387e721e8ca46d133c5189.png _images/ede8a76b5e5c2452a8a3614c22b889e16b6cc00d5c2837b0a4981db7a25231b6.png _images/2111ccdc51683f8dbcde0b9d3a4b0748c4dfbd9bc21c2c6ab94367f9bb6c90f5.png _images/66be3cbba4a732a08328d100198f175ab69fc5a0895c040e71638f6828e4054e.png _images/8c8882a695b9399c01710fe64657ab3e3906e1c6c5a65c268b79f20da0b19d11.png _images/aee4cbd7f47b937f4ff1dd977f51e0cdc303224e7b3a0bd210fa8c650fa48840.png _images/7876f219e6e2449769b7e72a49944b76be95d1d244cad26e4cf3f1e08dd66706.png _images/0dd36aee330ba03882d3250ff8c6d654fdd9b1741ffc67cd5dfbf28eb353df45.png _images/1a523818da60ae1aad48af2c403bd56bb8e1d4f565eae8e6208edb1c2bf55818.png _images/4ff856bc63df652847be6b05f6225efe60c043464c9cccc66a18e945111ed2b6.png

Linear Relationships#

Next, we will look for linear relationships in the data:

  • Positive correlation between conductivity and disolved oxygen

  • Possible positive correlation between conductivity and alkalinity

  • Visualize these (possible) correlations with linear regression

First we will start by plotting the data. You always want to start any data analysis with visual inspections.

def scatter_plot(x, y, seperate_lakes=False):
    """ Create a scatter plot for two columns in the dataset

    Arguments:
        x: the name of the column to plot on the x-axis
        y: the name of the column to plot on the y-axis
    """

    def finalize_plot():
        plt.xlabel(x)
        plt.ylabel(y)
        plt.grid()
        plt.legend(loc='lower center', bbox_to_anchor=(0.5, 1.0), ncol=2)
        plt.show()

    for j,l in enumerate(lakes):
        df = data[data['Lake'] == l]
        plt.scatter(df[x], df[y], label=l, marker=symbols[j], color=colors[j])
        if seperate_lakes:
            finalize_plot()

    if not seperate_lakes:
        finalize_plot()

Dissolved Oxygen and Conductivity#

scatter_plot('Dissolved Oxygen (ppm)', 'Conductivity (μS)')
_images/88137ea6ff68b8ccf9a85994f099b1985aba84e3b9157196a8f82e7ebbfc530a.png

We can also make seperate plots for each lake.

scatter_plot('Dissolved Oxygen (ppm)', 'Conductivity (μS)', seperate_lakes=True)
_images/102868ee61964818ab17326895534faef38b31725ac6a4efa51462574abb28c2.png _images/72af589de0e976e33270df2ab10a79e9cb31990c4b428553db355799c0e400dd.png

Alkalinity and Conductivity#

scatter_plot('Alkalinity (ppm)', 'Conductivity (μS)')
_images/f1079fdf411d24c596cd82749241aabe03f7bf2616cedad1f064f424bf088dd3.png
scatter_plot('Alkalinity (ppm)', 'Conductivity (μS)', seperate_lakes=True)
_images/d4462d1d4dda7260d9e0999ee8af29040f9ec81e8cc5ec62047835460aecf9a0.png _images/f35635c19fd85ce3e0fedc2bfcd619fdc37e72fe13a713c6c591d00402c50a14.png

Alkalinity and Total Hardness#

scatter_plot('Alkalinity (ppm)', 'Total Hardness (ppm)')
_images/6c925d0cae8c9ece37c17aaa71a0344d8521feab0adafe2b08ff12f86d483127.png
scatter_plot('Alkalinity (ppm)', 'Total Hardness (ppm)', seperate_lakes=True)
_images/3643a04ab1d6c79e9ac0880ce1a922ecc25da9c1935df37092450ceebdd82837.png _images/ee6d245534225a8d870d4e2c0f23fd441917d400ae11d8d3954d446c1022c659.png

Linear Regression#

Next, we will calculate the best fit line and correlation coefficient.

# We will use scipy.stats to calculate the linear regression
import scipy.stats as stats

def scatter_plot_with_linear_regression(x, y):
    """ Create a scatter plot with linear regression
    for two columns in the dataset

    Arguments:
        x: the name of the column to plot on the x-axis
        y: the name of the column to plot on the y-axis
    """

    # Select the data for the x and y columns
    x_data = data[x].values
    y_data = data[y].values

    # Perform linear regression
    b1, b0, r_value, p_value, std_err = stats.linregress(x_data, y_data)

    print("slope =", round(b1,3), x, "/", y)
    print("intercept =", round(b0,2), y)
    print("r =", round(r_value,2))
    
    # Plot the raw data
    for j,l in enumerate(lakes):
        df = data[data['Lake'] == l]
        plt.scatter(df[x], df[y], label=l, marker=symbols[j], color=colors[j])

    # Plot the linear regression line
    plt.plot(x_data, b0 + b1 * x_data, color='black', label='Best Fit Line')

    # Plot the mean values of x and y
    # By definition, the regression line passes through the mean values of x and y
    x_mean = x_data.mean()
    y_mean = y_data.mean()
    plt.plot(x_mean, y_mean, marker='d', color='black', linestyle='', label='Mean (Average) Values')

    plt.xlabel(x)
    plt.ylabel(y)
    plt.grid()
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    plt.title("Both Lakes")
    plt.show()

def scatter_plot_with_linear_regression_by_lake(x, y):
    """ Create a scatter plot with linear regression
    for two columns in the dataset by lake

    Arguments:
        x: the name of the column to plot on the x-axis
        y: the name of the column to plot on the y-axis
    """

    # Loop over the lakes
    for j,l in enumerate(lakes):
        # Get the data for the current lake
        df = data[data['Lake'] == l]

        # Select the data for the x and y columns
        x_data = df[x].values
        y_data = df[y].values

        # Plot the data
        plt.scatter(x_data, y_data, label=l, marker=symbols[j], color=colors[j])

        # Perform linear regression
        b1, b0, r_value, p_value, std_err = stats.linregress(x_data, y_data)

        print(" *** ", l, " ***")
        print("slope =", round(b1,3), x, "/", y)
        print("intercept =", round(b0,2), y)
        print("r =", round(r_value,2))
        print(" ")
    
        # Plot the linear regression line
        plt.plot(x_data, b0 + b1 * x_data, color='black', label='Best Fit Line')

        # Plot the mean values of x and y
        # By definition, the regression line passes through the mean values of x and y
        x_mean = x_data.mean()
        y_mean = y_data.mean()
        plt.plot(x_mean, y_mean, marker='d', color='black', linestyle='', label='Mean (Average) Values')

        plt.xlabel(x)
        plt.ylabel(y)
        plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
        plt.grid()
        plt.title(l)
        plt.show()

Dissolved Oxygen and Conductivity#

scatter_plot_with_linear_regression('Dissolved Oxygen (ppm)', 'Conductivity (μS)')
slope = -0.005 Dissolved Oxygen (ppm) / Conductivity (μS)
intercept = 0.41 Conductivity (μS)
r = -0.07
_images/4004eb3ded08a241610025bbcc41553a7300878e15c0eba6e354f3543e536b6d.png
scatter_plot_with_linear_regression_by_lake('Dissolved Oxygen (ppm)', 'Conductivity (μS)')
 ***  Lake Michigan  ***
slope = -0.001 Dissolved Oxygen (ppm) / Conductivity (μS)
intercept = 0.26 Conductivity (μS)
r = -0.01
 
_images/2126ff7a7712b1c4410538fc288717125970a9cc16361afa4c1bd80046728bce.png
 ***  North Lake  ***
slope = -0.008 Dissolved Oxygen (ppm) / Conductivity (μS)
intercept = 0.56 Conductivity (μS)
r = -0.32
 
_images/e2f2006f2abbeee82f577adc47b8be5ab594a0144decad15f4c5ffb9f5bcd0a8.png

Alkalinity and Conductivity#

scatter_plot_with_linear_regression('Alkalinity (ppm)', 'Conductivity (μS)')
slope = 0.001 Alkalinity (ppm) / Conductivity (μS)
intercept = 0.25 Conductivity (μS)
r = 0.27
_images/c1677082d0e05dcc3e9b2ec2acbaf39a6d5a8deaadd63abd4a8f779f5d5344f4.png
scatter_plot_with_linear_regression_by_lake('Alkalinity (ppm)', 'Conductivity (μS)')
 ***  Lake Michigan  ***
slope = 0.0 Alkalinity (ppm) / Conductivity (μS)
intercept = 0.21 Conductivity (μS)
r = 0.16
 
_images/def07c53520f43f7cb2a0183ac3dbd6090b20cc922f351ec30fa61476cc3c3fd.png
 ***  North Lake  ***
slope = 0.0 Alkalinity (ppm) / Conductivity (μS)
intercept = 0.46 Conductivity (μS)
r = 0.13
 
_images/35f0a97b1c6888c72a0ddfc854f3662e275e33305b03f6f2aff3ffb4fafbdf86.png

Alkalinity and Total Hardness#

scatter_plot_with_linear_regression('Alkalinity (ppm)', 'Total Hardness (ppm)')
slope = 0.429 Alkalinity (ppm) / Total Hardness (ppm)
intercept = 77.73 Total Hardness (ppm)
r = 0.41
_images/e521436d61219336e82f617de3b3eca7aa894a20a54165fdd73eef24b7943fb5.png
scatter_plot_with_linear_regression_by_lake('Alkalinity (ppm)', 'Total Hardness (ppm)')
 ***  Lake Michigan  ***
slope = 0.012 Alkalinity (ppm) / Total Hardness (ppm)
intercept = 116.85 Total Hardness (ppm)
r = 0.02
 
_images/01f61e11d36fe880fa4ffb1ce018ed7e34d2b617dace3e963efcd09d254ec53b.png
 ***  North Lake  ***
slope = 1.075 Alkalinity (ppm) / Total Hardness (ppm)
intercept = -10.0 Total Hardness (ppm)
r = 0.76
 
_images/27c9397160b917348854aefb7218224cc1f9fac3f9d6a8ae3eebb625ab52fa60.png