4. Cruise data analysis in Python#

4.1. WCOA 2013 data set#

To download the data used in this tutorial, use the following command in the Terminal (Mac) or Git Bash (Windows).

git clone https://github.com/mlmldata24/wcoa_cruise.git

The data comes from the West Coast Ocean Acidification (WCOA) cruise in 2013. The goal of this NOAA-supported research cruise is to collect data to help understand the effects of coastal upwelling on ocean acidification, and the impacts of ocean acidification on organisms and ecosystems. This video gives an idea of life aboard the ship and the type of science operations conducted.

In this part of the tutorial, we will go over the basics of working with dates in Pandas and Numpy, make some exploratory plots and start a regression analysis. The data exploration will be largely guided by student interest.

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy import stats

Introduction to Pandas dataframes#

We use Pandas to import the csv data file.

Here, there is an optional parse_dates argument. The numbers in double brackets [[8,9]] indicate which columns to interpret as dates.

filename = 'data/wcoa_cruise/WCOA2013_hy1.csv'
df = pd.read_csv(filename,header=31,na_values=-999,
                 parse_dates=[[8,9]])
df.head()
DATE_TIME EXPOCODE SECT_ID LEG LINE STNNBR CASTNO BTLNBR BTLNBR_FLAG_W LATITUDE ... TCARBN TCARBN_FLAG_W ALKALI ALKALI_FLAG_W PH_TOT PH_TOT_FLAG_W PH_TMP CO32 CO32__FLAG_W CHLORA
0 2013-08-05 02:12:20 317W20130803 WCOA2013 1 2 11 1 1 2 48.2 ... 2370.2 2 2369.0 2 7.294 2 25.0 NaN 9 NaN
1 2013-08-05 02:12:53 317W20130803 WCOA2013 1 2 11 1 2 2 48.2 ... NaN 9 NaN 9 7.295 2 25.0 NaN 9 NaN
2 2013-08-05 02:19:58 317W20130803 WCOA2013 1 2 11 1 3 2 48.2 ... 2349.6 2 2343.7 2 7.282 2 25.0 43.521 3 NaN
3 2013-08-05 02:27:01 317W20130803 WCOA2013 1 2 11 1 4 2 48.2 ... 2318.7 2 2311.9 2 7.287 2 25.0 45.641 2 NaN
4 2013-08-05 02:30:53 317W20130803 WCOA2013 1 2 11 1 5 2 48.2 ... 2300.0 2 2299.7 2 7.308 2 25.0 47.741 2 NaN

5 rows × 42 columns

df.columns
Index(['DATE_TIME', 'EXPOCODE', 'SECT_ID', 'LEG', 'LINE', 'STNNBR', 'CASTNO',
       'BTLNBR', 'BTLNBR_FLAG_W', 'LATITUDE', 'LONGITUDE', 'DEPTH', 'CTDPRS',
       'CTDTMP', 'CTDSAL', 'CTDSAL_FLAG_W', 'CTDOXY', 'CTDOXY_FLAG_W',
       'SALNTY', 'SALNTY_FLAG_W', 'OXYGEN', 'OXYGEN_FLAG_W', 'SILCAT',
       'SILCAT_FLAG_W', 'NITRAT', 'NITRAT_FLAG_W', 'NITRIT', 'NITRIT_FLAG_W',
       'PHSPHT', 'PHSPHT_FLAG_W', 'AMMONI', 'AMMONI_FLAG_W', 'TCARBN',
       'TCARBN_FLAG_W', 'ALKALI', 'ALKALI_FLAG_W', 'PH_TOT', 'PH_TOT_FLAG_W',
       'PH_TMP', 'CO32', 'CO32__FLAG_W', 'CHLORA'],
      dtype='object')

Instead of strings, the dates are now in a special datetime64 format. This means that, instead of treating the dates in the same way as any other collection of characters, pandas and NumPy can understand how this variable represents time.

df['DATE_TIME'].head()
0   2013-08-05 02:12:20
1   2013-08-05 02:12:53
2   2013-08-05 02:19:58
3   2013-08-05 02:27:01
4   2013-08-05 02:30:53
Name: DATE_TIME, dtype: datetime64[ns]

For example, subtracting datetime64 objects with pandas gives a Timedelta object, which is specifically used to represent differences between times. The first two samples in the cruise data are separated by 33 seconds (the time between firing of bottles).

df['DATE_TIME'][1]-df['DATE_TIME'][0]
Timedelta('0 days 00:00:33')
pd.unique(df['LATITUDE'])
array([48.2 , 48.3 , 48.37, 48.44, 48.5 , 48.53, 48.61, 48.66, 48.71,
       48.78, 48.81, 48.84, 47.97, 48.14, 47.96, 47.68, 47.13, 47.11,
       47.12, 47.34, 46.13, 46.17, 46.19, 46.25, 46.24, 46.12, 44.65,
       44.66, 44.2 , 41.99, 41.97, 41.96, 41.94, 41.9 , 40.25, 40.23,
       40.22, 40.21, 40.1 , 37.67, 37.94, 37.91, 37.87, 37.76, 37.75,
       36.8 , 36.78, 36.76, 36.73, 36.71, 36.69, 36.52, 36.7 ])

Exercise#

Create a list of unique station ID’s (“STNNBR”) found in the survey data. Call it stns. How many unique stations are there in the data?

Summary statistics#

A summary of the dataframe is given by the .describe() method.

df['CTDTMP'].describe()
count    969.000000
mean       8.993954
std        3.055917
min        1.738200
25%        7.292000
50%        8.328200
75%       10.752200
max       20.747400
Name: CTDTMP, dtype: float64

These summary statistics can also be accessed individually with similar syntax.

df['CTDTMP'].mean()
8.993954179566563
df['CTDTMP'].min()
1.7382

Alternate method using Numpy functions.

np.min(df['CTDTMP'])
1.7382

Mathematical operations#

Converting Celcius to Fahrenheit

df['CTDTMP_F'] = 9/5*df['CTDTMP'] + 32 
df['CTDTMP_F'].head()
0    38.63894
1    38.64236
2    39.85916
3    41.05328
4    41.82404
Name: CTDTMP_F, dtype: float64
df['CTDTMP'].head()
0    3.6883
1    3.6902
2    4.3662
3    5.0296
4    5.4578
Name: CTDTMP, dtype: float64

Plotting#

Plot latitude as a function of time.

plt.figure()
plt.plot(df['DATE_TIME'],df['LATITUDE'],'-o')
plt.xticks(rotation=15)
(array([15922., 15926., 15930., 15934., 15938., 15942., 15946.]),
 [Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, '')])
_images/1-04-cruise-data-analysis_28_1.png

The pyplot library automatically understands datetime64 objects so it is easy to see how the ship moved between stations from north to south as weeks passed.

plt.figure()
plt.plot(df['LONGITUDE'], df['LATITUDE'], 'ro')
[<matplotlib.lines.Line2D at 0x166dcb190>]
_images/1-04-cruise-data-analysis_30_1.png

The scatter() function allows points to be colored according to the value of a variable. In the case of dates, later dates are shown as warmer colors.

plt.figure()
plt.scatter(df['LONGITUDE'],df['LATITUDE'],c=df['DATE_TIME'])
<matplotlib.collections.PathCollection at 0x166e42440>
_images/1-04-cruise-data-analysis_32_1.png

Note that the vertical coordinate is pressure (not depth, which indicates the bottom depth rather than the depth of the sample). To plot dissolved oxygen with depth:

plt.figure()
plt.plot(df['OXYGEN'],df['CTDPRS'],'.')
plt.gca().invert_yaxis()

plt.xlabel('$O_2 [\mu M]$')
plt.ylabel('pressure[dbar]')
Text(0, 0.5, 'pressure[dbar]')
_images/1-04-cruise-data-analysis_34_1.png
plt.figure()
df['CTDTMP'].hist()
<AxesSubplot:>
_images/1-04-cruise-data-analysis_35_1.png
plt.figure()
df['CTDTMP'].hist(bins=50)
<AxesSubplot:>
_images/1-04-cruise-data-analysis_36_1.png
df.keys()
Index(['DATE_TIME', 'EXPOCODE', 'SECT_ID', 'LEG', 'LINE', 'STNNBR', 'CASTNO',
       'BTLNBR', 'BTLNBR_FLAG_W', 'LATITUDE', 'LONGITUDE', 'DEPTH', 'CTDPRS',
       'CTDTMP', 'CTDSAL', 'CTDSAL_FLAG_W', 'CTDOXY', 'CTDOXY_FLAG_W',
       'SALNTY', 'SALNTY_FLAG_W', 'OXYGEN', 'OXYGEN_FLAG_W', 'SILCAT',
       'SILCAT_FLAG_W', 'NITRAT', 'NITRAT_FLAG_W', 'NITRIT', 'NITRIT_FLAG_W',
       'PHSPHT', 'PHSPHT_FLAG_W', 'AMMONI', 'AMMONI_FLAG_W', 'TCARBN',
       'TCARBN_FLAG_W', 'ALKALI', 'ALKALI_FLAG_W', 'PH_TOT', 'PH_TOT_FLAG_W',
       'PH_TMP', 'CO32', 'CO32__FLAG_W', 'CHLORA', 'CTDTMP_F'],
      dtype='object')

Exercises#

  • What scientific questions can be addressed with this data set?

  • What relationships might occur between different variables?

  • What differences might occur within the same variables, but at different locations or times?

  • Create exploratory plots (one PDF, one scatter plot)

4.2. Slicing and subsetting data#

df[0:3]
DATE_TIME EXPOCODE SECT_ID LEG LINE STNNBR CASTNO BTLNBR BTLNBR_FLAG_W LATITUDE ... TCARBN_FLAG_W ALKALI ALKALI_FLAG_W PH_TOT PH_TOT_FLAG_W PH_TMP CO32 CO32__FLAG_W CHLORA CTDTMP_F
0 2013-08-05 02:12:20 317W20130803 WCOA2013 1 2 11 1 1 2 48.2 ... 2 2369.0 2 7.294 2 25.0 NaN 9 NaN 38.63894
1 2013-08-05 02:12:53 317W20130803 WCOA2013 1 2 11 1 2 2 48.2 ... 9 NaN 9 7.295 2 25.0 NaN 9 NaN 38.64236
2 2013-08-05 02:19:58 317W20130803 WCOA2013 1 2 11 1 3 2 48.2 ... 2 2343.7 2 7.282 2 25.0 43.521 3 NaN 39.85916

3 rows × 43 columns

df.iloc[0:3]
DATE_TIME EXPOCODE SECT_ID LEG LINE STNNBR CASTNO BTLNBR BTLNBR_FLAG_W LATITUDE ... TCARBN_FLAG_W ALKALI ALKALI_FLAG_W PH_TOT PH_TOT_FLAG_W PH_TMP CO32 CO32__FLAG_W CHLORA CTDTMP_F
0 2013-08-05 02:12:20 317W20130803 WCOA2013 1 2 11 1 1 2 48.2 ... 2 2369.0 2 7.294 2 25.0 NaN 9 NaN 38.63894
1 2013-08-05 02:12:53 317W20130803 WCOA2013 1 2 11 1 2 2 48.2 ... 9 NaN 9 7.295 2 25.0 NaN 9 NaN 38.64236
2 2013-08-05 02:19:58 317W20130803 WCOA2013 1 2 11 1 3 2 48.2 ... 2 2343.7 2 7.282 2 25.0 43.521 3 NaN 39.85916

3 rows × 43 columns

df.loc[0:3]
DATE_TIME EXPOCODE SECT_ID LEG LINE STNNBR CASTNO BTLNBR BTLNBR_FLAG_W LATITUDE ... TCARBN_FLAG_W ALKALI ALKALI_FLAG_W PH_TOT PH_TOT_FLAG_W PH_TMP CO32 CO32__FLAG_W CHLORA CTDTMP_F
0 2013-08-05 02:12:20 317W20130803 WCOA2013 1 2 11 1 1 2 48.2 ... 2 2369.0 2 7.294 2 25.0 NaN 9 NaN 38.63894
1 2013-08-05 02:12:53 317W20130803 WCOA2013 1 2 11 1 2 2 48.2 ... 9 NaN 9 7.295 2 25.0 NaN 9 NaN 38.64236
2 2013-08-05 02:19:58 317W20130803 WCOA2013 1 2 11 1 3 2 48.2 ... 2 2343.7 2 7.282 2 25.0 43.521 3 NaN 39.85916
3 2013-08-05 02:27:01 317W20130803 WCOA2013 1 2 11 1 4 2 48.2 ... 2 2311.9 2 7.287 2 25.0 45.641 2 NaN 41.05328

4 rows × 43 columns

df.loc[0:3,['CTDTMP','CTDPRS']]
CTDTMP CTDPRS
0 3.6883 999.5
1 3.6902 1000.8
2 4.3662 749.0
3 5.0296 503.9

Exercise#

What do you expect to happen when you execute:

df[0:1]
df[:4]
df[:-1]

What do you expect to happen when you call:

df.iloc[0:4, 1:4]
df.loc[0:4, 1:4]

How are the two commands different?

Adapted from: https://datacarpentry.org/python-ecology-lesson/03-index-slice-subset/index.html

Subsetting Data using Criteria#

df[df.LATITUDE > 40].head()
DATE_TIME EXPOCODE SECT_ID LEG LINE STNNBR CASTNO BTLNBR BTLNBR_FLAG_W LATITUDE ... TCARBN_FLAG_W ALKALI ALKALI_FLAG_W PH_TOT PH_TOT_FLAG_W PH_TMP CO32 CO32__FLAG_W CHLORA CTDTMP_F
0 2013-08-05 02:12:20 317W20130803 WCOA2013 1 2 11 1 1 2 48.2 ... 2 2369.0 2 7.294 2 25.0 NaN 9 NaN 38.63894
1 2013-08-05 02:12:53 317W20130803 WCOA2013 1 2 11 1 2 2 48.2 ... 9 NaN 9 7.295 2 25.0 NaN 9 NaN 38.64236
2 2013-08-05 02:19:58 317W20130803 WCOA2013 1 2 11 1 3 2 48.2 ... 2 2343.7 2 7.282 2 25.0 43.521 3 NaN 39.85916
3 2013-08-05 02:27:01 317W20130803 WCOA2013 1 2 11 1 4 2 48.2 ... 2 2311.9 2 7.287 2 25.0 45.641 2 NaN 41.05328
4 2013-08-05 02:30:53 317W20130803 WCOA2013 1 2 11 1 5 2 48.2 ... 2 2299.7 2 7.308 2 25.0 47.741 2 NaN 41.82404

5 rows × 43 columns

df[df.LATITUDE <= 40].head()
DATE_TIME EXPOCODE SECT_ID LEG LINE STNNBR CASTNO BTLNBR BTLNBR_FLAG_W LATITUDE ... TCARBN_FLAG_W ALKALI ALKALI_FLAG_W PH_TOT PH_TOT_FLAG_W PH_TMP CO32 CO32__FLAG_W CHLORA CTDTMP_F
770 2013-08-25 21:23:35 32P020130821 WCOA2013 2 10 133 1 1 2 37.67 ... 6 2430.4 6 7.493 2 25.0 73.346 2 NaN 35.12876
771 2013-08-25 21:32:13 32P020130821 WCOA2013 2 10 133 1 2 2 37.67 ... 2 2427.6 2 7.467 2 25.0 69.812 2 NaN 35.30336
772 2013-08-25 21:40:44 32P020130821 WCOA2013 2 10 133 1 3 2 37.67 ... 2 2424.3 2 7.442 2 25.0 65.502 2 NaN 35.55914
773 2013-08-25 21:49:45 32P020130821 WCOA2013 2 10 133 1 4 2 37.67 ... 2 2413.7 2 7.400 2 25.0 60.170 2 NaN 36.10526
774 2013-08-25 21:57:55 32P020130821 WCOA2013 2 10 133 1 5 2 37.67 ... 2 2401.4 2 7.374 2 25.0 57.739 2 NaN 36.92894

5 rows × 43 columns

dfsub = df[(df.CTDPRS <= 10) & (df.LATITUDE > 40)]
dfsub.head()
DATE_TIME EXPOCODE SECT_ID LEG LINE STNNBR CASTNO BTLNBR BTLNBR_FLAG_W LATITUDE ... TCARBN_FLAG_W ALKALI ALKALI_FLAG_W PH_TOT PH_TOT_FLAG_W PH_TMP CO32 CO32__FLAG_W CHLORA CTDTMP_F
18 2013-08-05 03:00:52 317W20130803 WCOA2013 1 2 11 1 19 4 48.20 ... 9 NaN 9 NaN 9 NaN NaN 9 NaN 56.54660
19 2013-08-05 03:01:10 317W20130803 WCOA2013 1 2 11 1 20 2 48.20 ... 2 2189.1 6 7.983 3 25.0 155.043 2 NaN 56.54570
38 2013-08-05 06:37:22 317W20130803 WCOA2013 1 2 12 1 19 2 48.30 ... 9 2180.0 6 7.980 3 25.0 152.868 2 NaN 57.72218
39 2013-08-05 06:37:42 317W20130803 WCOA2013 1 2 12 1 20 2 48.30 ... 2 NaN 9 7.981 2 25.0 NaN 5 NaN 57.72290
58 2013-08-05 10:41:19 317W20130803 WCOA2013 1 2 13 1 19 2 48.37 ... 2 2178.7 2 7.931 2 25.0 142.629 2 NaN 55.69844

5 rows × 43 columns

Exercises#

  • Select a subset of rows in the df DataFrame that contains data from a pressure range between 500 and 1000 dbar. How many rows did you end up with? What did your neighbor get?

  • You can use the isin command in Python to query a DataFrame based upon a list of values as follows:

df[df['STNNBR'].isin([listGoesHere])]

Use the isin function to find all samples from station numbers 11 and 12.

Exercises#

pH values are contained in the df['PH_TOT'] DataArray

Quality control flags for pH are contained in the df['PH_TOT_FLAG_W'] DataArray.

World Ocean Circulation Experiment (WOCE) quality control flags are used:

  • 2 = good value

  • 3 = questionable value

  • 4 = bad value

  • 5 = value not reported

  • 6 = mean of replicate measurements

  • 9 = sample not drawn.

    1. Create a new DataFrame called dfsub1 that includes only data where pressure is less than 10 dbar and the latitude is farher north than \(40^{\circ} \text{N}\).

dfsub = df[(df['CTDPRS'] > 10) & (df['LATITUDE'] > 40)]
  1. Create a new DataFrame called dfsub2 that excludes all bad, questionable and missing pH and CTD oxygen values. Plot oxygen vs. pH.

dfsub2 = df[(df['CTDOXY_FLAG_W'] == 2) & (df['PH_TOT_FLAG_W'] == 2)]

Fit a linear model in Python#

plt.plot(df['CTDOXY'],df['PH_TOT'],'.')
plt.xlabel('dissolved oxygen [$\mu$mol/kg]')
plt.ylabel('pH$_{Tot}$')
Text(0, 0.5, 'pH$_{Tot}$')
_images/1-04-cruise-data-analysis_56_1.png
result = stats.linregress(dfsub2['CTDOXY'],dfsub2['PH_TOT'])
result
LinregressResult(slope=0.002305076546103237, intercept=7.202036145940105, rvalue=0.9510631019969654, pvalue=0.0, stderr=2.7737424376777354e-05, intercept_stderr=0.004977802641597415)

Exercise#

Plot the linear model with the data.