5. Implementing linear regression in Python and matrix review#

Like many data analysis problems, there are a number of different ways to perform a linear regression in Python. This notebook shows a few different methods. The final method motivates a review of matrix multiplication, which will be helpful in the next section on multivariate regression.

5.1. Example: 2007 West Coast Ocean Acidification Cruise#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats
import statsmodels.formula.api as smf
import pingouin as pg

import PyCO2SYS as pyco2

5.2. Load data#

Load 2007 data#

filename07 = 'data/wcoa_cruise_2007/32WC20070511.exc.csv'
df07 = pd.read_csv(filename07,header=29,na_values=-999,parse_dates=[[6,7]])

df07.keys()

Index(['DATE_TIME', 'EXPOCODE', 'SECT_ID', 'SAMPNO', 'LINE', 'STNNBR',
       'CASTNO', 'LATITUDE', 'LONGITUDE', 'BOT_DEPTH', 'BTLNBR',
       'BTLNBR_FLAG_W', 'CTDPRS', 'CTDTMP', 'CTDSAL', 'CTDSAL_FLAG_W',
       'SALNTY', 'SALNTY_FLAG_W', 'CTDOXY', 'CTDOXY_FLAG_W', 'OXYGEN',
       'OXYGEN_FLAG_W', 'SILCAT', 'SILCAT_FLAG_W', 'NITRAT', 'NITRAT_FLAG_W',
       'NITRIT', 'NITRIT_FLAG_W', 'PHSPHT', 'PHSPHT_FLAG_W', 'TCARBN',
       'TCARBN_FLAG_W', 'ALKALI', 'ALKALI_FLAG_W'],
      dtype='object')

5.3. Linear regression: five methods in Python#

Create \(x\) and \(y\) variables.

x = df07['PHSPHT']
y = df07['NITRAT']

Plot data.

plt.figure()
plt.plot(x,y,'.')
plt.xlabel('phosphate')
plt.ylabel('nitrate')

Text(0, 0.5, 'nitrate')

_images/2-05-wcoa-cruise-regression_9_1.png

Create a subset where both variables have finite values.

ii = np.isfinite(x+y)

ii

     True
     True
     True
     True
     True
        ... 
  True
  True
  True
  True
  True
Length: 2348, dtype: bool

Method 1: Scipy#

result = stats.linregress(x[ii],y[ii])

result

LinregressResult(slope=14.740034517902119, intercept=-3.9325720551998167, rvalue=0.9860645445968044, pvalue=0.0, stderr=0.052923783569700955, intercept_stderr=0.10258209230911229)

result.slope

14.740034517902119

Exercise: Draw the regression line with the data

plt.figure()
plt.plot(x,y,'k.')
plt.plot(x,result.slope*x+result.intercept,'r-')

[<matplotlib.lines.Line2D at 0x152ce6e00>]

/Users/tconnolly/opt/miniconda3/envs/data-book/lib/python3.10/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package outdated is out of date. Your version is 0.2.1, the latest is 0.2.2.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  return warn(

_images/2-05-wcoa-cruise-regression_18_2.png

Method 2: statsmodels#

Ordinary least squares fit using statsmodels.

smres = smf.ols('NITRAT ~ PHSPHT',df07).fit()

smres.summary()

OLS Regression Results
Dep. Variable:	NITRAT	R-squared:	0.972
Model:	OLS	Adj. R-squared:	0.972
Method:	Least Squares	F-statistic:	7.757e+04
Date:	Mon, 13 Jan 2025	Prob (F-statistic):	0.00
Time:	11:28:09	Log-Likelihood:	-4993.9
No. Observations:	2210	AIC:	9992.
Df Residuals:	2208	BIC:	1.000e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-3.9326	0.103	-38.336	0.000	-4.134	-3.731
PHSPHT	14.7400	0.053	278.514	0.000	14.636	14.844

Omnibus:	874.728	Durbin-Watson:	0.269
Prob(Omnibus):	0.000	Jarque-Bera (JB):	5147.310
Skew:	-1.766	Prob(JB):	0.00
Kurtosis:	9.589	Cond. No.	4.90

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Method 3: pingouin#

pg.linear_regression(x[ii],y[ii])

/Users/tconnolly/opt/miniconda3/envs/data-book/lib/python3.10/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.5.2, the latest is 0.5.5.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  return warn(

	names	coef	se	T	pval	r2	adj_r2	CI[2.5%]	CI[97.5%]
0	Intercept	-3.932572	0.102582	-38.335853	6.540950e-247	0.972323	0.972311	-4.133740	-3.731405
1	PHSPHT	14.740035	0.052924	278.514375	0.000000e+00	0.972323	0.972311	14.636249	14.843820

Method 4: Regression coefficients using numpy’s polyfit function#

We can also use the polyfit function from numpy to calculate the coefficients of the line of best fit (minimizing the sum of square errors):

coefficients = np.polyfit(x[ii],y[ii],1)
print(coefficients)

[14.74003452 -3.93257206]

5.4. Matrix multiplication and linear algebra#

In the next section, we’ll consider solving for the coefficients of the linear fit using matrices. But first, let’s do a quick review of matrix multiplication:

Review: Matrix Multiplication#

Suppose we have the following matrices:

\[\begin{split} \textbf{A}= \begin{bmatrix} \color{red} 1 & \color{red} 2 & \color{red} 3\\ \color{blue} 4 & \color{blue} 5 & \color{blue} 6 \end{bmatrix} \end{split}\]

\[\begin{split} \textbf{B} = \begin{bmatrix} \color{green} 7 & \color{purple} 8\\ \color{green} 9 & \color{purple} {10} \\ \color{green} {11} & \color{purple} {12} \end{bmatrix} \end{split}\]

The matrix product \(\textbf{AB}\) is defined by the dot products of the rows of \(\textbf{A}\) and columns of \(\textbf{B}\).

\[\begin{split} \textbf{AB} = \begin{bmatrix} \color{red}{(1)}\color{green}{(7)} \color{black} + \color{red}(2)\color{green} (9) \color{black} + \color{red}(3)\color{green}(11) & \color{red}(1)\color{purple}(8) + (\color{red} 2)\color{purple}(10) + \color{red} (3)\color{purple}(12)\\ \color{blue}(4)\color{green}(7) + \color{blue}(5)\color{green}(9) + \color{blue}(6)\color{green}(11) & \color{blue}(4)\color{purple}(8) + \color{blue}(5)\color{purple}(10) + \color{blue}(6)\color{purple}(12) \end{bmatrix} \end{split}\]

\[\begin{split} \textbf{AB} = \begin{bmatrix} 58 & 64\\ 139 & 154 \end{bmatrix} \end{split}\]

To define matrices in Python, we define 2-d arrays as lists of lists wrapped in numpy’s array function, for example:

# define matrix A
A = np.array([[1, 2, 3], 
              [4, 5, 6]])

# define matrix B 
B = np.array([[7, 8], 
              [9, 10], 
              [11, 12]])

We can check the dimensions of these array’s using numpy’s shape function:

print('shape of A: ', np.shape(A))
print('shape of B: ', np.shape(B))

shape of A:  (2, 3)
shape of B:  (3, 2)

Finally, we can multiply two arrays using numpy’s dot function:

np.dot(A,B)

array([[ 58,  64],
       [139, 154]])

Alternatively, we could use np.matmul or the @ operator:

np.matmul(A,B)

array([[ 58,  64],
       [139, 154]])

A@B

array([[ 58,  64],
       [139, 154]])

It is important to remember that matrix multiplication is not commutative, meaning \(\textbf{AB}\) is generally not the same as \(\textbf{BA}\). In this example, \(\textbf{BA}\) gives us a different size matrix.

np.dot(B,A)

array([[ 39,  54,  69],
       [ 49,  68,  87],
       [ 59,  82, 105]])

Review: Matrix Transpose#

The transpose of a matrix \(\textbf{A}^T\) has the same values as \(\textbf{A}\), but the rows are converted to columns. One way to do this is with the np.transpose function.

print(A)

[[1 2 3]
 [4 5 6]]

print(np.transpose(A))

[[1 4]
 [2 5]
 [3 6]]

Another way is to use the .T method on a Numpy array.

print(A.T)

[[1 4]
 [2 5]
 [3 6]]

Note that the product \(\textbf{A}^T\textbf{A}\) is a square matrix, which has the same number of rows and columns.

np.dot(A.T, A)

array([[17, 22, 27],
       [22, 29, 36],
       [27, 36, 45]])

Review: Matrix Inverse#

The concept of a matrix inverse is similar to the matrix of a single number. If we have a single value \(b\), its inverse can be represented as \(b^{-1}\). A value times its inverse is equal to 1.

\[b^{-1}b = 1\]

The inverse of a single number can be used to solve for \(x\) in a linear equation \(bx = c\). For example:

\[ 10x = 2\]

\[ (10^{-1})10x = 2(10^{-1})\]

\[ x = 2(10^{-1})\]

Let’s say we have a square matrix \(\textbf{B}\) where the number of rows and columns are equal. The inverse \(\textbf{B}^{-1}\) is the matrix that gives

\[ \textbf{B}^{-1}\textbf{B} = \textbf{I} \]

where the identity matrix \(\textbf{I}\) is a matrix that has all 0 values, except for 1 values along the diagonal from the upper left to the lower right. For example, a 3x3 identity matrix would be

\[\begin{split} \textbf{I} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \end{split}\]

This is called an identity matrix because \(\textbf{B}\textbf{I} = \textbf{I}\textbf{B} = \textbf{B}\) for square matrices. This is analagous to \(1b = b\) for single values.

In a linear algebra class, you might calculate \(\textbf{B}^{-1}\) by hand, but in this class we will rely in Numpy to do it for us. Let’s set up a \(3 \times 3\) \(\textbf{B}\) matrix.

B = np.array([[1, 2, 1],
              [3, 2, 1],
              [1, 2, 2]])
print(B)

[[1 2 1]
 [3 2 1]
 [1 2 2]]

The inverse \(\textbf{B}^{-1}\) is

np.linalg.inv(B)

array([[-0.5 ,  0.5 ,  0.  ],
       [ 1.25, -0.25, -0.5 ],
       [-1.  ,  0.  ,  1.  ]])

The product \(\textbf{B}^{-1}\textbf{B}\) can be calculated as

BinvB = np.dot(np.linalg.inv(B), B)
print(BinvB)

[[ 1.00000000e+00  3.33066907e-16  1.66533454e-16]
 [-5.55111512e-17  1.00000000e+00 -2.22044605e-16]
 [ 0.00000000e+00  0.00000000e+00  1.00000000e+00]]

If we round these values, we can see more clearly that this is nearly identical to the identity matrix \(\textbf{I}\), with some very small round-off error.

print(np.round(BinvB))

[[ 1.  0.  0.]
 [-0.  1. -0.]
 [ 0.  0.  1.]]

For reference, an identity matrix can be created with the np.eye function

print(np.eye(3))

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Formulating linear regression in matrix form#

We can formulate regression in terms of matrices as

\[\begin{split}\begin{bmatrix} y_1\\ y_2\\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_N \end{bmatrix} \begin{bmatrix} c_0\\ c_1\end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots \\ \varepsilon_N \end{bmatrix}\end{split}\]

\[\vec{y} = \textbf{X}\vec{c} + \vec{\varepsilon}\]

Here, \(\vec{\varepsilon}\) represents the vector of errors, or differences between the model and data values.

\[ \vec{\varepsilon} = \hat{y} - \vec{y} \]

To solve for the parameters c using matrix multiplication, we first need to fomulate the \(\vec{y}\) and \(\textbf{X}\) matrices

# check to see that the y_subset is only 1-d (and won't work for matrix multiplication)
# print(np.shape(y_subset))

# define an (N, 1 matrix of the y values)
y_matrix = np.reshape(y[ii].ravel(),(len(y[ii]),1))

# print the matrix
print(y_matrix)

# print the shape of the matrix
print('shape of y: ', np.shape(y_matrix))

[[39.14]
 [40.36]
 [42.36]
 ...
 [26.46]
 [25.29]
 [23.98]]
shape of y:  (2210, 1)

# define a matrix X with a column of ones and a column of the x values
X = np.column_stack([np.ones((len(x[ii]),1)),
                     np.ravel(x[ii])])

# print the X matrix
print(X)

# print the shape of the X matrix
print(np.shape(X))

[[1.   2.73]
 [1.   2.83]
 [1.   2.94]
 ...
 [1.   2.35]
 [1.   2.26]
 [1.   2.1 ]]
(2210, 2)

Now that we have our matrices set up, let’s return to our model equation in matrix form.

\[\hat{y} = \textbf{X}\vec{c}\]

We can multiply each side of the equation by the transpose

\[\textbf{X}^T\hat{y} = \textbf{X}^T\textbf{X}\vec{c}\]

then multiply each side by the inverse of \(\textbf{X}^T\textbf{X}\)

\[(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\hat{y} = (\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{X}\vec{c}\]

Since which reduces to

\[(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\hat{y} = \textbf{I}\vec{c}.\]

\[(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\hat{y} = \vec{c}\]

giving us an expression for the vector of coefficents \(\vec{c}\).

Here \(\hat{y}\) represents the vector of model values. With the vector of data values \(\vec{y}\) we can solve for the coefficients that minimize the sum of square errors using the same equation:

\[\vec{c} = (\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\vec{y}\]

Using numpy, we can define the matrix components of the above equation and then run the calculation to find the coefficients:

# calculate the transpose of the matrix X
XT = np.transpose(X)

# calculate the product of XT and X
XTX = np.dot(XT,X)

# calculate the inverse of XTX
XTX_inverse = np.linalg.inv(XTX)

# then calculate the products of XTX, XT and using two dot products
c = np.dot(XTX_inverse,np.dot(XT,y_matrix))

# print the coefficients
print(c)

[[-3.93257206]
 [14.74003452]]

As a sanity check, we can double check that the coefficients are the same as those from numpy’s polyfit function

coefficients = np.polyfit(x[ii],y[ii],1)
print(coefficients)

[14.74003452 -3.93257206]

Data Analysis Techniques in Marine Science

Implementing linear regression in Python and matrix review

Contents