| Location | Loc | Population | Marriage | Divorce | WaffleHouses | South | |
|---|---|---|---|---|---|---|---|
| 0 | Alabama | AL | 4.78 | 20.2 | 12.7 | 128 | 1 |
| 1 | Alaska | AK | 0.71 | 26.0 | 12.5 | 0 | 0 |
| 2 | Arizona | AZ | 6.33 | 20.3 | 10.8 | 18 | 0 |
| 3 | Arkansas | AR | 2.92 | 26.4 | 13.5 | 41 | 1 |
| 4 | California | CA | 37.25 | 19.1 | 8.0 | 0 | 0 |
2026-02-09
statsmodels\[ g(\mathbf{x}) = w_1 x_1 + w_2x_2 + \cdots w_nx_n = \sum_{i=1}^n w_ix_i = \mathbf{w}^T\mathbf{x} + c \]
\[ \mathbf{a} = \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{y} \]
Here \(\mathbf{X}\) is the design matrix \[ \mathbf{X} = \begin{pmatrix} x_{11} & \cdots & x_{1n} & 1 \\ \vdots & \cdots & \vdots & 1 \\ x_{m1} & \cdots & x_{mn} & 1 \end{pmatrix} \]
It is a math object with all the observations stacked in the rows
Linear algebra is math worth learning
ISLP
It’s rarely a mistake to start with linear regression, even if you end up just using it as a benchmark to compare to.






\[ \mathbf{w} = (X^TX)^{-1}X^T\mathbf{y} \]

ill-posed but we will show some tricks in a few weeks| Location | Loc | Population | Marriage | Divorce | WaffleHouses | South | |
|---|---|---|---|---|---|---|---|
| 0 | Alabama | AL | 4.78 | 20.2 | 12.7 | 128 | 1 |
| 1 | Alaska | AK | 0.71 | 26.0 | 12.5 | 0 | 0 |
| 2 | Arizona | AZ | 6.33 | 20.3 | 10.8 | 18 | 0 |
| 3 | Arkansas | AR | 2.92 | 26.4 | 13.5 | 41 | 1 |
| 4 | California | CA | 37.25 | 19.1 | 8.0 | 0 | 0 |
Confounding describes a situation where the predictions of a model differ from the result of experimentally manipulating the variables of that model.
Can you tell me an example of this?
If there is a variable you suspect is a common cause for your dependent and independent variables, you must include it in your regression for interpretability
Collider Bias describes a situation when the independent and dependent variable both influence a third variable which you have used in your analysis

Hint: Smokers were about twice as likely to have babies lighter than 2500 grams, which is considered “low birthweight”.

AIC for statistical model is: \[ AIC = n - \log(\hat{L}) \]
\(\hat{L}\) is the likelihood of the model
\(n\) is the number of parameters
This balances model complexity (\(n\)) with in-sample accuracy
AIC approximates the loss of information caused by using the model out of sample \[
KL(p,q) = \int_{X}dx p(x)(\log(p(x))-\log(q(x)))
\]<class 'pandas.DataFrame'>
Index: 263 entries, 1 to 321
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AtBat 263 non-null int64
1 Hits 263 non-null int64
2 HmRun 263 non-null int64
3 Runs 263 non-null int64
4 RBI 263 non-null int64
5 Walks 263 non-null int64
6 Years 263 non-null int64
7 CAtBat 263 non-null int64
8 CHits 263 non-null int64
9 CHmRun 263 non-null int64
10 CRuns 263 non-null int64
11 CRBI 263 non-null int64
12 CWalks 263 non-null int64
13 League 263 non-null category
14 Division 263 non-null category
15 PutOuts 263 non-null int64
16 Assists 263 non-null int64
17 Errors 263 non-null int64
18 Salary 263 non-null float64
19 NewLeague 263 non-null category
dtypes: category(3), float64(1), int64(16)
memory usage: 37.9 KB
CRBI 0.566966
CRuns 0.562678
CHits 0.548910
CAtBat 0.526135
CHmRun 0.524931
CWalks 0.489822
RBI 0.449457
Walks 0.443867
Hits 0.438675
Runs 0.419859
Years 0.400657
AtBat 0.394771
HmRun 0.343028
PutOuts 0.300480
Assists 0.025436
Errors 0.005401
Name: Salary, dtype: float64
| models | aic | |
|---|---|---|
| 1 | intelligent | 3780.465325 |
| 2 | everything | 3790.489087 |
| 0 | basic | 3859.826181 |
intelligent ignored dropped several features that were irrelevant or redundantHmRun, RBI, CHmRun
DATA 622