VIF is one of those diagnostics that looks simple, but it is easy to use mechanically without really thinking about what it is measuring.
Variance Inflation Factor tells us how much the variance of a coefficient estimate is inflated because that predictor overlaps with the other predictors.
In other words, it is not asking whether a variable is useful. It is asking whether the model can estimate that variable’s coefficient cleanly.
The basic idea
To calculate VIF for a predictor, we temporarily make that predictor the target.
If I want the VIF for X_j, I regress X_j on all the other predictors:
X_j = alpha + X_{-j} gamma + error
Then I take the R-squared from that auxiliary regression and calculate:
VIF_j = 1 / (1 - R_j^2)
If the other predictors cannot explain X_j, then R_j^2 is low and VIF stays close to 1.
If R_j^2 = 0:
VIF_j = 1 / (1 - 0) = 1
If the other predictors explain X_j really well, then R_j^2 gets close to 1 and VIF gets large.
If R_j^2 = 0.90:
VIF_j = 1 / (1 - 0.90) = 10
That is the whole intuition: if a predictor can be reconstructed from the other predictors, its coefficient will be harder to estimate independently.
Why VIF shows up through standard errors
High VIF does not automatically mean the coefficient is wrong. It means the coefficient is estimated with more uncertainty.
That uncertainty shows up through the standard error:
SE_with_collinearity(beta_j)
= SE_without_collinearity(beta_j) * sqrt(VIF_j)
So if VIF = 9, the standard error is inflated by:
sqrt(9) = 3
Larger standard errors make confidence intervals wider and p-values larger. A variable can have a real relationship with the target and still look statistically weak if its information overlaps heavily with other variables.
The constant is not optional
One detail I want to remember: when calculating VIF, the auxiliary regression should include a constant.
This matters because the usual R-squared compares the fitted model against a baseline model that predicts the mean.
R^2 = 1 - SSE / SST
SST = sum_i (y_i - y_bar)^2
The intercept is what lets the fitted values align with that mean baseline. Without an intercept, the auxiliary regression is effectively forced through zero. That can distort R^2, and because VIF is just a transformation of R^2, it can distort VIF too.
This is especially easy to miss in statsmodels, because you often need to explicitly add the constant yourself before calculating VIF.
X_with_constant = add_constant(X)
The point is not that the constant is interesting as a variable. The point is that the auxiliary regression needs the right baseline.
Controls can change VIF
Another detail from my notes that I like: VIF should be calculated in the design matrix that matches the model I am actually fitting.
If I add controls, the auxiliary regression for X_j now has more variables available to explain X_j. R-squared can only stay the same or increase when predictors are added. So VIF can only stay the same or increase too.
More variables in auxiliary regression
-> R_j^2 same or higher
-> VIF_j same or higher
That means a predictor may look fine without controls and become collinear after controls are added. But if VIF is acceptable with the controls included, then the version without controls is usually not the harder case.
VIF and p-values are related, but not the same
I think of VIF and p-values as connected through standard error.
A high VIF can push a p-value up by inflating the standard error. But a high VIF with a low p-value can still happen if the effect is strong enough. And a low VIF with a high p-value may simply mean the variable does not add much signal.
So VIF tells me about redundancy. It does not tell me whether the variable matters.
How I would use it
I would pay most attention to VIF when the coefficient itself is going to be interpreted.
For a pure prediction model, correlated variables may not be a serious issue if validation performance is stable. But for a model where coefficients are used to explain effects, support business decisions, or build rating factors, VIF becomes more important.
My practical takeaway: add the constant, calculate VIF in the same design matrix the model is trained on, and treat high VIF as a reason to inspect stability rather than as an automatic reason to drop a variable.