量化金融面试知识点: 线性回归(Linear Regression)

现在公式暂时还不能正常显示,由于是这个issue
等作者修好了我再来更新吧

Assumptions

  • linearity of the conditional probability
  • Strict exogeneity: Erros are uncorrelated with indepedent variables (X)
    • If violated, it is called endogeneity
  • No multicollinearity: All regressor variables are linearly independent
  • Variance of erros should be constant: it is called homoscedasticity.
    • If violated, it is called heteroscedasticity
  • Errors have No serial correlation/autocorrelation
  • Errors are normally distributed
  • Errors are independent and identically distributed

Estimation Model

  • coefficients: $$\beta = (X^TX)^(-1)(X^TY)$$
  • variance of coefficients: Var(\beta|X) = ^(\sigmaerr^2)/((n - 1)s_x^2)
    • More variance in the noise means \beta is more variable
    • Larger sample variance means smaller variance of coefficients. It is because it’s much easier to estimate the coefficients
    • Higher sampling frequency reduce variance

Variance, Sum of Squares and R^2

  • TSS: total sum of squares
    TSS = SUM of (Y_i - \overline{Y})^2
    It is the total variance in oberserved dependent variable

  • Regression SS:
    RSS = SUM of (Y_fit - \overline{Y})^2
    total variance in fitted observed dependent variables

  • Residual error SS:
    RESS = SUM of (Y_i - Y_fit)^2

  • R^2
    R^2 = 1 - ^(RESS/_TSS)
    R^2 is the sample covariance between Y and Y_fit

    • Special case: Single X variable
      R^2 measures the sample covariance between Y and X
  • Adjusted R^2
    R^2 increases with number of parameters
    Adjusted R^2 is adjusted by the degree of freedom
    adj-R^2 = 1 - ^RSS(n - p - 1)^(-1)/_(TSS(n - 1)^(-1))

  • Durbin-Watson Test
    Test if there is serial correlation in residuals/autocorrelation
    If the p-value from the test is low, it indicates they are probably autocorrelation in noise

  • ACF
    ACF graph is used to look for potential serial correlation at a number of lags

Testing

  • Test if multiple coefficients are significant (not zero)
    F-test

    • This can be used to compare two models that one of the model has a subset of variables
  • Model Selection Criteria

    • AIC & BIC
      The smaller the error variance, the smaller AIC/BIC but it is penalized by number of variables
    • R^2
  • Variance inflation factor (VIF)

    • Measures how much the variance increases by including other predictor variables (test for multicollinearity)
    • Calculate by runnning regression of X_j on X_1 … Xn
      and get R^2: VIF = ^1/
      (1 - R^2)

Violation of Assumptions

  • Multicollinearity
    If two or more variables are strongly correlated, it brings in multicollinearity problem
    • the standard error of coefficients increases
    • It’s harder to seperate effects for correlated variables
    • Estimated coefficients are highly sensitive to whether the correlated variables exists