量化金融面试知识点: 线性回归(Linear Regression)

发表于 2016-07-16 | 分类于量化分析知识点

现在公式暂时还不能正常显示，由于是这个issue
等作者修好了我再来更新吧

Assumptions

linearity of the conditional probability
Strict exogeneity: Erros are uncorrelated with indepedent variables (X)
- If violated, it is called endogeneity
No multicollinearity: All regressor variables are linearly independent
Variance of erros should be constant: it is called homoscedasticity.
- If violated, it is called heteroscedasticity
Errors have No serial correlation/autocorrelation
Errors are normally distributed
Errors are independent and identically distributed

Estimation Model

coefficients: $$\beta = (X^TX)^(-1)(X^TY)$$
variance of coefficients: Var(\beta|X) = ^(\sigmaerr^2)/((n - 1)s_x^2)
- More variance in the noise means \beta is more variable
- Larger sample variance means smaller variance of coefficients. It is because it’s much easier to estimate the coefficients
- Higher sampling frequency reduce variance

Variance, Sum of Squares and R^2

TSS: total sum of squares
TSS = SUM of (Y_i - \overline{Y})^2
It is the total variance in oberserved dependent variable
Regression SS:
RSS = SUM of (Y_fit - \overline{Y})^2
total variance in fitted observed dependent variables
Residual error SS:
RESS = SUM of (Y_i - Y_fit)^2
R^2
R^2 = 1 - ^(RESS/_TSS)
R^2 is the sample covariance between Y and Y_fit
- Special case: Single X variable
  R^2 measures the sample covariance between Y and X
Adjusted R^2
R^2 increases with number of parameters
Adjusted R^2 is adjusted by the degree of freedom
adj-R^2 = 1 - ^RSS(n - p - 1)^(-1)/_(TSS(n - 1)^(-1))
Durbin-Watson Test
Test if there is serial correlation in residuals/autocorrelation
If the p-value from the test is low, it indicates they are probably autocorrelation in noise
ACF
ACF graph is used to look for potential serial correlation at a number of lags

Testing

Test if multiple coefficients are significant (not zero)
F-test
- This can be used to compare two models that one of the model has a subset of variables
Model Selection Criteria
- AIC & BIC
  The smaller the error variance, the smaller AIC/BIC but it is penalized by number of variables
- R^2
Variance inflation factor (VIF)
- Measures how much the variance increases by including other predictor variables (test for multicollinearity)
- Calculate by runnning regression of X_j on X_1 … Xn
  and get R^2: VIF = ^1/(1 - R^2)

Violation of Assumptions

Multicollinearity
If two or more variables are strongly correlated, it brings in multicollinearity problem
- the standard error of coefficients increases
- It’s harder to seperate effects for correlated variables
- Estimated coefficients are highly sensitive to whether the correlated variables exists

leetcode解题: Edit Distance (72)

发表于 2016-07-13 | 分类于刷题总结

Given two words word1 and word2, find the minimum number of steps required to convert word1 to word2. (each operation is counted as 1 step.)

You have the following 3 operations permitted on a word:

a) Insert a character
b) Delete a character
c) Replace a character

解法：DP with O(N^2) Time and O(N^2) Space

双字符串统计min的问题，容易想到用DP解决。
假设dp[i][j] 表示前i个字符和前j个字符的子串，convert substr1 to substr2的minimum steps.
那么对于初始条件我们容易得到：

dp[0][j] = j
dp[i][0] = i
这里表示的是，如果其中一个是空字符，那么你要么插入j个字符或者删除j个字符，任何一种情况都是j个操作
当考虑状态方程的时候，要考虑i和j两位置上的字符是否相等，分两种情况:
1 charAt(i) == charAt(j)
当最后比较的字符相同时，有三种情况/操作可能出现
1 可能最小值来自于i - 1和j - 1的匹配，
2 或者是i - 1转为j的匹配，那么就要加上1表示从i字符串删除一位变为i - 1字符串的操作。
3 也可能是把i匹配为j - 1的字符串，然后最后加上第j个字符(insert操作）
dp[i][j] = Min(dp[i - 1][j - 1], dp[i - 1][j] + 1, dp[i][j - 1] + 1)

2 charAt(i) != charAt(j)
相类似于上面一种情况，唯一不同的是，dp[i - 1][j - 1]要加上1表示把第i位转化为第j位的操作
dp[i][j] = Min(dp[i - 1][j - 1] + 1, dp[i - 1][j] + 1, dp[i][j - 1] + 1）

public int minDistance(String word1, String word2) {
		
	// input validation
	if (word1 == null && word2 == null) {
		return 0;
	}
	
	if (word1 == null) {
		return word2.length();
	}
	
	if (word2 == null) {
		return word1.length();
	}
	
	// DP
	
	int[][] dp = new int[word1.length() + 1][word2.length() + 1];
	
	// initialize
	for (int i = 0; i <= word1.length(); i++) {
		dp[i][0] = i;
	}
	
	for (int j = 0; j <= word2.length(); j++) {
		dp[0][j] = j;
	}
	
	for (int i = 1; i <= word1.length(); i++) {
		for (int j = 1; j <= word2.length(); j++) {
			if (word1.charAt(i - 1) == word2.charAt(j - 1)) {
				// make sure you get the index right, since we added 1 to the array, we need to subtract 1 
				// to get the index positions
				dp[i][j] = Math.min(dp[i - 1][j - 1], Math.min(dp[i - 1][j], dp[i][j - 1]) + 1);
			} else {
				dp[i][j] = Math.min(dp[i - 1][j - 1] + 1, Math.min(dp[i - 1][j], dp[i][j - 1]) + 1);
			}
		}
	}
	
	return dp[word1.length()][word2.length()];
}

leetcode解题: Unique Binary Search Trees (96)

发表于 2016-07-13 | 分类于刷题总结

Given n, how many structurally unique BST’s (binary search trees) that store values 1…n?

For example,
Given n = 3, there are a total of 5 unique BST’s.

解法：DP with O(N^2) time and O(N) space

通常count的问题也可以考虑用dp的方法求解。由于是binary search trees,那么首先要搞清楚的是binary search tree的结构：
1 root左边的所有节点一定比root小
2 root右边的所有节点一定比root大
如果我们有N个数字可以选择，那么如果选择了一个i为root,则1 ~ （i -1)都必须在左子数，(i + 1) ~ N 都必须在右子数
具体每一个子数有多少unique的排列则完全取决于node的个数。
那么我们可以用memorization的思想，从1个node开始，存下每个node可能的排列个数，然后对每一个可能作为root的计算相应的子数的个数。对应的算法为O(N^2)的Time complexity
举例，如果N = 3
那么可以成为root的有1，2，3三个数
当1作为root的时候，由于他是最小值，则剩下的2个数只能排放右边，故排列数为num(2)
当2作为root的时候，比他小的数有一个，比他大的数有1个，分列左右两边，故排列数为num(1) * num(1)
当3作为root的时候，同理1，他是最大值，排列数为num(1) * num(1)
所以总的排列数是每个情况的相加,只要知道了num(1)和num(2)就可以求出num(3)

同时，这个问题也是Catalan Number的一个应用，具体可以看Wiki

public int numTrees(int n) {
	
	if (n <= 0) {
		return 1; // If no nodes, then just an empty tree, so 1
	}
	
	int[] dp = new int[n + 1];
	dp[0] = 1;	// empty tree
	dp[1] = 1;	// single node tree
	
	for (int i = 2; i <= n; i++) {
		int sum = 0;
		for (int j = 1; j <= i; j++) {
			int left = j - 1;	// number of left node
			int right = i -  j;	// number of right node
			sum += dp[left] * dp[right];	// multiply 
		}
		dp[i] = sum;
	}
	
	return dp[n];
}

leetcode解题: Maximum Product Subarray (152)

发表于 2016-07-13 | 分类于刷题总结

Find the contiguous subarray within an array (containing at least one number) which has the largest product.

For example, given the array [2,3,-2,4],
the contiguous subarray [2,3] has the largest product = 6.

解法：DP with O(N) time and O(1) space

由于有负数的存在，有可能负负得正的情况出现，所以需要维护当前的最大值和最小值。最大值是max(A[i], min A[i], max A[i])
同时对每一个新元素要update最大值和最小值。

public int maxProduct(int[] nums) {
	
	if (nums == null || nums.length == 0) {
		return 0;
	}
	
	int min = nums[0];
	int max = nums[0];
	int res = nums[0];
	
	for (int i = 1; i < nums.length; i++) {
		int current_max = Math.max(max * nums[i], Math.max(min * nums[i], nums[i]));	
		int current_min = Math.min(max * nums[i], Math.max(min * nums[i], nums[i]));
		res = Math.max(res, current_max);
		// 要注意的是max和min需要同时update,不能先update一个再update另一个，因为计算max和min的公式中用到了对方
		max = current_max;
		min = current_min;
	}
	return res;
}

leetcode解题: Range Sum Query - immutable (303)

发表于 2016-07-13 | 分类于刷题总结

Given an integer array nums, find the sum of the elements between indices i and j (i ≤ j), inclusive.

Example:
Given nums = [-2, 0, 3, -5, 2, -1]

sumRange(0, 2) -> 1
sumRange(2, 5) -> -1
sumRange(0, 5) -> -3
Note:
You may assume that the array does not change.
There are many calls to sumRange function.

解法: Memorization/DP with O(1) time and O(N) space

就是Subarray类题目的变形，求subarray sum首先要想到前缀和数组。维护一个cumulative sum,然后每次计算i到j的和的时候就是sum[j] - sum[i-1],花费时间是O(1)要注意的是下边界越界的处理。
dp[i] = (k - 1) * (dp[i -1] + dp[i - 2])

public class NumArray {
	
	int[] cumsum  = null;
    public NumArray(int[] nums) {
       // calculate the cumulative sum O(N)
	   
	   // exception handling
	   if (nums == null || nums.length == 0) {
			return;
		}
		
	   cumsum = new int[nums.length];
	   cumsum[0] = nums[0];
	   for (int i = 1; i < nums.length; i++) {
			cumsum[i] = cumsum[i - 1] + nums[i];
		}
    }
    public int sumRange(int i, int j) {
        // Array access: O(1)
		if (cumsum == null) {
			return 0;
		}
		
		if (i == 0) {
			return cumsum[j];
		}
		
		return cumsum[j] - cumsum[i - 1];
    }
}
// Your NumArray object will be instantiated and called as such:
// NumArray numArray = new NumArray(nums);
// numArray.sumRange(0, 1);
// numArray.sumRange(1, 2);

leetcode解题: Paint Fence (276)

发表于 2016-07-13 | 分类于刷题总结

There is a fence with n posts, each post can be painted with one of the k colors.
You have to paint all the posts such that no more than two adjacent fence posts have the same color.
Return the total number of ways you can paint the fence.

解法1: DP with O(N) time and O(1) space

经典DP, dp[i]表示第i根柱子可能paint的方法数量。考虑三根柱子，如果第i根柱子和第i - 1根一样，那么一定和第i-2根不一样，所以这种情况有k - 1个颜色可选。同理，如果i和第i -2根一样，
那必然和第i - 1根不一样，也是k - 1种颜色可选。
dp[i] = (k - 1) * (dp[i -1] + dp[i - 2])

public int numWays(int n, int k) {
	
	if (n <= 0) {
		return 0; // If no fence exists, there is no way to paint it.
	}
	
	// Corner case 
	if (n == 1) {
		return k;
	}
	int[] dp = new int[]{k, k * k, k * k}; // initial value of dp[2] is set to be equal to dp[1]. 
	for (int i = 3; i <= n; i++) {
		dp[2] = (k - 1) * (dp[0] + dp[1]);
		dp[0] = dp[1];
		dp[1] = dp[2];
	}
	
	return dp[2];
        
}

leetcode解题: House Robber(198)

发表于 2016-07-13 | 分类于刷题总结

You are a professional robber planning to rob houses along a street.
Each house has a certain amount of money stashed, the only constraint stopping you from robbing each of them is that adjacent houses have security system connected and it will automatically contact the police if two adjacent houses were broken into on the same night.

Given a list of non-negative integers representing the amount of money of each house, determine the maximum amount of money you can rob tonight without alerting the police.

解法1: DP with O(N) time and O(N) space

应该考虑到是经典的DP问题，要求一个最大的解。那么首先dp数组的含义是dp[i]代表前i个房子能取得的最大值，
考虑第i个房子，如果抢第i个房子，那么第i-1个房子一定不能取，所以最大的值是dp[i - 2] + A[i]
如果不抢第i个房子，那么最大的值就是前一个房子能取得的最大值，状态方程dp[i] = max(dp[i - 1], dp[i - 2] + A[i])。
这种考虑第i个数字取或者不取得思路在其他dp问题中也经常见到。

public int rob(int[] nums) {
	if (nums == null || nums.length == 0) {
		return 0;
	}
	
	if (nums.length == 1) {
		return nums[0];
	}
	
	// initialize
	int[] dp = new int[nums.length];
	dp[0] = nums[0];
	dp[1] = Math.max(dp[0], nums[1]);
	
	// dp transition
	for (int i = 2; i < nums.length; i++) {
		dp[i] = Math.max(dp[i - 1], dp[i - 2] + nums[i]);
	}
	
	return dp[nums.length - 1];
}

解法1优化: DP with O(N) time and O(1) space

实际上观察状态方程可以发现，我们只需要记录dp[i - 2], dp[i -1]两个数值，所以可以维护一个数组{A,B,C},前两个数字作为buffer，不停更新三个数字。

public int rob(int[] nums) {
	if (nums == null || nums.length == 0) {
		return 0;
	}
	
	if (nums.length == 1) {
		return nums[0];
	}
	
	// initialize
	int[] dp = new int[]{nums[0], Math.max(nums[0], nums[1]), Math.max(nums[0], nums[1])};
	
	// dp transition
	for (int i = 2; i < nums.length; i++) {
		dp[2] = Math.max(dp[1], dp[0] + nums[i]);
		dp[0] = dp[1];
		dp[1] = dp[2];
	}
	
	return dp[2];
}

leetcode解题: Best time to buy and sell stock IV (188)

发表于 2016-07-08 | 分类于刷题总结

Say you have an array for which the ith element is the price of a given stock on day i.

Design an algorithm to find the maximum profit. You may complete at most k transactions.

Note:
You may not engage in multiple transactions at the same time (ie, you must sell the stock before you buy again).

解法:

比较难想清楚的一题。
需要用到两个dp matrix, 一个叫local, 一个叫global
local[i][j] 表示的是前i天做j次操作，最后一次sell正好是i天时的最大利润
global[i][j]表示的是前i天做j次操作，最大的利润
那么答案就是global[N][k]
local的递推关系是：
local[i][j]可以是i-1天已经完成了j次交易，那么只要把最后一天的交易顺延，就得到了i天完成j次交易。另外一个可能是到i - 1天完成了j - 1次交易，那么只要再加上新加的一天的交易即可。
global的递推关系是：
global[i][j]是要么新一天的local maxi, 或者是前一天的global max

public class Solution {
    public int maxProfit(int k, int[] prices) {
        if (prices == null || prices.length == 0) {
            return 0;
        }
        int n = prices.length;
        if (k > n) {
            return maxProfit_naive(prices);
        }
        int[][] global = new int[n][k + 1];
        int[][] local = new int[n][k + 1];
        for (int i = 1; i < n; i++) {
            int diff = prices[i] - prices[i - 1];
            for (int j = 1; j <= k; j++) {
                local[i][j] = Math.max(global[i - 1][j - 1] + diff, local[i - 1][j] + diff);
                global[i][j] = Math.max(local[i][j], global[i -1][j]);
            }
        }
        return global[n - 1][k];
    }
    private int maxProfit_naive(int[] prices) {
        int sum = 0;
        for (int i = 1; i < prices.length; i++) {
            if (prices[i] > prices[i - 1]) {
                sum += prices[i] - prices[i - 1];
            }
        }
        return sum;
    }
}

leetcode解题: best time to buy and sell stock III (123)

发表于 2016-07-08 | 分类于刷题总结

Say you have an array for which the ith element is the price of a given stock on day i.

Design an algorithm to find the maximum profit. You may complete at most two transactions.

Note:
You may not engage in multiple transactions at the same time (ie, you must sell the stock before you buy again).

解法1： Divide & Conquer O(N^2) Time + O(N) Space

思路是利用I的解题方法，对于一个(i,j)范围的数组，可以用O(N)的时间和O(1)的空间算出最大的利润
那么可以将原数组分割为(0,i) + (i,length - 1)的两个数组，分别计算每个范围内的最大利润再相加。
一共有N种分割方法，所以总的时间复杂度是O(N^2)
这个解法lintcode可以AC,leetcode会TLE，所以需要改进这个解法，最好的是O(N) Time

public int maxProfit(int[] prices) {
	if (prices == null || prices.length == 0) {
		return 0;
	}
	int res = 0;
	for (int i = 0; i < prices.length - 1; i++) {
		int left = maxSingleTransition(prices, 0, i);
		int right = maxSingleTransition(prices, i, prices.length - 1);
		int sum = left + right;
		res = Math.max(res, sum);
	}
	return res;
}
private int maxSingleTransition(int[] prices, int start, int end) {
	int min = prices[start];
	int res = 0;
	for (int i = start; i <= end; i++) {
		res = Math.max(res, prices[i] - min);
		min = Math.min(min, prices[i]);
	}
	return res;
}

解法2：两次DP，O(N) Time + O(N) Space

思路还是类似于第一种解法，但是做了优化。计算(0,i)的时候实际可以运用(0,i-1)的结果。同样，计算(i,length - 1)的时候可以
运用(i+1,length - 1)的结果。这就引出了用两个dp数组先记录下每一个区间的最大收益，然后再扫描一遍得到左右两个区间和的最大值。

public int maxProfit(int[] prices) {
	if (prices == null || prices.length == 0) {
		return 0;
	}
	int[] left = new int[prices.length];
	int[] right = new int[prices.length];
	// Scan from left
	int min = prices[0];
	for (int i = 1; i < prices.length; i++) {
		left[i] = Math.max(left[i - 1], prices[i] - min);
		min = Math.min(min, prices[i]);
	}
	// Scan from right
	int max = prices[prices.length - 1];
	for (int i = prices.length - 2; i >= 0; i--) {
		right[i] = Math.max(right[i + 1], max - prices[i]);
		max = Math.max(max, prices[i]);
	}
	// search for the largest sum
	int res = 0;
	for (int i = 0; i < prices.length; i++) {
		res = Math.max(res, left[i] + right[i]);
	}
	return res;
}

解法３: 用交替的dp方程组

class Solution {
    public int maxProfit(int[] prices) {
        if (prices == null || prices.length == 0) {
            return 0;
        }
        int t_i10 = 0, t_i11 = Integer.MIN_VALUE, t_i20 = 0, t_i21 = Integer.MIN_VALUE;
        for (int price : prices) {
            t_i20 = Math.max(t_i20, t_i21 + price);
            t_i21 = Math.max(t_i21, t_i10 - price);
            t_i10 = Math.max(t_i10, t_i11 + price);
            t_i11 = Math.max(t_i11, -price);
        }
        return t_i20;
    }
}

leetcode解题: Best time to buy and sell stock II （122）

发表于 2016-07-08 | 分类于刷题总结

Say you have an array for which the ith element is the price of a given stock on day i.

Design an algorithm to find the maximum profit. You may complete as many transactions as you like
(ie, buy one and sell one share of the stock multiple times). However, you may not engage in multiple transactions at the same time
(ie, you must sell the stock before you buy again).

解法1：O(N)

既然是可以交易任意多次，只要price[i] > price[i - 1]我们就认为是一次合格的交易，
那么最大的profit一定是所有positive profit的和。只要扫描一遍求出所有正差值的和即为答案
这是一种greedy的思路。

public int maxProfit(int[] prices) {
	
	if (prices == null || prices.length == 0) {
		return 0;
	}
	
	int profit = 0;
	for (int i = 1; i < prices.length; i++) {
		if (prices[i] > prices[i - 1]) {
			profit += prices[i] - prices[i - 1];
		}
	}
	
	return profit;
}