多元线性回归
Tips
Assumptions
For a successful regression analysis. It’s essential to validate these assumptions.
Linearity: The relationship between dependent and independent variables should be Linear.
Homoscedasticity 方差齐性: (constant variance 恒定方差) of the errors should be maintained. 方差:离散程度的度量
Multivarivate Normality(多元正态性): Multiple regression assumes that the residuals are normally distributed.
Lack of Multicollinearity(没有多重共线性,由于存在精确相关关系或者高度相关关系而使模型估计失真或难以估计准确): It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are not independent of each other.
Dummy Variables(虚变量、哑变量)
Using categorical data in Multiple Regression Models is a powerful method to include non-numeric data types into a regression model.
Categorical data refers to data values which represent categories - data values with a fixed and unordered number of values. for instance, gender(male/female). In a regression model, these values can be represented bu dummy variables - variables containing values such as 1 or 0 representing the presence or absence of the categorical.
NOTE
having too many variables could potentially cause our model to become less accurate. especially if certain variables have no effect on the outcome or have a significant effect on other variables. There are various methods to select the appropriate various methods to select the appropriate variable like -
- Forward Selection
- Backward Elimination
- Bi-directional Comparision