scikit-learn LinearRegressionのp値（重要度）を見つける

28

のscikit-学ぶ線形回帰は、この情報を計算していませんが、あなたは簡単にそれを行うには、クラスを拡張することができます

hereから盗ま

from sklearn import linear_model 
from scipy import stats 
import numpy as np 


class LinearRegression(linear_model.LinearRegression): 
    """ 
    LinearRegression class after sklearn's, but calculate t-statistics 
    and p-values for model coefficients (betas). 
    Additional attributes available after .fit() 
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1]) 
    which is (n_features, n_coefs) 
    This class sets the intercept to 0 by default, since usually we include it 
    in X. 
    """ 

    def __init__(self, *args, **kwargs): 
     if not "fit_intercept" in kwargs: 
      kwargs['fit_intercept'] = False 
     super(LinearRegression, self)\ 
       .__init__(*args, **kwargs) 

    def fit(self, X, y, n_jobs=1): 
     self = super(LinearRegression, self).fit(X, y, n_jobs) 

     sse = np.sum((self.predict(X) - y) ** 2, axis=0)/float(X.shape[0] - X.shape[1]) 
     se = np.array([ 
      np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X)))) 
                for i in range(sse.shape[0]) 
        ]) 

     self.t = self.coef_/se 
     self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])) 
     return self

。

Pythonでのこの種の統計解析については、statsmodelsをご覧ください。

出典

2015-01-13 17:54:43 elyase

+0

このバックに貢献しない任意の理由は？まだ含まれていないようだし、たくさんの覗き込みがそれに興味を持っているようだ。 – Sammaron

4

上記のコードは実際には機能しません。 sseはスカラーであることに注意してください。それから、それを反復しようとします。次のコードは変更されたバージョンです。驚くほどクリーンではありませんが、それは多かれ少なかれ動作すると思います。

class LinearRegression(linear_model.LinearRegression): 

    def __init__(self,*args,**kwargs): 
     # *args is the list of arguments that might go into the LinearRegression object 
     # that we don't know about and don't want to have to deal with. Similarly, **kwargs 
     # is a dictionary of key words and values that might also need to go into the orginal 
     # LinearRegression object. We put *args and **kwargs so that we don't have to look 
     # these up and write them down explicitly here. Nice and easy. 

     if not "fit_intercept" in kwargs: 
      kwargs['fit_intercept'] = False 

     super(LinearRegression,self).__init__(*args,**kwargs) 

    # Adding in t-statistics for the coefficients. 
    def fit(self,x,y): 
     # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column 
     # of constants. 

     # Not totally sure what 'super' does here and why you redefine self... 
     self = super(LinearRegression, self).fit(x,y) 
     n, k = x.shape 
     yHat = np.matrix(self.predict(x)).T 

     # Change X and Y into numpy matricies. x also has a column of ones added to it. 
     x = np.hstack((np.ones((n,1)),np.matrix(x))) 
     y = np.matrix(y).T 

     # Degrees of freedom. 
     df = float(n-k-1) 

     # Sample variance.  
     sse = np.sum(np.square(yHat - y),axis=0) 
     self.sampleVariance = sse/df 

     # Sample variance for x. 
     self.sampleVarianceX = x.T*x 

     # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root. ugly) 
     self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I) 

     # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix. 
     self.se = self.covarianceMatrix.diagonal()[1:] 

     # T statistic for each beta. 
     self.betasTStat = np.zeros(len(self.se)) 
     for i in xrange(len(self.se)): 
      self.betasTStat[i] = self.coef_[0,i]/self.se[i] 

     # P-value for each beta. This is a two sided t-test, since the betas can be 
     # positive or negative. 
     self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

出典

2015-01-16 00:46:25 Alex

6

sklearn.feature_selection.f_regressionを使用できます。

click here for the scikit-learn page

出典

2016-01-24 23:44:19

+0

これはFテストなのだろうか？私は、線形回帰のためのp値は、典型的には個々の退行者ごとのものであると考えていました。関数の詳細な説明は、良い答えに必要です。 – wordsforthewise

29

これはやり過ぎのようなものですが、さんはそれを行くを与えてみましょう。まず、p値が

import pandas as pd 
import numpy as np 
from sklearn import datasets, linear_model 
from sklearn.linear_model import LinearRegression 
import statsmodels.api as sm 
from scipy import stats 

diabetes = datasets.load_diabetes() 
X = diabetes.data 
y = diabetes.target 

X2 = sm.add_constant(X) 
est = sm.OLS(y, X2) 
est2 = est.fit() 
print(est2.summary())

どうあるべきかを調べるために使用statsmodelをすることができますし、我々は[OK]を、のは、これを再現してみましょう

      OLS Regression Results        
============================================================================== 
Dep. Variable:      y R-squared:      0.518 
Model:       OLS Adj. R-squared:     0.507 
Method:     Least Squares F-statistic:      46.27 
Date:    Wed, 08 Mar 2017 Prob (F-statistic):   3.83e-62 
Time:      10:08:24 Log-Likelihood:    -2386.0 
No. Observations:     442 AIC:        4794. 
Df Residuals:      431 BIC:        4839. 
Df Model:       10           
Covariance Type:   nonrobust           
============================================================================== 
       coef std err   t  P>|t|  [0.025  0.975] 
------------------------------------------------------------------------------ 
const  152.1335  2.576  59.061  0.000  147.071  157.196 
x1   -10.0122  59.749  -0.168  0.867 -127.448  107.424 
x2   -239.8191  61.222  -3.917  0.000 -360.151 -119.488 
x3   519.8398  66.534  7.813  0.000  389.069  650.610 
x4   324.3904  65.422  4.958  0.000  195.805  452.976 
x5   -792.1842 416.684  -1.901  0.058 -1611.169  26.801 
x6   476.7458 339.035  1.406  0.160 -189.621 1143.113 
x7   101.0446 212.533  0.475  0.635 -316.685  518.774 
x8   177.0642 161.476  1.097  0.273 -140.313  494.442 
x9   751.2793 171.902  4.370  0.000  413.409 1089.150 
x10   67.6254  65.984  1.025  0.306  -62.065  197.316 
============================================================================== 
Omnibus:      1.506 Durbin-Watson:     2.029 
Prob(Omnibus):     0.471 Jarque-Bera (JB):    1.404 
Skew:       0.017 Prob(JB):      0.496 
Kurtosis:      2.726 Cond. No.       227. 
==============================================================================

を取得します。 Matrix Algebraを使って線形回帰分析をほぼ再現しているので、これは非常に残酷です。しかし、一体何？

これは私たちに与えます。

Coefficients Standard Errors t values Probabilites 
0  152.1335   2.576 59.061   0.000 
1  -10.0122   59.749 -0.168   0.867 
2  -239.8191   61.222 -3.917   0.000 
3  519.8398   66.534  7.813   0.000 
4  324.3904   65.422  4.958   0.000 
5  -792.1842   416.684 -1.901   0.058 
6  476.7458   339.035  1.406   0.160 
7  101.0446   212.533  0.475   0.635 
8  177.0642   161.476  1.097   0.273 
9  751.2793   171.902  4.370   0.000 
10  67.6254   65.984  1.025   0.306

したがって、statsmodelから値を再現できます。

出典

2017-03-08 17:17:57 JARH

+1

私のvar_bがすべてNansだとはどういう意味ですか？線形代数部が失敗する根本的な理由はありますか？ – famargar

+0

それはなぜそうかもしれないかについて本当に推測するのは難しいです。私はあなたのデータの構造を見て、それを例と比較します。それは手がかりを与えるかもしれない。 – JARH

+0

'code'のように見えます。np.linalg.invは、行列が反転不可能な場合でも結果を返すことがあります。それは問題かもしれません。 – JARH

3

scipyをp値として使用できます。このコードはscipyのドキュメントにあります。

>>> from scipy import stats 
>>> import numpy as np 
>>> x = np.random.random(10) 
>>> y = np.random.random(10) 
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

出典

2017-10-24 13:57:03

scikit-learn LinearRegressionのp値（重要度）を見つける

答えて

関連する問題