{"id":334,"date":"2015-08-03T09:05:43","date_gmt":"2015-08-03T09:05:43","guid":{"rendered":"http:\/\/blog.tiran.info\/?p=334"},"modified":"2017-12-05T14:18:17","modified_gmt":"2017-12-05T13:18:17","slug":"regression-lineaire-multiple-avec-r","status":"publish","type":"post","link":"https:\/\/blog.tiran.stream\/?p=334","title":{"rendered":"R\u00e9gression lin\u00e9aire multiple avec R"},"content":{"rendered":"<p>Dans la continuit\u00e9 du <a href=\"http:\/\/blog.tiran.info\/regression-lineaire-multiple-avec-oracle\">pr\u00e9c\u00e9dent billet<\/a>, je r\u00e9alise cette fois-ci la r\u00e9gression multiple avec R. Le jeu de donn\u00e9es utilis\u00e9 est identique.<\/p>\n<p><strong>R\u00e9cup\u00e9ration des donn\u00e9es depuis le SGBD<\/strong><\/p>\n<pre>&gt; library(ROracle)\r\n&gt; ora = Oracle()\r\n&gt; cnx = dbConnect(ora, username=&quot;rafa&quot;, password=&quot;rafa&quot;, dbname=&quot;S1401037:1521\/STATPDB&quot;)\r\n&gt; ImmoParis &lt;- dbGetQuery(cnx, &quot;select * from immo_paris_v&quot;)\r\n&gt; dbDisconnect(cnx)\r\n[1] TRUE\r\n&gt;<\/pre>\n<p><strong>Normalisation des informations<\/strong><\/p>\n<p>Les variables cat\u00e9gorielles sont transform\u00e9es en facteurs:<\/p>\n<pre>&gt; ImmoParis$ARRONDISSEMENT &lt;- factor(ImmoParis$ARRONDISSEMENT)\r\n&gt; ImmoParis$BALCON &lt;- factor(ImmoParis$BALCON)\r\n&gt; ImmoParis$CAVE &lt;- factor(ImmoParis$CAVE)\r\n&gt; ImmoParis$PARKING &lt;- factor(ImmoParis$PARKING)\r\n&gt; ImmoParis$GARDIEN &lt;- factor(ImmoParis$GARDIEN)\r\n&gt; summary(ImmoParis)\r\n      PRIX           NBPIECES       NBCHAMBRES    PARKING BALCON  CAVE    GARDIEN   SUPERFICIE    ARRONDISSEMENT\r\n Min.   : 46000   Min.   :1.000   Min.   :0.000   0:300   0:274   0:162   0:275   Min.   : 4.80   75018  : 50   \r\n 1st Qu.:209750   1st Qu.:1.000   1st Qu.:1.000   1: 32   1: 58   1:170   1: 57   1st Qu.:25.00   75020  : 40   \r\n Median :294500   Median :2.000   Median :1.000                                   Median :35.48   75014  : 35   \r\n Mean   :298556   Mean   :1.913   Mean   :1.006                                   Mean   :36.91   75019  : 34   \r\n 3rd Qu.:394625   3rd Qu.:2.000   3rd Qu.:1.000                                   3rd Qu.:47.95   75011  : 31   \r\n Max.   :499000   Max.   :5.000   Max.   :5.000                                   Max.   :77.00   75013  : 27   \r\n                                                                                                  (Other):115   \r\n&gt;<\/pre>\n<p><strong>Etude de la colin\u00e9arit\u00e9<\/strong><\/p>\n<p style=\"text-align: justify;\">Le VIF des variables peut \u00eatre obtenu par la fonction vif de la librairie CAR. On retrouve bien les m\u00eames chiffres que lors de l&rsquo;\u00e9tude r\u00e9alis\u00e9e \u00e0 partir d&rsquo;Oracle. On exclut donc cette fois encore la variable NBPIECES:<\/p>\n<pre>&gt; res &lt;- lm(PRIX~., data=ImmoParis)\r\n&gt; library(car)\r\n&gt; ### Analyse de colin\u00e9arit\u00e9\r\n&gt; vif(res)\r\n                   GVIF Df GVIF^(1\/(2*Df))\r\nNBPIECES       6.539100  1        2.557166\r\nNBCHAMBRES     3.912744  1        1.978066\r\nPARKING        1.231601  1        1.109775\r\nBALCON         1.238437  1        1.112851\r\nCAVE           1.304018  1        1.141936\r\nGARDIEN        1.189156  1        1.090484\r\nSUPERFICIE     4.502493  1        2.121908\r\nARRONDISSEMENT 2.428668 18        1.024955\r\n&gt;\r\n&gt; res &lt;- lm(PRIX~. -NBPIECES, data=ImmoParis)\r\n&gt; vif(res)\r\n                   GVIF Df GVIF^(1\/(2*Df))\r\nNBCHAMBRES     2.504415  1        1.582534\r\nPARKING        1.215899  1        1.102678\r\nBALCON         1.231309  1        1.109643\r\nCAVE           1.303490  1        1.141705\r\nGARDIEN        1.182347  1        1.087358\r\nSUPERFICIE     2.505841  1        1.582985\r\nARRONDISSEMENT 2.168864 18        1.021739\r\n&gt;<\/pre>\n<p><strong>D\u00e9termination des pr\u00e9dicteurs par la m\u00e9thode stepwise<\/strong><\/p>\n<p style=\"text-align: justify;\">A l&rsquo;instar de la fonctionnalit\u00e9 de \u00ab\u00a0feature selection\u00a0\u00bb d&rsquo;ODM, on peut utiliser la fonction step pour r\u00e9aliser une d\u00e9termination pas \u00e0 pas (bas\u00e9e sur l&rsquo;AIC) de la meilleure combinaison de pr\u00e9dicteurs.<\/p>\n<pre>&gt; stepw &lt;- step(res, direction=&quot;both&quot;)\r\nStart: AIC=6942.85\r\nPRIX ~ (NBPIECES + NBCHAMBRES + PARKING + BALCON + CAVE + GARDIEN + \r\n    SUPERFICIE + ARRONDISSEMENT) - NBPIECES\r\n\r\n                 Df  Sum of Sq        RSS    AIC\r\n- GARDIEN         1 5.8483e+07 3.4503e+11 6940.9\r\n- BALCON          1 6.7118e+07 3.4504e+11 6940.9\r\n- NBCHAMBRES      1 1.2915e+09 3.4627e+11 6942.1\r\n- PARKING         1 1.4300e+09 3.4641e+11 6942.2\r\n&lt;none&gt;                         3.4498e+11 6942.9\r\n- CAVE            1 7.3252e+09 3.5230e+11 6947.8\r\n- ARRONDISSEMENT 18 3.0846e+11 6.5343e+11 7118.9\r\n- SUPERFICIE      1 1.6239e+12 1.9689e+12 7519.1\r\n\r\nStep:  AIC=6940.91\r\nPRIX ~ NBCHAMBRES + PARKING + BALCON + CAVE + SUPERFICIE + ARRONDISSEMENT\r\n\r\n                 Df  Sum of Sq        RSS    AIC\r\n- BALCON          1 6.0791e+07 3.4510e+11 6939.0\r\n- PARKING         1 1.4011e+09 3.4644e+11 6940.3\r\n- NBCHAMBRES      1 1.4506e+09 3.4648e+11 6940.3\r\n&lt;none&gt;                         3.4503e+11 6940.9\r\n+ GARDIEN         1 5.8483e+07 3.4498e+11 6942.9\r\n- CAVE            1 7.5937e+09 3.5263e+11 6946.1\r\n- ARRONDISSEMENT 18 3.1054e+11 6.5557e+11 7118.0\r\n- SUPERFICIE      1 1.6832e+12 2.0282e+12 7527.0\r\n\r\nStep:  AIC=6938.97\r\nPRIX ~ NBCHAMBRES + PARKING + CAVE + SUPERFICIE + ARRONDISSEMENT\r\n\r\n                 Df  Sum of Sq        RSS    AIC\r\n- NBCHAMBRES      1 1.4338e+09 3.4653e+11 6938.3\r\n- PARKING         1 1.6356e+09 3.4673e+11 6938.5\r\n&lt;none&gt;                         3.4510e+11 6939.0\r\n+ BALCON          1 6.0791e+07 3.4503e+11 6940.9\r\n+ GARDIEN         1 5.2156e+07 3.4504e+11 6940.9\r\n- CAVE            1 7.5740e+09 3.5267e+11 6944.2\r\n- ARRONDISSEMENT 18 3.1534e+11 6.6044e+11 7118.5\r\n- SUPERFICIE      1 1.6925e+12 2.0376e+12 7526.5\r\n\r\nStep:  AIC=6938.34\r\nPRIX ~ PARKING + CAVE + SUPERFICIE + ARRONDISSEMENT\r\n\r\n                 Df  Sum of Sq        RSS    AIC\r\n- PARKING         1 1.4649e+09 3.4799e+11 6937.7\r\n&lt;none&gt;                         3.4653e+11 6938.3\r\n+ NBCHAMBRES      1 1.4338e+09 3.4510e+11 6939.0\r\n+ GARDIEN         1 2.0568e+08 3.4632e+11 6940.1\r\n+ BALCON          1 4.3968e+07 3.4648e+11 6940.3\r\n- CAVE            1 7.5455e+09 3.5407e+11 6943.5\r\n- ARRONDISSEMENT 18 3.2207e+11 6.6859e+11 7120.5\r\n- SUPERFICIE      1 2.9126e+12 3.2591e+12 7680.4\r\n\r\nStep:  AIC=6937.74\r\nPRIX ~ CAVE + SUPERFICIE + ARRONDISSEMENT\r\n\r\n                 Df  Sum of Sq        RSS    AIC\r\n&lt;none&gt;                         3.4799e+11 6937.7\r\n+ PARKING         1 1.4649e+09 3.4653e+11 6938.3\r\n+ NBCHAMBRES      1 1.2631e+09 3.4673e+11 6938.5\r\n+ BALCON          1 2.4356e+08 3.4775e+11 6939.5\r\n+ GARDIEN         1 1.2345e+08 3.4787e+11 6939.6\r\n- CAVE            1 7.2543e+09 3.5525e+11 6942.6\r\n- ARRONDISSEMENT 18 3.2847e+11 6.7646e+11 7122.4\r\n- SUPERFICIE      1 3.0115e+12 3.3595e+12 7688.5\r\n&gt;<\/pre>\n<p style=\"text-align: justify;\">La succession d&rsquo;it\u00e9ration conduit \u00e0 finalement \u00e0 conserver le n-uplet\u00a0CAVE, SUPERFICIE et ARRONDISSEMENT. Il est int\u00e9ressant de noter que ce ne sont pas les m\u00eames pr\u00e9dicteurs que ceux auxquels la fonctionnalit\u00e9 de \u00ab\u00a0feature selection\u00a0\u00bb d&rsquo;ODM est parvenue.<\/p>\n<p><strong>Analyse du mod\u00e8le<\/strong><\/p>\n<p style=\"text-align: justify;\">M\u00eame si les pr\u00e9dicteurs auxquels aboutissent les deux approches sont diff\u00e9rents, le R2 reste tr\u00e8s proche et \u00e9lev\u00e9 (de l&rsquo;ordre de 91%):<\/p>\n<pre>&gt; res &lt;- lm(PRIX ~ CAVE + SUPERFICIE + ARRONDISSEMENT, data=ImmoParis)\r\n&gt; summary(res)\r\n\r\nCall:\r\nlm(formula = PRIX ~ CAVE + SUPERFICIE + ARRONDISSEMENT, data = ImmoParis)\r\n\r\nResiduals:\r\n   Min     1Q Median     3Q    Max \r\n-92193 -20335   -798  18608 105249 \r\n\r\nCoefficients:\r\n                     Estimate Std. Error t value Pr(&gt;|t|)    \r\n(Intercept)          118225.5    33619.2   3.517 0.000502 ***\r\nCAVE1                 10586.1     4157.6   2.546 0.011372 *  \r\nSUPERFICIE             6706.7      129.3  51.878  &lt; 2e-16 ***\r\nARRONDISSEMENT75002  -44128.0    36135.2  -1.221 0.222938    \r\nARRONDISSEMENT75004    1787.2    38661.5   0.046 0.963158    \r\nARRONDISSEMENT75005   38996.6    40969.1   0.952 0.341910    \r\nARRONDISSEMENT75006  -57852.7    41011.5  -1.411 0.159348    \r\nARRONDISSEMENT75007   29061.0    35778.5   0.812 0.417271    \r\nARRONDISSEMENT75008   -4547.0    47315.1  -0.096 0.923503    \r\nARRONDISSEMENT75009  -50682.7    37425.9  -1.354 0.176651    \r\nARRONDISSEMENT75010  -64342.7    34137.5  -1.885 0.060387 .  \r\nARRONDISSEMENT75011  -52656.4    34035.0  -1.547 0.122850    \r\nARRONDISSEMENT75012  -66473.5    34441.3  -1.930 0.054510 .  \r\nARRONDISSEMENT75013  -55508.2    34155.5  -1.625 0.105142    \r\nARRONDISSEMENT75014  -49996.2    34010.4  -1.470 0.142565    \r\nARRONDISSEMENT75015  -58493.1    34198.9  -1.710 0.088193 .  \r\nARRONDISSEMENT75016  -60975.7    36163.2  -1.686 0.092775 .  \r\nARRONDISSEMENT75017  -56077.3    34734.0  -1.614 0.107437    \r\nARRONDISSEMENT75018 -114471.5    33834.7  -3.383 0.000808 ***\r\nARRONDISSEMENT75019 -106194.4    34078.0  -3.116 0.002003 ** \r\nARRONDISSEMENT75020 -103040.5    34021.1  -3.029 0.002662 ** \r\n---\r\nSignif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\r\n\r\nResidual standard error: 33450 on 311 degrees of freedom\r\nMultiple R-squared:  0.9183,\tAdjusted R-squared:  0.9131 \r\nF-statistic: 174.8 on 20 and 311 DF,  p-value: &lt; 2.2e-16\r\n\r\n&gt;<\/pre>\n<p><strong>V\u00e9rification de la normalit\u00e9 des r\u00e9sidus<\/strong><\/p>\n<p style=\"text-align: justify;\">On peut valider la normalit\u00e9 des r\u00e9sidus par diverses m\u00e9thodes: visualisation de la droite de Henry, analyse de l&rsquo;aspect de l&rsquo;histogramme ou bien plus rigoureusement avec un test de Shapiro:<\/p>\n<pre>&gt; qqnorm(resid(res));qqline(resid(res));<\/pre>\n<p><a href=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2015\/07\/qq_resid.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-339\" src=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2015\/07\/qq_resid.png\" alt=\"qq_resid\" width=\"450\" height=\"377\" \/><\/a><\/p>\n<pre>&gt; hist(resid(res), freq = F, col =&quot;grey&quot;, main=&quot;&quot;, xlab=&quot;r\u00e9sidus&quot;, ylab=&quot;fr\u00e9quences&quot;, nclass=200)<\/pre>\n<p><a href=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2015\/07\/hist_resid.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-338\" src=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2015\/07\/hist_resid.png\" alt=\"hist_resid\" width=\"450\" height=\"377\" \/><\/a><\/p>\n<pre>&gt; shapiro.test(resid(res))\r\n\tShapiro-Wilk normality test\r\n\r\ndata:  resid(res)\r\nW = 0.9951, p-value = 0.3684\r\n\r\n&gt;<\/pre>\n<p style=\"text-align: justify;\">La p-valeur est sup\u00e9rieure au seuil et on ne peut donc pas rejeter l&rsquo;hypoth\u00e8se H0. Les donn\u00e9es sont donc compatibles avec une distribution normale.<\/p>\n<p style=\"text-align: justify;\">De m\u00eame on peut v\u00e9rifier l&rsquo;homosc\u00e9dasticit\u00e9 des r\u00e9sidus en v\u00e9rifiant visuellement que le nuage de point est bien \u00e9pars de mani\u00e8re sym\u00e9trique autour de 0:<\/p>\n<pre>&gt; plot(res$fitted.values, res$residuals,\r\n + xlab=&quot;Valeurs pr\u00e9dites par le mod\u00e8le&quot;,\r\n + ylab=&quot;R\u00e9sidus&quot;, pch=16, cex=0.75, col=&quot;blue&quot;)\r\n&gt;<\/pre>\n<p><a href=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2015\/07\/var_resid.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-340\" src=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2015\/07\/var_resid.png\" alt=\"var_resid\" width=\"450\" height=\"377\" \/><\/a><\/p>\n<p>On peut aussi recourir \u00e0 un <a href=\"https:\/\/en.wikipedia.org\/wiki\/Breusch%E2%80%93Pagan_test\" target=\"_blank\" rel=\"noopener\">test de Breush-Pagan<\/a> pour cette validation:<\/p>\n<pre>&gt; library(lmtest)\r\n&gt; bptest(res)\r\n\r\n\tstudentized Breusch-Pagan test\r\n\r\ndata:  res\r\nBP = 28.3765, df = 20, p-value = 0.1008\r\n\r\n&gt;<\/pre>\n<p>Ici, l&rsquo;hypoth\u00e8se H0 d&rsquo;homosc\u00e9dasticit\u00e9 ne peut pas \u00eatre rejet\u00e9e au seuil de 5%.<\/p>\n<p><strong>Simulation<\/strong><\/p>\n<p style=\"text-align: justify;\">La fonction PREDICT permet de simuler le prix d\u2019un appartement en utilisant le mod\u00e8le pr\u00e9c\u00e9demment cr\u00e9\u00e9.\u00a0On peut par exemple estimer le prix d\u2019un deux pi\u00e8ces de 23m2 dans le 20eme arrondissement avec une cave:<\/p>\n<pre>&gt; predict(res,data.frame(SUPERFICIE=23,CAVE=&quot;1&quot;,ARRONDISSEMENT=&quot;75020&quot;))\r\n       1\r\n180025.4\r\n&gt;<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Dans la continuit\u00e9 du pr\u00e9c\u00e9dent billet, je r\u00e9alise cette fois-ci la r\u00e9gression multiple avec R. Le jeu de donn\u00e9es utilis\u00e9<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[12,13,14],"tags":[],"class_list":["post-334","post","type-post","status-publish","format-standard","hentry","category-r","category-regression","category-statistique"],"_links":{"self":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/334","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=334"}],"version-history":[{"count":1,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/334\/revisions"}],"predecessor-version":[{"id":1159,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/334\/revisions\/1159"}],"wp:attachment":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=334"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=334"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=334"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}