{"id":528,"date":"2016-03-26T16:02:40","date_gmt":"2016-03-26T16:02:40","guid":{"rendered":"http:\/\/blog.tiran.info\/?p=528"},"modified":"2016-03-26T16:02:40","modified_gmt":"2016-03-26T16:02:40","slug":"analyse-de-correlation-avec-oracle","status":"publish","type":"post","link":"https:\/\/blog.tiran.stream\/?p=528","title":{"rendered":"Analyse de corr\u00e9lation avec Oracle"},"content":{"rendered":"<p style=\"text-align: justify;\">Avec l&rsquo;organisation ce mois-ci d&rsquo;un exercice d&rsquo;envergure de <a href=\"http:\/\/www.prefecturedepolice.interieur.gouv.fr\/Sequana\/EU-Sequana-2016\" target=\"_blank\">simulation d&rsquo;inondation \u00e0 Paris<\/a>, on a beaucoup parl\u00e9 du risque de <a href=\"http:\/\/www.franceinfo.fr\/emission\/transportez-moi\/2015-2016\/la-crue-centennale-paris-12-03-2016-14-12\" target=\"_blank\">crue centennale de la Seine<\/a>. Cela m&rsquo;a amen\u00e9 \u00e0 m&rsquo;int\u00e9resser aux outils de suivi de l&rsquo;\u00e9tat des cours d&rsquo;eau en France. J&rsquo;ai ainsi d\u00e9couvert deux sites tr\u00e8s bien faits: <a href=\"http:\/\/www.vigicrues.gouv.fr\/\" target=\"_blank\">vigicrues<\/a> et la <a href=\"http:\/\/www.hydro.eaufrance.fr\/selection.php\" target=\"_blank\">banque hydro<\/a>.<\/p>\n<p style=\"text-align: justify;\">Vivant en bordure de la Loire, je me suis demand\u00e9 comment les \u00e9volutions de d\u00e9bits se propageaient en diff\u00e9rents points du fleuve. Pour cela, \u00e0 partir de la banque Hydro, j&rsquo;ai extrait pour la p\u00e9riode [14\/01 \u2013 06\/03] l&rsquo;ensemble des relev\u00e9s de d\u00e9bit (p\u00e9riodicit\u00e9 de 1h) sur 5 points de son lit:<\/p>\n<table style=\"border: 1px solid #ffffff;\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td style=\"border: 1px solid #ffffff;\" width=\"250\">\n<ul style=\"text-align: justify;\">\n<li>\u00a0Nevers<\/li>\n<li>\u00a0Saint-Satur<\/li>\n<li>\u00a0Orleans<\/li>\n<li>\u00a0Blois<\/li>\n<li>\u00a0Tours<\/li>\n<\/ul>\n<\/td>\n<td style=\"border: 1px solid #ffffff;\"><iframe loading=\"lazy\" src=\"https:\/\/www.google.com\/maps\/d\/embed?mid=zlN6GHK9WQ30.kI4tivwjpK2k\" width=\"480\" height=\"360\"><\/iframe><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: justify;\">Les donn\u00e9es sont accessibles ici :\u00a0<a href=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2016\/03\/Debits_Loire.xlsx\" rel=\"\">Debits_Loire.xlsx<\/a><\/p>\n<p style=\"text-align: justify;\">Mon id\u00e9e \u00e9tait de voir comment l&rsquo;\u00e9volution du d\u00e9bit de la Loire \u00e0 Nevers se manifestait dans les villes en aval. En effet, de nombreux param\u00e8tres entrent en ligne de compte : effet des pr\u00e9cipitations locales, impact des affluents, \u00e9largissement\/r\u00e9tr\u00e9cissements du lit etc\u2026<\/p>\n<p style=\"text-align: justify;\">Pour ce faire, j\u2019ai eu recours \u00e0 des mesures de <a href=\"https:\/\/fr.wikipedia.org\/wiki\/Corr%C3%A9lation_(statistiques)\" target=\"_blank\">corr\u00e9lation<\/a> entre les divers \u00e9chantillons. A cet effet, Oracle propose diverses fonctions <a href=\"http:\/\/docs.oracle.com\/database\/121\/SQLRF\/functions042.htm#SQLRF00621\" target=\"_blank\">CORR<\/a>, <a href=\"http:\/\/docs.oracle.com\/database\/121\/SQLRF\/functions043.htm#SQLRF06314\" target=\"_blank\">CORR_S<\/a>, <a href=\"http:\/\/docs.oracle.com\/database\/121\/SQLRF\/functions043.htm#SQLRF06314\" target=\"_blank\">CORR_K<\/a>.<\/p>\n<p style=\"text-align: justify;\">La fonction CORR calcule le <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pearson_product-moment_correlation_coefficient\" target=\"_blank\">coefficient de corr\u00e9lation de Pearson<\/a>. Il est adapt\u00e9 lorsque la relation entre les \u00e9chantillons est de type lin\u00e9aire.<br \/>\nLes fonctions CORR_S et CORR_K calculent respectivement les coefficients de<a href=\"https:\/\/en.wikipedia.org\/wiki\/Spearman%27s_rank_correlation_coefficient\" target=\"_blank\"> corr\u00e9lation de rang de Spearman<\/a> et de <a href=\"https:\/\/en.wikipedia.org\/wiki\/Kendall_rank_correlation_coefficient\" target=\"_blank\">Kendall<\/a>. Ces coefficients permettent de mettre en \u00e9vidence une corr\u00e9lation monotone (\u00e9volution simultan\u00e9e des \u00e9chantillons) mais pas forc\u00e9ment lin\u00e9aire. Leur calcul est bas\u00e9 sur le rang des observations plut\u00f4t que sur les valeurs.<br \/>\nL\u2019impl\u00e9mentation d\u2019Oracle permet en outre d\u2019obtenir une p-value pour ces deux fonctions (mais pas pour la fonction CORR).<\/p>\n<h2>Chargement des donn\u00e9es<\/h2>\n<p>Le chargement des donn\u00e9es a \u00e9t\u00e9 effectu\u00e9 \u00e0 l\u2019aide du module d\u2019import de SQLDeveloper. Cf billet\u00a0<a href=\"http:\/\/blog.tiran.info\/chargement-de-donnees-au-format-excel-dans-une-table-oracle\">http:\/\/blog.tiran.info\/chargement-de-donnees-au-format-excel-dans-une-table-oracle<\/a><\/p>\n<h2>Pr\u00e9paration des donn\u00e9es<\/h2>\n<p>Le format des dates ne correspondant pas \u00e0 un masque type propos\u00e9 par Oracle, ces derni\u00e8res ont \u00e9t\u00e9 charg\u00e9es dans un champ d\u00e9fini en VARCHAR2. La conversion est op\u00e9r\u00e9e dans un second temps \u00e0 l\u2019aide d\u2019une <a href=\"https:\/\/docs.oracle.com\/database\/121\/SQLRF\/statements_7002.htm#BABIJABG\" target=\"_blank\">colonne virtuelle<\/a>. Le champ original est ensuite masqu\u00e9 via la fonctionnalit\u00e9 12c de colonne <a href=\"https:\/\/docs.oracle.com\/database\/121\/ADMIN\/tables.htm#ADMIN13866\" target=\"_blank\">invisible<\/a> :<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; ALTER TABLE debits_loire\n  2     ADD DT_MES AS (TO_DATE (REPLACE (REPLACE (REPLACE (dt, &#039;JAN.&#039;, &#039;01&#039;), &#039;FEV.&#039;, &#039;02&#039;), &#039;MAR.&#039;, &#039;03&#039;), &#039;DD MM YYYY HH24:MI&#039;));\n\nTable altered.\n\nSQL&gt;\nSQL&gt; ALTER TABLE debits_loire\n  2     MODIFY dt INVISIBLE;\n\nTable altered.\n\nSQL&gt;\nSQL&gt; ALTER SESSION SET nls_date_format=&#039;DD\/MM\/YYYY HH24:MI&#039;;\n\nSession altered.\n\nSQL&gt;\nSQL&gt;      SELECT *\n  2         FROM debits_loire\n  3     ORDER BY dt_mes\n  4  FETCH FIRST 10 ROWS ONLY;\n\nNEVERS SAINT_SATUR ORLEANS BLOIS TOURS DT_MES\n------ ----------- ------- ----- ----- ----------------\n   437         655     526   525   484 14\/01\/2016 00:00\n   439         656     530   529   486 14\/01\/2016 01:00\n   440         658     534   532   487 14\/01\/2016 02:00\n   441         658     538   538   489 14\/01\/2016 03:00\n   442         658     542   542   491 14\/01\/2016 04:00\n   443         658     545   547   493 14\/01\/2016 05:00\n   442         658     549   551   496 14\/01\/2016 06:00\n   441         658     553   557   498 14\/01\/2016 07:00\n   440         657     556   563   501 14\/01\/2016 08:00\n   439         656     560   569   504 14\/01\/2016 09:00\n\n10 rows selected.\n\nSQL&gt;\n\n<\/pre>\n<h2>Matrices de corr\u00e9lation<\/h2>\n<p style=\"text-align: justify;\">Une matrice de corr\u00e9lation peut-\u00eatre constitu\u00e9e en calculant le coefficient pour chaque combinaison d\u2019\u00e9chantillon. Le bloc PLSQL ci-dessous r\u00e9alise les permutations et stocke les r\u00e9sultats (coef. de Pearson et de Spearman) dans une table res_corr :<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; CREATE TABLE res_corr\n  2  (\n  3     v1              VARCHAR2 (30),\n  4     v2              VARCHAR2 (30),\n  5     corr_pearson    NUMBER,\n  6     corr_spearman   NUMBER\n  7  );\n\nTable created.\n\nSQL&gt;\nSQL&gt; BEGIN\n  2     FOR rec\n  3        IN (SELECT a.column_name ville1, b.column_name ville2\n  4              FROM user_tab_cols a, user_tab_cols b\n  5             WHERE     a.table_name = &#039;DEBITS_LOIRE&#039;\n  6                   AND a.table_name = b.table_name\n  7                   AND a.column_name != &#039;DT_MES&#039;\n  8                   AND b.column_name != &#039;DT_MES&#039;\n  9                   AND a.hidden_column = &#039;NO&#039;\n 10                   AND b.hidden_column = &#039;NO&#039;)\n 11     LOOP\n 12        EXECUTE IMMEDIATE\n 13              &#039;INSERT INTO res_corr\n 14     SELECT :v1,\n 15            :v2,\n 16            CORR (&#039;\n 17           || rec.ville1\n 18           || &#039;, &#039;\n 19           || rec.ville2\n 20           || &#039;),\n 21            CORR_S (&#039;\n 22           || rec.ville1\n 23           || &#039;, &#039;\n 24           || rec.ville2\n 25           || &#039;)\n 26       FROM debits_loire&#039;\n 27           USING rec.ville1, rec.ville2;\n 28     END LOOP;\n 29\n 30     COMMIT;\n 31  END;\n 32  \/\n\nPL\/SQL procedure successfully completed.\n\nSQL&gt; \n<\/pre>\n<p>En <a href=\"https:\/\/docs.oracle.com\/database\/121\/DWHSG\/analysis.htm#DWHSG0209\" target=\"_blank\">pivotant <\/a>les donn\u00e9es de res_corr on peut alors obtenir la pr\u00e9sentation matricielle qui nous int\u00e9resse:<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; set numw 4\nSQL&gt;   SELECT *\n  2      FROM (SELECT v1 pearson, v2, corr_pearson FROM res_corr) PIVOT (MAX (corr_pearson)\n  3                                                          FOR v2\n  4                                                          IN (&#039;NEVERS&#039; AS NEVERS,\n  5                                                             &#039;SAINT_SATUR&#039; AS SAINT_SATUR,\n  6                                                             &#039;ORLEANS&#039; AS ORLEANS,\n  7                                                             &#039;BLOIS&#039; AS BLOIS,\n  8                                                             &#039;TOURS&#039; AS TOURS))\n  9  ORDER BY 2 DESC;\n\nPEARSON                        NEVERS SAINT_SATUR ORLEANS BLOIS TOURS\n------------------------------ ------ ----------- ------- ----- -----\nNEVERS                              1        .958     .84  .766  .721\nSAINT_SATUR                      .958           1    .943  .888  .847\nORLEANS                           .84        .943       1  .985  .963\nBLOIS                            .766        .888    .985     1  .989\nTOURS                            .721        .847    .963  .989     1\n\nSQL&gt; \nSQL&gt; list 2\n  2*     FROM (SELECT v1 pearson, v2, corr_pearson FROM res_corr) PIVOT (MAX (corr_pearson)\nSQL&gt; c\/pearson\/spearman\/\n  2*     FROM (SELECT v1 spearman, v2, corr_pearson FROM res_corr) PIVOT (MAX (corr_pearson)\nSQL&gt; c\/pearson\/spearman\/\n  2*     FROM (SELECT v1 spearman, v2, corr_spearman FROM res_corr) PIVOT (MAX (corr_pearson)\nSQL&gt; c\/pearson\/spearman\/\n  2*     FROM (SELECT v1 spearman, v2, corr_spearman FROM res_corr) PIVOT (MAX (corr_spearman)\nSQL&gt; \/\n\nSPEARMAN                       NEVERS SAINT_SATUR ORLEANS BLOIS TOURS\n------------------------------ ------ ----------- ------- ----- -----\nNEVERS                              1        .917    .804  .729  .689\nSAINT_SATUR                      .917           1    .951  .902  .868\nORLEANS                          .804        .951       1  .985  .944\nBLOIS                            .729        .902    .985     1  .975\nTOURS                            .689        .868    .944  .975     1\n\nSQL&gt;\n\n<\/pre>\n<p style=\"text-align: justify;\">On constate que plus les points sont \u00e9loign\u00e9s, plus le coefficient de corr\u00e9lation diminue. C\u2019est somme toute assez logique.<br \/>\nLes coefficients restent n\u00e9anmoins relativement \u00e9lev\u00e9s (&gt;0.72 pour Pearson et &gt;0.68 pour Spearman). Le fait que le coefficient de corr\u00e9lation de Pearson soit \u00e9lev\u00e9 et syst\u00e9matiquement sup\u00e9rieur \u00e0 celui de Spearman est un bon indicateur de l\u2019existence d\u2019une liaison lin\u00e9aire entre les d\u00e9bits aux divers points de mesure.<\/p>\n<h2>Etude de d\u00e9calage<\/h2>\n<p style=\"text-align: justify;\">On peut raisonnablement penser qu\u2019une \u00e9volution (augmentation ou diminution) de d\u00e9bit en amont met un certain temps \u00e0 se propager en aval. Ainsi, une forte pluie \u00e0 Nevers y fera augmenter le niveau de la Loire, mais cette crue ne sera visible \u00e0 Tours qu\u2019au bout d\u2019une certaine dur\u00e9e.<\/p>\n<p style=\"text-align: justify;\">Afin de d\u00e9terminer cette dur\u00e9e, on peut essayer de \u00ab d\u00e9caler temporellement \u00bb les \u00e9chantillons afin de trouver pour quel d\u00e9calage leur corr\u00e9lation est la plus forte.<\/p>\n<p style=\"text-align: justify;\">Dans le principe, on mesure le coefficient de corr\u00e9lation de l\u2019\u00e9chantillon de Nevers avec celui de Tours dont on a supprim\u00e9 la premi\u00e8re ligne (ce qui correspond \u00e0 un d\u00e9calage de 1 heure), puis on r\u00e9it\u00e8re le processus en supprimant une seconde ligne et ainsi de suite. A chaque \u00e9tape on compare le coefficient de corr\u00e9lation avec celui obtenu \u00e0 la phase pr\u00e9c\u00e9dente. Le d\u00e9calage pour lequel le coefficient de corr\u00e9lation est maximal correspond \u00e0 la dur\u00e9e de propagation des \u00e9volutions de d\u00e9bit entre les deux villes.<\/p>\n<p style=\"text-align: justify;\">Le bloc PL\/SQL suivant permet de r\u00e9aliser cette analyse. Le d\u00e9calage est op\u00e9r\u00e9 gr\u00e2ce \u00e0 la <a href=\"http:\/\/docs.oracle.com\/database\/121\/SQLRF\/functions098.htm#SQLRF00656\" target=\"_blank\">fonction analytique LEAD<\/a> qui accepte un param\u00e8tre d\u2019offset que l\u2019on fait progressivement augmenter :<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; set serveroutput on;\nSQL&gt;\nSQL&gt; DECLARE\n  2     l_max_corr_p           NUMBER := 0;\n  3     l_corr_p               NUMBER;\n  4     l_max_decal_p          NUMBER;\n  5     --\n  6     l_max_corr_s           NUMBER := 0;\n  7     l_corr_s               NUMBER;\n  8     l_max_decal_s          NUMBER;\n  9     --\n 10     l_max_decal   CONSTANT NUMBER := 120;\n 11  BEGIN\n 12     FOR rec\n 13        IN (SELECT column_name ville\n 14              FROM user_tab_cols\n 15             WHERE     table_name = &#039;DEBITS_LOIRE&#039;\n 16                   AND column_name != &#039;DT_MES&#039;\n 17                   AND hidden_column = &#039;NO&#039;\n 18                   AND column_name != &#039;NEVERS&#039;)\n 19     LOOP\n 20        l_max_corr_p := 0;\n 21        l_max_corr_s := 0;\n 22\n 23        FOR i IN 1 .. l_max_decal\n 24        LOOP\n 25           EXECUTE IMMEDIATE\n 26                 &#039;SELECT CORR ( NEVERS, &#039;\n 27              || rec.ville\n 28              || &#039;), CORR_S ( NEVERS, &#039;\n 29              || rec.ville\n 30              || &#039;)\n 31          FROM (  SELECT NEVERS, LEAD (&#039;\n 32              || rec.ville\n 33              || &#039;, :i) OVER (ORDER BY dt_mes) &#039;\n 34              || rec.ville\n 35              || &#039;  FROM debits_loire\n 36                ORDER BY dt_mes)&#039;\n 37              INTO l_corr_p, l_corr_s\n 38              USING i;\n 39\n 40           IF (l_max_corr_p &lt; l_corr_p)\n 41           THEN\n 42              l_max_decal_p := i;\n 43              l_max_corr_p := l_corr_p;\n 44           END IF;\n 45\n 46           IF (l_max_corr_s &lt; l_corr_s)\n 47           THEN\n 48              l_max_decal_s := i;\n 49              l_max_corr_s := l_corr_s;\n 50           END IF;\n 51        END LOOP;\n 52\n 53        DBMS_OUTPUT.put_line (CHR (10) || CHR (13) || rec.ville);\n 54        DBMS_OUTPUT.put_line (\n 55              &#039;.    Pearson - D\u00e9calage optimal &#039;\n 56           || l_max_decal_p\n 57           || &#039; heures (Coef: &#039;\n 58           || ROUND (l_max_corr_p, 3)\n 59           || &#039;)&#039;);\n 60        DBMS_OUTPUT.put_line (\n 61              &#039;.    Spearman - D\u00e9calage optimal &#039;\n 62           || l_max_decal_s\n 63           || &#039; heures (Coef: &#039;\n 64           || ROUND (l_max_corr_s, 3)\n 65           || &#039;)&#039;);\n 66     END LOOP;\n 67  END;\n 68  \/\n\nSAINT_SATUR\n.    Pearson - D\u00e9calage optimal 14 heures (Coef: .986)\n.    Spearman - D\u00e9calage optimal 16 heures (Coef: .954)\n\nORLEANS\n.    Pearson - D\u00e9calage optimal 36 heures (Coef: .982)\n.    Spearman - D\u00e9calage optimal 37 heures (Coef: .94)\n\nBLOIS\n.    Pearson - D\u00e9calage optimal 47 heures (Coef: .973)\n.    Spearman - D\u00e9calage optimal 47 heures (Coef: .924)\n\nTOURS\n.    Pearson - D\u00e9calage optimal 54 heures (Coef: .964)\n.    Spearman - D\u00e9calage optimal 52 heures (Coef: .878)\n\nPL\/SQL procedure successfully completed.\n\nSQL&gt;\n\n<\/pre>\n<p style=\"text-align: justify;\">On constate &#8211; tr\u00e8s logiquement &#8211; que le d\u00e9calage auquel le coefficient de corr\u00e9lation est maximal augmente avec l\u2019\u00e9loignement. Ce d\u00e9calage avoisine 2 jours et 6 heures entre Nevers et Tours.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Avec l&rsquo;organisation ce mois-ci d&rsquo;un exercice d&rsquo;envergure de simulation d&rsquo;inondation \u00e0 Paris, on a beaucoup parl\u00e9 du risque de crue<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[6,14],"tags":[],"class_list":["post-528","post","type-post","status-publish","format-standard","hentry","category-oracle","category-statistique"],"_links":{"self":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/528","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=528"}],"version-history":[{"count":0,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/528\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=528"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=528"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=528"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}