{"id":143,"date":"2015-05-18T13:27:24","date_gmt":"2015-05-18T13:27:24","guid":{"rendered":"http:\/\/cms04397.apps-1and1.net\/?p=143"},"modified":"2015-05-18T13:27:24","modified_gmt":"2015-05-18T13:27:24","slug":"khi-2-dindependance-avec-oracle","status":"publish","type":"post","link":"https:\/\/blog.tiran.stream\/?p=143","title":{"rendered":"Khi-2 d&rsquo;ind\u00e9pendance avec Oracle"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">Je voyage quotidiennement entre Tours et Paris en utilisant le TGV. Il y a peu, je discutais avec un ancien coll\u00e8gue qui lui aussi \u00ab\u00a0navette\u00a0\u00bb chaque jour entre Orl\u00e9ans et Paris. De son cot\u00e9, il utilise le r\u00e9seau Intercit\u00e9.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">La question \u00e9tait de savoir s&rsquo;il y avait une diff\u00e9rence significative dans la ponctualit\u00e9 des trains entre sa ligne (Corail Orl\u00e9ans\/Paris) et la mienne (TGV Tours\/Paris)&#8230;<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">Suite \u00e0 cette discussion, j&rsquo;ai d\u00e9couvert que la SNCF mettait \u00e0 disposition un grand nombre de\u00a0donn\u00e9es relatives \u00e0 la ponctualit\u00e9 des trains sur leur portail open-data (<a href=\"https:\/\/data.sncf.com\/\" target=\"_blank\">https:\/\/data.sncf.com\/<\/a>).<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">V\u00e9rifions donc sur pi\u00e8ce ce qu\u2019il en est!<\/span><\/p>\n<p style=\"text-align: justify;\"><strong><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">1) T\u00e9l\u00e9chargement des fichiers de donn\u00e9es au format CSV<\/span><\/strong><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\"><a href=\"https:\/\/ressources.data.sncf.com\/explore\/dataset\/regularite-mensuelle-intercites\/download\/?format=csv&amp;timezone=Europe\/Berlin&amp;use_labels_for_header=true\" target=\"_blank\">https:\/\/ressources.data.sncf.com\/explore\/dataset\/regularite-mensuelle-intercites\/download\/?format=csv&amp;timezone=Europe\/Berlin&amp;use_labels_for_header=true<\/a><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\"><a href=\"https:\/\/ressources.data.sncf.com\/explore\/dataset\/regularite-mensuelle-tgv\/download\/?format=csv&amp;timezone=Europe\/Berlin&amp;use_labels_for_header=true\" target=\"_blank\">https:\/\/ressources.data.sncf.com\/explore\/dataset\/regularite-mensuelle-tgv\/download\/?format=csv&amp;timezone=Europe\/Berlin&amp;use_labels_for_header=true<\/a><\/span><\/p>\n<p style=\"text-align: justify;\">Les fichiers utilis\u00e9s dans cet article sont disponible ici:\u00a0<a href=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2015\/05\/regularite-mensuelle-intercites.csv\">regularite-mensuelle-intercites<\/a>\u00a0&amp;\u00a0<a href=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2015\/05\/regularite-mensuelle-tgv.csv\">regularite-mensuelle-tgv<\/a><\/p>\n<p style=\"text-align: justify;\"><strong><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">2) Cr\u00e9ation des tables de destination dans une base Oracle 12c<\/span><\/strong><\/p>\n<pre>SQL&gt; CREATE TABLE ponctualite_tgv\n  2  (\n  3     dt                 VARCHAR2 (7),\n  4     axe                VARCHAR2 (30),\n  5     depart             VARCHAR2 (30),\n  6     arrivee            VARCHAR2 (30),\n  7     nb_trains_prog     NUMBER (5),\n  8     nb_trains_circ     NUMBER (5),\n  9     nb_trains_annul    NUMBER (5),\n 10     nb_trains_retard   NUMBER (5),\n 11     taux_regul         NUMBER (4)\n 12  );\n\nTable created.\n\nSQL&gt;\nSQL&gt;\nSQL&gt; CREATE TABLE ponctualite_corail\n  2  AS\n  3     SELECT * FROM ponctualite_tgv;\n\nTable created.\n\nSQL&gt; \n<\/pre>\n<p style=\"text-align: justify;\"><strong><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">3) Chargement des donn\u00e9es<\/span><\/strong><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">J&rsquo;utilise ici la fonctionnalit\u00e9 SQL*Loader Express.<\/span><\/p>\n<pre>C:\\RTI\\Stats\\SNCF&gt;sqlldr rafa\/rafa@localhost\/STATPDB data=regularite-mensuelle-intercites.csv table=ponctualite_corail terminated_by=';' field_names=all_ignore\n\nSQL*Loader: Release 12.1.0.2.0 - Production on Fri May 15 15:13:47 2015\n\nCopyright (c) 1982, 2014, Oracle and\/or its affiliates.  All rights reserved.\n\nExpress Mode Load, Table: PONCTUALITE_CORAIL\nPath used:      External Table, DEGREE_OF_PARALLELISM=AUTO\n\nTable PONCTUALITE_CORAIL:\n  750 Rows successfully loaded.\n\nCheck the log files:\n  ponctualite_corail.log\n  ponctualite_corail_%p.log_xt\nfor more information about the load.\n\nC:\\RTI\\Stats\\SNCF&gt;sqlldr rafa\/rafa@localhost\/STATPDB data=regularite-mensuelle-tgv.csv table=ponctualite_tgv terminated_by=';' field_names=all_ignore\n\nSQL*Loader: Release 12.1.0.2.0 - Production on Fri May 15 15:14:16 2015\n\nCopyright (c) 1982, 2014, Oracle and\/or its affiliates.  All rights reserved.\n\nExpress Mode Load, Table: PONCTUALITE_TGV\nPath used:      External Table, DEGREE_OF_PARALLELISM=AUTO\n\nTable PONCTUALITE_TGV:\n  4100 Rows successfully loaded.\n\nCheck the log files:\n  ponctualite_tgv.log\n  ponctualite_tgv_%p.log_xt\nfor more information about the load.\n\nC:\\RTI\\Stats\\SNCF&gt;\n<\/pre>\n<p style=\"text-align: justify;\"><strong><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">4) Analyse des donn\u00e9es<\/span><\/strong><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">On va utiliser un <a href=\"http:\/\/fr.wikipedia.org\/wiki\/Test_du_%CF%87%C2%B2\" target=\"_blank\">test du Khi-2<\/a> pour v\u00e9rifier s&rsquo;il y a ind\u00e9pendance entre la distribution des retards et le type de train (TGV\/Corail).<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">La fonction <a href=\"http:\/\/docs.oracle.com\/database\/121\/SQLRF\/functions184.htm#SQLRF06317\" target=\"_blank\">STATS_CROSSTAB<\/a> permet d&rsquo;effectuer un tel test. N\u00e9anmoins, son impl\u00e9mentation par Oracle ne permet pas de travailler \u00e0 partir d&rsquo;une table de contingence mais seulement \u00e0 partir d&rsquo;un \u00e9chantillon d&rsquo;observations unitaires.<\/span><br \/>\n<span style=\"font-family: tahoma, arial, helvetica, sans-serif;\"> Or, ici, nous disposons de donn\u00e9es agr\u00e9g\u00e9es mois par mois. Il va donc falloir les convertir en une s\u00e9rie d&rsquo;enregistrements \u2013 un par train ayant circul\u00e9 avec un indicateur (O\/N) informant de son \u00e9ventuel retard!<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">Pour les deux lignes (Corail Orl\u00e9ans\/Paris &amp; TGV Tours\/Paris), on d\u00e9termine d&rsquo;abord le nombre de trains ayant circul\u00e9s et ceux ayant \u00e9t\u00e9 retard\u00e9s (\u00e0 noter que les donn\u00e9es sont filtr\u00e9es pour correspondre aux trajets d&rsquo;int\u00e9r\u00eat au cours d\u2019une m\u00eame p\u00e9riode):<\/span><\/p>\n<pre>SQL&gt; SELECT SUM (nb_trains_circ) circ, SUM (nb_trains_retard) retard\n  2    FROM ponctualite_tgv\n  3   WHERE     TO_DATE (dt, 'YYYY-MM') &gt;= TO_DATE ('2014-01', 'YYYY-MM')\n  4         AND (depart = 'ST PIERRE DES CORPS' OR arrivee = 'ST PIERRE DES CORPS');\n\n      CIRC     RETARD\n---------- ----------\n     11190       1325\n\nSQL&gt; SELECT SUM (nb_trains_circ) circ, SUM (nb_trains_retard) retard\n  2    FROM ponctualite_corail\n  3   WHERE depart LIKE 'ORL%' OR arrivee LIKE 'ORL%';\n\n      CIRC     RETARD\n---------- ----------\n     10131       1735\n\nSQL&gt;\n<\/pre>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">On \u00ab\u00a0\u00e9clate\u00a0\u00bb ensuite ces chiffres sous la forme d&rsquo;enregistrements individuels \u00e0 l&rsquo;aide d&rsquo;un g\u00e9n\u00e9rateur de lignes (op\u00e9rateur hi\u00e9rarchique CONNECT BY appliqu\u00e9 \u00e0 DUAL).<\/span><br \/>\n<span style=\"font-family: tahoma, arial, helvetica, sans-serif;\"> Les enregistrements g\u00e9n\u00e9r\u00e9s vont ensuite \u00eatre pass\u00e9es \u00e0 la fonction STATS_CROSSTAB:<\/span><\/p>\n<pre>SQL&gt; SELECT STATS_CROSSTAB (type_train, retard, 'CHISQ_SIG') p_value,\n  2         STATS_CROSSTAB (type_train, retard, 'CHISQ_OBS') khi_2\n  3    FROM (    SELECT 'TGV' type_train,\n  4                     CASE WHEN LEVEL &lt;= 1325 THEN 'O' ELSE 'N' END retard\n  5                FROM DUAL\n  6          CONNECT BY LEVEL &lt;= 11190\n  7          UNION ALL\n  8              SELECT 'CORAIL' type_train,\n  9                     CASE WHEN LEVEL &lt;= 1735 THEN 'O' ELSE 'N' END retard\n 10                FROM DUAL\n 11          CONNECT BY LEVEL &lt;= 10131);\n\n   P_VALUE      KHI_2\n---------- ----------\n4.2138E-28 120.806054\n\nSQL&gt;\n<\/pre>\n<p style=\"text-align: justify;\"><span style=\"font-family: tahoma, arial, helvetica, sans-serif;\">La p-value obtenue est tr\u00e8s faible, on peut donc rejeter l&rsquo;hypoth\u00e8se d&rsquo;ind\u00e9pendance (H0) et consid\u00e9rer qu&rsquo;il y a bien un lien statistiquement significatif entre les retards \u00e0 l&rsquo;arriv\u00e9e et la ligne de train emprunt\u00e9e.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Je voyage quotidiennement entre Tours et Paris en utilisant le TGV. Il y a peu, je discutais avec un ancien<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[6,14,16],"tags":[],"class_list":["post-143","post","type-post","status-publish","format-standard","hentry","category-oracle","category-statistique","category-tests-dhypotheses"],"_links":{"self":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/143","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=143"}],"version-history":[{"count":0,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/143\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=143"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=143"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=143"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}