{"id":692,"date":"2016-09-17T07:30:46","date_gmt":"2016-09-17T07:30:46","guid":{"rendered":"http:\/\/blog.tiran.info\/?p=692"},"modified":"2016-09-17T07:30:46","modified_gmt":"2016-09-17T07:30:46","slug":"regression-logistique-avec-oracle","status":"publish","type":"post","link":"https:\/\/blog.tiran.stream\/?p=692","title":{"rendered":"R\u00e9gression logistique avec Oracle"},"content":{"rendered":"<p style=\"text-align: justify;\">Le JO de Rio ont \u00e9t\u00e9 l&rsquo;occasion de mettre \u00e0 l&rsquo;affiche un certain nombre de sports dont la couverture m\u00e9diatique est faible le reste du temps. C&rsquo;est le cas du <a href=\"https:\/\/fr.wikipedia.org\/wiki\/Cano%C3%AB-kayak\" target=\"_blank\">Cano\u00eb-Kayak<\/a>, discipline qui a permis \u00e0 la France de remporter 3 m\u00e9dailles.<\/p>\n<p style=\"text-align: justify;\">On a pu voir <a href=\"https:\/\/www.youtube.com\/watch?v=_xGj94Sg05M\" target=\"_blank\">deux types d&rsquo;\u00e9preuves<\/a>:<\/p>\n<ul style=\"text-align: justify;\">\n<li>le\u00a0slalom qui consiste dans un parcours en eau vive ou il faut passer entre des portes<\/li>\n<li>la course en ligne qui consiste dans un sprint sur un plan d&rsquo;eau calme<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">De mon exp\u00e9rience, le profil physique des sportifs de ces deux famille est sensiblement diff\u00e9rent. Les champions de course en ligne sont souvent plus massifs ; le sprint n\u00e9cessitant une grande force physique. En eau vive, tout en \u00e9tant athl\u00e9tiques, les sportifs sont moins musculeux, la discipline privil\u00e9giant la \u00ab glisse \u00bb.<\/p>\n<p style=\"text-align: justify;\">Pour objectiver mes observations, j\u2019ai collect\u00e9 les informations de tous les participants \u00e0 des \u00e9preuves individuelles de Cano\u00eb-Kayak aux JO depuis 2000 (Jeux Olympiques de Sydney, Ath\u00e8nes, P\u00e9kin, Londres) via le site <a href=\"http:\/\/www.sports-reference.com\/olympics\/summer\/\" target=\"_blank\">www.sports-reference.com<\/a>.<\/p>\n<p style=\"text-align: justify;\">Pour les sportifs ayant particip\u00e9 \u00e0 plusieurs olympiades, j\u2019ai conserv\u00e9 l\u2019\u00e2ge moyen et le dernier pays qu&rsquo;ils ont repr\u00e9sent\u00e9s.<\/p>\n<p style=\"text-align: justify;\">Ces donn\u00e9es ont servi \u00e0 b\u00e2tir un mod\u00e8le de <a href=\"https:\/\/fr.wikipedia.org\/wiki\/R%C3%A9gression_logistique\" target=\"_blank\">r\u00e9gression logistique binaire<\/a>.<\/p>\n<p style=\"text-align: justify;\">Dans un second temps, les caract\u00e9ristiques des comp\u00e9titeurs individuels des JO de Rio ont \u00e9t\u00e9 r\u00e9cup\u00e9r\u00e9es depuis le site <a href=\"https:\/\/www.rio2016.com\/en\/olympics\" target=\"_blank\">www.rio2016.com<\/a>. Ces derni\u00e8res ont permis le scoring du mod\u00e8le.<\/p>\n<p style=\"text-align: justify;\">Pour chaque dataset\u00a0(JO Rio et JO pr\u00e9-Rio), les comp\u00e9titeurs pour lesquels des informations \u00e9taient manquantes ont \u00e9t\u00e9 supprim\u00e9es.<\/p>\n<p style=\"text-align: justify;\">La variable cible est le type d&rsquo;\u00e9preuve : \u00ab slalom \u00bb ou \u00ab sprint \u00bb &#8211; donc de type binaire. Les pr\u00e9dicteurs sont la taille, le poids, le sexe, l\u2019\u00e2ge et le pays du sportif.<\/p>\n<h3>Chargement des donn\u00e9es<\/h3>\n<p style=\"text-align: justify;\">Les donn\u00e9es sont accessibles ici au format Table Externe DataPump:\u00a0<a href=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2016\/09\/JO_CK_PRERIO.zip\">JO_CK_PRERIO<\/a>\u00a0et\u00a0<a href=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2016\/09\/JO_CK_RIO.zip\">JO_CK_RIO<\/a>.<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt;\nSQL&gt; CREATE TABLE jo_ck_prerio\n  2  (\n  3     nom            VARCHAR2 (50),\n  4     type_epreuve   VARCHAR2 (6),\n  5     age            NUMBER,\n  6     sexe           VARCHAR2 (1),\n  7     taille         NUMBER,\n  8     poids          NUMBER,\n  9     pays           VARCHAR2 (80)\n 10  )\n 11  ORGANIZATION EXTERNAL\n 12     (TYPE oracle_datapump\n 13           DEFAULT DIRECTORY datadir\n 14           LOCATION (&#039;JO_CK_PRERIO.dp&#039;));\n\nTable created.\n\nSQL&gt;\nSQL&gt; CREATE TABLE jo_ck_rio\n  2  (\n  3     nom            VARCHAR2 (50),\n  4     type_epreuve   VARCHAR2 (6),\n  5     age            NUMBER,\n  6     sexe           VARCHAR2 (1),\n  7     taille         NUMBER,\n  8     poids          NUMBER,\n  9     pays           VARCHAR2 (80)\n 10  )\n 11  ORGANIZATION EXTERNAL\n 12     (TYPE oracle_datapump\n 13           DEFAULT DIRECTORY datadir\n 14           LOCATION (&#039;JO_CK_RIO.dp&#039;));\n\nTable created.\n\nSQL&gt;\nSQL&gt; set pages 30\nSQL&gt; column nom format a20\nSQL&gt; column pays format a15\nSQL&gt;\nSQL&gt;      SELECT *\n  2         FROM jo_ck_prerio\n  3     ORDER BY DBMS_RANDOM.VALUE\n  4  FETCH FIRST 10 ROWS ONLY;\n\nNOM                  TYPE_E        AGE S     TAILLE      POIDS PAYS\n-------------------- ------ ---------- - ---------- ---------- ---------------\nAnne Rikala          Sprint         33 F        170         67 Finland\nMichal Gajownik      Sprint         18 M        169         70 Poland\nEmir Mujcinovic      Slalom         30 M        187         76 Croatia\n\u00f0\u2017\u00d3n Th\u2510 C\u00dfch        Sprint         20 F        161         54 Vietnam\nThomas Schmidt       Slalom         26 M        172         69 Germany\nFloris Braat         Slalom         24 M        180         74 Netherlands\nMoe Kaifuchi         Slalom         26 F        159         55 Japan\nTony Lespoir         Sprint         29 M        182         85 Seychelles\nCaroline Queen       Slalom         20 F        157         66 United States\nKrzysztof Bieryt     Slalom         30 M        178         80 Poland\n\n10 rows selected.\n\nSQL&gt;\nSQL&gt;\nSQL&gt;      SELECT *\n  2         FROM jo_ck_rio\n  3     ORDER BY DBMS_RANDOM.VALUE\n  4  FETCH FIRST 10 ROWS ONLY;\n\nNOM                  TYPE_E        AGE S     TAILLE      POIDS PAYS\n-------------------- ------ ---------- - ---------- ---------- ---------------\nMiroslav Kirchev     Sprint         26 M        186         78 Bulgaria\nRichard Merjan       Slalom         27 M        160         57 Lebanon\nOlivera Moldovan     Sprint         27 F        158         62 Serbia\nYvonne Schuring      Sprint         38 F        175         68 Austria\nAngel Kodinov        Sprint         18 M        188         88 Bulgaria\nPavlo Altukhov       Sprint         20 M        185         90 Ukraine\nStefanie Horn        Slalom         25 F        168         59 Italy\nJoaquim Lobo         Sprint         21 M        172         66 Mozambique\nGiovanni de Gennaro  Slalom         24 M        185         80 Italy\nVincent Farkas       Sprint         23 M        193         90 Slovakia\n\n10 rows selected.\n\nSQL&gt;\n<\/pre>\n<h3>Param\u00e9trage du mod\u00e8le<\/h3>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; CREATE TABLE ck_settings\n  2  (\n  3     setting_name    VARCHAR2 (30),\n  4     setting_value   VARCHAR2 (30)\n  5  );\n\nTable created.\n\nSQL&gt;\nSQL&gt; BEGIN\n  2     INSERT INTO ck_settings (setting_name, setting_value)\n  3             VALUES (\n  4                       DBMS_DATA_MINING.algo_name,\n  5                       DBMS_DATA_MINING.algo_generalized_linear_model);\n  6\n  7     INSERT INTO ck_settings (setting_name, setting_value)\n  8          VALUES (DBMS_DATA_MINING.prep_auto, DBMS_DATA_MINING.prep_auto_on);\n  9\n 10     INSERT INTO ck_settings (setting_name, setting_value)\n 11             VALUES (\n 12                       DBMS_DATA_MINING.glms_ftr_selection,\n 13                       DBMS_DATA_MINING.glms_ftr_selection_enable);\n 14\n 15     INSERT INTO ck_settings (setting_name, setting_value)\n 16             VALUES (\n 17                       DBMS_DATA_MINING.glms_ftr_identification,\n 18                       DBMS_DATA_MINING.glms_ftr_ident_complete);\n 19\n 20     INSERT INTO ck_settings (setting_name, setting_value)\n 21             VALUES (\n 22                       DBMS_DATA_MINING.glms_select_block,\n 23                       DBMS_DATA_MINING.glms_select_block_enable);\n 24     INSERT INTO ck_settings (setting_name, setting_value)\n 25             VALUES (\n 26                       DBMS_DATA_MINING.glms_prune_model,\n 27                       DBMS_DATA_MINING.glms_prune_model_enable);\n 28\n 29     COMMIT;\n 30  END;\n 31  \/\n\nPL\/SQL procedure successfully completed.\n\nSQL&gt;\n<\/pre>\n<p>Le param\u00e9trage indique :<\/p>\n<ul>\n<li>le type d\u2019algorithme \u00e0 utiliser DBMS_DATA_MINING.algo_generalized_linear_model. \u00a0Une r\u00e9gression logistique sera r\u00e9alis\u00e9e dans la mesure ou la variable cible est cat\u00e9gorielle.<\/li>\n<li>la d\u00e9sactivation de la fonctionnalit\u00e9 de pr\u00e9paration automatique dans la mesure o\u00f9 ces derni\u00e8res ont \u00e9t\u00e9 pr\u00e9alablement nettoy\u00e9es<\/li>\n<li>le traitement en bloc des variables cat\u00e9gorielles (DBMS_DATA_MINING.glms_select_block_enable) \u2013 c\u2019est-\u00e0-dire qu\u2019on n\u2019ajoute pas s\u00e9lectivement certaines modalit\u00e9s de la variable au mod\u00e8le (soit on ajoute toutes les modalit\u00e9s, soit on retire la variable)<\/li>\n<li>la d\u00e9termination automatique des pr\u00e9dicteurs avec suppression de ceux non significatifs (DBMS_DATA_MINING.glms_ftr_selection_enable, DBMS_DATA_MINING.glms_ftr_ident_complete, DBMS_DATA_MINING.glms_prune_model_enable). Il s\u2019agit de supprimer it\u00e9rativement les \u00e9l\u00e9ments dont les effets apparaissent le moins significatif en se basant sur un test du rapport de vraisemblance.<\/li>\n<\/ul>\n<h3>Cr\u00e9ation du mod\u00e8le<\/h3>\n<p>On pr\u00e9cise qu\u2019il s\u2019agit d\u2019un mod\u00e8le de classification dont la cible est la variable TYPE_EPREUVE. Le dataset d\u2019apprentissage se trouve dans JO_CK_PRERIO et la variable NOM est ignor\u00e9e:<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; DECLARE\n  2     l_xform   DBMS_DATA_MINING_TRANSFORM.transform_list;\n  3  BEGIN\n  4     DBMS_DATA_MINING_TRANSFORM.set_transform (l_xform,\n  5                                               &#039;NOM&#039;,\n  6                                               NULL,\n  7                                               NULL,\n  8                                               NULL);\n  9\n 10     DBMS_DATA_MINING.create_model (\n 11        model_name            =&gt; &#039;LOGREG_CK&#039;,\n 12        mining_function       =&gt; DBMS_DATA_MINING.classification,\n 13        data_table_name       =&gt; &#039;JO_CK_PRERIO&#039;,\n 14        case_id_column_name   =&gt; NULL,\n 15        target_column_name    =&gt; &#039;TYPE_EPREUVE&#039;,\n 16        settings_table_name   =&gt; &#039;CK_SETTINGS&#039;,\n 17        xform_list            =&gt; l_xform);\n 18  END;\n 19  \/\n\nPL\/SQL procedure successfully completed.\n\nSQL&gt;\n<\/pre>\n<h3>Scoring du mod\u00e8le<\/h3>\n<p>Le mod\u00e8le est appliqu\u00e9e aux donn\u00e9es des JO de Rio afin de v\u00e9rifier sa qualit\u00e9 pr\u00e9dictive.<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt;   SELECT *\n  2      FROM (SELECT type_epreuve AS epreuve_reelle,\n  3                   PREDICTION (logreg_ck USING *) AS epreuve_predict\n  4              FROM jo_ck_rio)\n  5           PIVOT\n  6              (COUNT (*)\n  7              FOR epreuve_predict\n  8              IN (&#039;Slalom&#039; slalom_pred, &#039;Sprint&#039; sprint_pred))\n  9  ORDER BY 1;\n\nEPREUV SLALOM_PRED SPRINT_PRED\n------ ----------- -----------\nSlalom          40          21\nSprint          13         110\n\nSQL&gt;\n<\/pre>\n<p>La matrice de confusion ci-dessus permet d\u2019\u00e9valuer le taux de r\u00e9ussite \u00e0 <strong>81.5%<\/strong> (40+110)\/(40+21+13+110).<\/p>\n<h3>Analyse du mod\u00e8le<\/h3>\n<p>La p-valeur est tr\u00e8s faible, le mod\u00e8le est donc significatif.<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; SELECT *\n  2    FROM TABLE (DBMS_DATA_MINING.get_model_details_global (&#039;LOGREG_CK&#039;))\n  3   WHERE global_detail_name LIKE &#039;%P_VALUE&#039;;\n\nGLOBAL_DETAIL_NAME             GLOBAL_DETAIL_VALUE\n------------------------------ -------------------\nLR_CHI_SQ_P_VALUE                       6.8959E-37\n\nSQL&gt;\n<\/pre>\n<p>D&rsquo;autre part, on peut constater\u00a0que seules les variables POIDS, SEXE et TAILLE ont \u00e9t\u00e9 maintenues dans le mod\u00e8le final.<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; column class format a15\nSQL&gt; column attribute_name format a20\nSQL&gt; column attribute_value format a20\nSQL&gt;\nSQL&gt;   SELECT class,\n  2           attribute_name,\n  3           attribute_value,\n  4           coefficient,\n  5           p_value\n  6      FROM TABLE (DBMS_DATA_MINING.get_model_details_glm (&#039;LOGREG_CK&#039;))\n  7  ORDER BY attribute_name, attribute_value;\n\nCLASS           ATTRIBUTE_NAME       ATTRIBUTE_VALUE      COEFFICIENT    P_VALUE\n--------------- -------------------- -------------------- ----------- ----------\nSlalom          POIDS                                      -.29005196 7.3193E-20\nSlalom          SEXE                 F                     -3.1906121 5.0268E-13\nSlalom          TAILLE                                     .101887325 .000237583\nSlalom                                                     3.79165831 .308946949\n\nSQL&gt;\n<\/pre>\n<p>La cat\u00e9gorie de r\u00e9f\u00e9rence \u00e9tant Slalom, le mod\u00e8le logit se traduit par l\u2019\u00e9quation suivante :<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/www.sciweavers.org\/tex2img.php?eq=P%20%5Cbig%28slalom%20%7C%20X%5Cbig%29%20%3D%20%5Cfrac%7B1%7D%7B1%20%2B%20e%5E%7B%20%20%28-%203.791%20%2B%200.290%20%5Ctimes%20POIDS%20%20%20-%200.101%20%5Ctimes%20TAILLE%20%2B%203.190%20%5Ctimes%20SEXE.F%29%20%7D%7D%20%20&amp;bc=White&amp;fc=Black&amp;im=jpg&amp;fs=12&amp;ff=arev&amp;edit=0\" alt=\"P \\big(slalom | X\\big) = \\frac{1}{1 + e^{ (- 3.791 + 0.290 \\times POIDS - 0.101 \\times TAILLE + 3.190 \\times SEXE.F) }} \" width=\"529\" height=\"43\" align=\"center\" border=\"0\" \/><\/p>\n<p style=\"text-align: justify;\">On peut v\u00e9rifier que la fonction <a href=\"https:\/\/docs.oracle.com\/database\/121\/SQLRF\/functions150.htm#SQLRF06212\" target=\"_blank\">PREDICTION_PROBABILITY<\/a> utilise bien cette formule\u00a0de calcul en comparant sa sortie avec une version cod\u00e9e manuellement:<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; set numformat 9.999999\nSQL&gt; WITH FUNCTION calc_prob (p_poids NUMBER, p_taille NUMBER, p_sexe VARCHAR2)\n  2          RETURN NUMBER\n  3       IS\n  4          l_prob   NUMBER;\n  5       BEGIN\n  6          l_prob :=\n  7               1\n  8             \/ (  1\n  9                + EXP (\n 10                       -3.79165831302755\n 11                     + 0.290051964566268 * p_poids\n 12                     +   3.19061208475395\n 13                       * (CASE WHEN p_sexe = &#039;F&#039; THEN 1 ELSE 0 END)\n 14                     - 0.101887324838025 * p_taille));\n 15          IF l_prob &lt; 0.5\n 16          THEN\n 17             RETURN 1 - l_prob;\n 18          ELSE\n 19             RETURN l_prob;\n 20          END IF;\n 21       END;\n 22  SELECT calc_prob (poids, taille, sexe),\n 23         PREDICTION_PROBABILITY (logreg_ck USING *)\n 24    FROM jo_ck_rio\n 25   WHERE ROWNUM &lt; 6\n 26  \/\n\nCALC_PROB(POIDS,TAILLE,SEXE) PREDICTION_PROBABILITY(LOGREG_CKUSING*)\n---------------------------- ---------------------------------------\n                     .880047                                 .880047\n                     .880047                                 .880047\n                     .824317                                 .824317\n                     .875819                                 .875819\n                     .721147                                 .721147\n\nSQL&gt;\n<\/pre>\n<p>A noter l&rsquo;utilisation d&rsquo;<a href=\"http:\/\/docs.oracle.com\/database\/121\/SQLRF\/statements_10002.htm#BABJFIDC\" target=\"_blank\">un bloc PL\/SQL dans la clause WITH<\/a>\u00a0(fonctionnalit\u00e9 12c).<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Le JO de Rio ont \u00e9t\u00e9 l&rsquo;occasion de mettre \u00e0 l&rsquo;affiche un certain nombre de sports dont la couverture m\u00e9diatique<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[3,5,7,13],"tags":[],"class_list":["post-692","post","type-post","status-publish","format-standard","hentry","category-classification","category-odm","category-oracle-advanced-analytics","category-regression"],"_links":{"self":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/692","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=692"}],"version-history":[{"count":0,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/692\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=692"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=692"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=692"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}