{"id":670,"date":"2016-08-03T07:11:08","date_gmt":"2016-08-03T06:11:08","guid":{"rendered":"http:\/\/blog.tiran.info\/?p=670"},"modified":"2017-12-22T14:44:05","modified_gmt":"2017-12-22T13:44:05","slug":"visualisation-en-nuage-de-mots-avec-oracle","status":"publish","type":"post","link":"https:\/\/blog.tiran.stream\/?p=670","title":{"rendered":"Visualisation en nuage de mots avec Oracle"},"content":{"rendered":"<p style=\"text-align: justify;\">La repr\u00e9sentation en <a href=\"https:\/\/fr.wikipedia.org\/wiki\/Nuage_de_mots-cl%C3%A9s\" target=\"_blank\" rel=\"noopener\">nuage de mots-cl\u00e9s<\/a>\u00a0est tr\u00e8s r\u00e9pandue pour donner une impression de l&rsquo;importance relative &#8211; en terme de fr\u00e9quence d&rsquo;apparition &#8211; des termes au sein de donn\u00e9es textuelles.\u00a0Ici, j&rsquo;exploite les informations du mod\u00e8le de classification Bay\u00e9sienne construit dans le billet \u00ab\u00a0<a href=\"http:\/\/blog.tiran.info\/classification-bayesienne-naive-avec-oracle\" target=\"_blank\" rel=\"noopener\">Classification Bay\u00e9sienne Na\u00efve avec Oracle<\/a>\u00a0\u00bb\u00a0pour extraire les mots les plus utilis\u00e9s par cat\u00e9gorie.<\/p>\n<p style=\"text-align: justify;\">Le d\u00e9tail du mod\u00e8le est accessible via la fonction <a href=\"http:\/\/docs.oracle.com\/database\/121\/ARPLS\/d_datmin.htm#ARPLS65870\" target=\"_blank\" rel=\"noopener\">DBMS_DATA_MINING.GET_MODEL_DETAILS_NB<\/a>. Il s&rsquo;agit d&rsquo;une fonction <a href=\"https:\/\/docs.oracle.com\/database\/121\/ADDCI\/pipe_paral_tbl.htm#ADDCI4692\" target=\"_blank\" rel=\"noopener\">PIPELINE<\/a> dont le champ (nested table) CONDITIONALS permet d&rsquo;obtenir les <a href=\"https:\/\/fr.wikipedia.org\/wiki\/Probabilit%C3%A9_conditionnelle\" target=\"_blank\" rel=\"noopener\">probabilit\u00e9s conditionnelles<\/a> des diff\u00e9rents termes. Cette information peut ensuite \u00eatre utilis\u00e9e comme indicateur de popularit\u00e9 des mots au sein d&rsquo;une cat\u00e9gorie:<\/p>\n<p style=\"text-align: justify;\">C&rsquo;est le principe des la vue suivante qui extrait les 40 mots les plus \u00ab\u00a0populaires\u00a0\u00bb de la cat\u00e9gorie Polars:<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; CREATE OR REPLACE VIEW mots_freq_polar\r\n  2  AS\r\n  3          SELECT attribute_subname mot,\r\n  4                 ROUND (100 * conditional_probability, 1) freq\r\n  5            FROM TABLE (DBMS_DATA_MINING.GET_MODEL_DETAILS_NB (&#039;BOOKS_NB&#039;)) a,\r\n  6                 TABLE (a.conditionals) b\r\n  7           WHERE     INSTR (attribute_str_value, &#039;[&#039;) = 0\r\n  8                 AND target_attribute_str_value = &#039;polar&#039;\r\n  9        ORDER BY conditional_probability DESC\r\n 10     FETCH FIRST 40 ROWS ONLY;\r\n\r\nView created.\r\n\r\nSQL&gt; \r\nSQL&gt; column mot format a30\r\nSQL&gt; SELECT *\r\n  2    FROM mots_freq_polar\r\n  3  FETCH FIRST 10 ROWS ONLY;\r\n\r\nMOT                                  FREQ\r\n------------------------------ ----------\r\nENQU\u00caTE                              24.3\r\nBIEN                                 21.9\r\nFEMME                                21.4\r\nVIE                                  18.8\r\nMORT                                 18.1\r\nANS                                    18\r\nDEUX                                 15.9\r\nROMAN                                15.8\r\nJEUNE                                14.6\r\nPOLICE                               14.5\r\n\r\n10 rows selected.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\">Pour g\u00e9n\u00e9rer le graphique, j&rsquo;utilise ici <a href=\"https:\/\/docs.oracle.com\/middleware\/bidv1221\/desktop\/index.html\" target=\"_blank\" rel=\"noopener\">Oracle Data Visualisation Desktop<\/a> qui dispose d&rsquo;une visualisation \u00ab\u00a0Nuage de mots-cl\u00e9s\u00a0\u00bb:<\/p>\n<p style=\"text-align: justify;\"><a href=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2016\/08\/NuageMotsDVD.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-673 size-large\" src=\"https:\/\/blog.tiran.stream\/wp-content\/uploads\/2016\/08\/NuageMotsDVD.png\" alt=\"NuageMotsDVD\" width=\"648\" height=\"351\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\">Dans un cadre plus g\u00e9n\u00e9ral, il est possible d&rsquo;obtenir un d\u00e9compte des termes \u00e0 partir d&rsquo;une source textuelle \u00e0 l&rsquo;aide de la proc\u00e9dure <a href=\"http:\/\/docs.oracle.com\/database\/121\/CCREF\/cdocpkg.htm#CCREF0718\" target=\"_blank\" rel=\"noopener\">CTX_DOC.TOKENS<\/a> appliqu\u00e9e \u00e0 un <a href=\"http:\/\/docs.oracle.com\/database\/121\/CCREF\/csql.htm#CCREF0105\" target=\"_blank\" rel=\"noopener\">index textuel<\/a> (ctxsys.context).<\/p>\n<p style=\"text-align: justify;\">Exemple:<\/p>\n<pre class=\"brush: sql; ruler: true;\">SQL&gt; CREATE TABLE polar_tab (book_id PRIMARY KEY, extrait)\r\n  2  AS\r\n  3     SELECT ROWNUM book_id, a.extrait\r\n  4       FROM train_set_books a\r\n  5      WHERE categorie = &#039;polar&#039;;\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\nSQL&gt; CREATE INDEX POLAR_IDXTXT\r\n  2     ON polar_tab (extrait)\r\n  3     INDEXTYPE IS ctxsys.context\r\n  4        PARAMETERS (&#039;LEXER BOOKS_NB_LEXER WORDLIST BOOKS_NB_WORDLIST STOPLIST BOOKS_STOPLIST SYNC(MANUAL)&#039;);\r\n\r\nIndex created.\r\n\r\nSQL&gt;\r\nSQL&gt; CREATE TABLE polar_tokens\r\n  2  (\r\n  3     QUERY_ID   NUMBER,\r\n  4     TOKEN      VARCHAR2 (64),\r\n  5     OFFSET     NUMBER,\r\n  6     LENGTH     NUMBER\r\n  7  );\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\nSQL&gt; BEGIN\r\n  2     FOR rec IN (SELECT *\r\n  3                   FROM polar_tab)\r\n  4     LOOP\r\n  5        ctx_doc.tokens (&#039;POLAR_IDXTXT&#039;,\r\n  6                        rec.book_id,\r\n  7                        &#039;POLAR_TOKENS&#039;,\r\n  8                        rec.book_id);\r\n  9     END LOOP;\r\n 10  END;\r\n 11  \/\r\n\r\nPL\/SQL procedure successfully completed.\r\n\r\nSQL&gt;\r\nSQL&gt; COMMIT;\r\n\r\nCommit complete.\r\n\r\nSQL&gt;\r\nSQL&gt; set pages 30\r\nSQL&gt;      SELECT token, COUNT (*) cnt\r\n  2         FROM polar_tokens\r\n  3     GROUP BY token\r\n  4     ORDER BY 2 DESC\r\n  5  FETCH FIRST 15 ROWS ONLY;\r\n\r\nTOKEN                                                                   CNT\r\n---------------------------------------------------------------- ----------\r\nBIEN                                                                   1765\r\nJEUNE                                                                  1744\r\nENQU\u00caTE                                                                1715\r\nDEUX                                                                   1710\r\nFEMME                                                                  1637\r\nVIE                                                                    1587\r\nHOMME                                                                  1537\r\nANS                                                                    1404\r\nMORT                                                                   1318\r\nROMAN                                                                  1200\r\nMONDE                                                                  1088\r\nPOLICE                                                                 1004\r\nVILLE                                                                   930\r\nAFFAIRE                                                                 914\r\nJOUR                                                                    908\r\n\r\n15 rows selected.\r\n\r\nSQL&gt;\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>La repr\u00e9sentation en nuage de mots-cl\u00e9s\u00a0est tr\u00e8s r\u00e9pandue pour donner une impression de l&rsquo;importance relative &#8211; en terme de fr\u00e9quence<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[7,8,15,19,17],"tags":[],"class_list":["post-670","post","type-post","status-publish","format-standard","hentry","category-oracle-advanced-analytics","category-oracle-data-visualisation","category-statistique-exploratoire","category-donnees-non-structurees","category-visualisation"],"_links":{"self":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/670","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=670"}],"version-history":[{"count":1,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/670\/revisions"}],"predecessor-version":[{"id":1164,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/670\/revisions\/1164"}],"wp:attachment":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=670"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=670"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=670"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}