{"id":1254,"date":"2018-05-26T15:17:03","date_gmt":"2018-05-26T14:17:03","guid":{"rendered":"http:\/\/130.61.50.57\/?p=1254"},"modified":"2018-07-27T15:23:35","modified_gmt":"2018-07-27T14:23:35","slug":"classification-textuelle-avec-oracle-text-1","status":"publish","type":"post","link":"https:\/\/blog.tiran.stream\/?p=1254","title":{"rendered":"Classification Textuelle avec Oracle Text #1"},"content":{"rendered":"<p>Dans un <a href=\"https:\/\/blog.tiran.stream\/clustering-textuel-avec-oracle-text\/\" target=\"_blank\" rel=\"noopener\">article pr\u00e9c\u00e9dent<\/a>, Oracle Text a \u00e9t\u00e9 utilis\u00e9 pour r\u00e9aliser un clustering. Ici, c&rsquo;est une classification supervis\u00e9e reposant sur l&rsquo;<a href=\"https:\/\/fr.wikipedia.org\/wiki\/Machine_%C3%A0_vecteurs_de_support\" target=\"_blank\" rel=\"noopener\">algorithme de SVM<\/a> que nous allons mettre en oeuvre. La encore, l&rsquo;avantage de l&rsquo;utilisation d&rsquo;Oracle Text est que cela ne n\u00e9cessite pas de disposer de la licence Oracle Advanced Analytics.<\/p>\n<p>Pour cela, je vais utiliser un dataset issu de kaggle relatif \u00e0 un catalogue de <a href=\"https:\/\/www.kaggle.com\/zynicide\/wine-reviews\" target=\"_blank\" rel=\"noopener\">revues de d\u00e9gustation de vins<\/a>.<\/p>\n<p>Celui-ci est au format csv et j&rsquo;utilise un table externe pour acc\u00e9der aux donn\u00e9es:<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; CREATE OR REPLACE DIRECTORY d1 AS &#039;\/tmp&#039;;\r\n\r\nDirectory created.\r\n\r\nSQL&gt;\r\nSQL&gt; CREATE TABLE wine_dataset_ext\r\n  2  (\r\n  3      id NUMBER,\r\n  4      country VARCHAR2 (50),\r\n  5      description VARCHAR2 (1000),\r\n  6      designation VARCHAR2 (100),\r\n  7      points VARCHAR2 (50),\r\n  8      price VARCHAR2 (50),\r\n  9      province VARCHAR2 (50),\r\n 10      region_1 VARCHAR2 (50),\r\n 11      region_2 VARCHAR2 (50),\r\n 12      taster_name VARCHAR2 (50),\r\n 13      taster_twitter_handle VARCHAR2 (50),\r\n 14      title VARCHAR2 (200),\r\n 15      variety VARCHAR2 (50),\r\n 16      winery VARCHAR2 (100)\r\n 17  )\r\n 18  ORGANIZATION EXTERNAL\r\n 19      (TYPE oracle_loader\r\n 20       DEFAULT DIRECTORY d1\r\n 21       ACCESS PARAMETERS (\r\n 22           RECORDS DELIMITED BY NEWLINE\r\n 23           SKIP 1\r\n 24           FIELDS TERMINATED BY &#039;,&#039; OPTIONALLY ENCLOSED BY &#039;&quot;&#039; LRTRIM MISSING FIELD VALUES ARE NULL\r\n 25       )\r\n 26       LOCATION (d1:&#039;winemag-data-130k-v2.csv&#039;))\r\n 27      REJECT LIMIT 10;\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>Les enregistrements sont filtr\u00e9s pour ne conserver que ceux relatifs aux 4 cat\u00e9gories de vin les plus repr\u00e9sent\u00e9es:<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; CREATE TABLE wine_dataset\r\n  2  AS\r\n  3      SELECT variety, description\r\n  4        FROM wine_dataset_ext\r\n  5       WHERE variety IN (     SELECT variety\r\n  6                                FROM wine_dataset_ext\r\n  7                            GROUP BY variety\r\n  8                            ORDER BY COUNT (*) DESC\r\n  9                         FETCH FIRST 4 ROWS ONLY);\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\nSQL&gt; ALTER TABLE wine_dataset\r\n  2      ADD wine# NUMBER GENERATED AS IDENTITY;\r\n\r\nTable altered.\r\n\r\nSQL&gt;\r\nSQL&gt; ALTER TABLE wine_dataset\r\n  2      ADD CONSTRAINT pk_wine_dataset PRIMARY KEY (wine#);\r\n\r\nTable altered.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>Les quatre cat\u00e9gories sont approximativement distribu\u00e9es de mani\u00e8re \u00e9quitable (entre 20 et 30%):<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt;   SELECT variety,\r\n  2           ROUND (100 * ratio_to_report (cnt) OVER (PARTITION BY NULL)) pct\r\n  3      FROM (  SELECT variety, COUNT (*) cnt\r\n  4                FROM wine_dataset\r\n  5            GROUP BY variety)\r\n  6  ORDER BY 2 DESC;\r\n\r\nVARIETY                                                   PCT\r\n-------------------------------------------------- ----------\r\nPinot Noir                                                 31\r\nChardonnay                                                 27\r\nCabernet Sauvignon                                         22\r\nRed Blend                                                  21\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>On cr\u00e9e une table de lookup pour ces cat\u00e9gories:<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; CREATE TABLE wine_variety\r\n  2  AS\r\n  3      SELECT ROWNUM cat#, variety\r\n  4        FROM (SELECT DISTINCT variety\r\n  5                FROM wine_dataset);\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\nSQL&gt; ALTER TABLE wine_variety\r\n  2      ADD CONSTRAINT pk_wine_variety PRIMARY KEY (cat#);\r\n\r\nTable altered.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>On divise (80\/20) le dataset en un ensemble d&rsquo;apprentissage et un ensemble de test :<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; CREATE TABLE train_set_wine#\r\n  2  AS\r\n  3      SELECT wine#\r\n  4        FROM wine_dataset SAMPLE (80);\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\nSQL&gt; CREATE TABLE train_set_wines\r\n  2  AS\r\n  3      SELECT a.wine#, description\r\n  4        FROM wine_dataset a, train_set_wine# b\r\n  5       WHERE a.wine# = b.wine#;\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\nSQL&gt; CREATE TABLE train_set_variety\r\n  2  AS\r\n  3      SELECT a.wine#, cat#\r\n  4        FROM wine_dataset a, train_set_wine# b, wine_variety c\r\n  5       WHERE c.variety = a.variety AND a.wine# = b.wine#;\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\nSQL&gt; CREATE TABLE test_set_wines\r\n  2  AS\r\n  3      SELECT wine#,\r\n  4             (SELECT cat#\r\n  5                FROM wine_variety\r\n  6               WHERE variety = a.variety)\r\n  7                 cat#,\r\n  8             description\r\n  9        FROM wine_dataset a\r\n 10       WHERE wine# NOT IN (SELECT wine# FROM train_set_wine#);\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>Une stoplist contenant les mots vides (de la langue anglaise) ainsi que le nom des cat\u00e9gories de vin est ensuite cr\u00e9\u00e9e:<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; BEGIN\r\n  2      ctx_ddl.create_stoplist (stoplist_name   =&gt; &#039;WINE_STOPLIST&#039;,\r\n  3                               stoplist_type   =&gt; &#039;BASIC_STOPLIST&#039;);\r\n  4\r\n  5      FOR rec\r\n  6          IN (SELECT spw_word mot\r\n  7                FROM ctx_stopwords\r\n  8               WHERE spw_stoplist = &#039;DEFAULT_STOPLIST&#039;)\r\n  9      LOOP\r\n 10          ctx_ddl.add_stopword (stoplist_name   =&gt; &#039;WINE_STOPLIST&#039;,\r\n 11                                stopword        =&gt; rec.mot);\r\n 12      END LOOP;\r\n 13\r\n 14  ctx_ddl.add_stopword (stoplist_name =&gt; &#039;WINE_STOPLIST&#039;, stopword =&gt; &#039;Cabernet&#039;);\r\n 15  ctx_ddl.add_stopword (stoplist_name =&gt; &#039;WINE_STOPLIST&#039;, stopword =&gt; &#039;Sauvignon&#039;);\r\n 16  ctx_ddl.add_stopword (stoplist_name =&gt; &#039;WINE_STOPLIST&#039;, stopword =&gt; &#039;Chardonnay&#039;);\r\n 17  ctx_ddl.add_stopword (stoplist_name =&gt; &#039;WINE_STOPLIST&#039;, stopword =&gt; &#039;Pinot&#039;);\r\n 18  ctx_ddl.add_stopword (stoplist_name =&gt; &#039;WINE_STOPLIST&#039;, stopword =&gt; &#039;Noir&#039;);\r\n 19  ctx_ddl.add_stopword (stoplist_name =&gt; &#039;WINE_STOPLIST&#039;, stopword =&gt; &#039;Red&#039;);\r\n 20  ctx_ddl.add_stopword (stoplist_name =&gt; &#039;WINE_STOPLIST&#039;, stopword =&gt; &#039;Blend&#039;);\r\n 21\r\n 22  END;\r\n 23  \/\r\n\r\nPL\/SQL procedure successfully completed.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>On cr\u00e9e ensuite un index textuel exploitant la stoplist pr\u00e9c\u00e9demment cr\u00e9\u00e9e et utilisant la tokenisation par d\u00e9faut:<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; CREATE INDEX train_set_wines_ctxidx\r\n  2      ON train_set_wines (description)\r\n  3      INDEXTYPE IS ctxsys.context\r\n  4          PARAMETERS (&#039;\r\n  5          stoplist        WINE_STOPLIST\r\n  6        &#039;)\r\n  7  \/\r\n\r\nIndex created.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>On sp\u00e9cifie l&rsquo;algorithme de classification &#8211; ici, Support Vector Machine &#8211; et la table de stockage du param\u00e9trage du classifieur :<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; EXEC ctx_ddl.create_preference(&#039;WINE_CLASSIFIER&#039;,&#039;SVM_CLASSIFIER&#039;);\r\n\r\nPL\/SQL procedure successfully completed.\r\n\r\nSQL&gt; CREATE TABLE restab\r\n  2  (\r\n  3      cat_id  NUMBER,\r\n  4      TYPE    NUMBER (3) NOT NULL,\r\n  5      rule    BLOB\r\n  6  );\r\n\r\nTable created.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>La classification est alors d\u00e9clench\u00e9e via CTX_CLS.TRAIN:<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; BEGIN\r\n  2      ctx_cls.train (index_name   =&gt; &#039;train_set_wines_ctxidx&#039;,\r\n  3                     docid        =&gt; &#039;wine#&#039;,\r\n  4                     cattab       =&gt; &#039;train_set_variety&#039;,\r\n  5                     catdocid     =&gt; &#039;wine#&#039;,\r\n  6                     catid        =&gt; &#039;cat#&#039;,\r\n  7                     restab       =&gt; &#039;restab&#039;,\r\n  8                     pref_name    =&gt; &#039;wine_classifier&#039;);\r\n  9  END;\r\n 10  \/\r\n\r\nPL\/SQL procedure successfully completed.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>On peut alors cr\u00e9er <a href=\"https:\/\/docs.oracle.com\/en\/database\/oracle\/oracle-database\/12.2\/ccapp\/indexing-with-oracle-text.html#GUID-32E664C4-0379-4CC0-8040-67B2CFED020C\" target=\"_blank\" rel=\"noopener\">un index de \u00ab\u00a0routage\u00a0\u00bb (CTXRULE)<\/a> exploitant les r\u00e8gles produites par la classification:<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; CREATE INDEX restabx\r\n  2      ON restab (rule)\r\n  3      INDEXTYPE IS ctxsys.ctxrule\r\n  4          PARAMETERS (&#039;filter ctxsys.null_filter classifier wine_classifier&#039;);\r\n\r\nIndex created.\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>La fonction <a href=\"https:\/\/docs.oracle.com\/en\/database\/oracle\/oracle-database\/12.2\/ccref\/oracle-text-SQL-statements-and-operators.html#GUID-24B74948-E18F-4B71-8F74-032F933CF53A\" target=\"_blank\" rel=\"noopener\">MATCH_SCORE<\/a> de l&rsquo;op\u00e9rateur MATCHES peut alors \u00eatre utilis\u00e9 pour quantifier le degr\u00e9 de correspondance du texte descriptif vis \u00e0 vis des r\u00e8gles produites par le classifieur.<\/p>\n<p>Pour le vin 6 par exemple, c&rsquo;est la cat\u00e9gorie 4 qui pr\u00e9sente le score le plus \u00e9lev\u00e9 (31):<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt;   SELECT b.wine#,\r\n  2           a.cat_id      pred_cat#,\r\n  3           match_score (1) scr\r\n  4      FROM restab a, test_set_wines b\r\n  5     WHERE matches (rule, description, 1) &gt; 0 AND b.wine# = 6\r\n  6  ORDER BY 3 DESC;\r\n\r\n     WINE#  PRED_CAT#        SCR\r\n---------- ---------- ----------\r\n         6          4         31\r\n         6          2         24\r\n         6          3         17\r\n         6          1         15\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>On stocke le r\u00e9sultat du scoring pour le dataset de test dans une table \u00ab\u00a0res\u00a0\u00bb. On peut alors comparer la cat\u00e9gorie r\u00e9elle avec celle pr\u00e9dite:<\/p>\n<pre class=\"brush: sql; ruler: true;\">\u00a0\r\nSQL&gt; CREATE TABLE res\r\n  2  AS\r\n  3      SELECT *\r\n  4        FROM (SELECT wine#,\r\n  5                     pred_cat#,\r\n  6                     cat#,\r\n  7                     ROW_NUMBER () OVER (PARTITION BY wine# ORDER BY scr DESC)\r\n  8                         rn\r\n  9                FROM (SELECT b.wine#,\r\n 10                             a.cat_id        pred_cat#,\r\n 11                             b.cat#,\r\n 12                             match_score (1) scr\r\n 13                        FROM restab a, test_set_wines b\r\n 14                       WHERE matches (rule, description, 1) &gt; 0))\r\n 15       WHERE rn = 1;\r\n\r\nTable created.\r\n\r\nSQL&gt; WITH\r\n  2      badpred\r\n  3      AS\r\n  4          (SELECT COUNT (*) c1\r\n  5             FROM res\r\n  6            WHERE pred_cat# != cat#),\r\n  7      totpop AS (SELECT COUNT (*) c2 FROM res)\r\n  8  SELECT ROUND (100 * (c2 - c1) \/ c2, 2) pct_good\r\n  9    FROM badpred, totpop;\r\n\r\n  PCT_GOOD\r\n----------\r\n     84.82\r\n\r\nSQL&gt;\r\n<\/pre>\n<p>Le taux de pr\u00e9dictions correcte est de l&rsquo;ordre de <strong>85%<\/strong> &#8211; ce qui est tr\u00e8s correct quand on consid\u00e8re que l&rsquo;on n&rsquo;a pas particuli\u00e8rement fait d&rsquo;effort de pr\u00e9paration des donn\u00e9es ni de tuning du mod\u00e8le.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Dans un article pr\u00e9c\u00e9dent, Oracle Text a \u00e9t\u00e9 utilis\u00e9 pour r\u00e9aliser un clustering. Ici, c&rsquo;est une classification supervis\u00e9e reposant sur<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[3,6,19],"tags":[],"class_list":["post-1254","post","type-post","status-publish","format-standard","hentry","category-classification","category-oracle","category-donnees-non-structurees"],"_links":{"self":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/1254","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1254"}],"version-history":[{"count":3,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/1254\/revisions"}],"predecessor-version":[{"id":1257,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/1254\/revisions\/1257"}],"wp:attachment":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1254"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1254"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1254"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}