{"id":1205,"date":"2018-01-02T10:00:04","date_gmt":"2018-01-02T09:00:04","guid":{"rendered":"http:\/\/130.61.50.57\/?p=1205"},"modified":"2018-01-03T12:37:31","modified_gmt":"2018-01-03T11:37:31","slug":"clustering-textuel-avec-r","status":"publish","type":"post","link":"https:\/\/blog.tiran.stream\/?p=1205","title":{"rendered":"Clustering textuel avec R"},"content":{"rendered":"<p><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Dans l&rsquo;article pr\u00e9c\u00e9dent, Oracle Text a \u00e9t\u00e9 utilis\u00e9 pour assurer le partitionnement d&rsquo;un ensemble de recettes de cuisines. Dans ce post, la m\u00eame op\u00e9ration va \u00eatre r\u00e9alis\u00e9e \u00e0 l&rsquo;aide de R.\u00a0<\/span><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Les donn\u00e9es utilis\u00e9es seront identiques (on les rapatrie avec ROracle depuis la base Oracle de l&rsquo;article pr\u00e9c\u00e9dent):<\/span><\/p>\n<pre class=\"brush: js; ruler: true;\">&gt; library(ROracle)\r\nLoading required package: DBI\r\n&gt; ora = Oracle()\r\n&gt; cnx = dbConnect(ora, username=&quot;c##rafa&quot;, password=&quot;Password1#&quot;, dbname=&quot;\/\/clorai2-scan:1521\/pdb_hodba08&quot;)\r\n&gt; data_set &lt;- dbGetQuery(cnx, &quot;select * from RECETTES_CLEAN&quot;)\r\n&gt;\r\n<\/pre>\n<p><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">On va se servir des fonctions de la <a href=\"https:\/\/cran.r-project.org\/web\/packages\/tm\/\" target=\"_blank\" rel=\"noopener\">librairie tm<\/a> pour r\u00e9aliser les op\u00e9rations de manipulation de texte. On commence par cr\u00e9er\u00a0un Corpus de termes \u00e0 partir des listes d&rsquo;ingr\u00e9dients:<\/span><\/p>\n<pre class=\"brush: js; ruler: true;\">&gt; library(tm)\r\n&gt; IngredientsCorpus &lt;- Corpus(VectorSource(data_set$INGREDIENTS), readerControl = list(language = &quot;fr&quot;))\r\n&gt;\r\n<\/pre>\n<p><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Ce Corpus va ensuite \u00eatre re-travaill\u00e9 en r\u00e9alisant les op\u00e9rations suivantes:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Homog\u00e9n\u00e9isation de la casse<\/span><\/li>\n<li><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Suppression des symboles de ponctuation<\/span><\/li>\n<li><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Suppression des chiffres<\/span><\/li>\n<li><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Conversion des caract\u00e8res accentu\u00e9s en caract\u00e8res non-accentu\u00e9s<\/span><\/li>\n<li><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Suppression des mots vides (on r\u00e9cup\u00e8re depuis <a href=\"https:\/\/docs.oracle.com\/en\/database\/oracle\/oracle-database\/12.2\/ccref\/oracle-text-views.html#GUID-C9CDF4AF-0155-4705-A9DC-6A4A848DDF0E\" target=\"_blank\" rel=\"noopener\">CTX_STOPWORDS<\/a> la liste des mots vides g\u00e9n\u00e9r\u00e9e dans le billet pr\u00e9c\u00e9dent)<\/span><\/li>\n<li><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Suppression des espaces<\/span><\/li>\n<li><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Racinisation des termes<\/span><\/li>\n<\/ul>\n<pre class=\"brush: js; ruler: true;\">&gt; IngredientsCorpus &lt;- tm_map(IngredientsCorpus, content_transformer(tolower))\r\n&gt;\r\n&gt; replacePunctuation &lt;- function(x) {\r\n+   gsub(&quot;[[:punct:]]+&quot;, &quot; &quot;, x)\r\n+ }\r\n&gt;\r\n&gt; IngredientsCorpus &lt;- tm_map(IngredientsCorpus, content_transformer(replacePunctuation))\r\n&gt; \r\n&gt; IngredientsCorpus &lt;- tm_map(IngredientsCorpus, removeNumbers)\r\n&gt;\r\n&gt; replaceAccent &lt;- function(x) {\r\n+   iconv(x, to=&quot;ASCII\/\/TRANSLIT\/\/IGNORE&quot;)\r\n+ }\r\n&gt;\r\n&gt; IngredientsCorpus &lt;- tm_map(IngredientsCorpus, replaceAccent)\r\n&gt; \r\n&gt; mots_vides &lt;- dbGetQuery(cnx, &quot;select upper(spw_word) mot from CTX_STOPWORDS where spw_stoplist=&#039;RECETTE_STOPLIST&#039;&quot;)\r\n&gt; IngredientsCorpus &lt;- tm_map(IngredientsCorpus, removeWords, tolower(mots_vides$MOT))\r\n&gt;\r\n&gt; IngredientsCorpus &lt;- tm_map(IngredientsCorpus, stripWhitespace)\r\n&gt;\r\n&gt; IngredientsCorpus &lt;- tm_map(IngredientsCorpus, stemDocument, &quot;fr&quot;)\r\n&gt;\r\n<\/pre>\n<p><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">A partir du Corpus ainsi obtenu, on produit une matrice Documents\/Termes (on extrait au passage les mots de moins de 3 lettres). On lui applique ensuite la fonction wigthTfIdf pour d\u00e9terminer les <a href=\"https:\/\/fr.wikipedia.org\/wiki\/TF-IDF\" target=\"_blank\" rel=\"noopener\">poids Td-Idf<\/a> de chaque terme:<\/span><\/p>\n<pre class=\"brush: js; ruler: true;\">\u00a0\r\n&gt; IngredientsDTM &lt;- DocumentTermMatrix(IngredientsCorpus, control=list(minWordLength=3))\r\n&gt;\r\n&gt; IngredientsDTM_TfIdf &lt;- weightTfIdf(IngredientsDTM)\r\n&gt;\r\n<\/pre>\n<p><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">Pour chaque document, on normalise les poids par <a href=\"https:\/\/fr.wikipedia.org\/wiki\/Norme_(math%C3%A9matiques)\" target=\"_blank\" rel=\"noopener\">la norme euclidienne\/L2<\/a> du document:<\/span><\/p>\n<pre class=\"brush: js; ruler: true;\">&gt; normalisation_L2 &lt;- function(x) {\r\n+   x \/ apply(x, MARGIN=1, \r\n+             FUN=function(y) \r\n+             {\r\n+               norm(y, type=&quot;2&quot;)\r\n+             })\r\n+ }\r\n&gt;\r\n&gt; IngredientsDTM_TfIdf_norm &lt;- normalisation_L2(as.matrix(IngredientsDTM_TfIdf))\r\n&gt; \r\n<\/pre>\n<p><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">La matrice peut alors \u00eatre utilis\u00e9e par la fonction kmeans. On indique que l&rsquo;on souhaite obtenir 2 clusters:<\/span><\/p>\n<pre class=\"brush: js; ruler: true;\">&gt; recettes_cluster &lt;- kmeans(IngredientsDTM_TfIdf_norm, 2)\r\n&gt; table(recettes_cluster$cluster, data_set$CATEGORIE_PLAT)\r\n   \r\n    Sal\u00e9 Sucr\u00e9\r\n  1    2   169\r\n  2  354    19\r\n&gt; \r\n<\/pre>\n<p><span style=\"font-family: verdana, geneva, sans-serif; font-size: 10pt;\">On peut voir \u00e0 l&rsquo;aide de la table de contingence que le r\u00e9sultat du\u00a0partitionnement est tr\u00e8s similaire \u00e0 celui obtenu dans le billet pr\u00e9c\u00e9dent.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Dans l&rsquo;article pr\u00e9c\u00e9dent, Oracle Text a \u00e9t\u00e9 utilis\u00e9 pour assurer le partitionnement d&rsquo;un ensemble de recettes de cuisines. Dans ce<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"colormag_page_container_layout":"default_layout","colormag_page_sidebar_layout":"default_layout","footnotes":""},"categories":[4,10,12,19],"tags":[],"class_list":["post-1205","post","type-post","status-publish","format-standard","hentry","category-clustering","category-preparation-des-donnees","category-r","category-donnees-non-structurees"],"_links":{"self":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/1205","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1205"}],"version-history":[{"count":17,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/1205\/revisions"}],"predecessor-version":[{"id":1236,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=\/wp\/v2\/posts\/1205\/revisions\/1236"}],"wp:attachment":[{"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1205"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1205"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.tiran.stream\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1205"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}