Sample code for using ArabicTokenizer #18

alismart · 2015-05-14T08:16:46Z

Sergey, i did my best to understand how to use the ArabicTokenizer, you can see my try in the following code. i hope to check it and see if this is the best way of use.
i am also trying to set the parameters in the main method, but it doesn't seem to work at all. for example it neither removes the diacritics nor removingTatweel.

       ArabicTokenizer.main(new string[] { "normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker", "removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping" });
        string s = textBox2.Text;
        java.io.StringReader sr = new StringReader(s);
        ArabicTokenizer tokenizer = new ArabicTokenizer(sr, new edu.stanford.nlp.process.WordTokenFactory(), new java.util.Properties());

        java.util.List al = tokenizer.tokenize();
        int size = al.size();
        string container = "";
        for (int i = 0; i < size; i++)
        {
           Word w = (Word)al.get(i);
           container = container + " ^ " + w.word();
        }
        textBox1.Text = container;

The text was updated successfully, but these errors were encountered:

sergey-tihon · 2015-05-18T12:54:52Z

As I understand you need Stanford Word Segmenter that is designed for tokenization of Arabic and Chinese languages.
Here you can find C#/F# samples for Stanford Word Segmenter for .NET
Is it what you are trying to do?

alismart · 2015-05-18T19:28:58Z

I tried Stanford Word Segmenter, as its name implies, it is dividing the raw text to segments ( sentences ) depending on a training set which needs so many ram and cpu resources as i noticed, since it is using Machine Learning.

Later on,I noticed that Stanford has included another tool called: Arabic Tokenizer
which is a straightforward algorithm for dividing raw Arabic text to tokens ( words ) . so i think my need is the Tokenizer (to words) instead of the Segmenter (to sentences).

My question is, do have any idea how to use the Tokenizer? especially how to set the parameters to vary its functionality ..

sergey-tihon · 2015-05-22T11:29:19Z

Please try following sample

using System;
using edu.stanford.nlp.international.arabic.process;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.util;
using java.io;

namespace CoreNLPArabic
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "جامعة الدول العربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا";

            var parameters =
                new[]
                {
                    "normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker",
                    "removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping"
                };
            var tokenizerOptions = StringUtils.argsToProperties(parameters);
            var tf = tokenizerOptions.containsKey("atb")
                ? ArabicTokenizer.atbFactory()
                : ArabicTokenizer.factory();

            foreach (String option in tokenizerOptions.stringPropertyNames().toArray())
            {tf.setOptions(option);}
            tf.setOptions("tokenizeNLs");

            int nLines = 0;
            int nTokens = 0;
            var tokenizer = tf.getTokenizer(new StringReader(s));
            var printSpace = false;
            const string NEWLINE_TOKEN = "*NL*";
            while (tokenizer.hasNext()) {
              ++nTokens;
              var next = tokenizer.next() as CoreLabel;
              String word = next.word();
              if (word.Equals(NEWLINE_TOKEN)) {
                ++nLines;
                printSpace = false;
                System.Console.WriteLine();
              } else {
                if (printSpace) System.Console.Write(" ");
                System.Console.Write(word);
                printSpace = true;
              }
            }
            System.Console.WriteLine("\nDone! Tokenized %d lines (%d tokens)%n", nLines, nTokens);

        }
    }
}

alismart · 2015-07-02T11:49:13Z

Unfortunately, it didn't work.
Obviously the tokenizer is not recognizing any of the options you provided in the code
this is my try, the red words are what is expected after taking the options in consideration

maybe there is still something required to get everything works properly.

given the following input:
جامعةُ الدُوَلِ العـــــــــربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا
the expected output is:
جامعة الدول العربية هي منظمة تضم دولا في الشرق الاوسط وافريقيا

i hope you could fix the code as soon as possible because i need it in my graduation project
thanks in advance

sergey-tihon · 2015-07-07T05:08:27Z

@alismart Sorry, I have no ideas.
Could you try Java version? Would be nice to know if it is work as you expected or not.

saidMoulay · 2015-07-12T09:49:18Z

in your script program,replace this line scripte "foreach (String option in tokenizerOptions.stringPropertyNames().toArray())" with this one "foreach (String option in parameters)" . And the output text wil be as you hope (fine). with out Tatweel . . .

saidMoulay · 2017-06-06T12:26:05Z

Fixed code , Try it

using System;
using edu.stanford.nlp.international.arabic.process;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.util;
using java.io;

namespace CoreNLPArabic
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "جامعة الدول العربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا";

            var parameters =
                new[]
                {
                    "normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker",
                    "removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping"
                };
            var tokenizerOptions = StringUtils.argsToProperties(parameters);
            var tf = tokenizerOptions.containsKey("atb")
                ? ArabicTokenizer.atbFactory()
                : ArabicTokenizer.factory();

            foreach (String option in parameters)
            {tf.setOptions(option);}
            tf.setOptions("tokenizeNLs");

            int nLines = 0;
            int nTokens = 0;
            var tokenizer = tf.getTokenizer(new StringReader(s));
            var printSpace = false;
            const string NEWLINE_TOKEN = "*NL*";
            while (tokenizer.hasNext()) {
              ++nTokens;
              var next = tokenizer.next() as CoreLabel;
              String word = next.word();
              if (word.Equals(NEWLINE_TOKEN)) {
                ++nLines;
                printSpace = false;
                System.Console.WriteLine();
              } else {
                if (printSpace) System.Console.Write(" ");
                System.Console.Write(word);
                printSpace = true;
              }
            }
            System.Console.WriteLine("\nDone! Tokenized %d lines (%d tokens)%n", nLines, nTokens);

        }
    }
}

sergey-tihon mentioned this issue Mar 10, 2017

Need a very basic C# example to anaylze arabic words #59

Closed

sergey-tihon closed this as completed Feb 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample code for using ArabicTokenizer #18

Sample code for using ArabicTokenizer #18

alismart commented May 14, 2015

sergey-tihon commented May 18, 2015

alismart commented May 18, 2015

sergey-tihon commented May 22, 2015

alismart commented Jul 2, 2015

sergey-tihon commented Jul 7, 2015

saidMoulay commented Jul 12, 2015

saidMoulay commented Jun 6, 2017 •

edited by sergey-tihon

Loading

Sample code for using ArabicTokenizer #18

Sample code for using ArabicTokenizer #18

Comments

alismart commented May 14, 2015

sergey-tihon commented May 18, 2015

alismart commented May 18, 2015

sergey-tihon commented May 22, 2015

alismart commented Jul 2, 2015

sergey-tihon commented Jul 7, 2015

saidMoulay commented Jul 12, 2015

saidMoulay commented Jun 6, 2017 • edited by sergey-tihon Loading

saidMoulay commented Jun 6, 2017 •

edited by sergey-tihon

Loading