Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample code for using ArabicTokenizer #18

Closed
alismart opened this issue May 14, 2015 · 7 comments
Closed

Sample code for using ArabicTokenizer #18

alismart opened this issue May 14, 2015 · 7 comments

Comments

@alismart
Copy link

Sergey, i did my best to understand how to use the ArabicTokenizer, you can see my try in the following code. i hope to check it and see if this is the best way of use.
i am also trying to set the parameters in the main method, but it doesn't seem to work at all. for example it neither removes the diacritics nor removingTatweel.

       ArabicTokenizer.main(new string[] { "normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker", "removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping" });
        string s = textBox2.Text;
        java.io.StringReader sr = new StringReader(s);
        ArabicTokenizer tokenizer = new ArabicTokenizer(sr, new edu.stanford.nlp.process.WordTokenFactory(), new java.util.Properties());

        java.util.List al = tokenizer.tokenize();
        int size = al.size();
        string container = "";
        for (int i = 0; i < size; i++)
        {
           Word w = (Word)al.get(i);
           container = container + " ^ " + w.word();
        }
        textBox1.Text = container;

image

@sergey-tihon
Copy link
Owner

As I understand you need Stanford Word Segmenter that is designed for tokenization of Arabic and Chinese languages.
Here you can find C#/F# samples for Stanford Word Segmenter for .NET
Is it what you are trying to do?

@alismart
Copy link
Author

I tried Stanford Word Segmenter, as its name implies, it is dividing the raw text to segments ( sentences ) depending on a training set which needs so many ram and cpu resources as i noticed, since it is using Machine Learning.

Later on,I noticed that Stanford has included another tool called: Arabic Tokenizer
which is a straightforward algorithm for dividing raw Arabic text to tokens ( words ) . so i think my need is the Tokenizer (to words) instead of the Segmenter (to sentences).

My question is, do have any idea how to use the Tokenizer? especially how to set the parameters to vary its functionality ..

@sergey-tihon
Copy link
Owner

Please try following sample

using System;
using edu.stanford.nlp.international.arabic.process;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.util;
using java.io;

namespace CoreNLPArabic
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "جامعة الدول العربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا";

            var parameters =
                new[]
                {
                    "normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker",
                    "removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping"
                };
            var tokenizerOptions = StringUtils.argsToProperties(parameters);
            var tf = tokenizerOptions.containsKey("atb")
                ? ArabicTokenizer.atbFactory()
                : ArabicTokenizer.factory();

            foreach (String option in tokenizerOptions.stringPropertyNames().toArray())
            {tf.setOptions(option);}
            tf.setOptions("tokenizeNLs");

            int nLines = 0;
            int nTokens = 0;
            var tokenizer = tf.getTokenizer(new StringReader(s));
            var printSpace = false;
            const string NEWLINE_TOKEN = "*NL*";
            while (tokenizer.hasNext()) {
              ++nTokens;
              var next = tokenizer.next() as CoreLabel;
              String word = next.word();
              if (word.Equals(NEWLINE_TOKEN)) {
                ++nLines;
                printSpace = false;
                System.Console.WriteLine();
              } else {
                if (printSpace) System.Console.Write(" ");
                System.Console.Write(word);
                printSpace = true;
              }
            }
            System.Console.WriteLine("\nDone! Tokenized %d lines (%d tokens)%n", nLines, nTokens);

        }
    }
}

@alismart
Copy link
Author

alismart commented Jul 2, 2015

Unfortunately, it didn't work.
Obviously the tokenizer is not recognizing any of the options you provided in the code
this is my try, the red words are what is expected after taking the options in consideration
image

maybe there is still something required to get everything works properly.

given the following input:
جامعةُ الدُوَلِ العـــــــــربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا
the expected output is:
جامعة الدول العربية هي منظمة تضم دولا في الشرق الاوسط وافريقيا

i hope you could fix the code as soon as possible because i need it in my graduation project
thanks in advance

@sergey-tihon
Copy link
Owner

@alismart Sorry, I have no ideas.
Could you try Java version? Would be nice to know if it is work as you expected or not.

@saidMoulay
Copy link

in your script program,replace this line scripte "foreach (String option in tokenizerOptions.stringPropertyNames().toArray())" with this one "foreach (String option in parameters)" . And the output text wil be as you hope (fine). with out Tatweel . . .

@saidMoulay
Copy link

saidMoulay commented Jun 6, 2017

Fixed code , Try it

using System;
using edu.stanford.nlp.international.arabic.process;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.util;
using java.io;

namespace CoreNLPArabic
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "جامعة الدول العربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا";

            var parameters =
                new[]
                {
                    "normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker",
                    "removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping"
                };
            var tokenizerOptions = StringUtils.argsToProperties(parameters);
            var tf = tokenizerOptions.containsKey("atb")
                ? ArabicTokenizer.atbFactory()
                : ArabicTokenizer.factory();

            foreach (String option in parameters)
            {tf.setOptions(option);}
            tf.setOptions("tokenizeNLs");

            int nLines = 0;
            int nTokens = 0;
            var tokenizer = tf.getTokenizer(new StringReader(s));
            var printSpace = false;
            const string NEWLINE_TOKEN = "*NL*";
            while (tokenizer.hasNext()) {
              ++nTokens;
              var next = tokenizer.next() as CoreLabel;
              String word = next.word();
              if (word.Equals(NEWLINE_TOKEN)) {
                ++nLines;
                printSpace = false;
                System.Console.WriteLine();
              } else {
                if (printSpace) System.Console.Write(" ");
                System.Console.Write(word);
                printSpace = true;
              }
            }
            System.Console.WriteLine("\nDone! Tokenized %d lines (%d tokens)%n", nLines, nTokens);

        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants