-
-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample code for using ArabicTokenizer #18
Comments
As I understand you need Stanford Word Segmenter that is designed for tokenization of Arabic and Chinese languages. |
I tried Stanford Word Segmenter, as its name implies, it is dividing the raw text to segments ( sentences ) depending on a training set which needs so many ram and cpu resources as i noticed, since it is using Machine Learning. Later on,I noticed that Stanford has included another tool called: Arabic Tokenizer My question is, do have any idea how to use the Tokenizer? especially how to set the parameters to vary its functionality .. |
Please try following sample using System;
using edu.stanford.nlp.international.arabic.process;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.util;
using java.io;
namespace CoreNLPArabic
{
class Program
{
static void Main(string[] args)
{
string s = "جامعة الدول العربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا";
var parameters =
new[]
{
"normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker",
"removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping"
};
var tokenizerOptions = StringUtils.argsToProperties(parameters);
var tf = tokenizerOptions.containsKey("atb")
? ArabicTokenizer.atbFactory()
: ArabicTokenizer.factory();
foreach (String option in tokenizerOptions.stringPropertyNames().toArray())
{tf.setOptions(option);}
tf.setOptions("tokenizeNLs");
int nLines = 0;
int nTokens = 0;
var tokenizer = tf.getTokenizer(new StringReader(s));
var printSpace = false;
const string NEWLINE_TOKEN = "*NL*";
while (tokenizer.hasNext()) {
++nTokens;
var next = tokenizer.next() as CoreLabel;
String word = next.word();
if (word.Equals(NEWLINE_TOKEN)) {
++nLines;
printSpace = false;
System.Console.WriteLine();
} else {
if (printSpace) System.Console.Write(" ");
System.Console.Write(word);
printSpace = true;
}
}
System.Console.WriteLine("\nDone! Tokenized %d lines (%d tokens)%n", nLines, nTokens);
}
}
} |
@alismart Sorry, I have no ideas. |
in your script program,replace this line scripte "foreach (String option in tokenizerOptions.stringPropertyNames().toArray())" with this one "foreach (String option in parameters)" . And the output text wil be as you hope (fine). with out Tatweel . . . |
Fixed code , Try it using System;
using edu.stanford.nlp.international.arabic.process;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.util;
using java.io;
namespace CoreNLPArabic
{
class Program
{
static void Main(string[] args)
{
string s = "جامعة الدول العربية هي منظمة تضم دولا في الشرق الأوسط وأفريقيا";
var parameters =
new[]
{
"normArDigits", "normAlif", "normYa", "removeDiacritics", "removeTatweel", "removeProMarker",
"removeSegMarker", "removeMorphMarker", "removeLengthening", "atbEscaping"
};
var tokenizerOptions = StringUtils.argsToProperties(parameters);
var tf = tokenizerOptions.containsKey("atb")
? ArabicTokenizer.atbFactory()
: ArabicTokenizer.factory();
foreach (String option in parameters)
{tf.setOptions(option);}
tf.setOptions("tokenizeNLs");
int nLines = 0;
int nTokens = 0;
var tokenizer = tf.getTokenizer(new StringReader(s));
var printSpace = false;
const string NEWLINE_TOKEN = "*NL*";
while (tokenizer.hasNext()) {
++nTokens;
var next = tokenizer.next() as CoreLabel;
String word = next.word();
if (word.Equals(NEWLINE_TOKEN)) {
++nLines;
printSpace = false;
System.Console.WriteLine();
} else {
if (printSpace) System.Console.Write(" ");
System.Console.Write(word);
printSpace = true;
}
}
System.Console.WriteLine("\nDone! Tokenized %d lines (%d tokens)%n", nLines, nTokens);
}
}
} |
Sergey, i did my best to understand how to use the ArabicTokenizer, you can see my try in the following code. i hope to check it and see if this is the best way of use.
i am also trying to set the parameters in the main method, but it doesn't seem to work at all. for example it neither removes the diacritics nor removingTatweel.
The text was updated successfully, but these errors were encountered: