-
Notifications
You must be signed in to change notification settings - Fork 16
Hello World from CORB
To illustrate how easy it is to build a CORB job the following trivial example is proposed. Suppose we wanted to search for all documents that had the phrase "Hello World" in them and print out the document's title, author and synopsis. Well, if you didn't have many documents I would propose that you NOT use CORB for such an easy task. It could instead be run in MarkLogic's QueryConsole. But let's say there are a million documents out of several million that for some inexplicable reason had the phrase "Hello World" in them. In that case, I would suggest using CORB. You see, CORB is great for running multiple threads when executing transforms against thousands or millions of documents that have to be filtered (opened) in order to perform the task at hand. In other words, the transform can't simply use indexes to do what you need to do so the process is a little slower.
To tackle this task, you'll need the jar for CORB, you'll need MarkLogic's XCC connection jar and a MarkLogic XDBC server attached to a MarkLogic database. You'll also need a selector module and a transform module. In this example, we're going to use XQuery for our selector and transform. However, if you choose, you may use JavaScript modules. Finally, we'll put a few properties in a properties file to handle the report that we want to generate.
Assume the following document is in a database under the URI /document/corbnewworld.xml
<document>
<title>It's a CORB New World</title>
<author>Corbyn Corbado</author>
<synopsis>Like many bands, CORB has struggled for years in relative obscurity before finally having the overnight success that it is now experiencing. This book serves to aid CORB in finally saying Hello World where have you been all my life?</synopsis>
</document>
let $uris := cts:uris((),(),cts:word-query("Hello World"))
return (fn:count($uris),$uris)
declare variable $URI as xs:string external;
let $document := fn:doc($URI)/document
let $author := $document/author/string()
let $title := $document/title/string()
return fn:string-join(($title,$author,$URI),",")
THREAD-COUNT=8
URIS-MODULE=selector.xqy|ADHOC
PROCESS-MODULE=transform.xqy|ADHOC
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
EXPORT-FILE-NAME=HelloWorldReport.csv
PRE-BATCH-TASK=com.marklogic.developer.corb.PreBatchUpdateFileTask
EXPORT-FILE-TOP-CONTENT=Title,Author,URI
Assume there is a database called FFE set up on a MarkLogic instance running locally with user/password of admin/admin and a MarkLogic XDBC server listening on port 9000.
LIB=/path/to/where/your/jars/are/located
java -cp "$LIB/marklogic-xcc-6.0.2.jar:$LIB/corb2.jar" \
-DOPTIONS-FILE=my.properties \
com.marklogic.developer.corb.Manager \
xcc://admin:admin@localhost:9000/FFE
Note: On a Windows system, use
;
instead of:
as a delimiter for the jars in the classpath. i.e.java -cp "$LIB/marklogic-xcc-6.0.2.jar;$LIB/corb2.jar"
Executing the shell script should produce a file called HelloWorldReport.csv in the same directory as the script. Opening the file should reveal the following two lines of text:
Title,Author,URI
It's a CORB New World,Corbyn Corbado,/document/corbnewworld.xml