forked from ganramkr/apache-nutch-2.3
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Documentation.txt
35 lines (16 loc) · 1.67 KB
/
Documentation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
CHANGES COMITTED IN CODE FOR SANDHAN 1.0 WHICH USES A COMBINATION OF NUTCH 2.3 + SOLR 4.3.10 + HBASE 0.94
1.) conf/parse-plugins.xml
a.) add lines just below the comment --> <!-- Included in Sandhan upgraded version -->
2.) src/plugin/horizontal/offline/parse-cml/src/java/org/iitkgp/cel/parse/html/CmlParser.java -:
a.) change in boilerplate code (line 264 , original code commented)
b.) change in outlinks url, convert https to http(line 351 , original code commented)
c.) change in set metadata, Utf8 encoding of string must be used in place of simple string(line 522 , original code commented)
3.) src/plugin/horizontal/offline/index-cml/src/java/org/apache/nutch/parse/cml/CmlIndexer.java
a.) change in definition of override filter method due to difference in functionality of nutch 2.3 and nutch 1.4 ->
NUTCH 2.3 - public NutchDocument filter(NutchDocument doc,String url,WebPage page) throws IndexingException
NUTCH 1.4 - public NutchDocument filter(NutchDocument doc, Parse parse, Text url,CrawlDatum datum, Inlinks inlinks) throws IndexingException
b.) change in get metadata, Utf8 encoding of string must be used in place of simple string analogous to change(c) in parse-cml(line 74 , original code commented)
4.) src/plugin/horizontal/offline/index-unl/src/java/org/apache/nutch/unl/UNLIndexer.java
a.) change in definition of override filter method due to difference in functionality of nutch 2.3 and nutch 1.4 ->
NUTCH 2.3 - public NutchDocument filter(NutchDocument doc,String url,WebPage page) throws IndexingException
NUTCH 1.4 - public NutchDocument filter(NutchDocument doc, Parse parse, Text url,CrawlDatum datum, Inlinks inlinks) throws IndexingException