Skip to content
This repository has been archived by the owner on Nov 21, 2018. It is now read-only.

Commit

Permalink
Aded code boilterplate, tweaked documentation.
Browse files Browse the repository at this point in the history
  • Loading branch information
jimmy0017 committed Jul 14, 2013
1 parent f790690 commit d1503b5
Show file tree
Hide file tree
Showing 31 changed files with 461 additions and 18 deletions.
7 changes: 7 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
Version 0.2
===========
July 14, 2013

+ Refactored package layout, separating ClueWeb12-specific classes
+ Added PFor-compressed DocVector, in addition to the previous VByte-compressed version

Version 0.1
===========
July 10, 2013
Expand Down
25 changes: 17 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ ClueWeb Tools

This is a collection of tools for manipulating the [ClueWeb12 collection](http://lemurproject.org/clueweb12/).


Getting Stated
--------------

Expand Down Expand Up @@ -108,24 +107,34 @@ Building Document Vectors

With the dictionary, we can now convert the entire collection into a sequence of document vectors, where each document vector is represented by a sequence of termids; the termids map to the sequence of terms that comprise the document. These document vectors are much more compact and much faster to scan for processing purposes.

To build document vectors, issue the following command:
The document vector is represented by the interface `org.clueweb.data.DocVector`. Currently, there are two concrete implementations:

+ `VByteDocVector`, which uses Hadoop's built-in utilities for writing variable-length integers (what Hadoop calls VInt).
+ `PForDocVector`, which uses PFor compression from Daniel Lemire's [JavaFastPFOR](https://github.com/lemire/JavaFastPFOR/) package.

To build document vectors, use either `BuildVByteDocVectors` or `BuildPForDocVectors`:

```
hadoop jar target/clueweb-tools-0.X-SNAPSHOT-fatjar.jar \
org.clueweb.clueweb12.app.BuildDocVectors \
org.clueweb.clueweb12.app.Build{VByte,PFor}DocVectors \
-input /data/private/clueweb12/Disk1/ClueWeb12_00/*/*.warc.gz \
-output /data/private/clueweb12/derived/docvectors.20130710/segment00 \
-dictionary /data/private/clueweb12/derived/dictionary.20130710 \
-output /data/private/clueweb12/derived/docvectors/segment00 \
-dictionary /data/private/clueweb12/derived/dictionary \
-reducers 100
```

Once again, it's advisable to run on a segment at a time in order to keep the Hadoop job sizes manageable. Note that the program run identity reducers to repartition the document vectors into 100 parts (to avoid the small files problem).

The output directory will contain `SequenceFile`s, with a `Text` containing the WARC-TREC-ID as the key, and a `BytesWritable` object as the value. The value contains a stream of integers written using Hadoop's VInt util (for writing variable-length integers).
The output directory will contain `SequenceFile`s, with a `Text` containing the WARC-TREC-ID as the key. For VByte, the value will be a `BytesWritable` object; for PFor, the value will be an `IntArrayWritable` object.

To process these document vectors, either use `ProcessVByteDocVectors` or `ProcessPForDocVectors` in the `org.clueweb.clueweb12.app` package, which provides sample code for consuming these document vectors and converting the termids back into terms.

To read back these document vectors, check out `org.clueweb.clueweb12.app.ProcessDocVectors` for some demo code.
Size comparisons, on the entire ClueWeb12 collection:

The entire ClueWeb12 collection, converted into document vectors, occupies roughly 1.08 TB (compared to 5.54 TB compressed in the original WARC files).
+ 5.54 TB: original compressed WARC files
+ 1.08 TB: repackaged as `VByteDocVector`s
+ 0.86 TB: repackaged as `PForDocVector`s
+ ~1.6 TB: uncompressed termids (collection size is around ~400 billion terms)

License
-------
Expand Down
16 changes: 16 additions & 0 deletions src/main/java/org/clueweb/clueweb12/app/BuildDictionary.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import it.unimi.dsi.sux4j.mph.TwoStepsLcpMonotoneMinimalPerfectHashFunction;
Expand Down
16 changes: 16 additions & 0 deletions src/main/java/org/clueweb/clueweb12/app/BuildPForDocVectors.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
16 changes: 16 additions & 0 deletions src/main/java/org/clueweb/clueweb12/app/BuildVByteDocVectors.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.BufferedReader;
Expand Down
16 changes: 16 additions & 0 deletions src/main/java/org/clueweb/clueweb12/app/ComputeTermStatistics.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
16 changes: 16 additions & 0 deletions src/main/java/org/clueweb/clueweb12/app/CountClueWarcRecords.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import org.apache.commons.cli.CommandLine;
Expand Down
16 changes: 16 additions & 0 deletions src/main/java/org/clueweb/clueweb12/app/MergeTermStatistics.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
16 changes: 16 additions & 0 deletions src/main/java/org/clueweb/clueweb12/app/ProcessPForDocVectors.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.app;

import java.io.IOException;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
package org.clueweb.clueweb12.mapred;

/*
* Cloud9: A Hadoop toolkit for working with big data
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
Expand All @@ -16,6 +14,8 @@
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.mapred;

/**
* Hadoop FileInputFormat for reading WARC files
*
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.clueweb12.mapreduce;

import java.io.DataInputStream;
Expand Down
9 changes: 2 additions & 7 deletions src/main/java/org/clueweb/data/ClueWarcRecord.java
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
package org.clueweb.data;

/*
* Cloud9: A Hadoop toolkit for working with big data
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
Expand Down Expand Up @@ -50,10 +48,7 @@
* @author [email protected] (Mark J. Hoy)
*/

/**
* @author samar
* Modified for WARC 1.0 version used in clueweb12
*/
package org.clueweb.data;

import java.io.DataInput;
import java.io.DataInputStream;
Expand Down
16 changes: 16 additions & 0 deletions src/main/java/org/clueweb/data/DocVector.java
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* ClueWeb Tools: Hadoop tools for manipulating ClueWeb collections
*
* Licensed under the Apache License, Version 2.0 (the "License"); you
* may not use this file except in compliance with the License. You may
* obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
* implied. See the License for the specific language governing
* permissions and limitations under the License.
*/

package org.clueweb.data;

public interface DocVector {
Expand Down
Loading

0 comments on commit d1503b5

Please sign in to comment.