You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have noticed that in the first pipeline that you use Trafilatura to extract text out of WARC Records:
main_processing_executor=SlurmPipelineExecutor(
job_name=f"cc_{DUMP_TO_PROCESS}",
pipeline=[
WarcReader(
f"s3://commoncrawl/crawl-data/{DUMP_TO_PROCESS}/segments/",
glob_pattern="*/warc/*", # we want the warc filesdefault_metadata={"dump": DUMP_TO_PROCESS},
),
URLFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/1_url/{DUMP_TO_PROCESS}")),
Trafilatura(favour_precision=True),
LanguageFilter(
...
and the deduplicate flag for the Trafilatura class is set to be True
classTrafilatura(BaseExtractor):
"""Trafilatura extractor, it uses https://trafilatura.readthedocs.io/en/latest/index.html We're actually only using the main entry point of trafilatura: the `extract` function. No specific data structure is exchanged with Trafilatura, only the text is passed and the extracted text is returned. Alternatively and identically, `trafilatura` could be used through its command line main interface. Args: favour_precision: prefer less text but correct extraction. include_images: not implemented currently timeout: the timeout for extraction, per document, in seconds deduplicate: trafilatura's deduplicate option **kwargs: any other option will be passed to trafilatura """name="⛏ Trafilatura"_requires_dependencies= ["trafilatura"]
def__init__(
self,
favour_precision: bool=True,
include_images: bool=False,
timeout: float=0.1,
deduplicate: bool=True,
**kwargs,
):
super().__init__(timeout)
self.favour_precision=favour_precisionself.include_images=include_imagesself.deduplicate=deduplicateself.kwargs=kwargsifself.include_images:
raiseNotImplementedError
...
I am reading the script for reproducing fineweb.
I have noticed that in the first pipeline that you use Trafilatura to extract text out of WARC Records:
and the
deduplicate
flag for theTrafilatura
class is set to beTrue
see this line:
datatrove/src/datatrove/pipeline/extractors/trafilatura.py
Line 27 in c7f6f51
Is that means the fineweb use the Element and paragraph level dedup feature provided in
Trafilatura
by default?And I am also wondering how does this flag affect the final dataset, i.e., what if I set
deduplicate=False
here?The text was updated successfully, but these errors were encountered: