Load text files(pdf, doc, etc..) , transform, chunk and upload to the Clarifai Platform
- File Partitioning
- Cleaning Chunks
- Metadata Extraction
To use Data Ingestion Pipeline, please run
pip install -r requirements-dev.txt
from clarifai_datautils.text import Pipeline, PDFPartition
from clarifai_datautils.text.pipeline.cleaners import Clean_extra_whitespace
# Define the pipeline
pipeline = Pipeline(
name='pipeline-1',
transformations=[
PDFPartition(chunking_strategy = "by_title",max_characters = 1024),
Clean_extra_whitespace()
]
)
# Using SDK to upload
from clarifai.client import Dataset
dataset = Dataset(dataset_url)
dataset.upload_dataset(pipeline.run(files = file_path, loader = True))
- Text(.txt)
- Docx
- Markdown