NeMo Curator

!!!!!!!!!DEPRECATED-Not in use!!!!!!!!!

Installation

Further installation instructions can be found on the NeMo Curator repository

Installing NeMo Curator on Docker Linux Image

Requirements

Python 3.10

Base image

Python base image python:3.10-slim for a lightweight container and to meet NeMo Curator requirements

FROM python:3.10-slim

Run the essential packages

Prepares the Dockerfile by installing essential packages for NeMo Curator
- RUN apt-get update updates the package index
- && ensures the previous command succeded
- apt-get install install packages using advanced package tool
- -y answer "yes" automatically
- \ allows the command to continue on the next line
- build-essential collection of meta packages
- python3-dev necessary header files and static libraries
- rm -rf /var/lib/apt/lists/* removes the cached package files

 RUN apt-get update && apt-get install -y \
    build-essential \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*

Run the Cython installer

Run the installer
- -q reduces output during installation
- -U upgrades package to latest version

RUN pip install -qU cython

Run the NeMo Curator installer

Run the installer
- -q reduces output during installation
- -U upgrades package to latest version

RUN pip install -qU nemo-curator

Configuration

NeMo Curator Dockerfile configuration

Set environment variables, ENV

ENV CURATOR_CONFIG=/app/curator_config.yaml

Set up configuration file

preprocess:
  steps:
    - "custom_cleaner"
  custom_cleaner:
    path: "/path/to/custom_cleaner.py"

Implementation

NeMo Curator Dockerfile implementation

Copy existing configuration file COPY ./curator_config.yaml /app/curator_config.yaml

Usage

NeMo Curator data cleaning and preprocessing

Import necessary library and function
- Import TextCleaningProcessor function
- From nemo.collections.nlp.data.preprocessors library

from nemo.collections.nlp.data.preprocessors import TextCleaningProcessor

Clean the imported web documents
- Create instance of TextCleaningProcessorclass
  - processor = TextCleaningProcessor()
- Clean the documents and store in a clean list
  - cleaned_docs = [processor.clean(document.page_content) for document in documents]
- Return the cleaned documents
  - return cleaned_docs

processor = TextCleaningProcessor()
cleaned_docs = [processor.clean(document.page_content) for document in documents]
return cleaned_docs

Troubleshooting

Maximum recursion depth exceeded error

Check for conflicts of the pydantic package with MistralAI
Try installing a earlier version of Pydantic via requirements.txt
- Add pydantic==1.0 to requirements.txt
- Build Docker image
If conflict still exists install newer version of Pydantic
- i.e. Add pydantic==2.5.2 to requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeMo Curator

Contents

Installation

Installing NeMo Curator on Docker Linux Image

Requirements

Base image

Run the essential packages

Run the Cython installer

Run the NeMo Curator installer

Configuration

NeMo Curator Dockerfile configuration

Implementation

NeMo Curator Dockerfile implementation

Usage

NeMo Curator data cleaning and preprocessing

Troubleshooting

Maximum recursion depth exceeded error

Clone this wiki locally