Skip to content
This repository has been archived by the owner on Dec 9, 2024. It is now read-only.

NeMo Curator

ACraig7 edited this page Nov 25, 2024 · 12 revisions

!!!!!!!!!DEPRECATED-Not in use!!!!!!!!!

Contents

  1. Installation
  2. Configuration
  3. Implementation
  4. Usage
  5. Troubleshooting

Installation

Further installation instructions can be found on the NeMo Curator repository

Installing NeMo Curator on Docker Linux Image

Requirements

  • Python 3.10

Base image

  • Python base image python:3.10-slim for a lightweight container and to meet NeMo Curator requirements
FROM python:3.10-slim

image

Run the essential packages

  • Prepares the Dockerfile by installing essential packages for NeMo Curator
    • RUN apt-get update updates the package index
    • && ensures the previous command succeded
    • apt-get install install packages using advanced package tool
    • -y answer "yes" automatically
    • \ allows the command to continue on the next line
    • build-essential collection of meta packages
    • python3-dev necessary header files and static libraries
    • rm -rf /var/lib/apt/lists/* removes the cached package files
 RUN apt-get update && apt-get install -y \
    build-essential \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*

image

Run the Cython installer

  • Run the installer
    • -q reduces output during installation
    • -U upgrades package to latest version
RUN pip install -qU cython

image

Run the NeMo Curator installer

  • Run the installer
    • -q reduces output during installation
    • -U upgrades package to latest version
RUN pip install -qU nemo-curator

image

Configuration

NeMo Curator Dockerfile configuration

  • Set environment variables, ENV
ENV CURATOR_CONFIG=/app/curator_config.yaml

image

  • Set up configuration file
preprocess:
  steps:
    - "custom_cleaner"
  custom_cleaner:
    path: "/path/to/custom_cleaner.py"

Implementation

NeMo Curator Dockerfile implementation

  • Copy existing configuration file COPY ./curator_config.yaml /app/curator_config.yaml

image

Usage

NeMo Curator data cleaning and preprocessing

  • Import necessary library and function
    • Import TextCleaningProcessor function
    • From nemo.collections.nlp.data.preprocessors library
from nemo.collections.nlp.data.preprocessors import TextCleaningProcessor

image

  • Clean the imported web documents
    • Create instance of TextCleaningProcessorclass
      • processor = TextCleaningProcessor()
    • Clean the documents and store in a clean list
      • cleaned_docs = [processor.clean(document.page_content) for document in documents]
    • Return the cleaned documents
      • return cleaned_docs
processor = TextCleaningProcessor()
cleaned_docs = [processor.clean(document.page_content) for document in documents]
return cleaned_docs

image

Troubleshooting

Maximum recursion depth exceeded error

  1. Check for conflicts of the pydantic package with MistralAI
  2. Try installing a earlier version of Pydantic via requirements.txt
    • Add pydantic==1.0 to requirements.txt
    • Build Docker image
  3. If conflict still exists install newer version of Pydantic
    • i.e. Add pydantic==2.5.2 to requirements.txt