This repository has been archived by the owner on Dec 9, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
NeMo Curator
ACraig7 edited this page Nov 25, 2024
·
12 revisions
!!!!!!!!!DEPRECATED-Not in use!!!!!!!!!
Further installation instructions can be found on the NeMo Curator repository
- Python 3.10
- Python base image
python:3.10-slim
for a lightweight container and to meet NeMo Curator requirements
FROM python:3.10-slim
- Prepares the Dockerfile by installing essential packages for NeMo Curator
-
RUN apt-get update
updates the package index -
&&
ensures the previous command succeded -
apt-get install
install packages using advanced package tool -
-y
answer "yes" automatically -
\
allows the command to continue on the next line -
build-essential
collection of meta packages -
python3-dev
necessary header files and static libraries -
rm -rf /var/lib/apt/lists/*
removes the cached package files
-
RUN apt-get update && apt-get install -y \
build-essential \
python3-dev \
&& rm -rf /var/lib/apt/lists/*
- Run the installer
-
-q
reduces output during installation -
-U
upgrades package to latest version
-
RUN pip install -qU cython
- Run the installer
-
-q
reduces output during installation -
-U
upgrades package to latest version
-
RUN pip install -qU nemo-curator
- Set environment variables,
ENV
ENV CURATOR_CONFIG=/app/curator_config.yaml
- Set up configuration file
preprocess:
steps:
- "custom_cleaner"
custom_cleaner:
path: "/path/to/custom_cleaner.py"
- Copy existing configuration file
COPY ./curator_config.yaml /app/curator_config.yaml
- Import necessary library and function
- Import
TextCleaningProcessor
function - From
nemo.collections.nlp.data.preprocessors
library
- Import
from nemo.collections.nlp.data.preprocessors import TextCleaningProcessor
- Clean the imported web documents
- Create instance of
TextCleaningProcessor
classprocessor = TextCleaningProcessor()
- Clean the documents and store in a clean list
cleaned_docs = [processor.clean(document.page_content) for document in documents]
- Return the cleaned documents
return cleaned_docs
- Create instance of
processor = TextCleaningProcessor()
cleaned_docs = [processor.clean(document.page_content) for document in documents]
return cleaned_docs
- Check for conflicts of the
pydantic
package with MistralAI - Try installing a earlier version of Pydantic via
requirements.txt
- Add
pydantic==1.0
torequirements.txt
- Build Docker image
- Add
- If conflict still exists install newer version of Pydantic
- i.e. Add
pydantic==2.5.2
torequirements.txt
- i.e. Add