Skip to content

Commit

Permalink
moving current wip to main (#6)
Browse files Browse the repository at this point in the history
* wip: putting together the scaffolding

* demo working

* wip: examples

* added examples command

* wip

* setting interface output colors here results in ascii chars sent to redis

* added gems

* added example

* added provider placeholder

* wip: flowise api

* added helper module from monadic-chat, wip: flowise api working

* added setup instructions for python libs

* wip ToT example, workflow architecture

* added original example

* moved cartridges to nano-bot registry

* wip

* added singleton class for spacy tasks

* wip ERROR -- : No valid words found in the provided documents

* Moved the require statement for text_processing_workflow to after other component requires.

* Changed logging level from DEBUG to INFO.ECommented out most of the binding.pry breakpoints.EUpdated the AdvancedAnalysisTask:EEModified the file path for the advanced_analysis_cartridge.yml.EChanged the prompt for analysis to generate a short narrative.

* Added more detailed logging during document processing.EModified the training process:EENow trains in iterations, printing progress.EOutputs more detailed model statistics.EEEUpdated the infer_topics method:EENow uses make_doc method.EHandles case where topic inference fails.EIdentifies and returns the most probable topic.EPrints full topic distribution.

* Removed unused imports and dependencies,EReorganized require statements in flowbots.rb,EDeleted topic_modeler.rb file,ESimplified TextProcessor and TextSegmenter classes,EUpdated TextProcessingWorkflow to use get_topics,ERemoved Redis initialization from WorkflowOrchestrator

* - Modularized topic modeling functionality
- Improved error handling and logging
- Updated Docker configuration
- Removed unused segmentation code
- Enhanced configuration management
- Adjusted file paths and dependencies
- Updated nano-bots submodule

* - Extracted train_model and infer_topics methods
- Improved error handling and logging throughout
- Removed redundant code and improved readability
- Added logger initialization in the constructor

* adding tty-box functions

* moved workflows, renamed components

* future utils

* wip: error handler

* wip: ui

* added error handling cartridge

* seperated cli module from main

* a nice and accurate exceptionhandler :)

* snapshot

* wip: almost back together

* working in ohm

* adjusted to output exception reports in markdown

* 1. ExceptionAgent improvements:
   - Removed the "Relevant Files" section from exception reports, simplifying the output.

2. TopicModelProcessor enhancements:
   - Improved model loading and creation process with a new `load_or_create_model` method.
   - Enhanced `process` method to handle empty documents and ensure model existence.
   - Improved `train_model` method with better handling of empty documents and word counting.
   - Added more robust error handling and logging throughout.
   - Improved `save_model` method with checks for directory existence, write permissions, and disk space.
   - Enhanced `store_topics` method with better error handling and logging.

3. Task structure changes:
   - Modified the base `Task` class to no longer inherit from `Jongleur::WorkerTask`.
   - Updated specific task classes (LlmAnalysisTask, NlpAnalysisTask, TopicModelingTask) to inherit directly from `Jongleur::WorkerTask`.

4. UI improvements:
   - Simplified the `info` method in the UI module.

5. TextProcessingWorkflow updates:
   - Commented out some workflow steps (process_input, run_nlp_analysis, run_topic_modeling) in the `execute` method.
   - Changed logging to use UI.info instead of logger.info in the `run_workflow` method.

* set messsages to print

* added cartridges

* snapshot: working

* wip: created task to display results

* removed redundant includes

* assets

* added cartridges

* set text segmentation as its own task

* added Fileloader task, added tokenizer, adjusted ohm models

* wip: working, set batch_size or else large datasets overflow mem

* Refactor topic modeling workflow and improve text processing pipeline

This commit significantly updates the topic modeling workflow and text processing pipeline, improving efficiency and adding new features:

1. TopicModelTrainerWorkflow:
   - Implement batch processing with BATCH_SIZE constant
   - Add flush_redis_cache method for clean slate processing
   - Refactor process_files method to handle batches
   - Implement train_topic_model method with cleaning and filtering
   - Add clean_segments_for_modeling method to improve data quality

2. Task Updates:
   - Modify LoadTextFilesTask to process single files
   - Update TextSegmentTask, TokenizeSegmentsTask, and NlpAnalysisTask for single file processing
   - Refactor FilterSegmentsTask with improved logging and error handling
   - Add AccumulateFilteredSegmentsTask for batch accumulation
   - Update TrainTopicModelTask to handle accumulated segments

3. LLM Analysis:
   - Refactor LlmAnalysisTask to use preprocessed content and file metadata
   - Implement generate_analysis_prompt method for better LLM input

4. UI Improvements:
   - Add BoxUI module with side_by_side_boxes method for improved result display
   - Update DisplayResultsTask to use new BoxUI for better visualization

5. NLP Processing:
   - Refactor NLPProcessor to return more detailed token information
   - Update NlpAnalysisTask to handle new NLP processor output

6. Miscellaneous:
   - Remove unused code and comments
   - Update error handling and logging across multiple files
   - Improve code organization and readability

This refactoring enhances the workflow's ability to handle large datasets efficiently, improves the quality of topic modeling input, and provides better visualization of results.

* added treetop grammar, working on clean interrupt

* wip: grammar parser

* Refactor text processing workflow and improve YAML front matter parsing

- Update GrammarProcessor to use Treetop grammar file
- Simplify markdown_yaml.treetop grammar for better YAML parsing
- Enhance PreprocessTextFileTask with improved error handling and logging
- Modify TextSegmentTask to use preprocessed content
- Add parallel processing support to flowbots.rb
- Update CLI to use TopicModelTrainerWorkflow instead of test version
- Improve error logging and context in GrammarProcessor
- Enhance WorkflowOrchestrator cleanup process

This commit significantly improves the text processing pipeline, 
particularly in handling YAML front matter in Markdown files. It also 
adds better error handling and logging throughout the workflow.

* this works at least

* update readme

* set preprocess task to get the current_textfile_id in the workflow

* add engtagger task wip: text compressor

* added rdocs

* documentation

* extras

* fix: linear logic for detecting file type

* wip

* Refactor tasks and implement uniform input retrieval (Epics 1 & 2)

* added lemmas ohm model

* ui improvements

* UI improvements

* cartridge updates

* ui improvements

* adjusted readme

* updated readme, results

* wip

* adjusted nano-bots

* Key changes include:

Renaming the Textfile model to FileObject.
Updating all references to Textfile to FileObject.
Modifying the FileLoader class to use the FileObject model.
Updating the InputRetrieval module to retrieve FileObject instances.
Adjusting the RedisKeys module to use keys related to FileObject.
Updating tasks and workflows to use the FileObject model.

* Adjust logging settings and enhance TextProcessingWorkflow

- Reduce log file max size to 2,145,728 bytes
- Increase max number of log files to 100
- Comment out flush_redis_cache in unified_file_processing
- Add batch mode to TextProcessingWorkflow
- Implement separate processing for batch and single file modes
- Add methods for fetching unprocessed file IDs and creating/fetching file objects
- Update perform_additional_tasks to work with specific file IDs

* added SI chars

* doc updates

* snapshot

* snapshot

* readme edits

* snapshot
  • Loading branch information
b08x authored Sep 25, 2024
1 parent 94377d1 commit 844ca60
Show file tree
Hide file tree
Showing 434 changed files with 41,348 additions and 133 deletions.
2 changes: 2 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
COHERE_API_KEY=
GEMINI_API_KEY=
22 changes: 21 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,29 @@
/.yardoc
/_yardoc/
/coverage/
/doc/
/pkg/
/site/
/spec/reports/
/tmp/
/Gemfile.lock
/docker/data/
.env
/docker/.env/
/log/
/vendor/
/docker/*.pem
/docs/agenta.json
.venv/
/docker/data.tar.gz
flowbots.json
/docker/docker-compose.yml
/models/
/examples/test/
/data/
/.bold/
/.examples/
/.vendor/
/.docs/
/exception_reports/
*.text.json
.gh_pages
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "nano-bots"]
path = nano-bots
url = [email protected]:b08x/nano-bots.git
131 changes: 131 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Use Ruby 3.3 as the base image
FROM ruby:3.3-slim

# Create a non-root user to run the app
RUN useradd -s /bin/bash -m flowbots

# Install system dependencies
RUN apt-get update && apt-get install -y \
--no-install-recommends \
apt-transport-https \
apt-utils \
build-essential \
ca-certificates \
cmake \
curl \
dialog \
exiftool \
git \
gnupg \
gnuplot \
gpg-agent \
graphviz \
libcairo2-dev \
libczmq-dev \
libffi-dev \
libfftw3-dev \
libgdbm-dev \
libgmp-dev \
libgsl-dev \
liblink-grammar-dev \
libmagick++-dev \
libmariadb-dev-compat \
libmariadb-dev \
libncurses5-dev \
libopenblas-dev \
libplot2c2 \
libpoppler-glib-dev \
libpq-dev \
libreadline-dev \
libreoffice \
libsqlite3-dev \
libssl-dev \
libtamuanova-0.2 \
libxml2-dev \
libxslt1-dev \
libyaml-dev \
libzmq3-dev \
link-grammar \
lsb-release \
minisat \
neovim \
openssl \
pandoc \
pdftk \
pkg-config \
plotutils \
poppler-utils \
postgresql-client \
python3 \
python3-link-grammar \
python3-pip \
python3.11-venv \
rsync \
ruby-psych \
software-properties-common \
sqlite3 \
tesseract-ocr \
tidy \
tzdata \
wget \
zip \
zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*

# Set the working directory in the container
WORKDIR /app

ARG USE_TRF=False
ARG USE_BOOKNLP=False

RUN python3 -m venv .venv && \
. /app/.venv/bin/activate && \
echo "[[ -f /app/.venv ]] && cd /app && . /app/.venv/bin/activate" >> /home/flowbots/.bashrc && \
echo "gem: --user-instal --no-document" >> /home/flowbots/.gemrc && \
pip3 install -U setuptools wheel && \
pip3 install -U spacy 'pdfminer.six[image]' && \
python3 -m spacy download en_core_web_lg && \
python -c "import sys, importlib.util as util; 1 if util.find_spec('nltk') else sys.exit(); import nltk; nltk.download('punkt')"

RUN if [ "${USE_TRF}" = "True"]; then \
. /app/.venv/bin/activate && \
python3 -m spacy download en_core_web_trf \
; fi

RUN if [ "${USE_BOOKNLP}" = "True"]; then \
. /app/.venv/bin/activate && \
pip3 install -U transformers booknlp \
; fi

# Copy only the Gemfile and requirements.txt
COPY Gemfile ./

# Copy the rest of the application code
# Copy only the specified directories and files
COPY bin/ ./bin/
COPY examples/ ./examples/
COPY exe/ ./exe/
COPY lib/ ./lib/
COPY nano-bots/ ./nano-bots/
COPY flowbots.json .

# Set environment variables
ENV LANG=C.UTF-8 \
LC_ALL=C.UTF-8

# Create necessary directories
RUN mkdir -p log models workspace

RUN chown -R flowbots:flowbots /app

USER flowbots

ENV PATH="/home/flowbots/.local/share/gem/ruby/3.3.0/bin:$PATH"
ENV PATH="/app/.venv/bin:$PATH"

RUN bundle lock --add-platform x86_64-linux && \
bundle config build.redic --with-cxx="clang++" --with-cflags="-std=c++0x" && \
bundle install

# Set the default command (can be overridden)
CMD . .venv/bin/activate && exec bash
68 changes: 64 additions & 4 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -1,16 +1,76 @@
source "https://rubygems.org"
gemspec

# gemspec

gem "algorithms", "~> 1.0"
gem "ansi_palette", "~> 0.0.1"
gem "chroma-db", "~> 0.7.0"
gem "cli-ui", "~> 2.2"
gem "dotenv", "~> 3.1"
gem "groq", "~> 0.3.1"
gem "highline", "~> 3.0"
gem "jongleur", "~> 1.1"
gem "json", "~> 2.7"
gem "jsonl", "~> 0.1.5"
gem "kramdown", "~> 2.4"
gem "langchainrb", "~> 0.13.5"
gem "lingua", "~> 0.6.2"
gem "mimemagic", "~> 0.4.3"
gem "minitest", "~> 5.11"
gem "minitest-rg", "~> 5.3"
gem "nano-bots", "~> 3.4"
gem "natty-ui", "~> 0.10.0"
gem "ohm", "~> 3.1"
gem "ohm-contrib", "~> 3.0"
gem "open3", "~> 0.2.1"
gem "open4", "~> 1.3"
gem "parallel", "~> 1.25"
gem "pastel", "~> 0.8.0"
gem "pdf-reader", "~> 2.12"
gem "pg", "~> 1.5"
gem "pgvector", "~> 0.3.1"
gem "pragmatic_segmenter", "~> 0.3.23"
gem "pragmatic_tokenizer", "~> 3.2"
gem "pry", "~> 0.14.2"
gem "pry-doc", "~> 1.5"
gem "pry-stack_explorer", "~> 0.6.1"
gem "rake", "~> 13.0"
gem "rb-readline", "~> 0.5.5"
gem "rubocop", "1.64.1"
gem "redis", "~> 5.2"
gem "rubocop", "~> 1.64"
gem "rubocop-minitest", "0.35.0"
gem "rubocop-packaging", "0.5.2"
gem "rubocop-performance", "1.21.1"
gem "rubocop-rake", "0.6.0"
gem "ruby-lsp", "~> 0.17.4"
gem "ruby-lsp", "~> 0.17.4"
gem "ruby-spacy", "~> 0.2.2"
gem "sequel", "~> 5.82"
gem "solargraph", "~> 0.48.0"
gem "stream_lines", "~> 0.4.1"
gem "thor", "~> 1.2"
gem "timeout", "~> 0.4.1"
gem "tomoto", "~> 0.4.0"
gem "tool_tailor", "~> 0.2.1"
gem "treetop", "~> 1.6"
gem "tty-box", "~> 0.7.0"
gem "tty-markdown", "~> 0.7.2"
gem "tty-prompt"
gem "tty-screen", "~> 0.8.2"
gem "tty-spinner", "~> 0.9.3"
gem "tty-table", "~> 0.12.0"
gem "wordnet", "~> 1.2"
gem "wordnet-defaultdb", "~> 2.0"
gem "yaml", "~> 0.3.0"

gem "polyglot", "~> 0.3.5"

gem "engtagger", "~> 0.4.1"

gem "scalpel", "~> 0.2.1"

#gem "gokdok", "~> 0.4.2"

gem "shale", "~> 1.1"

gem "paint", "~> 2.3"

gem "color", "~> 1.8"
File renamed without changes.
Loading

0 comments on commit 844ca60

Please sign in to comment.