moving current wip to main (#6)

* wip: putting together the scaffolding * demo working * wip: examples * added examples command * wip * setting interface output colors here results in ascii chars sent to redis * added gems * added example * added provider placeholder * wip: flowise api * added helper module from monadic-chat, wip: flowise api working * added setup instructions for python libs * wip ToT example, workflow architecture * added original example * moved cartridges to nano-bot registry * wip * added singleton class for spacy tasks * wip ERROR -- : No valid words found in the provided documents * Moved the require statement for text_processing_workflow to after other component requires. * Changed logging level from DEBUG to INFO.ECommented out most of the binding.pry breakpoints.EUpdated the AdvancedAnalysisTask:EEModified the file path for the advanced_analysis_cartridge.yml.EChanged the prompt for analysis to generate a short narrative. * Added more detailed logging during document processing.EModified the training process:EENow trains in iterations, printing progress.EOutputs more detailed model statistics.EEEUpdated the infer_topics method:EENow uses make_doc method.EHandles case where topic inference fails.EIdentifies and returns the most probable topic.EPrints full topic distribution. * Removed unused imports and dependencies,EReorganized require statements in flowbots.rb,EDeleted topic_modeler.rb file,ESimplified TextProcessor and TextSegmenter classes,EUpdated TextProcessingWorkflow to use get_topics,ERemoved Redis initialization from WorkflowOrchestrator * - Modularized topic modeling functionality - Improved error handling and logging - Updated Docker configuration - Removed unused segmentation code - Enhanced configuration management - Adjusted file paths and dependencies - Updated nano-bots submodule * - Extracted train_model and infer_topics methods - Improved error handling and logging throughout - Removed redundant code and improved readability - Added logger initialization in the constructor * adding tty-box functions * moved workflows, renamed components * future utils * wip: error handler * wip: ui * added error handling cartridge * seperated cli module from main * a nice and accurate exceptionhandler :) * snapshot * wip: almost back together * working in ohm * adjusted to output exception reports in markdown * 1. ExceptionAgent improvements: - Removed the "Relevant Files" section from exception reports, simplifying the output. 2. TopicModelProcessor enhancements: - Improved model loading and creation process with a new `load_or_create_model` method. - Enhanced `process` method to handle empty documents and ensure model existence. - Improved `train_model` method with better handling of empty documents and word counting. - Added more robust error handling and logging throughout. - Improved `save_model` method with checks for directory existence, write permissions, and disk space. - Enhanced `store_topics` method with better error handling and logging. 3. Task structure changes: - Modified the base `Task` class to no longer inherit from `Jongleur::WorkerTask`. - Updated specific task classes (LlmAnalysisTask, NlpAnalysisTask, TopicModelingTask) to inherit directly from `Jongleur::WorkerTask`. 4. UI improvements: - Simplified the `info` method in the UI module. 5. TextProcessingWorkflow updates: - Commented out some workflow steps (process_input, run_nlp_analysis, run_topic_modeling) in the `execute` method. - Changed logging to use UI.info instead of logger.info in the `run_workflow` method. * set messsages to print * added cartridges * snapshot: working * wip: created task to display results * removed redundant includes * assets * added cartridges * set text segmentation as its own task * added Fileloader task, added tokenizer, adjusted ohm models * wip: working, set batch_size or else large datasets overflow mem * Refactor topic modeling workflow and improve text processing pipeline This commit significantly updates the topic modeling workflow and text processing pipeline, improving efficiency and adding new features: 1. TopicModelTrainerWorkflow: - Implement batch processing with BATCH_SIZE constant - Add flush_redis_cache method for clean slate processing - Refactor process_files method to handle batches - Implement train_topic_model method with cleaning and filtering - Add clean_segments_for_modeling method to improve data quality 2. Task Updates: - Modify LoadTextFilesTask to process single files - Update TextSegmentTask, TokenizeSegmentsTask, and NlpAnalysisTask for single file processing - Refactor FilterSegmentsTask with improved logging and error handling - Add AccumulateFilteredSegmentsTask for batch accumulation - Update TrainTopicModelTask to handle accumulated segments 3. LLM Analysis: - Refactor LlmAnalysisTask to use preprocessed content and file metadata - Implement generate_analysis_prompt method for better LLM input 4. UI Improvements: - Add BoxUI module with side_by_side_boxes method for improved result display - Update DisplayResultsTask to use new BoxUI for better visualization 5. NLP Processing: - Refactor NLPProcessor to return more detailed token information - Update NlpAnalysisTask to handle new NLP processor output 6. Miscellaneous: - Remove unused code and comments - Update error handling and logging across multiple files - Improve code organization and readability This refactoring enhances the workflow's ability to handle large datasets efficiently, improves the quality of topic modeling input, and provides better visualization of results. * added treetop grammar, working on clean interrupt * wip: grammar parser * Refactor text processing workflow and improve YAML front matter parsing - Update GrammarProcessor to use Treetop grammar file - Simplify markdown_yaml.treetop grammar for better YAML parsing - Enhance PreprocessTextFileTask with improved error handling and logging - Modify TextSegmentTask to use preprocessed content - Add parallel processing support to flowbots.rb - Update CLI to use TopicModelTrainerWorkflow instead of test version - Improve error logging and context in GrammarProcessor - Enhance WorkflowOrchestrator cleanup process This commit significantly improves the text processing pipeline, particularly in handling YAML front matter in Markdown files. It also adds better error handling and logging throughout the workflow. * this works at least * update readme * set preprocess task to get the current_textfile_id in the workflow * add engtagger task wip: text compressor * added rdocs * documentation * extras * fix: linear logic for detecting file type * wip * Refactor tasks and implement uniform input retrieval (Epics 1 & 2) * added lemmas ohm model * ui improvements * UI improvements * cartridge updates * ui improvements * adjusted readme * updated readme, results * wip * adjusted nano-bots * Key changes include: Renaming the Textfile model to FileObject. Updating all references to Textfile to FileObject. Modifying the FileLoader class to use the FileObject model. Updating the InputRetrieval module to retrieve FileObject instances. Adjusting the RedisKeys module to use keys related to FileObject. Updating tasks and workflows to use the FileObject model. * Adjust logging settings and enhance TextProcessingWorkflow - Reduce log file max size to 2,145,728 bytes - Increase max number of log files to 100 - Comment out flush_redis_cache in unified_file_processing - Add batch mode to TextProcessingWorkflow - Implement separate processing for batch and single file modes - Add methods for fetching unprocessed file IDs and creating/fetching file objects - Update perform_additional_tasks to work with specific file IDs * added SI chars * doc updates * snapshot * snapshot * readme edits * snapshot
b08x · Sep 25, 2024 · 844ca60 · 844ca60
1 parent 94377d1
commit 844ca60
Show file tree

Hide file tree

Showing 434 changed files with 41,348 additions and 133 deletions.
diff --git a/.env.example b/.env.example
@@ -0,0 +1,2 @@
+COHERE_API_KEY=
+GEMINI_API_KEY=
diff --git a/.gitignore b/.gitignore
@@ -2,9 +2,29 @@
 /.yardoc
 /_yardoc/
 /coverage/
-/doc/
 /pkg/
 /site/
 /spec/reports/
 /tmp/
 /Gemfile.lock
+/docker/data/
+.env
+/docker/.env/
+/log/
+/vendor/
+/docker/*.pem
+/docs/agenta.json
+.venv/
+/docker/data.tar.gz
+flowbots.json
+/docker/docker-compose.yml
+/models/
+/examples/test/
+/data/
+/.bold/
+/.examples/
+/.vendor/
+/.docs/
+/exception_reports/
+*.text.json
+.gh_pages
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "nano-bots"]
+	path = nano-bots
+	url = [email protected]:b08x/nano-bots.git
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,131 @@
+# Use Ruby 3.3 as the base image
+FROM ruby:3.3-slim
+
+# Create a non-root user to run the app
+RUN useradd -s /bin/bash -m flowbots
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    --no-install-recommends \
+    apt-transport-https \
+    apt-utils \
+    build-essential \
+    ca-certificates \
+    cmake \
+    curl \
+    dialog \
+    exiftool \
+    git \
+    gnupg \
+    gnuplot \
+    gpg-agent \
+    graphviz \
+    libcairo2-dev \
+    libczmq-dev \
+    libffi-dev \
+    libfftw3-dev \
+    libgdbm-dev \
+    libgmp-dev \
+    libgsl-dev \
+    liblink-grammar-dev \
+    libmagick++-dev \
+    libmariadb-dev-compat \
+    libmariadb-dev \
+    libncurses5-dev \
+    libopenblas-dev \
+    libplot2c2 \
+    libpoppler-glib-dev \
+    libpq-dev \
+    libreadline-dev \
+    libreoffice \
+    libsqlite3-dev \
+    libssl-dev \
+    libtamuanova-0.2 \
+    libxml2-dev \
+    libxslt1-dev \
+    libyaml-dev \
+    libzmq3-dev \
+    link-grammar \
+    lsb-release \
+    minisat \
+    neovim \
+    openssl \
+    pandoc \
+    pdftk \
+    pkg-config \
+    plotutils \
+    poppler-utils \
+    postgresql-client \
+    python3 \
+    python3-link-grammar \
+    python3-pip \
+    python3.11-venv \
+    rsync \
+    ruby-psych \
+    software-properties-common \
+    sqlite3 \
+    tesseract-ocr \
+    tidy \
+    tzdata \
+    wget \
+    zip \
+    zlib1g-dev \
+    && rm -rf /var/lib/apt/lists/*
+
+# Set the working directory in the container
+WORKDIR /app
+
+ARG USE_TRF=False
+ARG USE_BOOKNLP=False
+
+RUN python3 -m venv .venv && \
+    . /app/.venv/bin/activate && \
+    echo "[[ -f /app/.venv ]] && cd /app && . /app/.venv/bin/activate" >> /home/flowbots/.bashrc && \
+    echo "gem: --user-instal --no-document" >> /home/flowbots/.gemrc && \
+    pip3 install -U setuptools wheel && \
+    pip3 install -U spacy 'pdfminer.six[image]' && \
+    python3 -m spacy download en_core_web_lg && \
+    python -c "import sys, importlib.util as util; 1 if util.find_spec('nltk') else sys.exit(); import nltk; nltk.download('punkt')"
+
+RUN if [ "${USE_TRF}" = "True"]; then \
+        . /app/.venv/bin/activate && \
+        python3 -m spacy download en_core_web_trf \
+    ; fi
+
+RUN if [ "${USE_BOOKNLP}" = "True"]; then \
+        . /app/.venv/bin/activate && \
+        pip3 install -U transformers booknlp \
+    ; fi
+
+# Copy only the Gemfile and requirements.txt
+COPY Gemfile ./
+
+# Copy the rest of the application code
+# Copy only the specified directories and files
+COPY bin/ ./bin/
+COPY examples/ ./examples/
+COPY exe/ ./exe/
+COPY lib/ ./lib/
+COPY nano-bots/ ./nano-bots/
+COPY flowbots.json .
+
+# Set environment variables
+ENV LANG=C.UTF-8 \
+    LC_ALL=C.UTF-8
+
+# Create necessary directories
+RUN mkdir -p log models workspace
+
+RUN chown -R flowbots:flowbots /app
+
+USER flowbots
+
+ENV PATH="/home/flowbots/.local/share/gem/ruby/3.3.0/bin:$PATH"
+ENV PATH="/app/.venv/bin:$PATH"
+
+RUN bundle lock --add-platform x86_64-linux && \
+    bundle config build.redic --with-cxx="clang++" --with-cflags="-std=c++0x" && \
+    bundle install
+
+# Set the default command (can be overridden)
+CMD . .venv/bin/activate && exec bash
diff --git a/Gemfile b/Gemfile
@@ -1,16 +1,76 @@
 source "https://rubygems.org"
-gemspec
-
+# gemspec
 
+gem "algorithms", "~> 1.0"
+gem "ansi_palette", "~> 0.0.1"
+gem "chroma-db", "~> 0.7.0"
+gem "cli-ui", "~> 2.2"
+gem "dotenv", "~> 3.1"
+gem "groq", "~> 0.3.1"
+gem "highline", "~> 3.0"
+gem "jongleur", "~> 1.1"
+gem "json", "~> 2.7"
+gem "jsonl", "~> 0.1.5"
+gem "kramdown", "~> 2.4"
+gem "langchainrb", "~> 0.13.5"
+gem "lingua", "~> 0.6.2"
+gem "mimemagic", "~> 0.4.3"
 gem "minitest", "~> 5.11"
 gem "minitest-rg", "~> 5.3"
+gem "nano-bots", "~> 3.4"
+gem "natty-ui", "~> 0.10.0"
+gem "ohm", "~> 3.1"
+gem "ohm-contrib", "~> 3.0"
+gem "open3", "~> 0.2.1"
+gem "open4", "~> 1.3"
+gem "parallel", "~> 1.25"
+gem "pastel", "~> 0.8.0"
+gem "pdf-reader", "~> 2.12"
+gem "pg", "~> 1.5"
+gem "pgvector", "~> 0.3.1"
+gem "pragmatic_segmenter", "~> 0.3.23"
+gem "pragmatic_tokenizer", "~> 3.2"
 gem "pry", "~> 0.14.2"
 gem "pry-doc", "~> 1.5"
+gem "pry-stack_explorer", "~> 0.6.1"
 gem "rake", "~> 13.0"
 gem "rb-readline", "~> 0.5.5"
-gem "rubocop", "1.64.1"
+gem "redis", "~> 5.2"
+gem "rubocop", "~> 1.64"
 gem "rubocop-minitest", "0.35.0"
 gem "rubocop-packaging", "0.5.2"
 gem "rubocop-performance", "1.21.1"
 gem "rubocop-rake", "0.6.0"
-gem "ruby-lsp", "~> 0.17.4"
+gem "ruby-lsp", "~> 0.17.4"
+gem "ruby-spacy", "~> 0.2.2"
+gem "sequel", "~> 5.82"
+gem "solargraph", "~> 0.48.0"
+gem "stream_lines", "~> 0.4.1"
+gem "thor", "~> 1.2"
+gem "timeout", "~> 0.4.1"
+gem "tomoto", "~> 0.4.0"
+gem "tool_tailor", "~> 0.2.1"
+gem "treetop", "~> 1.6"
+gem "tty-box", "~> 0.7.0"
+gem "tty-markdown", "~> 0.7.2"
+gem "tty-prompt"
+gem "tty-screen", "~> 0.8.2"
+gem "tty-spinner", "~> 0.9.3"
+gem "tty-table", "~> 0.12.0"
+gem "wordnet", "~> 1.2"
+gem "wordnet-defaultdb", "~> 2.0"
+gem "yaml", "~> 0.3.0"
+
+gem "polyglot", "~> 0.3.5"
+
+gem "engtagger", "~> 0.4.1"
+
+gem "scalpel", "~> 0.2.1"
+
+#gem "gokdok", "~> 0.4.2"
+
+gem "shale", "~> 1.1"
+
+gem "paint", "~> 2.3"
+
+gem "color", "~> 1.8"
diff --git a/LICENSE.txt → LICENSE b/LICENSE.txt → LICENSE