Skip to content

Updated AI implementation (Using Mistral 7B (AWQ‐4‐bit)

Rayan Louahche edited this page Feb 14, 2025 · 1 revision

🚀 Refactoring Our Implementation Strategy: Best Approach for Your SaaS AI Model

After encountering multiple issues with model downloads, storage limits, tokenizer mismatches, and quantization errors, we need a clear, optimized roadmap to:
1️⃣ Select the best AI model for your SaaS
2️⃣ Download it efficiently without running into storage issues
3️⃣ Train it on a scalable platform
4️⃣ Deploy it optimally for inference


✅ Step 1: Define Your AI Model Needs

Key Requirements for Your SaaS AI Model

Based on our previous discussions, your AI should:
Listen to audio (Whisper) and transcribe it
Generate real-time insights & solutions based on discussions
Summarize meetings with key actions & recommendations
Run efficiently on available GPUs (Kaggle, Colab, or Cloud GPUs)
Be scalable for future SaaS deployment


✅ Step 2: Choose the Best Model

To meet these needs, we need a model that is:

  • Optimized for NLP (meeting summarization, action-item extraction, etc.)
  • Supports real-time inference
  • Lightweight enough to run on available hardware (not requiring 642GB like DeepSeek-R1)
  • Easy to fine-tune and deploy for SaaS

Best Model Candidates:

Model Size Strengths Weaknesses
Mistral-7B (AWQ / 4-bit) ~7B params (Fits in Colab) ✅ Optimized for summarization & chat ❌ Needs fine-tuning for meeting-specific use case
LLaMA-2 7B AWQ ~7B params ✅ Highly optimized, runs efficiently ❌ May need custom RAG system for better retrieval
Gemma-7B (Google) ~7B params ✅ Google-optimized, supports summarization ❌ Requires Google Cloud TPU for best performance
DeepSeek-R1 Distill (LLaMA 70B AWQ) ~70B params ✅ High performance, multilingual ❌ Hard to load, tokenizer errors

🚀 Next Steps

Download & load Mistral-7B AWQ (Step 2)
Test the model’s responses
Prepare a dataset for fine-tuning
Train on RunPod.io or Paperspace
Deploy using vLLM for fast SaaS inference