Probabilistic Non-Tokenised Language Model
ProntoLM is a novel approach to machine learning & language models. The goal when creating this was to create something that is machine learning by definition, using the simplest approach possible. For context, this README is larger than the codebase itself.
I very much doubt this project will ever have a real use case. I created this because I love experimenting with novel ideas. I don't plan on maintaining this project, though I probably will randomly commit improvements when I have bursts of dopamine/motivation to work on it. For those who love tinkering with projects like this, I encourage you to fork this project. I'd love to see how you can improve it. Feel free to send a PR if you've improved on this project. If it's good, I'll accept it.
P.S. If you're looking for ideas on how to improve Pronto, keep reading. I've listed many of the flaws/limitations of Pronto.
ProntoLM is a probabilistic, non-tokenised language model that generates completions based on a provided text dataset. It utilizes a simple pattern matching and counting approach to generate the completions, which allows it to run perfectly well without a GPU, using a single thread. The language model does not use weights or a neural network to remain lightweight and easy to modify/improve.
ProntoLM aims to provide a basic language model implementation that can generate completions for user prompts without the need for tokenisation. It offers a lightweight alternative to larger and more complex language models, serving as a starting point for simple language generation tasks.
ProntoLM provides a straightforward implementation of a non-tokenised language model, making it easy to understand and modify for specific use cases.
With its simplified approach, ProntoLM can generate completions quickly, making it suitable for lightweight language generation tasks. Due to its simplicity, ProntoLM can sustainably generate completions faster than ChatGPT while running on a Raspberry Pi.
ProntoLM allows you to use your own text dataset, enabling customization and adaptation to specific domains or contexts.
Despite its simplicity, ProntoLM can be trained. When generating completions, ProntoLM forks into different chains and outputs the x most probable responses. You are able to provide feedback (only stored in memory right now) to train it to provide more accurate answers.
It's important to note the following limitations of ProntoLM:
- ProntoLM relies on simple pattern matching and counting, which may not produce high-quality or contextually accurate completions.
- The performance of ProntoLM heavily depends on the quality and representativeness of the provided text dataset.
- The completions generated by ProntoLM are limited to the next character following the prompt occurrences.
- ProntoLM lacks the sophisticated architecture and training of larger language models like GPT-3 or GPT-4.
- With the addition of response variation, ProntoLM is unable to stream responses anymore, even if there is only 1 variation.
- If a prompt is given that isn't in the dataset, Pronto doesn't complete the text.
Currently, ProntoLM relies on simple pattern matching and counting, which may limit its performance. Future improvements may involve implementing more advanced tokenisation techniques to handle complex language patterns and improve completion generation.
One of the future improvements for ProntoLM is to incorporate automated methods for dataset creation. This could involve web scraping, text extraction from various sources, or utilizing APIs to gather relevant textual data. By automating the dataset creation process, ProntoLM would become more versatile and adaptable to different domains and topics.
To enhance the diversity and richness of the generated completions, ProntoLM could benefit from the incorporation of data augmentation techniques. These techniques involve applying transformations or modifications to the existing dataset, such as adding noise, paraphrasing, or generating synthetic samples. By leveraging data augmentation, ProntoLM would be able to produce more varied and contextually relevant completions.
ProntoLM could benefit from a dynamic dataset expansion feature. This improvement would enable the model to automatically incorporate new data from the internet into its existing dataset, thus increasing its knowledge base and improving completion generation. By dynamically expanding the dataset, ProntoLM would be able to generate completions that encompass a broader range of topics and domains.
To enhance the contextual understanding of generated completions, ProntoLM could incorporate look-behinds. Look-behinds allow the model to consider the preceding context when generating completions, enabling it to generate more contextually relevant and coherent output. By incorporating look-behinds, ProntoLM would be able to capture and utilize a broader context for completion generation.
Another planned improvement is the implementation of parallel generation in ProntoLM. Parallel generation involves generating multiple completions simultaneously, providing users with a range of options or alternative suggestions. By adopting parallel generation, ProntoLM can offer more diverse and varied completions, catering to different user preferences and needs. Feedback provided by the user on which completion is better will be used to improve the dataset.
The AI can currently only handle alphanumeric characters. This is a severe limitation as it is unable to define sentence starts/endings or add proper punctuation. ProntoLM is also unable to share website URLs even if they are explicitly defined in the dataset.
ProntoLM is currently unable to finish completion at will or even know when to stop. At the moment, ProntoLM is hardcoded to stop after X characters. This needs to be changed.
You can create a dataset by simply copying text from websites such as Wikipedia. The smaller the dataset is, the more likely it is that ProntoLM will hallucinate completions.
-
Clone the repository to your local machine:
git clone https://github.com/QuixThe2nd/ProntoLM.git
-
Navigate to the project directory:
cd ProntoLM
-
Open the script file named
prontolm.py
in your preferred text editor. -
Set the
text
variable to the desired text dataset. Replace the empty string (""
) with your text.text = """ Your text dataset goes here. """
-
Save the file.
-
Run the script:
python prontolm.py
-
The program will prompt you to enter a prompt string.
-
Enter your desired prompt and press Enter.
-
The program will generate a completion based on the provided prompt.
-
Repeat steps 7-9 as desired to generate multiple completions.
Contributions to ProntoLM are welcome! If you have any suggestions, improvements, or bug fixes, please feel free to open an issue or submit a pull request.
This project is licensed under the MIT License.