-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine Learning Model #49
Comments
Starting this now |
Current update: I decided in place of our data set, I am going to use something that can be binary classified (I ended up deciding on using a dataset that is binary classification for rain in australia for now). I have logistic regression and Decision Tree models working for this data set (although the accuracy is pretty bad). I am aiming to have several different models for classification that can be used to determine the best performing one (after several phases of automated testing) in order to at least use our best and highest tuned model. This dataset has a couple of features (columns), and the more I think about it, maybe the features we use in our model can be the different accuracies produced by all the text comparisons (along with maybe size of text?). |
Current Update: I read more about the different ways that we can benchmark the performance, so now the precision score, recall score, f1 score, and roc auc are included as well. These are in one of the scikit packages, along with more models I implemented for this data, including random forest and SVC (Support Vector Classifier). These newly added models take a significantly longer time to fine tune than the first two, which means we may have to disregard them if they behave slow with our data as well. For now, I will keep them since they seem to have a better accuracy for this particular dataset. Right now data is using a 80/20 split, but the optimal range of splitting is anywhere from 60-80, so this would also need to be tuned as well. Also, I looked further online and found that some success can be found using a split of 80-10-10 where 80 is training, 10 validation, and 10 testing. This obviously is a further step. The problem here is not the ML model or even the code for it (there are so many packages). The problem is correctly tuning the model and using the right model. This is mainly why my next update will be focused on automating the process of tuning the parameters of the models. Near the end of this work interval, I was looking at efficient ways to tune the hyper parameters, and I found a cool way to automate the tuning phase, which can be found here. I will work further on automating the tuning phase for the Australia rain data. |
Current update: I started messing around with the tuning of the hyper parameters using the previously mentioned |
A good chunk of work was done in terms of creating a model and automating the process of fine tuning it. The data set used is not a dataset relevant to our project, yet it was helpful in writing code that was independent of the dataset, which means that now we can use our own data set when it is ready since I generalized the code as much as possible. Of course, the portion of models.py where the model is being trained needs to be changed since it is very specific to the Australian rain csv file. This csv file will also be deleted before a pull request is made to main.
Starting back on this right now, now that the dataset is in the google drive, and can look into testing some of the content here, as well as the text comparisons. |
Current update: I have created a helper function to read all subdirectories and files in the subdirectories from a main directory. I am also in the process of creating a helper function to append strings to a csv file. The purpose of this is to populate the training data used for the model. These helper functions will be located in |
Current update: I found out I have a design issue in how I want to parse the directories. If the dataset is to be made from all the files, then there is a problem with the way I am doing it. This is caused by the disability to make any comparisons due to all data (AI and Human) being in the same directory. I think I will move them out of the same directory and find the average of all comparisons in order to see and decide if this is a viable solution or if things need to be rethought. |
Current update: I ran into an issue reading a pdf. For some reason, the pdf is not being read correctly, which does not make sense. The specific problem I am running into is an end of file marker missing issue. I will look into this, but I am close to being able to compare all of the human files to the ai ones (with the purpose of producing training data for the models in |
Current update: I tried fixing the file parsing problem using some approaches mentioned here: py-pdf/pypdf#480. They did not work so I am looking into changing the pdf package we are using. At the moment, I am looking into pdfplumber. |
Current update: I am still not able to fix the problem with the missing EOF marker. I tried two other pdf packages, as well as different files from the testing set to see if it was a specific file. I think I have spent enough time on this to open a new issue for it. |
I am starting back on this since I found a quick work around (see #53 ). |
Current update: I ran into a problem where I realized I had multiple datapoints that were repeats of the same datapoint. I have fixed this by redoing how I was making the comparisons. I ended up making one big list that way the ai datapoints can be compared to everything as well. |
Current update: I ended up breaking the existing code (now I am not sure why the logic is not adding up). I think I will revert back to the previous image since that was closer to what I was looking for (when compared to the more recent changes). I will also end the day here and pick back up on it later this week. |
Starting back on this |
Current update: Restarted how I was creating the dataset, I am now doing it in a manner that I think makes more sense. Before I realized I had the data being appended in an inner loop, which was partially the reason why the dataset was larger than it should have been (in terms of rows). |
Current update: I decided to watch this 30 minute video about machine learning with small datasets, considering this may be something we may have to account for (if we cannot make enough data before the semester ends). Here are some of my key takeaways/notes while watching the video:
|
Current update: I am currently looking into ways to filter common words out in the preprocessing stage, mainly because I want to see how this affects the models (considering all of the data in the current dataset we have are programming based). I found the following packages, and I am in the process of implementing this. The library I found for this was Natural Language Tool Kit (nltk) which has the following (potentially useful) functions:
Overall this package is pretty cool for more advanced preprocessing, but for now, I am just planning on using the |
Current update: I am in the process of cleaning up the files I messed with today (removing old code, documenting new code), and I seem to have broken something in |
Improvement on documentation and functions for processing directory as well as creating a dataset and writing that dataset to csv. Associated with #49
I found the issue, I removed an essential variable. I fixed this, added the documentation, and pushed the file. I think I am done working on this for today. |
Starting on this again. Now that we have the user flow pretty much figured out, I can focus on calling and using the ML code with the submitted/uploaded content. |
Update: I ended up actually testing the code and looking at the user flow. It seems there are some issues with some of the errors we show (like a user trying to signup when they already exist or not a vlid email being used) that are not as clear as they could be. I also found that the button for switching between gemini and chatgpt does not work as intended (only chatgpt is shown). I also spent some time thinking of how the "create your own model" would work. |
Update: I was running into a problem but it was because I was missing a directory for uploads. I changed the code in API to check for this and create the folder if it did not exist so no one should face this error again if they forget about that. I also noticed that the parsing seems like it does not make any type of connection to the react app, and I verified this. I think this made sense because we want to return the percentages and cool stuff like that. I think I need to create a function in API.py that utilizes the ML and test comparison stuff so that it can be called from both the fileupload and formsubmission functions that are our FastAPI endpoints. Otherwise we would have superfluous code. |
Update: I am still working on creating the function for "generating the report" the user sees after submitting an assignment. At some point, an additional file upload area and text input should be added (so we have a distinction between instructions and student submissions). Doing so will make the product easier for users to understand and use (as well as easier delineation for us on the backend). My current approach has the end goal of sending a dictionary that can then be generated on in the react app. |
Current Update: I finished writing the method for generating a report in terms of just the cosine comparison, and I thought I could test it with just regular text, but this proved to be harder than I expected. I think this is a sign to work on the front end to include the ability to upload the submission file/text as well, in order to test the method as is. The good news is once I do this, the other part should be easier mainly because I just need to make function calls! |
Starting on this again. |
Update: I made the changes so there is now a generate report button next to submit prompt. This is the first step in generating the report. It currently works without the machine learning model by simply spitting out every metric in a dictionary (which is okay for now). The purpose of this is to allow users to not have to wait for 100 iterations of calling the APIs (unless they want to). |
Update: I now have the generate report button redirect the user to account page, where I plan to have a table of reports available (currently this does not work). I think the easiest way to grab the reports and store them is by using the Firebase Database, which might mean we should use that instead of the realtime database (at least for that portion since its connected to the user authentication). |
Starting on this again |
Update: I am implementing some more metrics into the model, mainly because I realized we could use more features, and I also found some new comparisons while I was reading something for my AI course final project. I also cleaned up one of the functions called |
Update: I am about to test the new metrics I added. In the background, one of the pretained models that will be used (GlowVee? I don't know how many e's there are) is being downloaded so it can be used (this line is currently commented out). |
Update: While testing this, I ran into a few issues, specifically with some of the packages I am using. I am working on resolving them, but the actual metrics are mostly being made (with some exceptions). After these are resolved, I can incorporate this into the model, which I anticipate having connected to the react app by end of today. |
update: I am reconstructing models.py, after working on the final AI project, there are some new tricks I want to incorporate or include that I think will be beneficial for the project. Currently, I am working on the feature extractor method, which was not existing but could be good. |
Update: I think I am done reworking the models.py functions, and I am about to start testing it as well. |
Update: I was testing the models, and I realized that I forgot to include the metrics as data columns, so I will do that now. I also need to decrease the step-size because these models are fully trained in milliseconds. |
Update, I just realized a huge flaw in the machine learning model. The metrics are based on comparisons, which are used on comparing a similarly (actually AI generated) data point. This is a problem because the machine learning model is not trained on equal assignment types, therefore, it would not make sense to incorporate these metrics into the model UNLESS the model is trained in the background using larger instances of generated ai work on the (between 100-1000) on the same assignment. This would not only catch ai generated work, but also plagiarism. For this to work, it would require a long amount of time, and a customized model per assignment (otherwise the metrics would be useless). I think this means the metrics are to act as a supplement (outside of the machine learning) which is not a problem. I think this also means we need to grab more ai generated work and just make a massive folder of ai generated work and human generated work, in order to train the model. I think it would be great to use the metrics as a standalone report in addition to the model (to either support or provide closer to accurate results). This is not a major problem since this was close to what was envisioned in the beginning. |
Update: I am done working on this for now. I did some more testing on the model end, and I also looked for possible sources of training data since we only currently have 40 datapoints (obviously not enough to train a reliable model). I also spent some time cleaning up |
This works and is used by the react side to. The only thing that has not been implemented yet is the auto training data being updated feature, which is not a big deal (but it would be nice to have). There are thumbs up/down icons that are really buttons and allow the user to label their own data). The feature for creating a custom model is also a separate issue that should only be completed if time allows. |
This is the issue related to the machine learning model that will be used as an additional form of classification between AI and human content. This can be started slightly without the testing data, but when the testing data is available, then the model can be benchmarked for accuracy.
The text was updated successfully, but these errors were encountered: