The whole repo need refactoring. I'll try to do it asap. I particularly need to find a new dataset to reproduce the results
This repo is in progress. It is dedicated to an implementation of Listening to Chaotic Whispers. https://arxiv.org/abs/1712.02136v1 We are writing a serie of blogpost to explain each step of our workflow. You may find the first the first post on Medium : https://medium.com/@gkeng/make-your-computer-invest-like-a-human-ef0654ccdcff
Description of each file :
- SP500_nasdaq100.csv : Csv file containing all companies in S&P 500 and Nasdaq
- extract_reuters : Parallelized scraping of article from reuters.com
- extract_wsj : Attempt of scraping Wall Street Journal
- data_process : some data processing on articles collected
- doc2vec : Doc2Vec vectorization of press articles
- word2vec : Word2Vec vectorization of press articles, but we preferred to continue with Doc2vec
- list_firm : List of all firms we choosed for this implementation
- create_dataset : A script to create our 4 dimensions dataset for each company
- picklizer : A script to make pickle file of all press articles for each firm
- action : A class that implements methods and object to simulate a portfolio
- han : Implementation of the Hybrid Attention Network
- han_training : Implementation and training of HAN
- pickle : a folder with all pickle files for stock price of companies
- pickle_article : a folder with all pickle files for articles on each company
- daterange : to link the ID of day to the actual day (year/month/day).
Folders :
- sample_of_scrap : sample of the articles we scraped
- stock_value : Contains stock values and stock moves of the companies.
- pickle : Contains dictionaries of all stock moves in pickles files. Used to create y_train and y_test
- pickle_article : Contains dictionnaries { str day : str [ list of all articles ID for this company on this day] } in pickle file.
- firm_csv_folder_old : Contains csv with IDs of all articles for each company.
Steps to follow to run the project :
-
Run extract_reuters.py it will organize articles in folder like this : your_chosen_folder <=== day_folder <== journal_dir <== article_title.txt
-
Use functions in data_process.py to process the data. In this order :
- rename_dir : will rename all directories. The directory for the first day( 1st January 201X) will be "0001"
- rename_file : will give an ID to every file. The 15th article of the first day will be "0001_15.txt"
- create_csv_firm will create a csv for each company in which one can find every day and ID of articles in which the company is cited
-
Run picklizer.py. This creates a dictionnary for each company and saves it as a pickle file. { str day : str [ list of all articles ID for this company on this day] }
-
Run doc2vec.py to train the doc2vec model and vectorize all the press articles. The output file is heavy.. For years 2015 to 2017 our doc2vec file was 2 Go of size
Now, focus on the stock prices and stock moves. We took most of our stock values from here : https://www.kaggle.com/camnugent/sandp500 You can also find many here : https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
-
Run pickle_stock_value.py to transform all csv of stock values in a pickle file containing a dictionary : { str day : float stock_value }
-
Run make_stock_move.py to create a csv of stock moves from day t to day t+1.
-
Run pickle_stock_move.py to create a dic of stock moves from day t to day t+1 stored in a pickle. { str day : int stock_move }
-
Run create_dataset.py to create the 4 dimension datase. tRefer to the comments in the code for more details.
-
Train the model with han_training.py
-
Test the model with show_results.py