Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hydrator closes suddenly with no errors #65

Open
rtrad89 opened this issue Oct 12, 2020 · 21 comments
Open

Hydrator closes suddenly with no errors #65

rtrad89 opened this issue Oct 12, 2020 · 21 comments

Comments

@rtrad89
Copy link

rtrad89 commented Oct 12, 2020

I am Hydrating GeoCoV19 dataset which corresponds to May the 1st. Hydrator was working fine till it stopped hydrating and suddenly closed with no error messages.

Reopening the program and clicking Start would trigger the same behaviour: it simply shuts down with no explanations.

I checked the ids around where it stopped and they are legit, without any overflow. I restarted the machine as well to no avail. The jsonl file as of now is ~21GB in size.

Total Tweet Ids:
7,298,409

Tweet Ids Read:
4,485,700

Tweets Hydrated:
3,760,528

Percent Deleted:
16%

Any ideas on what I can do?

@rtrad89
Copy link
Author

rtrad89 commented Oct 12, 2020

I have used a Python script to convert the current state of jsonl hydrated tweets into a csv file as a workaround.

The script's code:

# -*- coding: utf-8 -*-
"""
Adapted from https://stackoverflow.com/a/46653313/3429115
"""

import json
import csv
import io
from datetime import datetime

'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''

def extract_json(fileobj):
    """
    Iterates over an open JSONL file and yields
    decoded lines.  Closes the file once it has been
    read completely.
    """
    with fileobj:
        for line in fileobj:
            yield json.loads(line)    


data_json = io.open('tweets_20200501-V2.jsonl', mode='r', encoding='utf-8') # Opens in the JSONL file
data_python = extract_json(data_json)

csv_out = io.open('tweets_20200501.csv', mode='w', encoding='utf-8') #opens csv file


fields = u'id,created_at,reweet_id,user_screen_name,user_followers_count,user_friends_count,retweet_count,favourite_count,text' #field names
csv_out.write(fields)
csv_out.write(u'\n')

print(f"{datetime.utcnow()}: Output file created. Starting conversion..")

for i, line in enumerate(data_python):

    #writes a row and gets the fields from the json object
    #screen_name and followers/friends are found on the second level hence two get methods
    row = [line.get('id_str'),
           line.get('created_at'),
           line.get('retweeted_status').get('id_str') if line.get('retweeted_status') is not None else "",
           line.get('user').get('screen_name'),  
           str(line.get('user').get('followers_count')),
           str(line.get('user').get('friends_count')),
           str(line.get('retweet_count')),
           str(line.get('favorite_count')),
           '"' + line.get('full_text').replace('"','""') + '"', #creates double quotes
           ]
    
    if i%100000 == 0 and i > 0:
        print(f"{datetime.utcnow()}: {i} tweets done...")

    row_joined = u','.join(row)
    csv_out.write(row_joined)
    csv_out.write(u'\n')

print("All tweets done. Saving the csv...")
csv_out.close()
print("Done.")

@edsu
Copy link
Member

edsu commented Oct 13, 2020

What operating system are you using @rtrad89?

@rtrad89
Copy link
Author

rtrad89 commented Oct 14, 2020

What operating system are you using @rtrad89?

Microsoft Windows 10 Pro x64, version 2004

@margauxw
Copy link

I have the same issue!

@margauxw
Copy link

If I try to add another file it also keeps crashing suddenly. Has worked fine for days.

@edsu
Copy link
Member

edsu commented Dec 20, 2020

@rtrad89 do you have a folder C:\Program Files\Hydrator on your computer?

@rtrad89
Copy link
Author

rtrad89 commented Dec 21, 2020

@rtrad89 do you have a folder C:\Program Files\Hydrator on your computer?

@edsu I have installed it for my user only, so the folder is located under C:\Users\****\AppData\Local\Programs\

@edsu
Copy link
Member

edsu commented Dec 21, 2020

@rtrad89 could you try to open a console Window and start the .exe? I would like to see if there is any error message provided.

@rtrad89
Copy link
Author

rtrad89 commented Dec 31, 2020

@edsu
The following message appears when Hydrator.exe is launched:

(electron) The default value of app.allowRendererProcessReuse is deprecated, it is currently "false".  It will change to be "true" in Electron 9.  For more information please check https://github.com/electron/electron/issues/18397

@edsu
Copy link
Member

edsu commented Dec 31, 2020

That message is normal. So you don't see anything else before it quits?

@rtrad89
Copy link
Author

rtrad89 commented Dec 31, 2020

@edsu Strangely the hydration goes forward now without problems on my workstation. @margauxw could you assist in case you still have the problem?

@edsu
Copy link
Member

edsu commented Dec 31, 2020

Weird! Well, on the plus side I'm glad the problem has gone away for the moment. I will leave this open in case it happens again.

@shullaw
Copy link

shullaw commented Feb 25, 2021

I've had the same issue on Windows 10. I have been running Hydrator for over 7 days now along with 4 VMware machine all with different Twitter accounts. Several issues popped up during the process such as javascript errors and as OP stated, closing for no reason after pressing start. I am running on a laptop and I set it to never sleep or power off, only turn the screen off even when closing the lid. However, I still found issues when I would open my lid occasionally. I'm not sure if this is a Windows issue or Hydrator.

I ran sfc/scan in cmd and I did have an error that was fixed, but Hydrator still would not run. I've collected 360GB of tweets so far, and I still have a couple VMs that run. My next step is to use Linux VMs (which I should have from the beginning but I couldn't get Hydrator to run on my Linux desktop! Although now it works).

Thankfully I've collected the majority of the tweets that I need. Even with the errors this is a great program.

@edsu
Copy link
Member

edsu commented Feb 25, 2021

Thanks for summarizing those details @Tipphead! I wonder do you see a state.json in your Hydrator's internal storage location? I can see from the message you posted in #75 that it should be here:

C:\Users\j\AppData\Roaming\hydrator\storage (electron)  

@shullaw
Copy link

shullaw commented Feb 25, 2021

No problem! I do see the state.json.

{"router":{"location":{"pathname":"/C:/Users/j/AppData/Local/Programs/hydrator/resources/app.asar/build/renderer/index.html","search":"","hash":"#/","query":{}},"action":"POP"},"datasets":[{"id":"26b2aedd-c511-4f36-b474-c0041509be43","path":"X:\Twitter_Project\Datasets\Tweets_to_Trump\to_Trump_20200321_ids\to_realdonaldtrump_20200321_ids2.txt","outputPath":"X:\Twitter_Project\Datasets\Tweets_to_Trump\to_Trump_20200321_ids\to_realdonaldtrump_20200321_ids2_hydrated","title":"trump_20200321_ids2","creator":"","publisher":"","url":"","hydrating":true,"numTweetIds":236577727,"idsRead":0,"tweetsHydrated":0,"completed":null}],"newDataset":{"selectedFile":"","title":"","creator":"","publisher":"","url":"","lineCount":""},"settings":{"authorize":false,"invalidPin":false,"twitterAccessKey":"XXXXXXXXXXXX","twitterAccessSecret":"XXXXXXXXXXXX","twitterScreenName":"XXXXXXXXXXXXXXX"}}

@edsu
Copy link
Member

edsu commented Feb 25, 2021

Thanks for commenting out the important bits. I wonder if this might be part of the problem. It doesn't parse as JSON.

>>> import json
>>> json.load(open('x'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 245 (char 244)

The JSON parser (in Python) doesn't like the \T in X:\Twitter_Project\ Datasets\Tweets_to_Trump\to_Trump_20200321_ids\to_realdonaldtrump_20200321_ids2.txt" I think the backslashes need to be escaped.

@shullaw
Copy link

shullaw commented Feb 26, 2021

I changed it to \realdonaldtrump_2020321_ids2 and i changed the folder name to not begin with \t, but no cookie. I'm gonna change my folder from Twitter_Project after I let my other hydrators get some more tweets. Seems strange that it would have an issue with it after using it for so long under that folder name.

@edsu
Copy link
Member

edsu commented Feb 26, 2021

Yes I might be wrong with this diagnosis. There are many of backslashes in the JSON that I believe ought to be escaped. But perhaps it's not a problem for the JavaScript.

Had you been running the Hydrator for a long time without shutting it down? I think that it probably wouldn't need to read the path from the JSON when it started up after being shut down.

@shullaw
Copy link

shullaw commented Feb 26, 2021

When I click on the id_file name on Hydrator it actually shows me X://Path//to//file. But obviously it doesn't like it, if it is telling you that. And I've changed so many hard drives, folders, file names, etc. who knows. It's been a mess figuring out where to store all of this.

But yes, I've had Hydrator open and the VMs open since last week running 99% of the day. I have shut down and restarted several times to try and fix the issue, but to no avail.

@shullaw
Copy link

shullaw commented Feb 27, 2021

Update:
Windows host, Windows VMs, and Ubuntu VMs are all running fine. The /Twitter/to/trump path was the issue. There must have been a point where either Twitter was being escaped by being the //shared folder or by me not realizing the shared folder did not begin with a T.

I just want to point out that when Hydrator runs on Linux, it will actually catch the issue and notify you where Windows will just shut down. Also, on Linux hydrator automatically converts to .jsonl where Windows goes to .txt. That's fine as I prefer working with .txt. Another bug I've found is that on Linux, Hydrator has no icon in the task bar (not big deal just letting you know). Again, thanks for the program!

@edsu
Copy link
Member

edsu commented Feb 27, 2021

Many thanks for debugging this @Tipphead!I will leave this open until i figure out the serialization issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants