-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues on large files #8
Comments
+1 Massive perf issues. Especially in the find_match function. Can I do anything to help? |
@rutsch can you please share you optimized version? |
I am on mobile so cannot link correctly, but check out the inkling fork, On 4 Oct 2016 11:08 PM, "JK" [email protected] wrote:
|
Any updates here on improving performance or are there other js libraries that you guys can recommend for diffing html? When dealing with large files, it takes a very long time to finish processing (if ever). Thanks. |
Hi, 6 years later this still seems to be the best html diff of html library. Any hints or better libraries? |
Just to help anyone else - I didn't understand it until I messed with the code. Wrap this code:
with It looks up each token - e.g. It went from 1000-1500 milliseconds for that step to 57ms, a huge difference. But the main issue is recursively_find_matching_blocks.. any optimizations for that? |
Hi there,
I am trying to use your script to compare 2 large HTML files (500000 tokens in both arrays), but never reaches the end (and i've been pretty patient ;) )
I'm trying to find out what's going on exactly (now testing with files around 30000 tokens).
It takes about 68 seconds for the script to complete, which unfortunately is by far not good enough for my needs.
I've decided to see if i could find some things to improve, and the first thing i've noticed so far is in the create_index function.
I think the code below should only be executed when index[token] == null.
index[token] = []
idx = p.in_these.indexOf token
while idx isnt -1
index[token].push idx
idx = p.in_these.indexOf token, idx+1
Consider the fact that of 30000 tokens, there are a lot of duplicates (ie. " " (+/- 12000), "a", "and", "for".... etc )
Running the code as is means that you are getting the position of 12000 spaces in the target array for 12000 times, each time overwriting index[token].
Putting a check around the above lines immediately took of a solid 15 seconds of the total time (the time for building the index went down from 16 seconds to 1).
Just thought i'd let you know, think this should be fixed in your main file.
I'm now trying to find a way to get the recursive finding of matching blocks back to an acceptable excecution time (at the moment that part takes a little over 50 seconds)
Any ideas on this would be great!
Thx,
Rutger Scheepens
The text was updated successfully, but these errors were encountered: