Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues on large files #8

Open
rutsch opened this issue May 17, 2016 · 6 comments
Open

Performance issues on large files #8

rutsch opened this issue May 17, 2016 · 6 comments

Comments

@rutsch
Copy link

rutsch commented May 17, 2016

Hi there,

I am trying to use your script to compare 2 large HTML files (500000 tokens in both arrays), but never reaches the end (and i've been pretty patient ;) )

I'm trying to find out what's going on exactly (now testing with files around 30000 tokens).
It takes about 68 seconds for the script to complete, which unfortunately is by far not good enough for my needs.

I've decided to see if i could find some things to improve, and the first thing i've noticed so far is in the create_index function.

I think the code below should only be executed when index[token] == null.

index[token] = []
idx = p.in_these.indexOf token
while idx isnt -1
index[token].push idx
idx = p.in_these.indexOf token, idx+1

Consider the fact that of 30000 tokens, there are a lot of duplicates (ie. " " (+/- 12000), "a", "and", "for".... etc )
Running the code as is means that you are getting the position of 12000 spaces in the target array for 12000 times, each time overwriting index[token].

Putting a check around the above lines immediately took of a solid 15 seconds of the total time (the time for building the index went down from 16 seconds to 1).

Just thought i'd let you know, think this should be fixed in your main file.

I'm now trying to find a way to get the recursive finding of matching blocks back to an acceptable excecution time (at the moment that part takes a little over 50 seconds)

Any ideas on this would be great!

Thx,
Rutger Scheepens

@cjke
Copy link

cjke commented Jun 30, 2016

+1 Massive perf issues. Especially in the find_match function. Can I do anything to help?

@jkabat
Copy link

jkabat commented Oct 4, 2016

@rutsch can you please share you optimized version?

@cjke
Copy link

cjke commented Oct 4, 2016

I am on mobile so cannot link correctly, but check out the inkling fork,
massive perf gains.

On 4 Oct 2016 11:08 PM, "JK" [email protected] wrote:

@rutsch https://github.com/rutsch can you please share you optimized
version?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#8 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA19YY_b-aDN2Xi8tvI7Ilyrzvg-0MW2ks5qwl0vgaJpZM4IgYgK
.

@monkeyman888
Copy link

Any updates here on improving performance or are there other js libraries that you guys can recommend for diffing html? When dealing with large files, it takes a very long time to finish processing (if ever). Thanks.

@avimar
Copy link

avimar commented Mar 30, 2022

Hi, 6 years later this still seems to be the best html diff of html library.
But with only ~140k tokens, I'm still seeing recursively_find_matching_blocks to take 7+ seconds. (create_index is only 1-1.5).

Any hints or better libraries?

@avimar
Copy link

avimar commented Mar 30, 2022

I think the code below should only be executed when index[token] == null.

index[token] = [] idx = p.in_these.indexOf token while idx isnt -1 index[token].push idx idx = p.in_these.indexOf token, idx+1

Consider the fact that of 30000 tokens, there are a lot of duplicates (ie. " " (+/- 12000), "a", "and", "for".... etc ) Running the code as is means that you are getting the position of 12000 spaces in the target array for 12000 times, each time overwriting index[token].

Putting a check around the above lines immediately took of a solid 15 seconds of the total time (the time for building the index went down from 16 seconds to 1).

Just to help anyone else - I didn't understand it until I messed with the code.

Wrap this code:

index[token] = [];
idx = p.in_these.indexOf(token);
	while (idx !== -1) {
		index[token].push(idx);
		idx = p.in_these.indexOf(token, idx + 1);
    	}

with if(!index[token]) {... }

It looks up each token - e.g. <b> and tries to match it to the entire list of matches. But it does that each time it gets to it. With this code, it can do the lookup/search once.

It went from 1000-1500 milliseconds for that step to 57ms, a huge difference.

But the main issue is recursively_find_matching_blocks.. any optimizations for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants