Performance issues on large files #8

rutsch · 2016-05-17T15:12:16Z

Hi there,

I am trying to use your script to compare 2 large HTML files (500000 tokens in both arrays), but never reaches the end (and i've been pretty patient ;) )

I'm trying to find out what's going on exactly (now testing with files around 30000 tokens).
It takes about 68 seconds for the script to complete, which unfortunately is by far not good enough for my needs.

I've decided to see if i could find some things to improve, and the first thing i've noticed so far is in the create_index function.

I think the code below should only be executed when index[token] == null.

index[token] = []
idx = p.in_these.indexOf token
while idx isnt -1
index[token].push idx
idx = p.in_these.indexOf token, idx+1

Consider the fact that of 30000 tokens, there are a lot of duplicates (ie. " " (+/- 12000), "a", "and", "for".... etc )
Running the code as is means that you are getting the position of 12000 spaces in the target array for 12000 times, each time overwriting index[token].

Putting a check around the above lines immediately took of a solid 15 seconds of the total time (the time for building the index went down from 16 seconds to 1).

Just thought i'd let you know, think this should be fixed in your main file.

I'm now trying to find a way to get the recursive finding of matching blocks back to an acceptable excecution time (at the moment that part takes a little over 50 seconds)

Any ideas on this would be great!

Thx,
Rutger Scheepens

The text was updated successfully, but these errors were encountered:

cjke · 2016-06-30T02:44:34Z

+1 Massive perf issues. Especially in the find_match function. Can I do anything to help?

jkabat · 2016-10-04T14:05:34Z

@rutsch can you please share you optimized version?

cjke · 2016-10-04T15:19:00Z

I am on mobile so cannot link correctly, but check out the inkling fork,
massive perf gains.

On 4 Oct 2016 11:08 PM, "JK" [email protected] wrote:

@rutsch https://github.com/rutsch can you please share you optimized
version?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#8 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA19YY_b-aDN2Xi8tvI7Ilyrzvg-0MW2ks5qwl0vgaJpZM4IgYgK
.

monkeyman888 · 2017-11-14T17:47:00Z

Any updates here on improving performance or are there other js libraries that you guys can recommend for diffing html? When dealing with large files, it takes a very long time to finish processing (if ever). Thanks.

avimar · 2022-03-30T09:18:46Z

Hi, 6 years later this still seems to be the best html diff of html library.
But with only ~140k tokens, I'm still seeing recursively_find_matching_blocks to take 7+ seconds. (create_index is only 1-1.5).

Any hints or better libraries?

avimar · 2022-03-30T09:21:38Z

I think the code below should only be executed when index[token] == null.

index[token] = [] idx = p.in_these.indexOf token while idx isnt -1 index[token].push idx idx = p.in_these.indexOf token, idx+1

Consider the fact that of 30000 tokens, there are a lot of duplicates (ie. " " (+/- 12000), "a", "and", "for".... etc ) Running the code as is means that you are getting the position of 12000 spaces in the target array for 12000 times, each time overwriting index[token].

Putting a check around the above lines immediately took of a solid 15 seconds of the total time (the time for building the index went down from 16 seconds to 1).

Just to help anyone else - I didn't understand it until I messed with the code.

Wrap this code:

index[token] = [];
idx = p.in_these.indexOf(token);
	while (idx !== -1) {
		index[token].push(idx);
		idx = p.in_these.indexOf(token, idx + 1);
    	}

with if(!index[token]) {... }

It looks up each token - e.g. <b> and tries to match it to the entire list of matches. But it does that each time it gets to it. With this code, it can do the lookup/search once.

It went from 1000-1500 milliseconds for that step to 57ms, a huge difference.

But the main issue is recursively_find_matching_blocks.. any optimizations for that?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues on large files #8

Performance issues on large files #8

rutsch commented May 17, 2016 •

edited

Loading

cjke commented Jun 30, 2016

jkabat commented Oct 4, 2016

cjke commented Oct 4, 2016

monkeyman888 commented Nov 14, 2017

avimar commented Mar 30, 2022

avimar commented Mar 30, 2022

Performance issues on large files #8

Performance issues on large files #8

Comments

rutsch commented May 17, 2016 • edited Loading

cjke commented Jun 30, 2016

jkabat commented Oct 4, 2016

cjke commented Oct 4, 2016

monkeyman888 commented Nov 14, 2017

avimar commented Mar 30, 2022

avimar commented Mar 30, 2022

rutsch commented May 17, 2016 •

edited

Loading