Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with the redundancy load #147

Open
yyzdtccjdtc opened this issue Apr 15, 2022 · 3 comments
Open

Performance issue with the redundancy load #147

yyzdtccjdtc opened this issue Apr 15, 2022 · 3 comments

Comments

@yyzdtccjdtc
Copy link

loc/src/lib.rs

Line 668 in 1e0c7f4

for multi in multis.iter() {

According to my tests, the vector multis in this line of code has only one element in most cases. Calling the iter() function on a vector with only one element results in having a lot of redundant load operations. If we create a fast path for a vector with only one element, we can eliminate a lot of computation and memory operations, which in turn improves the performance of the program. Execution time decreased from 7.05s to 5.74s with my test input, which is a 22.8% speedup.

Here's the modified code for this count() function. (src/lib.rs: 597-714)

`
pub fn count(filepath: &str) -> Count {
let lang = lang_from_ext(filepath);
let (singles, multis) = counter_config_for_lang(lang);

let mfile = File::open(filepath);
let mut file = match mfile {
    Ok(file) => file,
    Err(_) => {
        return Count::default();
    }
};
// TODO(cgag): set the size of this vec to size of the file + a byte? a reddit comment
// somewhere says fs::read will do this ofr you.
let mut bytes = vec![];
file.read_to_end(&mut bytes).expect("nani?!");

let mut c = Count::default();
let mut multi_stack: Vec<(&str, &str)> = vec![];

//***YYZ note: The fast path for vecotr with only one element
if multis.len() == 1 {
    let (start, end) = &multis[0];
    let start_len = start.len();
    let end_len   = end.len();
    'line: for byte_line in ByteLines(&bytes).lines() {
        let line = match std::str::from_utf8(byte_line) {
            Ok(s) => s,
            Err(_) => return Count::default(),
        };
        c.lines += 1;

        let line = line.trim_start();
        if line.is_empty() {
            c.blank += 1;
            continue;
        };

        if multi_stack.is_empty() {
            for single_start in singles.iter() {
                if line.starts_with(single_start) {
                    if multis.iter().any(|(m_start, _)| line.starts_with(m_start)) {
                        break;
                    }

                    c.comment += 1;
                    continue 'line;
                }
            }

            if multis.is_empty() {
                c.code += 1;
                continue 'line;
            }
        }

        if multi_stack.is_empty() && !multis.iter().any(|(start, end)| line.contains(start) || line.contains(end)) {
            c.code += 1;
            continue 'line;
        }

        let mut pos = 0;
        let mut found_code = 0;
        let line_len = line.len();
        let contains_utf8 = (0..line_len).any(|i| !line.is_char_boundary(i));

        'outer: while pos < line_len {
                if contains_utf8 {
                    for i in pos..pos + min(max(start_len, end_len) + 1, line_len - pos) {
                        if !line.is_char_boundary(i) {
                            pos += 1;
                            continue 'outer;
                        }
                    }
                }

                if pos + start_len <= line_len && &line[pos..pos + start_len] == *start {
                    pos += start_len;
                    multi_stack.push(multis[0]);
                    continue;
                }

                if !multi_stack.is_empty() {
                    let (_, mut end) = multi_stack.last().expect("stack last");
                    if pos+end.len() <= line_len && &line[pos..pos+end.len()] == end {
                        let _ = multi_stack.pop();
                        pos += end.len();
                    }
                } else if multi_stack.is_empty() && pos < line_len && !&line[pos..pos + 1].chars().next().expect("whitespace check").is_whitespace() {
                    found_code += 1;
                }
            // }
            pos += 1;
        }

        // TODO(cgag): can this ever be greater or was that just defensive coding
        if found_code >= multis.len() {
            c.code += 1;
        } else {
            c.comment += 1;
        }
    }
// ***YYZ note: Will run the original code if it has more than one element
}else{
    'line: for byte_line in ByteLines(&bytes).lines() {
        let line = match std::str::from_utf8(byte_line) {
            Ok(s) => s,
            // TODO(cgag): should we report when this happens?
            Err(_) => return Count::default(),
        };
        c.lines += 1;

        let line = line.trim_start();
        // should blanks within a comment count as blank or comment? This counts them as blank.
        if line.is_empty() {
            c.blank += 1;
            continue;
        };

        // if we match a single line comment, count it and go onto next line
        // TODO(cgag): is the multiline comment start symbol ever the shorter one?
        // if multi_stack.is_empty, then we're not currently in a multiline comment
        if multi_stack.is_empty() {
            for single_start in singles.iter() {
                if line.starts_with(single_start) {
                    // if this single_start is a prefix of a multi_start,
                    // make sure that the line doesn't actually start with the multi_start
                    // TODO(cgag): donm't do this check here
                    // TODO(cgag): this assumption that the multi-line comment is always the longer one
                    //             may well be a terrible one
                    if multis.iter().any(|(m_start, _)| line.starts_with(m_start)) {
                        break;
                    }

                    c.comment += 1;
                    continue 'line;
                }
            }

            if multis.is_empty() {
                c.code += 1;
                continue 'line;
            }
        }

        if multi_stack.is_empty() && !multis.iter().any(|(start, end)| line.contains(start) || line.contains(end)) {
            c.code += 1;
            continue 'line;
        }

        let mut pos = 0;
        let mut found_code = 0;
        let line_len = line.len();
        let contains_utf8 = (0..line_len).any(|i| !line.is_char_boundary(i));

        'outer: while pos < line_len {
            for multi in multis.iter() {
                let (start, end) = multi;
                let start_len = start.len();
                let end_len   = end.len();

                // TODO(cgag): this is almost ceratinly giving us incorrect results.  Say the
                // first multi is the longest.  If we advance position because the final byte
                // position of that multi hits unicode, we might have skipped over a perfectly
                // valid comment start that was unaffected by the unicode.
                if contains_utf8 {
                    for i in pos..pos + min(max(start_len, end_len) + 1, line_len - pos) {
                        if !line.is_char_boundary(i) {
                            pos += 1;
                            continue 'outer;
                        }
                    }
                }

                if pos + start_len <= line_len && &line[pos..pos + start_len] == *start {
                    pos += start_len;
                    multi_stack.push(*multi);
                    continue;
                }

                if !multi_stack.is_empty() {
                    let (_, mut end) = multi_stack.last().expect("stack last");
                    if pos+end.len() <= line_len && &line[pos..pos+end.len()] == end {
                        let _ = multi_stack.pop();
                        pos += end.len();
                    }
                } else if multi_stack.is_empty() && pos < line_len && !&line[pos..pos + 1].chars().next().expect("whitespace check").is_whitespace() {
                    found_code += 1;
                }
            }
            pos += 1;
        }

        // TODO(cgag): can this ever be greater or was that just defensive coding
        if found_code >= multis.len() {
            c.code += 1;
        } else {
            c.comment += 1;
        }
    }
}

c

}
`

@yyzdtccjdtc
Copy link
Author

@cgag May I ask if this software is still being maintained?

@cgag
Copy link
Owner

cgag commented May 22, 2022

Not really unfortunately, I very periodically merge stuff. I did want to comment and say this is clever though, the overhead of .iter() definitely never occurred to me.

@pengfei-su
Copy link

Hi @cgag

This is a good catch. I wonder what is blocking you to merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants