Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify blocks yield logic for lines reader #6932

Merged
merged 9 commits into from
Jul 11, 2019

Conversation

medariox
Copy link
Contributor

@medariox medariox commented Jul 10, 2019

  • PR is based on the DEVELOP branch
  • Don't send big changes all at once. Split up big PRs into multiple smaller PRs that are easier to manage and review
  • Read the contribution guide

General overview of the bug: #6927 (comment)

Fixes #6927
Fixes #5463

@medariox medariox added the Bug label Jul 10, 2019
@medariox medariox added this to the 0.3.5 milestone Jul 10, 2019
@sharkykh
Copy link
Contributor

sharkykh commented Jul 10, 2019

For reference, I found this - https://github.com/RobinNil/file_read_backwards
(I'm against adding this as a dependency though...)
Skimming through the code they a custom buffer of some sorts, that looks for the furthest newline character (or EOF) in a block/chunk, and reads using that.

BTW the original function for what we have right now is no longer part of that flyingcircus package.
https://bitbucket.org/norok2/flyingcircus/src/9cbaa53051083f87152c5787e45dae788220eb92/flyingcircus/base.py#lines-4616
and https://bitbucket.org/norok2/flyingcircus/src/9cbaa53051083f87152c5787e45dae788220eb92/flyingcircus/base.py#lines-4818

@medariox
Copy link
Contributor Author

Thanks for the debugging @sharkykh
If I understand it correctly this code is exactly there to prevent that issue. I didn't think of it when changing the function to our use case. I believe it should be fixed for good now. Do you mind taking a look?

@sharkykh
Copy link
Contributor

I'm very tired right now, but maybe this will help:

# Put in project root
# coding=utf-8
from __future__ import unicode_literals, print_function

data = (
    b'AAAAAAAAAAAAAAA\n'
    b'AA\xe2\x80\xa0AAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
)
with open('test.log', 'wb') as fh:
    fh.write(data)

import sys
from medusa.logger import reverse_readlines

# Print block sizes that fail (easier to test split points)
for i in range(1, len(data)):
    try:
        lines = [
            line for line in reverse_readlines('test.log', block_size=i)
        ]
    except UnicodeDecodeError:
        print('failed with block size =', i)

for index, line in enumerate(lines):
    try:
        print('lines[' + str(index) + '/' + str(len(lines) - 1) + ']', line)
    except UnicodeEncodeError:
        if sys.version_info[0] == 2 and isinstance(line, unicode):
            line = line.encode('utf-8')
            print(line)
        else:
            raise

The idea with this is, it should succeed with any block size and not print any "failed" messages.
I don't know if I did the test right for Python 2 because I'm getting UnicodeEncodeError

Results on develop:

c:\Dev\Medusa>py -3 test.py
failed with block size = 1
failed with block size = 2
failed with block size = 4
failed with block size = 5
failed with block size = 25
failed with block size = 31
failed with block size = 62
failed with block size = 124
failed with block size = 125
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AA†AAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

c:\Dev\Medusa>py -2 test.py
failed with block size = 1
failed with block size = 2
failed with block size = 4
failed with block size = 5
failed with block size = 25
failed with block size = 31
failed with block size = 62
failed with block size = 124
failed with block size = 125
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AAΓאáAAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

Results on this branch:

c:\Dev\Medusa>py -3 test.py
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AA†AAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

c:\Dev\Medusa>py -2 test.py
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AAΓאáAAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

@medariox
Copy link
Contributor Author

This is simply awesome! Thank you so much @sharkykh
The UnicodeEncodeError in Py2 is expected, as it tries an implicit encode to str() with the default encoding (e.g. for Win 10 it's cp437(?)), but the test strings are utf-8 decoded. If you use repr(line) int the print() it will just print it as unicode.

@sharkykh
Copy link
Contributor

sharkykh commented Jul 11, 2019

The UnicodeEncodeError in Py2 is expected, as it tries an implicit encode to str() with the default encoding (e.g. for Win 10 it's cp437(?)), but the test strings are utf-8 decoded. If you use repr(line) int the print() it will just print it as unicode.

Yep. Looks like it. For me running chcp returns 862, I think it's just based on my Windows region/language setting (mine is obviously set to Israel/Hebrew).
By changing it to unicode (65001) it prints using print(line) without encode errors :)

C:\Dev\Medusa>chcp 65001
Active code page: 65001

C:\Dev\Medusa>py -2 test.py
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AA†AAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

def reverse_readlines(filename, skip_empty=True, append_newline=False, block_size=512 * 1024,
reset_offset=True, encoding='utf-8'):
def reverse_readlines(filename, skip_empty=True, append_newline=False,
block_size=128 * 1024, encoding='utf-8'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really think that there's a reason to lower the block size?
512KB to 128KB?

Copy link
Contributor Author

@medariox medariox Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that we only show max 1000 lines in the log viewer, which is roughly 128KB according to my tests (this can vary depending on the content ofc). I don't see why we should keep such a high block size. Also, this seems to suggest that 128KB is the default MMAP_THRESHOLD (that doesn't necessarily mean that it is more performant tho).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then, I don't know enough about it but sure.
I guess it's the same thing as the shutil monkeypatch, where if the block size is larger than X, it doesn't have any noticeable improvement to performance.

medusa/logger/__init__.py Outdated Show resolved Hide resolved
medusa/logger/__init__.py Outdated Show resolved Hide resolved
medusa/logger/__init__.py Outdated Show resolved Hide resolved
@sharkykh
Copy link
Contributor

I'll just leave this here, it's a different test that the old code fails and new one passes,
but it's not neat and clean like the first one. Just so we have it if we need it in the future (hopefully not 🙏).

# Put in project root
# coding=utf-8
from __future__ import unicode_literals, print_function

data = (
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ\n'
    b'\xe2\x80\xa0\n'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ\n'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ\n'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    b'ABCDEFG\xe2\x80\xa0HIJKLMNOPQRSTUVWXYZ'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ\n'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
)
unicode_data = data.decode('utf-8').split('\n')
unicode_data.reverse()

with open('test.log', 'wb') as fh:
    fh.write(data)

from medusa.logger import reverse_readlines

# Print block sizes that fail (easier to test split points)
for i in range(1, len(data)):
    try:
        lines = [
            line for line in reverse_readlines('test.log', block_size=i)
        ]

        if lines != unicode_data:
            print('block_size = %d' % i)
            print(sum(len(l) for l in lines), lines)
            print(sum(len(l) for l in unicode_data), unicode_data)
    except UnicodeDecodeError:
        print('failed with block size = %d' % i)

@medariox medariox merged commit 9924484 into develop Jul 11, 2019
@medariox medariox deleted the fix/simplify-blocks-reader branch July 11, 2019 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants