Simplify blocks yield logic for lines reader #6932

medariox · 2019-07-10T14:39:49Z

PR is based on the DEVELOP branch
Don't send big changes all at once. Split up big PRs into multiple smaller PRs that are easier to manage and review
Read the contribution guide

General overview of the bug: #6927 (comment)

sharkykh · 2019-07-10T15:22:51Z

For reference, I found this - https://github.com/RobinNil/file_read_backwards
(I'm against adding this as a dependency though...)
Skimming through the code they a custom buffer of some sorts, that looks for the furthest newline character (or EOF) in a block/chunk, and reads using that.

~~BTW the original function for what we have right now is no longer part of that flyingcircus package.~~
https://bitbucket.org/norok2/flyingcircus/src/9cbaa53051083f87152c5787e45dae788220eb92/flyingcircus/base.py#lines-4616
and https://bitbucket.org/norok2/flyingcircus/src/9cbaa53051083f87152c5787e45dae788220eb92/flyingcircus/base.py#lines-4818

medariox · 2019-07-10T15:38:23Z

Thanks for the debugging @sharkykh
If I understand it correctly this code is exactly there to prevent that issue. I didn't think of it when changing the function to our use case. I believe it should be fixed for good now. Do you mind taking a look?

sharkykh · 2019-07-10T17:16:54Z

I'm very tired right now, but maybe this will help:

# Put in project root
# coding=utf-8
from __future__ import unicode_literals, print_function

data = (
    b'AAAAAAAAAAAAAAA\n'
    b'AA\xe2\x80\xa0AAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
    b'AAAAAAAAAAAAAAA\n'
)
with open('test.log', 'wb') as fh:
    fh.write(data)

import sys
from medusa.logger import reverse_readlines

# Print block sizes that fail (easier to test split points)
for i in range(1, len(data)):
    try:
        lines = [
            line for line in reverse_readlines('test.log', block_size=i)
        ]
    except UnicodeDecodeError:
        print('failed with block size =', i)

for index, line in enumerate(lines):
    try:
        print('lines[' + str(index) + '/' + str(len(lines) - 1) + ']', line)
    except UnicodeEncodeError:
        if sys.version_info[0] == 2 and isinstance(line, unicode):
            line = line.encode('utf-8')
            print(line)
        else:
            raise

The idea with this is, it should succeed with any block size and not print any "failed" messages.
I don't know if I did the test right for Python 2 because I'm getting UnicodeEncodeError

Results on `develop`:

c:\Dev\Medusa>py -3 test.py
failed with block size = 1
failed with block size = 2
failed with block size = 4
failed with block size = 5
failed with block size = 25
failed with block size = 31
failed with block size = 62
failed with block size = 124
failed with block size = 125
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AA†AAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

c:\Dev\Medusa>py -2 test.py
failed with block size = 1
failed with block size = 2
failed with block size = 4
failed with block size = 5
failed with block size = 25
failed with block size = 31
failed with block size = 62
failed with block size = 124
failed with block size = 125
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AAΓאáAAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

Results on this branch:

c:\Dev\Medusa>py -3 test.py
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AA†AAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

c:\Dev\Medusa>py -2 test.py
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AAΓאáAAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

medariox · 2019-07-10T17:48:57Z

This is simply awesome! Thank you so much @sharkykh
The UnicodeEncodeError in Py2 is expected, as it tries an implicit encode to str() with the default encoding (e.g. for Win 10 it's cp437(?)), but the test strings are utf-8 decoded. If you use repr(line) int the print() it will just print it as unicode.

sharkykh · 2019-07-11T13:46:36Z

The UnicodeEncodeError in Py2 is expected, as it tries an implicit encode to str() with the default encoding (e.g. for Win 10 it's cp437(?)), but the test strings are utf-8 decoded. If you use repr(line) int the print() it will just print it as unicode.

Yep. Looks like it. For me running chcp returns 862, I think it's just based on my Windows region/language setting (mine is obviously set to Israel/Hebrew).
By changing it to unicode (65001) it prints using print(line) without encode errors :)

C:\Dev\Medusa>chcp 65001
Active code page: 65001

C:\Dev\Medusa>py -2 test.py
lines[0/8] AAAAAAAAAAAAAAA
lines[1/8] AAAAAAAAAAAAAAA
lines[2/8] AAAAAAAAAAAAAAA
lines[3/8] AAAAAAAAAAAAAAA
lines[4/8] AAAAAAAAAAAAAAA
lines[5/8] AAAAAAAAAAAAAAA
lines[6/8] AAAAAAAAAAAAAAA
lines[7/8] AA†AAAAAAAAAA
lines[8/8] AAAAAAAAAAAAAAA

sharkykh · 2019-07-11T13:53:10Z

medusa/logger/__init__.py

-def reverse_readlines(filename, skip_empty=True, append_newline=False, block_size=512 * 1024,
-                      reset_offset=True, encoding='utf-8'):
+def reverse_readlines(filename, skip_empty=True, append_newline=False,
+                      block_size=128 * 1024, encoding='utf-8'):


Do you really think that there's a reason to lower the block size?
512KB to 128KB?

Considering that we only show max 1000 lines in the log viewer, which is roughly 128KB according to my tests (this can vary depending on the content ofc). I don't see why we should keep such a high block size. Also, this seems to suggest that 128KB is the default MMAP_THRESHOLD (that doesn't necessarily mean that it is more performant tho).

Okay then, I don't know enough about it but sure.
I guess it's the same thing as the shutil monkeypatch, where if the block size is larger than X, it doesn't have any noticeable improvement to performance.

medusa/logger/__init__.py

sharkykh · 2019-07-11T15:16:18Z

I'll just leave this here, it's a different test that the old code fails and new one passes,
but it's not neat and clean like the first one. Just so we have it if we need it in the future (hopefully not 🙏).

# Put in project root
# coding=utf-8
from __future__ import unicode_literals, print_function

data = (
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ\n'
    b'\xe2\x80\xa0\n'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ\n'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ\n'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    b'ABCDEFG\xe2\x80\xa0HIJKLMNOPQRSTUVWXYZ'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ\n'
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
)
unicode_data = data.decode('utf-8').split('\n')
unicode_data.reverse()

with open('test.log', 'wb') as fh:
    fh.write(data)

from medusa.logger import reverse_readlines

# Print block sizes that fail (easier to test split points)
for i in range(1, len(data)):
    try:
        lines = [
            line for line in reverse_readlines('test.log', block_size=i)
        ]

        if lines != unicode_data:
            print('block_size = %d' % i)
            print(sum(len(l) for l in lines), lines)
            print(sum(len(l) for l in unicode_data), unicode_data)
    except UnicodeDecodeError:
        print('failed with block size = %d' % i)

Simplify blocks yield logic for lines reader

fc6bb14

medariox added the Bug label Jul 10, 2019

medariox added this to the 0.3.5 milestone Jul 10, 2019

medariox added the In progress label Jul 10, 2019

medariox added 2 commits July 10, 2019 17:31

Fix decoding of chunked lines

fc86b9c

Merge branch 'develop' into fix/simplify-blocks-reader

29e0ad6

medariox added 2 commits July 11, 2019 15:35

set block size to 128 * 1024

2b1c9a0

update changelog

0e07632

medariox added Concluded and removed In progress labels Jul 11, 2019

sharkykh reviewed Jul 11, 2019

View reviewed changes

medusa/logger/__init__.py Outdated Show resolved Hide resolved

sharkykh reviewed Jul 11, 2019

View reviewed changes

medusa/logger/__init__.py Outdated Show resolved Hide resolved

Minor enhancements (review)

09bc50d

sharkykh suggested changes Jul 11, 2019

View reviewed changes

medusa/logger/__init__.py Outdated Show resolved Hide resolved

medariox and others added 3 commits July 11, 2019 17:01

Review (2)

8648c91

Review (3)

b00377b

Update __init__.py

c6308c9

sharkykh approved these changes Jul 11, 2019

View reviewed changes

medariox merged commit 9924484 into develop Jul 11, 2019

medariox deleted the fix/simplify-blocks-reader branch July 11, 2019 16:24

medariox mentioned this pull request Jul 11, 2019

Uncaught exception in APIv2: UnicodeDecodeError #6927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify blocks yield logic for lines reader #6932

Simplify blocks yield logic for lines reader #6932

medariox commented Jul 10, 2019 •

edited by sharkykh

Loading

sharkykh commented Jul 10, 2019 •

edited

Loading

medariox commented Jul 10, 2019

sharkykh commented Jul 10, 2019

medariox commented Jul 10, 2019

sharkykh commented Jul 11, 2019 •

edited

Loading

sharkykh Jul 11, 2019

medariox Jul 11, 2019 •

edited

Loading

sharkykh Jul 11, 2019

sharkykh commented Jul 11, 2019

Simplify blocks yield logic for lines reader #6932

Simplify blocks yield logic for lines reader #6932

Conversation

medariox commented Jul 10, 2019 • edited by sharkykh Loading

sharkykh commented Jul 10, 2019 • edited Loading

medariox commented Jul 10, 2019

sharkykh commented Jul 10, 2019

Results on develop:

Results on this branch:

medariox commented Jul 10, 2019

sharkykh commented Jul 11, 2019 • edited Loading

sharkykh Jul 11, 2019

Choose a reason for hiding this comment

medariox Jul 11, 2019 • edited Loading

Choose a reason for hiding this comment

sharkykh Jul 11, 2019

Choose a reason for hiding this comment

sharkykh commented Jul 11, 2019

medariox commented Jul 10, 2019 •

edited by sharkykh

Loading

sharkykh commented Jul 10, 2019 •

edited

Loading

Results on `develop`:

sharkykh commented Jul 11, 2019 •

edited

Loading

medariox Jul 11, 2019 •

edited

Loading