Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add caching, add tests #175

Merged
merged 3 commits into from
Feb 23, 2020
Merged

Conversation

Germandrummer92
Copy link
Contributor

@Germandrummer92 Germandrummer92 commented Feb 11, 2020

id should work for caching as the three functions in questions always will receive a class as an argument which will have same identity always. Tests are passing and I added some too.

Load test with cache_size=1:

It took 39.030911922454834s to convert to dict
It took 30.857404708862305s to convert from dict

After introducing caching:

It took 41.651854038238525s to convert to dict
It took 16.02430510520935s to convert from dict

So about a 2x speedup when converting from dict.

I did write a load_test and profiled it to get to those results. There is great improvement but as you can see with the graph the performance is still quite bad for encoding to the dict. I was trying to somehow improve performance of the as_dict as well but didn't find a good solution. A cache wouldn't really work there as the hashing itself would mean you need to do json.dumps or something to have a consistent hash, which I think (i didn't try) would not improve performance. Find the screenshot of the callgraph attached. If you have some other ideas to fix it that would be great!

from builtins import list
from random import randint
from typing import List, Dict, Any

from dataclasses_json import dataclass_json, DataClassJsonMixin
from time import time
from dataclasses import dataclass, field


@dataclass_json
@dataclass(frozen=True)
class AnotherNestedDataclass(DataClassJsonMixin):
    more_stuff: List[int] = field(default_factory=list)
    x: str = "a"
    y: str = "b"
    z: str = "c"
    dict_key: Dict[str, Any] = field(default_factory=dict)
    key: Any = "some_super_value"


@dataclass_json
@dataclass(frozen=True)
class NestedDataclass(DataClassJsonMixin):
    super_complicated: AnotherNestedDataclass = AnotherNestedDataclass()
    stuff: List[AnotherNestedDataclass] = field(default_factory=list)
    a: float = 0.0
    b: float = 0.0
    c: float = 0.0


@dataclass_json
@dataclass(frozen=True)
class BigDataclass(DataClassJsonMixin):
    nested: List[NestedDataclass] = field(default_factory=list)
    top_level_dict: Dict[str, Any] = field(default_factory=dict)


if __name__ == '__main__':
    to_parse: List[BigDataclass] = list()
    for i in range(1, 500):
        ints = [randint(1, 100)]*50
        nested_dataclass = NestedDataclass(stuff=[AnotherNestedDataclass(more_stuff=ints)]*50)
        to_parse.append(BigDataclass(nested=[nested_dataclass]*50))

    start = time()
    json = list(map(lambda dc: dc.to_dict(), to_parse))
    end = time()
    print(f"It took {end-start}s to convert to dict")

    start = time()

    list = list(map(BigDataclass.from_dict, json))

    end = time()

    print(f"It took {end-start}s to convert from dict")

profile results after caching:

image

image

@RunOrVeith

@Germandrummer92
Copy link
Contributor Author

Edit above, as i messed up the load_test before:

From 30.857404708862305s to convert from dict to 16.02430510520935s to convert from dict

@RunOrVeith
Copy link
Contributor

You could try using https://deepdiff.readthedocs.io/en/latest/deephash.html to get object based hashes in to_dict, but I am not sure about introducing the dependency and if it improves the speed

Copy link
Owner

@lidatong lidatong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Germandrummer92 thanks for implementing this optimization. 2x speedup is great. Performance is definitely on the radar, as more people rely this library. I'll need to deep dive into it myself, but I have broad guesses at where other bottlenecks might be.

Also just want to say thank you to both you and @RunOrVeith for your contributions.

@lidatong lidatong mentioned this pull request Feb 23, 2020
@lidatong
Copy link
Owner

lidatong commented Feb 23, 2020

I had to fix merge conflicts so moved off this fork into #182
Fixing merge directly here

@lidatong lidatong changed the base branch from master to feat/caching February 23, 2020 14:15
@lidatong lidatong merged commit 2208d4f into lidatong:feat/caching Feb 23, 2020
lidatong added a commit that referenced this pull request Mar 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants