Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard against unnecessarily calling dump_graph in logging #4619

Merged
merged 39 commits into from
Jan 31, 2022

Conversation

nathaniel-may
Copy link
Contributor

@nathaniel-may nathaniel-may commented Jan 24, 2022

resolves #4569

Description

  • Adds a new Lazy data type that can be reused anywhere in the code base (code quality passes mypy --strict without any ignores)
  • Includes mashumaro serialization
  • Has the offending cache events use the lazy type
  • Tests that the existing fire_event function will work with this new lazy construction
  • Tests that all relevant subclasses of Cache apply this lazy evaluation strategy to catch similar issues in the future

The Challenge

Mashumaro does not support composing serialization strategies which makes serializing generic types, like Lazy tedious and error prone. Even if the wrapped type T has a mashumaro serialization strategy, it must be explicitly called by the serializer for Lazy itself. To be explicit, this means we would have to manually code mashumaro serialization strategies for all combinations of types that Lazy wraps in our code base e.g. Lazy[str], Lazy[int], Lazy[Dict[str, List[str]]].

I have attempted to dig into Mashumaro's metaprogramming library to see if I can write a truly generic serialization for the Lazy type, but I haven't been able to figure it out.

Questions

  • We a few paths forward on this PR:
    1. Use this Lazy type to more easily communicate intention, and get good mypy type checking but introduce room for runtime errors when we forget to add a serialization strategy when it's going to be serialized like in the events module.
    2. Ditch Lazy so we don't have this gap to introduce new errors in serialization and instead pass unapplied methods with # type: ignore because mypy doesn't handle them well.
    3. Spend more time looking into mashumaro metaprogramming to potentially get the best of both worlds.

Suggested Review Flow

  1. Look at the two test cases. They should very clearly show the expected behavior.
  2. Read the implementation for lazy behavior which is found in dbt.lazy.py::Lazy.
  3. Find the call sites for the updated cache events to see how it all fits together from beginning to end.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change

@cla-bot cla-bot bot added the cla:yes label Jan 24, 2022
@nathaniel-may nathaniel-may marked this pull request as ready for review January 24, 2022 23:00
@nathaniel-may nathaniel-may requested review from a team as code owners January 27, 2022 21:42
Comment on lines 8 to 22
# Logging
When events are processed via `fire_event`, nearly everything is logged. Whether or not the user has enabled the debug flag, all debug messages are still logged to the file. However, some events are particularly time consuming to construct because they return a huge amount of data. Today, the only messages in this category are cache events and are only logged if the `--log-cache-events` flag is on. This is important because these messages should not be created unless they are going to be logged, because they cause a noticable performance degredation. We achieve this by making the event class explicitly use lazy values for the expensive ones so they are not computed until the moment they are required. This is done with the data type `core/dbt/lazy.py::Lazy` which includes usage documentation.

Example:
```
@dataclass
class DumpBeforeAddGraph(DebugLevel, Cache):
dump: Lazy[Dict[str, List[str]]]
code: str = "E031"

def message(self) -> str:
return f"before adding : {self.dump.force()}"
```


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

# This is an explicit deserializer for the type Lazy[Dict[str, List[str]]]
# mashumaro does not support composing serialization strategies, so all
# future uses of Lazy will need to register a unique serialization class like this one.
class LazySerialization1(SerializationStrategy):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this name with just an integer on the end. But I think it's as good as you get without being overly verbose and still communicating that there can eventually be multiple associated with Lazy. No action needed, just needed to scratch the itch.

@Fatal1ty
Copy link

Fatal1ty commented Jan 29, 2022

Hi @nathaniel-may

As the author of mashumaro, I would be happy to help you sort out the problem if you would explain what you want to achieve. If it's not difficult, could you please write a sample code explaining the problem? You can do it here or create a new issue in the mashumaro repo.

As far as I understand, you want to serialize generic non dataclass types. Can GenericSerializableType help with it? You can also have a generic dataclass like your Lazy one, that will be serialized and deserialized according to the concrete types.

@nathaniel-may
Copy link
Contributor Author

nathaniel-may commented Jan 31, 2022

Hi, @Fatal1ty

That's very kind of you to offer to help here. Mashumaro does have several options for serializing generic types, but I haven't managed to figure out how to defer specifying the serialization for the concrete inner type to mashumaro.

Using the code in this PR, I currently have a SerializationStrategy registered with a config that returns a dict since that's the only concrete type we're using it with at this time:

class LazySerialization1(SerializationStrategy):
    def serialize(self, value) -> Dict[str, List[str]]:
        return value.force()
        
class Config(MashBaseConfig):
        serialization_strategy = {
            ...
            Lazy[Dict[str, List[str]]]: LazySerialization1()
        }

And this works great, but when I swap out the class to instead use a value of Lazy[str] mashumaro fails to serialize with the following exception: mashumaro.exceptions.UnserializableField: Field "dump" of type Lazy[str] in DumpBeforeAddGraph is not serializable.

What I want is a way for the above to work for any type Lazy[T] when T is either natively serializable (like str), or when T is a serializable type, or there exists a serialization strategy for T without needing to write code to handle serializing T again in the generic type's serializer. If something like the following worked that would be much nicer than what I've come up with:

class LazySerialization(SerializationStrategy):
    def serialize(self, value):
        return value.force()
        
class Config(MashBaseConfig):
        serialization_strategy = {
            ...
            Lazy: LazySerialization()
        }

@Fatal1ty
Copy link

Fatal1ty commented Jan 31, 2022

@nathaniel-may thank you for the details!

I came up with the idea of using DataClassDictMixin with SerializableType since it's legit. The code should be written as follows:

from dataclasses import dataclass
from datetime import date
from typing import Generic, TypeVar
from mashumaro import DataClassDictMixin
from mashumaro.types import SerializableType

T = TypeVar("T")

@dataclass
class Lazy(Generic[T], DataClassDictMixin, SerializableType):
    inner: T

    def _serialize(self) -> T:
        return self.to_dict()['inner']

    @classmethod
    def _deserialize(cls, value):
        return cls.from_dict({'inner': value})

    class Config:
        serialization_strategy = {
            # ...
        }

@dataclass
class LazyTest(DataClassDictMixin):
    lazy: Lazy[date]

obj = LazyTest.from_dict({'lazy': '2022-01-31'})
assert obj.lazy == Lazy[date](date(2022, 1, 31))
print(obj.to_dict())
assert obj.to_dict() == {'lazy': '2022-01-31'}

But unfortunately it won't work as expected. I think it's a bug that could be fixed. I'll dig deeper and try to fix it.

@nathaniel-may
Copy link
Contributor Author

@gshank suggested Lazy live in dbt.helper_types which makes sense to me so I've moved it there.

@nathaniel-may nathaniel-may merged commit 13b1865 into main Jan 31, 2022
@nathaniel-may nathaniel-may deleted the nate/skip_dump_graph branch January 31, 2022 19:14
@nathaniel-may
Copy link
Contributor Author

@Fatal1ty, since I've closed this PR if you'd like to continue our conversation in an issue or PR in the mashumaro repo, I'm happy to do so.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 1, 2022

Nice work here!!

Is this a fix we're going to be able to backport to 1.0.latest, for inclusion in v1.0.2 (given it resolves a significant performance regression)? Or does it intersect with / depend on #4505 in a way that makes backporting too difficult?

@nathaniel-may
Copy link
Contributor Author

@jtcohen6 this definitely depends on #4505. If it's important to backport, I can make a different commit to 1.0.latest but the user-facing behavior won't be exactly the same. If I exclude the dump field from the structured logging output, but keep it in the string message, I could get this fixed over there too without mashumaro.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 1, 2022

@nathaniel-may I'd be supportive of fixing the performance regression, even at the cost of losing that information in the structured logging output. To that end:

If I exclude the dump field from the structured logging output, but keep it in the string message, I could get this fixed over there too without mashumaro.

This feels like the right balance of effort + impact

nathaniel-may pushed a commit that referenced this pull request Feb 1, 2022
@nathaniel-may nathaniel-may mentioned this pull request Feb 1, 2022
4 tasks
nathaniel-may pushed a commit that referenced this pull request Feb 1, 2022
nathaniel-may pushed a commit that referenced this pull request Feb 2, 2022
* adds new function fire_event_if
@Fatal1ty
Copy link

Fatal1ty commented Feb 5, 2022

But unfortunately it won't work as expected. I think it's a bug that could be fixed. I'll dig deeper and try to fix it.

@nathaniel-may actually I missed a much simpler solution that works right now. You can wrap a value in the dataclass and use serialization hooks:

from dataclasses import dataclass
from datetime import date
from typing import Generic, TypeVar
from mashumaro import DataClassDictMixin

T = TypeVar("T")


@dataclass
class Lazy(Generic[T], DataClassDictMixin):
    inner: T

    def __post_serialize__(self: T, d):
        return d.pop('inner')

    @classmethod
    def __pre_deserialize__(cls, d):
        return {'inner': d}

    # @classmethod
    # def __post_deserialize__(cls, obj: T):
    #     # if you want to get LazyTest(lazy=datetime.date(2022, 1, 31))
    #     # instead of LazyTest(lazy=Lazy(inner=datetime.date(2022, 1, 31)))
    #     # but serialization will be broken because date doesn't have to_dict
    #     return obj.inner

    class Config:
        serialization_strategy = {
            # ...
        }


@dataclass
class LazyTest(DataClassDictMixin):
    lazy: Lazy[date]


obj = LazyTest.from_dict({'lazy': '2022-01-31'})
print(obj)
assert obj.lazy == Lazy[date](date(2022, 1, 31))
print(obj.to_dict())
assert obj.to_dict() == {'lazy': '2022-01-31'}

Also, UnserializableField exception on generic classes that implement SerializableType interface is a bug that will be fixed in the next version.

iknox-fa pushed a commit that referenced this pull request Feb 8, 2022
* add lazy type and apply to cache events

automatic commit by git-black, original commits:
  13b1865
@nathaniel-may
Copy link
Contributor Author

Hey, @Fatal1ty

Thanks for the update! If we have to make any modifications to this area of the code I'll definitely look into using this strategy 💪. However, since it seems like it's been easy to upgrade the last few versions I'm looking forward to using your fix when it comes out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants