Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utility methods to DataProcessor #948

Closed
amontanez24 opened this issue Aug 10, 2022 · 0 comments
Closed

Add utility methods to DataProcessor #948

amontanez24 opened this issue Aug 10, 2022 · 0 comments
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

Problem Description

Outside of managing transformations, the DataProcessor is responsible for validating constraints, making sure all the ids are unique, and converting back and forth to dicts and json.

Expected behavior

  • Add filter_valid(self, data): Filter the data using the constraints and return only the valid rows.
    • Borrow logic from here

      SDV/sdv/metadata/table.py

      Lines 726 to 740 in e5fc3f6

      def filter_valid(self, data):
      """Filter the data using the constraints and return only the valid rows.
      Args:
      data (pandas.DataFrame):
      Table data.
      Returns:
      pandas.DataFrame:
      Table containing only the valid rows.
      """
      for constraint in self._constraints:
      data = constraint.filter_valid(data)
      return data
  • Add make_ids_unique(self, data): Repopulate any id fields in provided data to guarantee uniqueness.
    • Borrow logic from here

      SDV/sdv/metadata/table.py

      Lines 742 to 759 in e5fc3f6

      def make_ids_unique(self, data):
      """Repopulate any id fields in provided data to guarantee uniqueness.
      Args:
      data (pandas.DataFrame):
      Table data.
      Returns:
      pandas.DataFrame:
      Table where all id fields are unique.
      """
      for name, field_metadata in self._fields_metadata.items():
      if field_metadata['type'] == 'id' and not data[name].is_unique:
      ids = self._make_ids(field_metadata, len(data))
      ids.index = data.index.copy()
      data[name] = ids
      return data
  • Add to_dict
    • Borrow logic from here

      SDV/sdv/metadata/table.py

      Lines 765 to 784 in e5fc3f6

      def to_dict(self):
      """Get a dict representation of this metadata.
      Returns:
      dict:
      dict representation of this metadata.
      """
      return {
      'fields': copy.deepcopy(self._fields_metadata),
      'constraints': [
      constraint if isinstance(constraint, dict) else constraint.to_dict()
      for constraint in self._constraints
      ],
      'model_kwargs': copy.deepcopy(self._model_kwargs),
      'name': self.name,
      'primary_key': self._primary_key,
      'sequence_index': self._sequence_index,
      'entity_columns': self._entity_columns,
      'context_columns': self._context_columns,
      }
    • fields, constraints, primary_key, entity_columns, sequence_index and context_columns all now live in the SingleTableMetadata so that will be handled by calling it's to_dict method
  • Add to_json
    • Borrow logic from here

      SDV/sdv/metadata/table.py

      Lines 786 to 794 in e5fc3f6

      def to_json(self, path):
      """Dump this metadata into a JSON file.
      Args:
      path (str):
      Path of the JSON file where this metadata will be stored.
      """
      with open(path, 'w') as out_file:
      json.dump(self.to_dict(), out_file, indent=4)
  • Add from_dict
    • Borrow logic from here

      SDV/sdv/metadata/table.py

      Lines 796 to 823 in e5fc3f6

      @classmethod
      def from_dict(cls, metadata_dict, dtype_transformers=None):
      """Load a Table from a metadata dict.
      Args:
      metadata_dict (dict):
      Dict metadata to load.
      dtype_transformers (dict):
      If passed, set the dtype_transformers on the new instance.
      """
      metadata_dict = copy.deepcopy(metadata_dict)
      fields = metadata_dict['fields'] or {}
      instance = cls(
      name=metadata_dict.get('name'),
      field_names=set(fields.keys()),
      field_types=fields,
      constraints=metadata_dict.get('constraints') or [],
      model_kwargs=metadata_dict.get('model_kwargs') or {},
      primary_key=metadata_dict.get('primary_key'),
      sequence_index=metadata_dict.get('sequence_index'),
      entity_columns=metadata_dict.get('entity_columns') or [],
      context_columns=metadata_dict.get('context_columns') or [],
      dtype_transformers=dtype_transformers,
      enforce_min_max_values=metadata_dict.get('enforce_min_max_values', True),
      learn_rounding_scheme=metadata_dict.get('learn_rounding_scheme', True),
      )
      instance._fields_metadata = fields
      return instance
    • Again, a lot of the variables now live inside the SingleTableMetadata
  • Add From_json
    • Borrow logic from here

      SDV/sdv/metadata/table.py

      Lines 825 to 834 in e5fc3f6

      @classmethod
      def from_json(cls, path):
      """Load a Table from a JSON.
      Args:
      path (str):
      Path of the JSON file to load
      """
      with open(path, 'r') as in_file:
      return cls.from_dict(json.load(in_file))
@amontanez24 amontanez24 added feature request Request for a new feature new Automatic label applied to new issues and removed new Automatic label applied to new issues labels Aug 10, 2022
@amontanez24 amontanez24 added this to the 1.0.0 milestone Aug 16, 2022
This was referenced Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants