Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STY: Put pure data mappings in separate files #1005

Merged
merged 6 commits into from
Jun 19, 2022
Merged

STY: Put pure data mappings in separate files #1005

merged 6 commits into from
Jun 19, 2022

Conversation

MartinThoma
Copy link
Member

No description provided.

@codecov
Copy link

codecov bot commented Jun 18, 2022

Codecov Report

Merging #1005 (e647f78) into main (f62e051) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1005      +/-   ##
==========================================
+ Coverage   89.28%   89.30%   +0.01%     
==========================================
  Files          19       24       +5     
  Lines        4435     4443       +8     
  Branches      921      921              
==========================================
+ Hits         3960     3968       +8     
  Misses        323      323              
  Partials      152      152              
Impacted Files Coverage Δ
PyPDF2/_codecs/adobe_glyphs.py 100.00% <ø> (ø)
PyPDF2/_cmap.py 84.39% <100.00%> (ø)
PyPDF2/_codecs/__init__.py 100.00% <100.00%> (ø)
PyPDF2/_codecs/pdfdoc.py 100.00% <100.00%> (ø)
PyPDF2/_codecs/std.py 100.00% <100.00%> (ø)
PyPDF2/_codecs/symbol.py 100.00% <100.00%> (ø)
PyPDF2/_codecs/zapfding.py 100.00% <100.00%> (ø)
PyPDF2/generic.py 91.25% <100.00%> (-0.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f62e051...e647f78. Read the comment docs.

@MartinThoma
Copy link
Member Author

@pubpub-zz What do you think about this change?

It might be helpful to have those symbol mappings in one folder for the following reasons:

  1. In PRs, I can notice data-changes more easily
  2. During testing / coverage, I can exclude the data and thus get a more reasonable estimate of how good our test coverage is.
  3. We can make diffs between the encodings and potentially notice issues more easily
  4. If new symbol mappings get added, it's pretty clear where they belong
  5. The generic.py is still huge, but not as crazy as before

I think all of those reasons are rather weak, but I guess this is overall more clean?

@MartinThoma MartinThoma marked this pull request as ready for review June 18, 2022 08:42
@MartinThoma
Copy link
Member Author

I was also uncertain if there is a more efficient way to store the data. I was mainly thinking about the speed it takes to load. The storage size is nothing that concerns me at the moment. But I guess reading / parsing a Python file is probably already super efficient.

@MartinThoma
Copy link
Member Author

Some functions, e.g. fill_from_encoding might actually fit better in the codecs part than into generic

@pubpub-zz
Copy link
Collaborator

@pubpub-zz What do you think about this change?

It might be helpful to have those symbol mappings in one folder for the following reasons:

1. In PRs, I can notice data-changes more easily
2. During testing / coverage, I can exclude the data and thus get a more reasonable estimate of how good our test coverage is.
3. We can make diffs between the encodings and potentially notice issues more easily
4. If new symbol mappings get added, it's pretty clear where they belong
5. The generic.py is still huge, but not as crazy as before

I think all of those reasons are rather weak, but I guess this is overall more clean?

I was not originally happy to add symbol,and other encodings. And this looks clearer : I love it! 😀
perhaps you should also transfer _predefined_cmap and _default_fonts_space_width from _cmap.py

@pubpub-zz
Copy link
Collaborator

In my fork of PyPDF4, I've tried to restructure the code.
For this I've isolated some very simple objects in a generic1 and I was just importing it into generic to prevent changing the rest of the code :
I've isolated that way quite easily;

class PdfObject(object):                   

class NullObject(PdfObject):

class BooleanObject(PdfObject):

class FloatObject(decimal.Decimal, PdfObject):

class NumberObject(int, PdfObject):

class ByteStringObject(BytesType, PdfObject):

class TextStringObject(StringType, PdfObject):

class NameObject(str, PdfObject):

def encode_pdf_doc_encoding(unicode_str):

def decode_pdf_doc_encoding(byte_array):

Generic.py keeps advance types:

def read_object(stream, pdf):

class ArrayObject(list, PdfObject):

class IndirectObject(PdfObject):

def read_hexstring_from_stream(stream):

def read_string_from_stream(stream):

class DictionaryObject(dict, PdfObject):

class TreeObject(DictionaryObject):

class StreamObject(DictionaryObject):

class EncodedStreamObject(StreamObject):

class DecodedStreamObject(StreamObject):

class ContentStream(DecodedStreamObject):

class ObjectStream(EncodedStreamObject):

class DocumentInformation(DictionaryObject):

class RectangleObject(ArrayObject):

def get_rectangle(self, name, defaults):
def set_rectangle(self, name, value):
def delete_rectangle(self, name):

def _create_rectangle_accessor(name, fallback):

class PdfBaseDocument(object):                           

I've create PdfBaseDocument as an abstract parent to PdfCommon that provides all functions common between PdfWriter and PdfReader ?
Your opinion ?

@MartinThoma
Copy link
Member Author

Sounds good!

Instead of generic1 I would probably use _base or similar.

@MartinThoma MartinThoma merged commit 2c53813 into main Jun 19, 2022
@MartinThoma MartinThoma deleted the codecs branch June 19, 2022 08:11
MartinThoma added a commit that referenced this pull request Jun 19, 2022
The highlight of this release is improved support for file encryption
(AES-128 and AES-256, R5 only). See #749 for the amazing work of
@exiledkingcc 🎊 Thank you 🤗

Deprecations (DEP):
-  Rename names to be PEP8-compliant (#967)
  - `PdfWriter.get_page`: the pageNumber parameter is renamed to page_number
  - `PyPDF2.filters`:
    * For all classes, a parameter rename: decodeParms ➔ decode_parms
    * decodeStreamData ➔ decode_stream_data
  - `PyPDF2.xmp`:
    * XmpInformation.rdfRoot ➔ XmpInformation.rdf_root
    * XmpInformation.xmp_createDate ➔ XmpInformation.xmp_create_date
    * XmpInformation.xmp_creatorTool ➔ XmpInformation.xmp_creator_tool
    * XmpInformation.xmp_metadataDate ➔ XmpInformation.xmp_metadata_date
    * XmpInformation.xmp_modifyDate ➔ XmpInformation.xmp_modify_date
    * XmpInformation.xmpMetadata ➔ XmpInformation.xmp_metadata
    * XmpInformation.xmpmm_documentId ➔ XmpInformation.xmpmm_document_id
    * XmpInformation.xmpmm_instanceId ➔ XmpInformation.xmpmm_instance_id
  - `PyPDF2.generic`:
    * readHexStringFromStream ➔ read_hex_string_from_stream
    * initializeFromDictionary ➔ initialize_from_dictionary
    * createStringObject ➔ create_string_object
    * TreeObject.hasChildren ➔ TreeObject.has_children
    * TreeObject.emptyTree ➔ TreeObject.empty_tree

New Features (ENH):
-  Add decrypt support for V5 and AES-128, AES-256 (R5 only) (#749)

Robustness (ROB):
-  Fix corrupted (wrongly) linear PDF (#1008)

Maintenance (MAINT):
-  Move PDF_Samples folder into ressources
-  Fix typos (#1007)

Testing (TST):
-  Improve encryption/decryption test (#1009)
-  Add merger test cases with real PDFs (#1006)
-  Add mutmut config

Code Style (STY):
-  Put pure data mappings in separate files (#1005)
-  Make encryption module private, apply pre-commit (#1010)

Full Changelog: 2.2.1...2.3.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants