STY: Put pure data mappings in separate files #1005

MartinThoma · 2022-06-18T08:27:10Z

No description provided.

codecov · 2022-06-18T08:31:13Z

Codecov Report

Merging #1005 (e647f78) into main (f62e051) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1005      +/-   ##
==========================================
+ Coverage   89.28%   89.30%   +0.01%     
==========================================
  Files          19       24       +5     
  Lines        4435     4443       +8     
  Branches      921      921              
==========================================
+ Hits         3960     3968       +8     
  Misses        323      323              
  Partials      152      152

Impacted Files	Coverage Δ
PyPDF2/_codecs/adobe_glyphs.py	`100.00% <ø> (ø)`
PyPDF2/_cmap.py	`84.39% <100.00%> (ø)`
PyPDF2/_codecs/__init__.py	`100.00% <100.00%> (ø)`
PyPDF2/_codecs/pdfdoc.py	`100.00% <100.00%> (ø)`
PyPDF2/_codecs/std.py	`100.00% <100.00%> (ø)`
PyPDF2/_codecs/symbol.py	`100.00% <100.00%> (ø)`
PyPDF2/_codecs/zapfding.py	`100.00% <100.00%> (ø)`
PyPDF2/generic.py	`91.25% <100.00%> (-0.29%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f62e051...e647f78. Read the comment docs.

MartinThoma · 2022-06-18T08:41:22Z

@pubpub-zz What do you think about this change?

It might be helpful to have those symbol mappings in one folder for the following reasons:

In PRs, I can notice data-changes more easily
During testing / coverage, I can exclude the data and thus get a more reasonable estimate of how good our test coverage is.
We can make diffs between the encodings and potentially notice issues more easily
If new symbol mappings get added, it's pretty clear where they belong
The generic.py is still huge, but not as crazy as before

I think all of those reasons are rather weak, but I guess this is overall more clean?

MartinThoma · 2022-06-18T08:43:26Z

I was also uncertain if there is a more efficient way to store the data. I was mainly thinking about the speed it takes to load. The storage size is nothing that concerns me at the moment. But I guess reading / parsing a Python file is probably already super efficient.

MartinThoma · 2022-06-18T08:44:41Z

Some functions, e.g. fill_from_encoding might actually fit better in the codecs part than into generic

pubpub-zz · 2022-06-18T08:53:01Z

@pubpub-zz What do you think about this change?

It might be helpful to have those symbol mappings in one folder for the following reasons:
1. In PRs, I can notice data-changes more easily
2. During testing / coverage, I can exclude the data and thus get a more reasonable estimate of how good our test coverage is.
3. We can make diffs between the encodings and potentially notice issues more easily
4. If new symbol mappings get added, it's pretty clear where they belong
5. The generic.py is still huge, but not as crazy as before
I think all of those reasons are rather weak, but I guess this is overall more clean?

I was not originally happy to add symbol,and other encodings. And this looks clearer : I love it! 😀
perhaps you should also transfer _predefined_cmap and _default_fonts_space_width from _cmap.py

pubpub-zz · 2022-06-18T09:44:53Z

In my fork of PyPDF4, I've tried to restructure the code.
For this I've isolated some very simple objects in a generic1 and I was just importing it into generic to prevent changing the rest of the code :
I've isolated that way quite easily;

class PdfObject(object):                   

class NullObject(PdfObject):

class BooleanObject(PdfObject):

class FloatObject(decimal.Decimal, PdfObject):

class NumberObject(int, PdfObject):

class ByteStringObject(BytesType, PdfObject):

class TextStringObject(StringType, PdfObject):

class NameObject(str, PdfObject):

def encode_pdf_doc_encoding(unicode_str):

def decode_pdf_doc_encoding(byte_array):

Generic.py keeps advance types:

def read_object(stream, pdf):

class ArrayObject(list, PdfObject):

class IndirectObject(PdfObject):

def read_hexstring_from_stream(stream):

def read_string_from_stream(stream):

class DictionaryObject(dict, PdfObject):

class TreeObject(DictionaryObject):

class StreamObject(DictionaryObject):

class EncodedStreamObject(StreamObject):

class DecodedStreamObject(StreamObject):

class ContentStream(DecodedStreamObject):

class ObjectStream(EncodedStreamObject):

class DocumentInformation(DictionaryObject):

class RectangleObject(ArrayObject):

def get_rectangle(self, name, defaults):
def set_rectangle(self, name, value):
def delete_rectangle(self, name):

def _create_rectangle_accessor(name, fallback):

class PdfBaseDocument(object):

I've create PdfBaseDocument as an abstract parent to PdfCommon that provides all functions common between PdfWriter and PdfReader ?
Your opinion ?

MartinThoma · 2022-06-18T09:51:54Z

Sounds good!

Instead of generic1 I would probably use _base or similar.

@exiledkingcc

The highlight of this release is improved support for file encryption (AES-128 and AES-256, R5 only). See #749 for the amazing work of @exiledkingcc 🎊 Thank you 🤗 Deprecations (DEP): - Rename names to be PEP8-compliant (#967) - `PdfWriter.get_page`: the pageNumber parameter is renamed to page_number - `PyPDF2.filters`: * For all classes, a parameter rename: decodeParms ➔ decode_parms * decodeStreamData ➔ decode_stream_data - `PyPDF2.xmp`: * XmpInformation.rdfRoot ➔ XmpInformation.rdf_root * XmpInformation.xmp_createDate ➔ XmpInformation.xmp_create_date * XmpInformation.xmp_creatorTool ➔ XmpInformation.xmp_creator_tool * XmpInformation.xmp_metadataDate ➔ XmpInformation.xmp_metadata_date * XmpInformation.xmp_modifyDate ➔ XmpInformation.xmp_modify_date * XmpInformation.xmpMetadata ➔ XmpInformation.xmp_metadata * XmpInformation.xmpmm_documentId ➔ XmpInformation.xmpmm_document_id * XmpInformation.xmpmm_instanceId ➔ XmpInformation.xmpmm_instance_id - `PyPDF2.generic`: * readHexStringFromStream ➔ read_hex_string_from_stream * initializeFromDictionary ➔ initialize_from_dictionary * createStringObject ➔ create_string_object * TreeObject.hasChildren ➔ TreeObject.has_children * TreeObject.emptyTree ➔ TreeObject.empty_tree New Features (ENH): - Add decrypt support for V5 and AES-128, AES-256 (R5 only) (#749) Robustness (ROB): - Fix corrupted (wrongly) linear PDF (#1008) Maintenance (MAINT): - Move PDF_Samples folder into ressources - Fix typos (#1007) Testing (TST): - Improve encryption/decryption test (#1009) - Add merger test cases with real PDFs (#1006) - Add mutmut config Code Style (STY): - Put pure data mappings in separate files (#1005) - Make encryption module private, apply pre-commit (#1010) Full Changelog: 2.2.1...2.3.0

STY: Put pure data mappings in separate files

bd6976a

Move adobe glyphs as well

3946cf3

MartinThoma marked this pull request as ready for review June 18, 2022 08:42

MartinThoma added 4 commits June 19, 2022 09:38

Merge branch 'main' into codecs

cb2b322

Move more to codecs

65f2cf4

Add noqa

d54da16

Make codecs a private module

e647f78

MartinThoma merged commit 2c53813 into main Jun 19, 2022

MartinThoma deleted the codecs branch June 19, 2022 08:11

nyanpasu64 mentioned this pull request Jun 19, 2022

PyPDF2 2.3.0 from PyPi cannot be imported, missing _codecs directory #1011

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STY: Put pure data mappings in separate files #1005

STY: Put pure data mappings in separate files #1005

MartinThoma commented Jun 18, 2022

codecov bot commented Jun 18, 2022 •

edited

Loading

MartinThoma commented Jun 18, 2022

MartinThoma commented Jun 18, 2022

MartinThoma commented Jun 18, 2022

pubpub-zz commented Jun 18, 2022

pubpub-zz commented Jun 18, 2022

MartinThoma commented Jun 18, 2022

STY: Put pure data mappings in separate files #1005

STY: Put pure data mappings in separate files #1005

Conversation

MartinThoma commented Jun 18, 2022

codecov bot commented Jun 18, 2022 • edited Loading

Codecov Report

MartinThoma commented Jun 18, 2022

MartinThoma commented Jun 18, 2022

MartinThoma commented Jun 18, 2022

pubpub-zz commented Jun 18, 2022

pubpub-zz commented Jun 18, 2022

MartinThoma commented Jun 18, 2022

codecov bot commented Jun 18, 2022 •

edited

Loading