Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[messages] Switch from JSON to ZIP / NDJSON format
This is a major rewrite of the message import / export code, that switches the format from a single (standard) JSON file, with embedded Base64 encoded MMS binary data, to a ZIP file containing a Newline-delimited JSON (NDJSON) file ('messasges.ndjson'), containing message metadata and text data, and a 'data' directory, containing the untouched binary files stored natively by Android. There are a number of advantages, as well as some disadvantages, to the new format: Advantages: ----------- Separating (encoded) binary data from text data and metadata results in much cleaner text, which can be much more comfortably browsed by humans. The ZIP file format is much more flexibile than the monolithic JSON file format. E.g., additional information about the exporting system and app and statistics about the export run can be easily included in another file within the ZIP archive without substantially modifying the existing export flow (this is not yet implemented, but will likely be in the future.) Using ZIP files automatically provides compression, although the reduction in file size will depend on how much of the exported data is compressible text (i.e., metadata and text data), as opposed to binary data, which will generally be already compressed and not able to be compressed much further. Not including the binary data in the (ND)JSON eliminates the need to read entire binary files into RAM at one time, resulting in much more efficient RAM usage. This fixes #84, which was the initial impetus for the format change. NDJSON allows the reading of message records one at a time, eliminating the need to use JSON streaming (see #6), resulting in much simpler and cleaner code. Disadvantages: -------------- The ZIP file format add code complexity. NDJSON is less common then standard JSON. NDJSON is less easily humanly-readable than the pretty-printed JSON previously used (since NDJSON records cannot contain newlines), although this can be easily mitigated by simply running 'jq < messages.ndjson' to pretty-print the NDJSON. Additional Changes: ------------------- An additional change in this commit is the prefixing of a double underscore to all (ND)JSON attributes added by the app (e.g., '__display_name', '__parts'), in order to clearly indicate that these have been added by the app and are not the names of columns in the Android message database tables. Bugs: ----- The current implementation of the new format works, although import performance is unacceptably poor for large message collections. This is apparently a consequence of the use of the InputStream paradigm (required by Android's Storage Access framework) to access the ZIP file, which allows only sequential access, not random access, and so accessing each binary data file requires a sequential read from the beginning of the ZIP file. This should be fixed in a subsequent commit. Closes: #6, #84
- Loading branch information