Settling the Project File(s) Format (Discussion) #259

vkbo · 2020-05-29T08:21:33Z

I'm in the process of settling the way novelWriter projects are saved to disk.

Background

The main project file, which holds the tree structure of the project tree view, is saved in an XML file. This file has a root element called novelWriterXML and a version number, currently 1.0 so that I can check that we at least seemingly (unless someone alters it) have a valid file.

The documents themselves are saved with their hexadecimal handle, in folders named after the first character of the handle, data_0, data_1, ..., data_f. This was split up since the original file structure that I ported over from a different iteration of novelWriter, had multiple files per document, so I didn't want the folders to grow too messy.

There's also a meta folder, where logs, a record of certain GUI settings like column widths, filter options, etc, and the cached index file.

Currently Planned Changes (Version 0.7)

There will now only be one file for each document. They will have the hexadecimal handle (13 characters) and the file extension .nwd. They will all be stored in a folder named content. The meta folder remains as-is. The project XML file will have its version bumped to 1.1, marking the change in document path.

When opening a project, novelWriter will move all .nwd files in data_N folders to the content folder. Backup .bak files, which novelWriter stopped making a couple of releases ago, will be deleted. Any other files in the data_N folders will be moved to the root of the project, and the folders then removed.

Possible Next Step (Discussion)

I'm considering adding the option to save the entire project as a single file. It will then be a file format inspired by the Open Document format, docx, epub, etc. These formats all have in common that the file is really a zip file with the actual data files in a folder structure inside.

There are a few issues with this, one is that the zip archive format doesn't really allow replacing content, it works by appending data. The cleanest way to do this then is to create a new zip file for each save, and otherwise keep the entire project in memory.

Since a novelWriter project will never be much more than a few megabytes, this is really a non-issue from that point of view, but in terms of data loss prevention in the event of a crash, the risk is larger than with the current scheme which is quite robust.

The other option is of course to use a file database format for storing the project, like SQLite or Berkeley DB, or something else. These are better suited to continuous IO, and already have stable APIs.

The text was updated successfully, but these errors were encountered:

johnblommers · 2020-05-29T18:40:23Z

Regarding possible next steps, I would prefer a project structure that I can navigate in the file manager of on the command line. Even zipping the contents takes that away from me. Zipping a project backup is fine of course.

As it now stands the folder names in the project file bear no relationship to their contents. It's my custom to place individual chapters in their own Markdown file and preferably to place individual scenes in their own files. Each chapter file bears the name of its H2 heading and each scene file bears the name of its H3 heading. But at present novelWriter's file names are meaningless to the writer. So I'd like to see a file structure like I just described, rather than further complicating it with zips and databases. IIRC the old GitBook project used a file like books.txt to order its files.

I agree with you that the space-saving feature of zip files have no benefit these days given how plentiful and cheap storage has become. Even a 200,000-page text file requires only about 500 MB of storage space.

Remember a huge point about writing in Markdown is that it's plain text. I can write scripts to automate some text processing right there in the project file. Zip files and no database files take that away and complicate life. I think it's fine to have a file, call it books.txt, that lists the order of the files in the manuscript. And novelWriter needs a file to store various metadata so an XML file for that makes sense. All of this is human readable. ✓

I believe that the data structure of a writing project should be as transparent and as open as possible. It should be obvious just by looking at the project files and folders what's in them.

That said, I remember reading somewhere about a suggested standard for zipping up a folder of Markdown files. That actually made sense. But I can't find any reference to it anymore. Perhaps that speaks volumes.

vkbo · 2020-05-29T18:56:12Z

I would not make the zipped project file mandatory. Myself, I'd never use it as I like to have my project version controlled, but I've noticed that most similar tools at least offer the option. Some users may prefer it to the current option.

Having spent some time researching various such methods though, I do think I will leave it for now. At least I want to wait until someone requests it.

As for the file and folder structure, I have no plans to mirror the project structure on drive. I've explained in the documentation why, so I won't repeat it here, but I do see the point in having a way to identify the files. Currently, you have to open them, but if you use a text viewer with a folder browser, like Geany, VSCode, Atom and probably numerous others, flipping through them isn't an issue.

Having them all in one folder definitely makes that easier at least. But another reason i want to reduce 16 data folders to one is that I have some file versioning scheme in mind for the future, and having to deal with a less complex folder structure will make that simpler.

In an y case, you make a good argument for sticking to the current model, and I do intend to add a file that actually mirrors the project tree with the file next to it, so you can identify them. I can even add a json ToC that can be imported in scripts.

Maybe I'll add a ToC.txt and ToC.json in the root project folder.

vkbo · 2020-05-29T19:07:37Z

In any case, I'm reverting some of the changes I've made in the dev branch and will instead write an abstraction class for the project folder, similar to how Qt handles resources. It will keep all that logic tucked into a single class, so I can make it a bit more flexible. It also makes it possible to add new storage APIs at a later time if needed.

johnblommers · 2020-05-29T19:23:10Z

I'll cast a vote for ToC.txt.

Plus you make an excellent point:

if you use a text viewer with a folder browser, like Geany, VSCode, Atom and probably numerous others, flipping through them isn't an issue.

vkbo · 2020-05-29T22:53:40Z

I've merged the dev branch into master now. Opening a project from the master branch will convert the project to "storage format 1.1" which means it can no longer be opened by any of the earlier versions of novelWriter, so be careful with trying it on an actual project, or at least copy it first.

That said, I've converted all of my own projects just fine, so it seems to work ok. But I want to get a weekend of working with novelWriter on my projects before I consider releasing version 0.7.

In any case, that more or less settles the current discussion of file storage. I've moved to the new, simpler layout, and will not be adding a zipped project at this time.

Oh, and when the project is closed, in the version currently in master, it should write the two ToC files.

johnblommers · 2020-05-30T19:45:15Z

My project converted fine and sure enough there are now two nice TOC files.

vkbo · 2020-06-05T17:25:01Z

Closing this now, as it has been released in 0.7.

vkbo · 2020-11-03T11:26:31Z

I'm reopening this issue because I am still not entirely satisfied with the way project files are currently stored. Especially after the latest comments in #383.

Options

Option 1: Database File

Use a sqlite (or similar simple database format) to store the content in a single file. The benefit is that you can easily pipe data back and forth live, preserving the benefit that a crash will likely not corrupt the project. In addition, it makes it very easy to build an internal versioning system.

Option 2: Zip File

Use a simple zip archive that is extracted when the project is opened, and re-zipped when it is closed. The project structure remains exactly the same as it is today, and the zip stage can be entirely optional, preserving the current functionality along with it.

This option is trivial to implement as all that is needed is an additional step in the open and save functions that unzips/zips the project if the associated option is enabled. The project is just extracted to a temp folder of the same name as the project (possibly a hidden folder) and the folder deleted after a successful zipping afterwards.

The zipped archive needs a dedicated file extension so that it can be opened by double-clicking the file, so .nwz may be appropriate.

Additional Reasons

There are a few reasons why I want to revisit this that is indirectly related to the issue of single- or multi-file project structure.

Firstly, I am considering moving the project index into a database instead of an in-memory index cached in a JSON file between sessions. Accessing the database index will be marginally slower, but not much data is extracted on a regular request anyway. An index rebuild will definitely be slower, but also that is a trivial amount from a user's point-of-view. The upside is a smaller memory footprint, and the downside is an index.db file that is not suitable for version control.

Secondly, I want to abstract away the handling of the file storage part of the project and index classes. Currently the file storage handling is entangled into the general project handling making the code a bit more difficult to follow and maintain.

Thirdly, versioning. This is mostly covered in #383, but I'm considering an internal versioning system. The file structure way of doing it is to create a versions folder alongside the contents folder, and keep earlier versions of the files in this folder. Each file should have the appropriate meta data to associate it with its currently active version, and the versioning history should be cached in the index for quick lookups. This ties back into the first point above. It is also behind my recent changes to the document meta data format in PR #486.

vkbo self-assigned this May 29, 2020

vkbo added enhancement Request: New feature or improvement potential feature Request: May be considered later question Meta: More information requested labels May 29, 2020

vkbo closed this as completed Jun 5, 2020

vkbo mentioned this issue Oct 21, 2020

Version Control #383

Open

vkbo reopened this Nov 3, 2020

vkbo closed this as completed Jan 23, 2021

vkbo mentioned this issue Jan 25, 2022

Single file format (again) #977

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Settling the Project File(s) Format (Discussion) #259

Settling the Project File(s) Format (Discussion) #259

vkbo commented May 29, 2020

johnblommers commented May 29, 2020

vkbo commented May 29, 2020

vkbo commented May 29, 2020

johnblommers commented May 29, 2020

vkbo commented May 29, 2020 •

edited

Loading

johnblommers commented May 30, 2020

vkbo commented Jun 5, 2020

vkbo commented Nov 3, 2020 •

edited

Loading

Settling the Project File(s) Format (Discussion) #259

Settling the Project File(s) Format (Discussion) #259

Comments

vkbo commented May 29, 2020

Background

Currently Planned Changes (Version 0.7)

Possible Next Step (Discussion)

johnblommers commented May 29, 2020

vkbo commented May 29, 2020

vkbo commented May 29, 2020

johnblommers commented May 29, 2020

vkbo commented May 29, 2020 • edited Loading

johnblommers commented May 30, 2020

vkbo commented Jun 5, 2020

vkbo commented Nov 3, 2020 • edited Loading

Options

Option 1: Database File

Option 2: Zip File

Additional Reasons

vkbo commented May 29, 2020 •

edited

Loading

vkbo commented Nov 3, 2020 •

edited

Loading