Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Batch Manifest" to the CVR #30

Open
raylutz opened this issue Jun 22, 2021 · 6 comments
Open

Add "Batch Manifest" to the CVR #30

raylutz opened this issue Jun 22, 2021 · 6 comments

Comments

@raylutz
Copy link

raylutz commented Jun 22, 2021

Organization Name: Citizens Oversight

Organization Type: 2 (Nonprofit, developer of AuditEngine)

Document (e.g., CastVoteRecords): CastVoteRecords

Reference (Include section and paragraph number): General

Comment (Include rationale for comment):

Currently, the CVR provides commonly chunks of data representing the output of a tabulator, such as a tabulator in a polling place. There may be a vast number of these chunks, perhaps over 10,000, where each might have 200 ballots described. Another option is to combine all the chunks into a single file, which can grow to be very large and become unmanageable. Maintaining these as chunks is therefore a viable approach to allow the standard to scale to any number of ballots without resulting in files that become unwieldy. Since this can be used at all scales, using smaller chunks, representing batches, should be recommended.

Also, it should be recommended that election operations organize physical ballots in the same batches.

There is nothing included in the standard at this time to describe these chunks, and make sure that all the chunks are included, and unaltered. Indeed, in general, there is very little "self description" in the standard, and the result is a set of files, say encoded as JSON, without any master description of those files. Most of the files that result from the implementation of the standard are a simple "dict" (list of name and value pairs) or "list of dict", which can be viewed as a table of rows, where the JSON or XML allows sparse data population in those rows. The number of rows is not further defined in the standard. In other words, the number of rows of data, i.e. the number of ballots in a given chunk is not limited nor described anywhere, and the number of chunks is similarly not limited nor described.

Therefore, we propose a Batch Manifest to accomplish both of the above goals for the primary data of the CVR, which is the list of ballots in batches.

In terms of structure, we propose that the batch manifest be a simple list-of-dict structure, that can be easily mapped to a table with rows and columns. Such a batch manifest can be generated by the exporting function, and does not alter the meaning of any fields in the CVR definition.

field type description
batchid str typically a combination of tabulator and batch, to create a unique batchid across all batches
tabulator int TabulatorId
batch int batch number produced by that tabulator, starts at 1 for each tabulator, not unique across tabulators
locationid str (optional) physical box or location of the physical ballots (if different from batchid
first_ballot_idx int The first ballot_idx in the batch, where the ballot_idx is nominally an integer that starts at 1
count int The number of ballots in the batch
cvr_basename str The filename, without path, of the file containing the batch, like "CvrExport_\d+.json"
batch_hash_digest str String describing the secure hash digest of the file, like "SHA256=[0-9a-f]{64}"

Typically, a single chunk of JSON describing batch will include only one batchid internally.

It may be advisable to allow the cvr_basename to contain the batchid, such as "CvrExport_{batchid}.json"

We suggest that additional named fields be allowed to be defined by implementations for internal use, to be ignored if they are not used by readers of the format.

It may be necessary to have yet another table to provide a cross reference between the locationid and batchid, if not included in the table.

Suggested Change:
See above

Organization Type: 1 = Federal, 2 = Industry, 3 = Academia, 4 = Self, 5 = Other

@JDziurlaj
Copy link
Collaborator

Hi @raylutz, can you describe how large the CVRs can get before processing them poses a problem?

@raylutz
Copy link
Author

raylutz commented Jun 25, 2021 via email

@JDziurlaj
Copy link
Collaborator

So if I am understanding this correctly, this proposal seeks to have random-access through a set of NIST CVRs, so that a particular CVR or set of CVRs can be easily retrieved?.A secondary goal is to provide a structured means to hold the hashes of each CVR file. Are there any other use-cases of this proposal?

@raylutz
Copy link
Author

raylutz commented Jul 13, 2021 via email

@JDziurlaj
Copy link
Collaborator

If different snapshots of the same CVR are to be included in separate CVR files, how do you determine which one to process?

@raylutz
Copy link
Author

raylutz commented Sep 11, 2021

Hi John:
It looks like I'm frequently missing these, so sorry for the delay.
To respect the requirement of immutable data, it is not allowed to go back and revise a status value. Thus, the records have to be self-describing. Now, you have 'Original' and 'Modified' status. Then, I imagine there might be several 'Modified' entries if several versions have been submitted, say through several rounds of adjudication.
It makes sense to complete the initial CVR and lock it down. Then process adjudications. The adjudicator app will be able to read the CVR and all other intervening adjudication files. Then it can see the 'Original' but then I suggest 'Modified' for the first one, and then 'Modified-2', 'Modified-3' etc for subsequent snapshots if there is more than just two. In general, I see this down in other standards by having a count, so instead of "Original" they just use 0, then 1, etc. To maintain the compatibility with any prior use (although there is not much of an installed base) then we can stay with the Original and Modified, but then Modified-2, Modified-3, etc would be used, so that Original is 0, Modified is 1, etc. And if we want to be aggressive about this, then we would make the numeric designation as preferred and deprecate the words.

--Ray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants