-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "Batch Manifest" to the CVR #30
Comments
Hi @raylutz, can you describe how large the CVRs can get before processing them poses a problem? |
Hi John:
You ask how large Cvr can be before processing becomes a problem.
This depends on the methods used to process them, and capacity of the
machine, so there is no hard answer.
Some cloud-based machines are limited to 250MB total memory. Also,
cloud-based machines can't incrementally read json, because you either
read the entire file or not, unless you know exactly what offsets to
pull out the contents. JSON has no directory so it is not possible to
pull out an arbitrary section from the middle.
Most of the files in the CVR are small, and are simple list-of-dict
structure. Those are not a problem.
The only one is the CvrExport file which is either all one file or split
into CvrExport_n.json, where n is an integer like 0,1,2, etc.
Dominion now creates a CVR output similar to the NIST standard.
50K ballots result in a zip file which is 6.66 MB zipped. Compressing
with ZIP is common practice but the standard does not specify an archive
format or compression.
When it is unzipped, it is 255MB. This JSON file will cause some free
JSON readers to choke. It cannot be read by limited-memory cloud
machines like AWS Lambda.
Dominion offers two output options, 1) one big file or 2) batches, one
per json.
We specify that we want the output from Dominion to be exported as batches.
If you zip the chunks, they are only slightly larger, at 7.59 MB. But
the nice thing about ZIP format, (as opposed to say tar) is that it does
not compress the whole file, it compresses each file, and stores them in
the archive. Thus, it is possible to pull out individual json chunks,
which are only maybe about 300K, because each one is a separate file and
the ZIP archive provides the location. Basically you just read that file
from the zip.
The use of chunks can scale to any size election without breaking
memory. To give you an idea, the entire CVR in Arizona, (2.1 M ballot
sheets) is 1.9 GB zipped.
If you unzip it all at once, it is 33GB. There are 10,344 chunks.
…--Ray
On 6/24/2021 9:36 AM, John Dziurlaj wrote:
Hi @raylutz <https://github.com/raylutz>, can you describe how large
the CVRs can get before processing them poses a problem?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADSDLSJRQW23PZP6DKOOQ6TTUNNHPANCNFSM47EDGKDA>.
--
-------
Ray Lutz
Citizens' Oversight Projects (COPs)
http://www.citizensoversight.org
619-820-5321
|
So if I am understanding this correctly, this proposal seeks to have random-access through a set of NIST CVRs, so that a particular CVR or set of CVRs can be easily retrieved?.A secondary goal is to provide a structured means to hold the hashes of each CVR file. Are there any other use-cases of this proposal? |
This is intended mainly as a means to bundle up a large set of
CvrExport.json files, maybe many thousands of them, and lock them into a
package. So that when we get the package, then we can identify all of
the components. There is an additional issue that we need to address,
that is that there may be several versions of the CVR which are released
at different times. This is partially different from the snapshot
concept which is used for adjudications.
It is common, for example, for there to be several stages of release,
such as election night, semi-final unofficial, and final. I don't think
the other manifest files need to be changed at all but it would change
the set of CvrExport.json files and the batch manifest. In such a staged
release, only the set of batches will likely change, not the individual
batches. So for example, let's say that in the election night report,
they have batches 0001 to 0100. Then when VBM ballots are fully
processed, they have batches 0001 to 0200. And when provisionals are
added to the set, there are few more added. Each one of the batches is
still the same, and would have the same hash but the set includes more
batches. Thus the batch manifest should have a version or perhaps better
called a stage indicator.
Due to the need to maintain a locked-down cvr set as they are processed,
the snapshots should not be added inside the cvr (unless the
adjudication is done in real time as they are scanned), but should be
added as a separate file. This is different from the way the CVR has
been conceptualized and implemented, because it is viewed that the
snapshots would exist in the same file, and it would be modified to
include the adjudication information as a separate "Version". I
commented on this when the CVR was being designed but my comments were
not fully embraced. Each one of the CvrExport.json versions should be
"immutable" in that as you process the canvass, the data from the prior
stages should remain unchanged, but you can add additional information
only as new files. It can likely still be embraced if we add the
snapshot information when it is actually provided, which may be after
the first stages have already been completed, and then a separte
adjudication phase is included in the process after the fact, and that
should not alter the CVR files already produced and locked down with a
hash code, but is added later in a separate CvrExport file. If
adjudication is done in real time, as I think is the case for Dominion,
then the "Modified" snapshot could be included in the original CvrExport
file adjacent to the "Original" version.
If adjudication is done later, then it is best to not go back and alter
the CvrExport file because that would change the hash code and then the
batch manifest.
Thus it is essential to be able to find that and know that the record as
been changed, because the changed record may not be embedded in the
original CVR file if we are to respect the requirement of immutability.
In terms of random access, yes, it is also a need with very large data
sets that will not easily fit in memory, because the ballot images are
not usually organized the same way as the cvrs, they are combined in to
ZIP files. So if we process a set of ballot images and generate an
independent tabulation from those images, then they will be in the
natural order as we found them in the zip archive, or their apparent
location, and then in the order of the archives processed. To compare
the two tabulations, then it is necessary to either reorder one of them,
or to have random access.
But this manifest for the needs to facilitate random access is something
we can generate ourselves by scanning the chunks, so it is not the
primary driver of this proposal.
…--Ray
On 7/13/2021 5:00 AM, John Dziurlaj wrote:
So if I am understanding this correctly, this proposal seeks to have
random-access through a set of NIST CVRs, so that a particular CVR or
set of CVRs can be easily retrieved?.A secondary goal is to provide a
structured means to hold the hashes of each CVR file. Are there any
other use-cases of this proposal?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADSDLSJTDFZMZBXQR5MLEQTTXQTGJANCNFSM47EDGKDA>.
|
If different snapshots of the same CVR are to be included in separate CVR files, how do you determine which one to process? |
Hi John: --Ray |
Organization Name: Citizens Oversight
Organization Type: 2 (Nonprofit, developer of AuditEngine)
Document (e.g., CastVoteRecords): CastVoteRecords
Reference (Include section and paragraph number): General
Comment (Include rationale for comment):
Currently, the CVR provides commonly chunks of data representing the output of a tabulator, such as a tabulator in a polling place. There may be a vast number of these chunks, perhaps over 10,000, where each might have 200 ballots described. Another option is to combine all the chunks into a single file, which can grow to be very large and become unmanageable. Maintaining these as chunks is therefore a viable approach to allow the standard to scale to any number of ballots without resulting in files that become unwieldy. Since this can be used at all scales, using smaller chunks, representing batches, should be recommended.
Also, it should be recommended that election operations organize physical ballots in the same batches.
There is nothing included in the standard at this time to describe these chunks, and make sure that all the chunks are included, and unaltered. Indeed, in general, there is very little "self description" in the standard, and the result is a set of files, say encoded as JSON, without any master description of those files. Most of the files that result from the implementation of the standard are a simple "dict" (list of name and value pairs) or "list of dict", which can be viewed as a table of rows, where the JSON or XML allows sparse data population in those rows. The number of rows is not further defined in the standard. In other words, the number of rows of data, i.e. the number of ballots in a given chunk is not limited nor described anywhere, and the number of chunks is similarly not limited nor described.
Therefore, we propose a Batch Manifest to accomplish both of the above goals for the primary data of the CVR, which is the list of ballots in batches.
In terms of structure, we propose that the batch manifest be a simple list-of-dict structure, that can be easily mapped to a table with rows and columns. Such a batch manifest can be generated by the exporting function, and does not alter the meaning of any fields in the CVR definition.
Typically, a single chunk of JSON describing batch will include only one batchid internally.
It may be advisable to allow the cvr_basename to contain the batchid, such as "CvrExport_{batchid}.json"
We suggest that additional named fields be allowed to be defined by implementations for internal use, to be ignored if they are not used by readers of the format.
It may be necessary to have yet another table to provide a cross reference between the locationid and batchid, if not included in the table.
Suggested Change:
See above
Organization Type: 1 = Federal, 2 = Industry, 3 = Academia, 4 = Self, 5 = Other
The text was updated successfully, but these errors were encountered: