Ingest pipeline breaks order after last file #63

juagargi · 2024-04-17T14:37:11Z

The ingest command ingests files into the DB.

The processing pipeline is:

Open file in Processor
Parse file, send each certificate to CertificateProcessor
In multiple routines, create batches of certificates, in CertificateProcessor
For each batch, in an independent routine, update the DB.
The CertificateProcessor waits until all its batches are updated.
The Processor waits until the CertificateProcessor is done
The Processor calls OnBundleFinished if the # of certs is a bundle, or at end

The bug appears:

When there is no bundle size defined: the call to OnBundleFinished is done before the CertProcessor is done.
When there is a defined bundle size: the CertProcessor is still running.

This bug makes the last batches not to be ready for the next steps, done on OnBundleFinished,
thus creating an incomplete view of the certificate landscape represented by the files.

We need to:

Just move the code to solve the first bug.
On each bundle:
- Close the parsedCertCh and wait for CertProcessor to finish.
- Create a new parsedCertCh and a new CertProcessor

The text was updated successfully, but these errors were encountered:

Bug reported as issue #63.

* Add config option to ingest CT log certificates from csv files * fix logging * fix typo * bump version * add config option for the maximum ingested csv rows per CT log * bump patch version * Unify CSV files. * Fix UTs. * Add comment fetchers implement Fetcher. * Update dependencies. In particular, the mysql driver has an issue where it closes the connection if using TCP and gets an invalid connection error. The error is "unexpected EOF", and can be work around by setting the number of idle connections to 0. SetMaxIdleConns(0). But updating the driver should also fix the bug. go-sql-driver/mysql#674 * Cosmetic changes. * Performance in Articuno documentation. * Add performance figures on articuno. * Allow SMT update alone. * Allow bundle max sizes. The ingest tool can be told the max amount of certificates before a coalesce, smt update and dirty clean steps need to happen. * Performance debugging. * WIP updateSMT fails with OOM. * SMT update using join dirty table. * Doc on how to rename with new MySQL. * Cleanup benchmark. * Edit notes on performance: directories. * For bundled ingestion, call OnBundleFinished on last one. * Tests to check performance of the driver/daemon. From this test, we can conclude that, since the CPU for the daemon is bound to its max, we could benefit by calling multiple inserts/select, even with the same connection. * IO performance at articuno notes. * More verbose output to discover bottlenecks. * Fix removal of index if not exists. * More verbose messages and DB stats. * Remove DB stats display. * Fix bug about ingest pipeline calling coalesce too early. Bug reported as issue #63. * Cache refactor method. * When strategy=overwrite, disable indices in dirty,certs,and domains. * commitmessage.txt * MyISAM numbers for the test set. * Extend the concurrency test. * Mapserver InnoDB (#72) * Use InnoDB. * New schema with auto increment. * Identify deadlocks in MySQL InnoDB with a test. * Further testing/improving DB performance. * Skip expensive tests. * More test results. InnoDB looks promising. * Test results for partitioning. * Last tests for partitions. With last results in articuno. * Finished insert test in partitions. * Finished read tests. * Rename variable. * Manager for DB sharding. * WIP manager and workers. * Remove unused. * Work on tests/random stuff. * Workers with WorkerBase. * WIP conn manager and workers. UTs seem to work now with the default settings of InnoDB. Need to extend them with many more test cases. * Cleanup, reorganize. * Add more test cases. * Add Log2. * Clean and tidy up, more mysql UTs. * Document possible mysqld bug. * Cleanup. * Cleanup. * Refactor ErrorsCoalesce. * Remove empty file. * Extend deadlock UT. * Utils for benchmarks. * Convert TestPartitionInsert to benchmark. * Convert TestInsertPerformance to benchmark. * Convert TestReadPerformance to benchmark. * Reorder functions in test. * Fix some unit tests. Domain names for intermediate certs are also present in domains table. Flaky deadlock test. Test for domain_payloads too slow. * Using partitions in production DB. Fixed RetrieveDomainEntries benchmark. Adapted deadlock test. * Globally set TRANSACTION ISOLATION LEVEL to READ UNCOMMITTED. * Force autocommit=1 on connection. * Manager allows resuming. * Fix bug in random test creation. * Using CSV files to update certificates. * Cleanup. * At manager, resume cert and domain workers synchronously. Before finishing the Resume() function, ensure that the cert and domain workers have correctly resumed. * Cleanup. * Debugging race condition. * No race condition, the race detector timed out. * Generic function to deduplicate slices. Allows to have a "master" slice used to deduplicate, and slave ones from which elements are removed if they are removed from the master one. * New deduplicate slice function. * Use the new dedup functions. * Dedup domains in three stages. One for dirty, just IDs. Another for id and name. Last one for domain and cert IDs. * Add mock DB, mockgen. * Allow dedup to receive existing storage. * Use slices of values (not pointers) for IDs (#65) * UT for random certs. size. * New hash method, no allocs. UTs. * Removing pointers from ID slices. The only slice with pointers is parent IDs, as they can be nil to represent "no parent". * New method to hash strings. * CertWorker does zero allocations. * Update gitignore. * Use CSV files for all ingest operations. * Fix integration test. * Rename baseWorker. * Better error coalescing. * Generic stages and pipeline. (#66) * WIP pipeline. * WIP pipeline with source. * A bit of cleanup. * Pipeline with Source and Sink. * Remove base. * Change test. * Tests for Stop and bundle sizes. * WIP multi channel. * Multi channel. * UnfoldCert takes an ID. * Update docs in pipeline. * Modify manager and workers (before generic pipeline). Record the last modifications to manager and workers, before the introduction of the generic pipeline. * Before the adoption of the generic pipeline, modify cmd ingest. * Multiple outputs at every Stage. * Add test for multiple outputs. * Adapt for OnNoMoreData. * Extend test, add comments. * Cleanup pipeline. * Cleanup README about DB engine preparation. * Restructure mapserver/updater using a pipeline. * Bring back the allocation tests. * Move worker alloc tests to their own file. * Test utility to check # allocations. * Initial test measuring malloc overhead in pipeline. * Added util function to remove element from slice. * Add comment. * Rename files. * Rename util func. * Add function to remove several indices from slice. * Tidy up elements from slice removal function. * Add pure Go in-place quicksort. * Allow unordered indices to RemoveElementsFromSlice. * Make Qsort generic. * Non allocating sending method in Stage. * Add a noop DB, to remove flakyness on tests. * No break if out channel unavailable. Remove allocations from debugPrint. * Better message in tests util timeout function. * Debug print now public. * Cleanup workers tests. * Extend workers test. * Extend manager tests. * Less flaky memalloc pipeline test. * Fix typo in docs. * Print update test times. * Extend pipeline tests. * Sequential Inputs, fix bug double setting options. * Source with multiple out values. * Modify TestOrTimeout. * Clean up, start debugging. * BUGFIX: sink needs a pointer to Stage. The pointer is needed, as the functions (e.g. sendOutputs) point to the method for the original pointer, not the one created after assigning Stage inside Sink. With a pointer to Stage inside Sink, this pointer never changes, leaving the pointed methods working with the right object. * WIP only pipeline allocation tests fail. * MemAlloc tests also passing now. * Refine MemAlloc test. * Use pointers for stages in mapserver workers. * WIP use regular pipeline in cmd/ingest. * Modify pipeline's bundle test. It reflects how the bundle size is used. * Modify bundle test again. * Only allow processing with multiple outputs. * Only allow source generator with multiple output. * Refactor: use renamed functions of pipeline. * Introduce StageBase, AutoResume. * Util func ResizeSlice. * AutoResume for Sink. * Remove debug print that makes alloc tests fail. * WIP preparing the main cmd processor pipeline. * File cmd ingest pipeline done until creation of Certificate in memory. * WIP new files for pipeline joining. * Remove duplicated funcs in source and sink. * Longer timeout for some updater tests. * Fix two bugs: Source.Prepare had a race condition, readIncomingSequentially a typo. The method Source.Prepare expected to run its goroutine on the error channel of the stage, but was not guaranteed if it took a bit longer than the replacement of the error channel happening in auto resume. The method readIncomingSequentially was always reading from the stage's incoming channel slice, instead of the temporary copy that was dynamically altered. * Correct test. Expected values should be sorted by channel, and reset sorting per batch. * Add OnErrorSending option. Fix bugs. Stages can now handle an error occurred while sending output. Sink None output reader goroutine must be started with parameter to avoid race condition. Function sendOutputConcurrent had a race condition if multiple channels had an error simultaneously. * Pipeline creation also returns err. Add Source with incoming channel. Add tests. * Introduced new option types and functions. * Better debug messages. * Add one more simple test. * Add ability to join pipelines. * Conditionally compile pipeline debug infrastructure. Use go test -tags=debug ./pipeline/ ........ to enable them. * Use source channel in DB manager. * WIP join complex pipelines. * AggregatedInput at Resume, and only if not existing. Aggregated input is rebuilt at Resume. The function to aggregate input returns the first channel if only one channel, the existing aggregated input channel if not nil, or builds a new one. More debug output. * Wording. * Revert "Wording." partial, manual. This reverts commit 55ffdb3. * Fix joinErrorChannels. The method was just completely wrong: - No code writes to the original error channel. The error channel simply exists as liked to the previous stages. - The method was simply reading messages from the new error channel and sending them to the original, while also reading messages from the original and sending them back to the new channel. It worked only because there was a race condition happening continuously, where the original error channel would sometimes be read by the method (no effect at all) and sometimes by previous stages (which resulted in the desired behavior). * Add onResume internal event. * Better debug output. * Add reference capture tests. * New debug/panic print functionality. * Joining complex pipelines. * Debug print function in util pkg. * Add guard to avoid calling debugPrintf in nodebug. * Fix build in test. * Fix source incoming channel option. Using a pointer to the source incoming channel now. * Add WithSourceSlice. * Adapt cmd ingest. * Fix ingest cmd. * Change types of interstage messages to pointers. * Adapt tests to new interstage pointer data types. * Default send strategy for stage is concurrent. * Num parsers & db writers as cmd arguments, better output. * Add comment on articuno performance. * Allow ingest workers interstage data with and without ptrs. * Continuation of ingest w/wout ptrs. * Continuation2 of ingest w/wout ptrs. * Continuation3 of ingest w/wout ptrs. * Fix ptr and non ptr cmd and manager workers. * Use non ptr stages in cmd and manager. * Fix tests. * Update comments with more articuno results. * Tracing. Add tracing capabilities. Add contexts to tasks, pipelines. Propagate. * Tracing (#67) * More verbose traces. * New benchmark results. * wip initial example * Tracing proof of concept finished. * Remove runtime file traces. * Initial jaeger traces. * Remove sequential IO. * Add timed events (log entries). * Unique context for DB workers. * Fix problem at the sharding function. * Different tracing providers. * Adapt to changes in tracing. * Cleaner traces at stages. * Instrument partition insert benchmark for traces. * Trace long waits at stages. * Solve pipeline stalls (#68) * Zero overhead traces when disabled. * Fix flaky allocation tests. Sometimes the init would be still running inside the critical region. * Remove Id from baseWorker. * Rename one function. * Traces for domain ptr worker. * Domain worker split in two: batcher and insert. * Fix utest, and refactor domainBatchWorker and test. * Rename updater internal types * Fix bug in ResizeSlice with fill value. * WithSourceChannel returns many out channels. * Fix bug when not tracing. * Trace stages automatically if longer than 1 second. * Certificate stages are now multiple. There are batchers, domain extractors, and inserters. * Better traces. * Introduce WithOutputStreamingFunction. Stages now have the ability to keep on sending output after finishing processing, allowing for cases such as csv splitter to keep on sending lines after having started processing one file name, and thus not blocking the whole pipeline on one stage finishing its processing. * The csv splitter stage streams out lines from the file. * Add makefile rule to build ingest with tracing. * Domain ID cache and fixes. Add a domain ID cache to the domain extractor. Report some domain IDs in the domain inserter trace. Slightly modify the ingestion cache API. Fix missing statistics update on csv split worker. * Add LinkStageDistribution, single-out to multiple stages. * Update comment. * Fix unit test not passing. * Add LinkStagesCrissCross, closeOutChannel function. * More efficient use of channels via LinkStagesCrissCross. * Read several files at once. * File forgotten. * Benchmark the first half of the pipeline. * Allow configuring separately the number of all ingest stages. * Change trace report when incoming it too long. * Join raw pipelines joins with custom I/O channels. Allow joining several pipelines with the creation of a new one, given a custom new pipeline linking function. This allows for the ingest pipeline to be more efficient, by not needing to funnel all certificates through the sink and source of the previous pipelines. * LRU cache (#69) * Move cache to pkg/ * WIP LRU cache. * Add a LRU cache and test. * Use the new LRU cache. * Bundle and deadlocks (#70) * Update comment. * Bundles done in certificate inserter. * WIP support at the inserter level, SMT panics. * Added WithStallStages and test. * Add test for concurrency during stalls. * Bundles based on StallStages. * Fix tests. * CSV in own stage (#71) * Functions to separate CSV creation from insertion. * Allow setting sequential inputs in tests. * Write CSV for certs in a new stage. Other changes. Set alter instance disable innodb redo_log; per default. Move some context creation in tests after the test db creation. * Benchmarks create table in innodb and myisam, for test purposes. It seems that with the latest innodb versions, the creation of a table has become quite slower than before. Measure this. * Fix create testDB tests and cleanup. * Domain csv creation, insertion and removal at separate stages. * Several file utils that don't allocate memory. * Typo. * Allow slightly bigger allocation in memalloc test. It seems that sometimes sending on a channel can allocate 4 times. * Tempfile name and creation. * No allocations for CSV creation. --------- Co-authored-by: Cyrill Krähenbühl <[email protected]>

juagargi added the bug Something isn't working label Apr 17, 2024

juagargi self-assigned this Apr 17, 2024

juagargi added a commit that referenced this issue Apr 17, 2024

Fix bug about ingest pipeline calling coalesce too early.

bd16979

Bug reported as issue #63.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest pipeline breaks order after last file #63

Ingest pipeline breaks order after last file #63

juagargi commented Apr 17, 2024 •

edited

Loading

Ingest pipeline breaks order after last file #63

Ingest pipeline breaks order after last file #63

Comments

juagargi commented Apr 17, 2024 • edited Loading

juagargi commented Apr 17, 2024 •

edited

Loading