Skip to content

Commit

Permalink
Feature/spillchuck (#2133)
Browse files Browse the repository at this point in the history
* Fix non-UTF8 character in version text

* Fix a few typos.

* Fix link filename

* Update URL for Atlassian HIC docs

---------

Co-authored-by: James A Sutherland <>
  • Loading branch information
jas88 authored Feb 13, 2025
1 parent 527a45d commit 58af45a
Show file tree
Hide file tree
Showing 335 changed files with 592 additions and 593 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -317,13 +317,13 @@ public void SetWindowManager(WindowManager windowManager)
}

launchAnotherInstanceToolStripMenuItem.ToolTipText =
"Start another copy of the RDMP process targetting the same (or another) RDMP platform database";
"Start another copy of the RDMP process targeting the same (or another) RDMP platform database";

if (switchToInstanceToolStripMenuItem.DropDownItems.Count > 1)
{
switchToInstanceToolStripMenuItem.Enabled = true;
switchToInstanceToolStripMenuItem.ToolTipText =
"Close the application and start another copy of the RDMP process targetting another RDMP platform database";
"Close the application and start another copy of the RDMP process targeting another RDMP platform database";
}
else
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ namespace ResearchDataManagementPlatform.WindowManagement.ContentWindowTracking.

/// <summary>
/// A Document Tab that hosts an RDMPSingleDatabaseObjectControl T, the control knows how to save itself to the persistence settings file for the user ensuring that when they next open the
/// software the Tab can be reloaded and displayed. Persistance involves storing this Tab type, the Control type being hosted by the Tab (a RDMPSingleDatabaseObjectControl) and the object
/// software the Tab can be reloaded and displayed. Persistence involves storing this Tab type, the Control type being hosted by the Tab (a RDMPSingleDatabaseObjectControl) and the object
/// ID , object Type and Repository (DataExport or Catalogue) of the T object currently held in the RDMPSingleDatabaseObjectControl.
/// </summary>
[System.ComponentModel.DesignerCategory("")]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ namespace ResearchDataManagementPlatform.WindowManagement.ContentWindowTracking.

/// <summary>
/// A Document Tab that hosts an RDMPCollection, the control knows how to save itself to the persistence settings file for the user ensuring that when they next open the
/// software the Tab can be reloaded and displayed. Persistance involves storing this Tab type, the Collection Control type being hosted by the Tab (an RDMPCollection).
/// software the Tab can be reloaded and displayed. Persistence involves storing this Tab type, the Collection Control type being hosted by the Tab (an RDMPCollection).
/// Since there can only ever be one RDMPCollection of any Type active at a time this is all that must be stored to persist the control
/// </summary>
[TechnicalUI]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,7 @@ internal void OnFormClosing(System.Windows.Forms.FormClosingEventArgs e)
/// <summary>
/// Attempts to ensure that a compatible RDMPCollectionUI is made visible for the supplied object which must be one of the expected root Tree types of
/// an RDMPCollectionUI. For example Project is the a root object of DataExportCollectionUI. If a matching collection is already visible or no collection
/// supports the supplied object as a root object then nothing will happen. Otherwise the coresponding collection will be shown
/// supports the supplied object as a root object then nothing will happen. Otherwise the corresponding collection will be shown
/// </summary>
/// <param name="root"></param>
public void ShowCollectionWhichSupportsRootObjectType(object root)
Expand Down Expand Up @@ -415,13 +415,13 @@ public PersistableObjectCollectionDockContent GetActiveWindowIfAnyFor(Type windo
/// Check whether a given RDMPSingleControlTab is already showing with the given DatabaseObject (e.g. is user currently editing Catalogue bob in CatalogueUI)
/// </summary>
/// <exception cref="ArgumentException"></exception>
/// <param name="windowType">A Type derrived from RDMPSingleControlTab</param>
/// <param name="windowType">A Type derived from RDMPSingleControlTab</param>
/// <param name="databaseObject">An instance of an object which matches the windowType</param>
/// <returns></returns>
public bool AlreadyActive(Type windowType, IMapsDirectlyToDatabaseTable databaseObject)
{
return !typeof(IRDMPSingleDatabaseObjectControl).IsAssignableFrom(windowType)
? throw new ArgumentException("windowType must be a Type derrived from RDMPSingleControlTab")
? throw new ArgumentException("windowType must be a Type derived from RDMPSingleControlTab")
: _trackedWindows.OfType<PersistableSingleDatabaseObjectDockContent>().Any(t =>
t.Control.GetType() == windowType && t.DatabaseObject.Equals(databaseObject));
}
Expand Down
48 changes: 24 additions & 24 deletions CHANGELOG.md

Large diffs are not rendered by default.

14 changes: 7 additions & 7 deletions Documentation/CodeTutorials/CSVHandling.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,23 +18,23 @@
* [Unclosed quotes](#unclosed-quotes)

## Background
CSV stands for 'Comma Separated Values'. A CSV file is created by writting a text document in which the cells of the table are separated by a comma. Here is an example CSV file:
CSV stands for 'Comma Separated Values'. A CSV file is created by writing a text document in which the cells of the table are separated by a comma. Here is an example CSV file:

```
CHI,StudyID,Date
0101010101,5,2001-01-05
0101010102,6,2001-01-05
```

CSV files usually end in the extension `.csv`. Sometimes an alternate separator will be used e.g. pipe `|` or tab `\t`. There is an [official ruleset](https://tools.ietf.org/html/rfc4180) for writting CSV files, this covers escaping, newlines etc. However this ruleset is often not correctly implemented by data suppliers. RDMP therefore supports the loading of corrupt/invalid CSV files.
CSV files usually end in the extension `.csv`. Sometimes an alternate separator will be used e.g. pipe `|` or tab `\t`. There is an [official ruleset](https://tools.ietf.org/html/rfc4180) for writing CSV files, this covers escaping, newlines etc. However this ruleset is often not correctly implemented by data suppliers. RDMP therefore supports the loading of corrupt/invalid CSV files.

The class that handles processing delimited files (CSV, TSV etc) is `DelimitedFlatFileDataFlowSource`. This class is responsible for turning the CSV file into a series of `System.DataTable` chunks for upload to the database.

## Scalability
CSV processing is done iteratively and streamed into the database in chunks. This has been tested with datasets of 800 million records without issue. Chunk size is determined by `MaxBatchSize`, optionally the initial batch can be larger `StronglyTypeInputBatchSize` to streamline [Type descisions](https://github.com/HicServices/FAnsiSql/blob/master/Documentation/TypeTranslation.md) e.g. when sending data to a `DataTableUploadDestination`.
CSV processing is done iteratively and streamed into the database in chunks. This has been tested with datasets of 800 million records without issue. Chunk size is determined by `MaxBatchSize`, optionally the initial batch can be larger `StronglyTypeInputBatchSize` to streamline [Type decisions](https://github.com/HicServices/FAnsiSql/blob/master/Documentation/TypeTranslation.md) e.g. when sending data to a `DataTableUploadDestination`.

## Type Determination
Type decisions [are handled seperately](https://github.com/HicServices/FAnsiSql/blob/master/Documentation/TypeTranslation.md) after the `System.DataTable` has been produced in memory from the CSV file.
Type decisions [are handled separately](https://github.com/HicServices/FAnsiSql/blob/master/Documentation/TypeTranslation.md) after the `System.DataTable` has been produced in memory from the CSV file.

## Corrupt Files
RDMP is able to detect and cope with some common problems with delimited (e.g. CSV) files. These situations can be classified as 'Resolved Automatically', 'Resolved Accordly' and 'Unresolveable'
Expand Down Expand Up @@ -93,7 +93,7 @@ CHI ,StudyID,Date,,
_TrailingNulls_InHeader_

### Empty Columns
Sometimes a CSV file will have an entirely null column in the middle. This can occur if you open a CSV in excel and insert a row or you have two 'tables' side by side in the CSV with a blank line separator. In this situation RDMP will expect the unamed column to be null/empty for all cells and it will ignore it.
Sometimes a CSV file will have an entirely null column in the middle. This can occur if you open a CSV in excel and insert a row or you have two 'tables' side by side in the CSV with a blank line separator. In this situation RDMP will expect the unnamed column to be null/empty for all cells and it will ignore it.

```
CHI ,,StudyID,Date
Expand Down Expand Up @@ -152,7 +152,7 @@ You can attempt to solve the problem of too few cells on a row by setting `Attem

![FlowChart](Images/CSVHandling/TooFewCellsFlow.png)

This is a conservative approach in which the process is abandonned as soon as:
This is a conservative approach in which the process is abandoned as soon as:
* A valid length row is read during the process (all work is discarded and processing resumes from this record)
* Too many cells are read (all work is discarded and processing resumes from the last record read)

Expand Down Expand Up @@ -202,7 +202,7 @@ _BadCSV_ForceHeaders_NoReplace_
## Unresolveable

### Unclosed Quotes
The CSV standard allows you to escape the separator charcter, newlines etc by using quotes. If your file contains an unclosed quote then the entire rest of the file will be in error:
The CSV standard allows you to escape the separator character, newlines etc by using quotes. If your file contains an unclosed quote then the entire rest of the file will be in error:

```
Name,Description,Age
Expand Down
2 changes: 1 addition & 1 deletion Documentation/CodeTutorials/Coding.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Many of the complicated low level APIs have been refactored out of RDMP and move

### Database Abstraction Layer

RDMP interacts with relational databases (Sql Server, Oracle, PostgresSQL and MySql). It runs SQL queries, creates tables and does general ETL. This functionality has been abstracted out into the [FAnsiSql library](https://github.com/HicServices/FAnsiSql)
RDMP interacts with relational databases (Sql Server, Oracle, postgresql and MySql). It runs SQL queries, creates tables and does general ETL. This functionality has been abstracted out into the [FAnsiSql library](https://github.com/HicServices/FAnsiSql)

### Type Determination

Expand Down
4 changes: 2 additions & 2 deletions Documentation/CodeTutorials/DataTableUpload.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The behaviour of `DataTableUploadDestination` is controlled through settings suc
![DataTableUploadDiagram](Images/DataTableUpload/Settings.png)

## Behaviour
When the first DataTable is passed into `DataTableUploadDestination` a new [Guesser](https://github.com/HicServices/TypeGuesser) will be created for each column. Every cell value is then fed through these computers. This will record the maximum decimal size seen, the longest string and make a guess at the Type (if the data is not hard typed already). The determination of data type is handled by implementations of `IDecideTypesForStrings`. Each allowable `Type` has a `TypeCompatibilityGroup`, this determines the behaviour of the system when mixed values are encountered e.g. "1" (int) followed by "1.1" (decimal). If the Types are compatible then the `Guess` will be the Type that accomodates both values (e.g. decimal). If values are not compatible (e.g. "1" followed by "2001-01-01") then the `Guess` will be changed to string (varchar).
When the first DataTable is passed into `DataTableUploadDestination` a new [Guesser](https://github.com/HicServices/TypeGuesser) will be created for each column. Every cell value is then fed through these computers. This will record the maximum decimal size seen, the longest string and make a guess at the Type (if the data is not hard typed already). The determination of data type is handled by implementations of `IDecideTypesForStrings`. Each allowable `Type` has a `TypeCompatibilityGroup`, this determines the behaviour of the system when mixed values are encountered e.g. "1" (int) followed by "1.1" (decimal). If the Types are compatible then the `Guess` will be the Type that accommodates both values (e.g. decimal). If values are not compatible (e.g. "1" followed by "2001-01-01") then the `Guess` will be changed to string (varchar).

Sometimes you want to force a specific destination database datatype for some/all columns. This can be done either by setting `ExplicitTypes` or by creating the table yourself before running the `DataTableUploadDestination`. These types are the initial estimates and can be changed based on data encountered (unless you turn off `AllowResizingColumnsAtUploadTime`).

Expand All @@ -33,7 +33,7 @@ Once all data types have been determined the destination table is created with t
If a column is always null then the column estimate will be `Boolean` (`bit`).

## Resizing
Data is loaded into tables in batches using an appropriate `IBulkCopy` for the [DBMS] being targetted (e.g. Sql Server, MySql). Each batch goes through the `Guesser` which can result in a column estimate changing. When this occurs (assuming `AllowResizingColumnsAtUploadTime`) then an ALTER statement is issued to change the column to the new `Type`.
Data is loaded into tables in batches using an appropriate `IBulkCopy` for the [DBMS] being targeted (e.g. Sql Server, MySql). Each batch goes through the `Guesser` which can result in a column estimate changing. When this occurs (assuming `AllowResizingColumnsAtUploadTime`) then an ALTER statement is issued to change the column to the new `Type`.

## Primary Keys
The table created will have an appropriate primary key if the `DataTable` batches supplied have a PrimaryKey set or if any `ExplicitTypes` columns have `IsPrimaryKey` set. This key will only be created when the `DataTableUploadDestination` is disposed since primary keys can affect the ability to issue ALTER statements.
Expand Down
2 changes: 1 addition & 1 deletion Documentation/CodeTutorials/DoubleClickAndDragDrop.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Both 'drag and drop' and 'double click' tie directly to the 'Command' pattern RD

Drag and drop and double clicking (called activation) is a core part of the RDMP API and is handled through the class `RDMPCommandExecutionProposal<T>` on a `Type` basis. Each object that supports activation and/or drop must have an associated instance of `RDMPCommandExecutionProposal<T>` called `ProposeExecutionWhenTargetIs<SomeClass>`.

This derrived class will decide what tab/window/custom action to show when `Activate` happens either as part of double click or as part of `ExecuteCommandActivate` (e.g. from a right click menu) or a call to `BasicUICommandExecution.Activate` and decide what `ICommandExecution` is executed when a given object/collection is dropped on it.
This derived class will decide what tab/window/custom action to show when `Activate` happens either as part of double click or as part of `ExecuteCommandActivate` (e.g. from a right click menu) or a call to `BasicUICommandExecution.Activate` and decide what `ICommandExecution` is executed when a given object/collection is dropped on it.

![ExampleMenu](Images/DoubleClickAndDragDrop/DropExample.png)
_Example of dragging a [Catalogue] onto an ExtractionConfiguration_
Expand Down
Loading

0 comments on commit 58af45a

Please sign in to comment.