forked from JustAnother1/DataSetTransformer
-
Notifications
You must be signed in to change notification settings - Fork 0
jinshen-cn/DataSetTransformer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
DataSetTransformer ------------------ DataSetTransformer is a tool that can read any data that can be organized in tables(rows and columns) can apply transformations and safe the transformed data. Contents -------- 1. Overview 2. License 3. Usage 4. Creating Job Files 5. Example Job File 1. Overview -------- The Transformation splits up into 5 Steps. - Importing - Import Selection - Data Filtering - Export Style - Exporting The Importer reads the Data stream. The import Filter then extracts the Data from the Stream. This Data is then filtered. The filtered Data is then converted into the Export format. Finally the data in export format is written out. currently supported input formats: - CSV (character separated values) - HTML - SQL Database (MySql,...) currently supported data filters: - combine fields - apply regular expression - remove empty lines currently supported output formats: - CSV (character separated values) - HTML - SQL Database (MySql,...) 2. License ------- Copyright (C) 2011 Lars Pötter <[email protected]> All Rights Reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, see <http://www.gnu.org/licenses/> 3. Usage ----- Start the DataSetTarnsformer by doing "java -jar DataSetTransformer.jar jobfile.job". The Job File contains all the settings needed to do the transformation. 4. Creating Job Files ------------------ Job Files are simple Text Files. Job Files contain Sections and Settings. A Section starts with a Line "[Setting Name]" and ends with the end of the File or the next Section. Settings are lines like "Setting Name = Value". A Job File must have the following sections: - "Importer" - "Import Selector" - "Exporter" - "Export Style" A Job File may also have the following sections: - "Data Filter", "Data Filter_0", "Data Filter_1", ... "Data Filter_99" These sections directly correspond to the 5 steps needed for the transformation. Each of these steps can be thought of as a separate module. Each section must have at least on setting. That setting is the "type"-setting. The type selects the kind of module used in this step. The different kind of modules may have different settings. Refer to the following sections to find which kinds of modules are available and which settings they need. 4.1. Importer -------- Description ----------- The Importer opens the data source and gives the Import Selector access to the raw data. There must be exactly one Importer in each Job. currently available Importers: 4.1.1. File Importer ------------ type = FileImporter Description ----------- Reads bytes from a file in the local file system. Parameters ---------- FileName = path to file with input data. 4.1.2. URL Importer ----------- type = UrlImporter Description ----------- Reads bytes from a resource identified by a Unified Resource Locator (URL). This module creates a folder cache where copies of the imported resources are kept. The second import then uses the cached copy and will not request the resource again. If you want to request the resource again, then remove the cached copy from the cache folder. Parameters ---------- SourceUrl = URL from that the bytes will be read. 4.1.4. URL List Importer ----------------- type = UrlListImporter Description ----------- Reads URL's from the specified file and import all of them using the URL Importer. Parameters ---------- source = path to text file that contains one URL per line. 4.1.5. Paged URL Importer ------------------ type = PagedUrlImporter Description ----------- Uses URL Importer to import bytes from the specified URL. If the resulting Data that comes out of the import filter has found a data element named Next with a valid URL as value, then this URL is also imported. This repeats until no Next data element is found or the next element does not contain a valid URL. Parameters ---------- SourceUrl = URL from that the bytes will be read. 4.1.6. JDBC Importer ------------- type = JdbcImporter Description ----------- Opens the Connection to a JDBC Database. Parameters ---------- dbDriver = the class name of the database driver. (example: com.mysql.jdbc.Driver) dbUrl = the URL needed to acces the database. (example: jdbc:mysql://localhost/javatestdb) dbUser = the username needed to access the database. May be empty or missing. dbPassword = the password needed to access the database. May be empty or missing. 4.2. Import Selector --------------- Description ----------- The Import Selector reads the raw data and extracts the data sets. There must be exactly one Import Selector in each Job. currently available Import Selectors: 4.2.1. CSV Import Selector ------------------- type = CsvImportSelector Description ----------- Interprets the imported bytes as text containing data elements separated by the separator character. With one Data Set per line. Parameters ---------- separator = character that splits the data elements. mapping_0 = name of the first data element in each row. mapping_1 = name of the second data element in each row. mapping_2 = name of the third data element in each row. mapping_. = name of the ... data element in each row. mapping_. = name of the ... data element in each row. mapping_. = name of the ... data element in each row. mapping_N = name of the (N+1)th data element in each row. 4.2.2. Tree Import Selector -------------------- type = TreeImportSelector Description ----------- Interprets the imported bytes as HTML File. A Location and a name need to be provided for each data element that shall be extracted. Locations are defined as follows: If a location starts with "@html@" then the HTML code contained in the defined tag will be the value of the data element. Without this the text defined by the tag will be the value of the data element. If only a value of a attribute in the tag is needed then this can be achieved with @attributeName. a Location ending with "a@href" will create a data element that contains the URL in the specified link. The Location presents the Path to the tag. In HTML this will always start with the tag "html". This is then followed by either "head" or "body". If a tag has more than one child tags of the same type an attribute that has to match can be supplied. "div(id:main)" matches only the div Tag that has the attribute "id" with the value "main". An alternative is the tag name "*". That matches all Tags also on several levels. The "*" can also be used with parts of the tag name, then is mean the tag name must contain the supplied part. This means "ca*" would match "cake" and "captain" but not "general". The tags in the location are separated by "/". Example: "html/head/link(type:text/css)@href" applied on a HTML data like this: <html> <head> <link href="styles/theme.css" rel="stylesheet" type="text/css"> </head> <body> </body </html> will create a data element that contains "styles/theme.css". Parameters ---------- Mapping_Name_0 = name of the first data element. Mapping_Name_1 = name of the second data element. Mapping_Name_N = name of the (N+1)th data element. Mapping_Position_0 = position of the first data element. Mapping_Position_1 = position of the second data element. Mapping_Position_N = position of the (N+1)th data element. 4.2.3. SQL Import Selector -------------------- type = SqlImportSelector Description ----------- Executes the SQL Query and creates data sets for each returned row. The name of the data element is the column name. Parameters ---------- SqlQuery = the SQL Query that will be executed to get the Data. (example: SELECT * FROM testtable;) 4.3. Data Filter ----------- Description ----------- The data filter reads the data sets and applies the filter. The result are the changed/ filtered data sets. There can be zero to 100 data filters in each job. The data filters will be executed according their number starting with 0 and going up to 99. A data filter section without number will be executed before data filter 0. currently available Data Filters: 4.3.1. Concatenate Filter ------------------ type = ConCatFilter Description ----------- This filter will return only one Data Set. The filter checks that in all Data Sets that are provided to the filter the data elements specified in "FieldThatMustBeEqual" have always the same value. If that is the case then the values of all that Data elements specified in "FieldsThatWillBeConCatenated" will be concatenated with the defined separator. Parameters ---------- ConCatSeparator = string that will be inserted between concatenated values. FieldThatMustBeEqual_0 = name of data element that shall be the same on all data sets. FieldThatMustBeEqual_1 = name of data element that shall be the same on all data sets. FieldThatMustBeEqual_N = name of data element that shall be the same on all data sets. FieldsThatWillBeConCatenated_0 = name of data element that shall be the same on all data sets. FieldsThatWillBeConCatenated_1 = name of data element that shall be the same on all data sets. FieldsThatWillBeConCatenated_N = name of data element that shall be the same on all data sets. 4.3.2. Regular Expression Filter ------------------------- type = RegExpFilter Description ----------- This filter only returns those data set from the provided data sets that contain the data element specified by "NameOfAtom" and where the value of that data element also matched the specified regular expression defined in "RegularExpression" Parameters ---------- NameOfAtom = Name of the data element that will be processed. RegularExpression = the regular expression that must be matched. 4.3.3. Remove Empty Lines Filter ------------------------- type = RemoveEmptyLinesFilter Description ----------- Returns from the provided data set only those that have at least one data element. Parameters ---------- no parameters available 4.3.4. Search and Replace Filter ------------------------- type = ReplaceFilter Description ----------- replaces all occurrences of the searched for string with the replacement in the specified data element off all data sets. Parameters ---------- Element to change = Name of the data element to change Search for = string that will be removed if found Replace with = replacement for the found string 4.3.5. Add Index data element Filter ----------------------------- type = AddIndexDataElementFilter Description ----------- adds a data element to each data set that contains the index/ row number of the data set. Counting starts with 0. Parameters ---------- Index element name = name of the added data element 4.4. Exporter -------- currently available Exporters: 4.4.1. File Exporter ------------- type = FileExporter Description ----------- Writes the bytes produced by the Export Style module into that file specified in "target". Parameters ---------- target = File Name of result file. 4.4.1. One File per Row Exporter ------------------------- type = OneFilePerRowExporter Description ----------- creates a file for each data set. The filenames for these files are created by prefix + <value of data element> + postfix. Parameters ---------- filename prefix = the prefix filename postfix = the postfix filename data element = the name of the data element used to construct the filename 4.4.3. JDBC Exporter ------------- type = JdbcExporter Description ----------- Opens the Connection to a JDBC Database. Parameters ---------- dbDriver = the class name of the database driver. (example: com.mysql.jdbc.Driver) dbUrl = the URL needed to acces the database. (example: jdbc:mysql://localhost/javatestdb) dbUser = the username needed to access the database. May be empty or missing. dbPassword = the password needed to access the database. May be empty or missing. 4.5. Export Style ------------ currently available Export Styles: 4.5.1. CSV Export Style ---------------- type = CsvExportStyle Description ----------- creates a CSV byte stream from the provided data sets. Each data set results in one row. The mapping defines the sequence of the date element of the data sets. The separator defines the used separator between the data elements. Parameters ---------- separator = character used to separate the data atoms of the data set. mapping_0 = name of first data element. mapping_1 = name of second data element. mapping_N = name of (N+1)th data element. 4.5.2. HTML Export Style ----------------- type = HtmlExportStyle Description ----------- Creates a byte stream by interleaving the data from the data sets with the Global and Row Style definition. Example: GlobalStyleDefinition = <html><head></head><body>%ROW%</body></html> RowStyleDefinition = party: %n% had %c% cakes Data = [n = Birthday c = 5][n = Christmas c = 0] Resulting HTML = <html><head></head><body>party: Birthday had 5 cakes party: Christmas had 0 cakes </body></html> Parameters ---------- GlobalStyleDefinition = specifies the HTML tags before and after the data. RowStyleDefinition = specifies the HTML tags used for each data set row. 4.5.3. SQL Export Style ---------------- type = SqlExportStyle Description ----------- Inserts the Data into a SQL Database Table. It does a "" for each data set Parameters ---------- tableName = Name of the SQL table that shall receive the data. fieldNames = Names of all fields/columns of the table that shall receive data elements. mapping = list of data elements that will be written into the table. "mapping_0" will be inserted into the column specified by the first column name in the fieldNames parameter. 5. Example Job File ---------------- [Importer] type = FileImporter FileName = testData/small.csv [Import Selector] type = CsvImportSelector separator = , mapping_0 = number mapping_1 = german mapping_2 = english [Data Filter_0] type = RemoveEmptyLinesFilter [Data Filter_1] type = ReplaceFilter Element to change = english Search for = pi Replace with = pie [Exporter] type = FileExporter target = exported.csv [Export Style] type = CsvExportStyle separator = , mapping_0 = number mapping_1 = english mapping_2 = german
About
Converts any Data Format into any (other) Data Format
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published