Skip to content

tutorial

Boris Glavic edited this page Aug 10, 2017 · 1 revision

iBench Tutorial

This tutorial makes use of the configuration files in folder exampleConfigurations/. This assumes that iBench is already installed (see the Installation Instructions).

Your first mapping scenario

As a first example, we are going to create a mapping scenario where the source schema has a single relation and the target schema has a copy of this relation. The schema mapping consists of a single tgd that maps the source to the target generation. To use iBench you have to create a configuration file that controls the metadata and data generation. Once you have finished this tutorial have a look at here for detailed documentation about iBench's configuration file format. We are going to use a precooked configuration file: exampleConfigurations/tutorialOneCopy.txt:

################################################################################
#
# Configuration File for iBench
#
# Basic tutorial configuration that creates a single copy scenario
#
################################################################################

########################################
# Output path prefixes and file names
SchemaPathPrefix = tutorialOneCP
InstancePathPrefix = tutorialOneCP

FileNames.Schemas = metadata.xml
FileNames.SourceInstance = toxConfig.tsl
FileNames.SourceDocumentName = sourceInst

########################################
# Number of Instances for each Basic Scenario Type
# Create one copy scenario
Scenarios.COPY = 1

################################################################################
# Additional configuration

# relations have 5 attributes each
ConfigOptions.NumOfSubElements = 5

# Random number generator and max values, DataGenerator and MappingLang
RandomSeed = 2

# create 10 rows of data
RepElementCount = 10
DataGenerator = TrampCSV
QueryGenerator = Postgres
MappingLanguage = SOtgds

################################################################################
# Optional activation/deactivation of output options
# create only TrampXML metadata file and data
################################################################################ 

OutputOption.Data = true
OutputOption.XMLSchemas = false
OutputOption.TrampXML = true			

################################################################################
# Optional activation/deactivation of parts of the generated Tramp XML document
################################################################################

TrampXMLOutput.ConnectionInfo = false

Let's run iBench with this configuration file suppressing all log output:

cd build
./iBench.sh -c ../exampleConfigurations/tutorialOneCopy.txt -loglevel error

Let's examine what is produced by iBench. We have asked iBench to store metadata and data file a directory tutorialOneCP by setting SchemaPathPrefix=tutorialOneCP and InstancePathPrefix=tutorialOneCP. Let's examine this directory:

ls -1 tutorialOneCP/

metadata.xml
sourceInst.xml
test_cp_0_nl0_ce0.csv
toxConfig.tsl

The metadata.xml file stores all the metadata: the source and target schemas, schema matches, constraints, the schema mapping, and transformations implementing the schema mapping. Let's understand what was generated by iBench. We have instructed iBench to generated a mapping that consists of a single copy primitive (Scenarios.COPY = 1). iBench generates metadata by creating complex schemas and mappings from simple, common sniplets called primitives. For instance, the copy primitive consists of a single source relations, a target relation that is an exact copy of the source relation, schema matches between the corresponding attributes, and a mapping that copies data from the source to the target relation. iBench comes with a wide range of build-in primitives and can be extended with user-defined primitives (UDPs).

Let's now dissect the sections of the generated metadata.xml file.

Generated Metadata XML file

The <Schemas> section in the file stores the generated source and target schema. In our case the source schema consists of a single relation test_cp_0_nl0_ce0. Relation and attribute names are randomly generated by iBench. For each attribute we record its name and data type. For each relation we record its primary key (if applicable). For more complex scenarios this section may also contain foreign keys.

  <Schemas>
    <SourceSchema>
      <Relation name="test_cp_0_nl0_ce0">
        <Attr>
          <Name>nut_cp_0_nl0_ae0ke0</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <Attr>
          <Name>art_cp_0_nl0_ae1</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <Attr>
          <Name>slope_cp_0_nl0_ae2</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <Attr>
          <Name>measure_cp_0_nl0_ae3</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <Attr>
          <Name>touch_cp_0_nl0_ae4</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <PrimaryKey>
          <Attr>nut_cp_0_nl0_ae0ke0</Attr>
        </PrimaryKey>
      </Relation>
    </SourceSchema>
	<TargetSchema>
      <Relation name="test_cp_0_nl0_ce0copy0_0">
        <Attr>
          <Name>nut_cp_0_nl0_ae0ke0</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <Attr>
          <Name>art_cp_0_nl0_ae1</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <Attr>
          <Name>slope_cp_0_nl0_ae2</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <Attr>
          <Name>measure_cp_0_nl0_ae3</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <Attr>
          <Name>touch_cp_0_nl0_ae4</Name>
          <DataType>TEXT</DataType>
        </Attr>
        <PrimaryKey>
          <Attr>nut_cp_0_nl0_ae0ke0</Attr>
        </PrimaryKey>
      </Relation>
    </TargetSchema>
  </Schemas>

the following section stores correspondences (schema matches), i.e., which attributes from the source schema correspond to which attributes from the target schema. In our example, there is a one to one correspondence between the attributes in the source and the target, because the COPY primitive creates an exact copy of the source relations (modulo attribute names).

  <Correspondences>
    <Correspondence id="C0">
      <From tableref="test_cp_0_nl0_ce0">
        <Attr>nut_cp_0_nl0_ae0ke0</Attr>
      </From>
      <To tableref="test_cp_0_nl0_ce0copy0_0">
        <Attr>nut_cp_0_nl0_ae0ke0</Attr>
      </To>
    </Correspondence>
    <Correspondence id="C1">
      <From tableref="test_cp_0_nl0_ce0">
        <Attr>art_cp_0_nl0_ae1</Attr>
      </From>
      <To tableref="test_cp_0_nl0_ce0copy0_0">
        <Attr>art_cp_0_nl0_ae1</Attr>
      </To>
    </Correspondence>
    <Correspondence id="C2">
      <From tableref="test_cp_0_nl0_ce0">
        <Attr>slope_cp_0_nl0_ae2</Attr>
      </From>
      <To tableref="test_cp_0_nl0_ce0copy0_0">
        <Attr>slope_cp_0_nl0_ae2</Attr>
      </To>
    </Correspondence>
    <Correspondence id="C3">
      <From tableref="test_cp_0_nl0_ce0">
        <Attr>measure_cp_0_nl0_ae3</Attr>
      </From>
      <To tableref="test_cp_0_nl0_ce0copy0_0">
        <Attr>measure_cp_0_nl0_ae3</Attr>
      </To>
    </Correspondence>
    <Correspondence id="C4">
      <From tableref="test_cp_0_nl0_ce0">
        <Attr>touch_cp_0_nl0_ae4</Attr>
      </From>
      <To tableref="test_cp_0_nl0_ce0copy0_0">
        <Attr>touch_cp_0_nl0_ae4</Attr>
      </To>
    </Correspondence>
  </Correspondences>

The next section stores the schema mapping. Each Mapping element is a tuple-generating dependency (TDG). We support st-tgds as well as SO-tgds (parameter MappingLanguage). For our example, the tgd is

FORALL a,b,c,d,e: test_cp_0_nl0_ce0(a,b,c,d,e) -> test_cp_0_nl0_ce0copy0_0(a,b,c,d,e)

Note that for each mapping we also record which correspondences are used.

  <Mappings>
    <Mapping id="M0">
      <Uses>
        <Correspondence ref="C0"/>
        <Correspondence ref="C1"/>
        <Correspondence ref="C2"/>
        <Correspondence ref="C3"/>
        <Correspondence ref="C4"/>
      </Uses>
      <Foreach>
        <Atom tableref="test_cp_0_nl0_ce0">
          <Var>a</Var>
          <Var>b</Var>
          <Var>c</Var>
          <Var>d</Var>
          <Var>e</Var>
        </Atom>
      </Foreach>
      <Exists>
        <Atom tableref="test_cp_0_nl0_ce0copy0_0">
          <Var>a</Var>
          <Var>b</Var>
          <Var>c</Var>
          <Var>d</Var>
          <Var>e</Var>
        </Atom>
      </Exists>
    </Mapping>
  </Mappings>

The Transformations section contains SQL code that implements the mappings. In our example, there is a single query which implements mapping M0 which is essentially SELECT * ....

  <Transformations>
    <Transformation id="T0" creates="test_cp_0_nl0_ce0copy0_0">
      <Implements>
        <Mapping ref="M0"/>
      </Implements>
      <Code>SELECT xl0v0.nut_cp_0_nl0_ae0ke0 AS nut_cp_0_nl0_ae0ke0, xl0v0.art_cp_0_nl0_ae1 AS art_cp_0_nl0_ae1, xl0v0.slope_cp_0_nl0_ae2 AS slope_cp_0_nl0_ae2, xl0v0.measure_cp_0_nl0_ae3 AS measure_cp_0_nl0_ae3, xl0v0.touch_cp_0_nl0_ae4 AS touch_cp_0_nl0_ae4
FROM source.test_cp_0_nl0_ce0 AS xl0v0</Code>
    </Transformation>
  </Transformations>

The last section stores information about the CSV files storing data generated for the source relations. In our example there is a single CSV file test_cp_0_nl0_ce0.csv for the source table test_cp_0_nl0_ce0.

  <Data>
    <InstanceFile name="test_cp_0_nl0_ce0">
      <Path>/Users/lord_pretzel/Documents/workspace/iBench/build/tutorialOneCP</Path>
      <FileName>test_cp_0_nl0_ce0.csv</FileName>
      <ColumnDelim>|</ColumnDelim>
    </InstanceFile>
    <ExchangeData/>
  </Data>

Generated Data

Since we asked it to do so (OutputOption.Data = true), iBench has also generated data stored in a file test_cp_0_nl0_ce0.csv. Let's have a look:

1|exaltpurc|blubbinvit|odorifero|horsesenti|purc
2|unprac|conde|ca|beadsaf|sa
3|c|wholl|f|stale|archbis
4|appreh|botc|c|boyte|slideob
5|hov|shif|offensiv|mis|vo
6|wearsdi|syll|catching|hud|hoophe
7|bloodba|ridgeswo|eternityr|prit|wedn
8|en|repugn|covers|mor|suppli
9|drive|pompi|strump|gatewo|thievis
10|stirringc|di|lures|gentle|faustetic

iBench has generated 10 rows with a unique identifier and the 5 attributes of relation test_cp_0_nl0_ce0. The default data generator generates nonsensical strings. You can use more meaningful data generators or even provide your own data generators. See here on how this works.

Loading the schema and data

Now let's use the loader that comes with iBench to load the schema and data into PostgreSQL. For this to work, you have to have a running postgres server. Here we assume that the postgres connection parameters are. Please change according to your configuration.

host: 127.0.0.1
user: postgres
pw: passwd
database: testdb
postgres port: 5432

To load the mapping scenario run:

./loader.sh -f tutorialOneCP/metadata.xml -u postgres -p passwd -d testdb

This will create schemas source and target in postgres, create a table for each source relation and load the CSV data into these tables, and create views representing the target relations.

Modifying the example configuration

Let's create a copy of the configuration file we have been using and start modifying it to see the effect of various iBench parameters.

cp exampleConfigurations/tutorialOneCopy.txt exampleConfigurations/myconfig.txt

Data generation

Number of generated rows per table

So far the CSV file generated by iBench only contains 10 rows. This is controlled by the parameter RepElementCount. Let's change this parameter to RepElementCount = 100, change the output directory (SchemaPathPrefix=myout and InstancePathPrefix=myout), and then run iBench.

cd build
./iBench.sh -c ../exampleConfigurations/myconfig.txt

Now inspect the generated CSV file to check that indeed 100 rows were generated by iBench.

Controlling value generation

So far we have used iBench's default data generator which produces arbitrary strings. Let's change this to use one of the built-in data generators instead. In the iBench configuration file you can declare what data types should be used, and what is the probability of each of the declared data types to be used. iBench comes equipped with a set of built-in data types. However, you can extend the system with new data types (see here). For now, let's stick to the built-in data types and instruct iBench to use emails with 60% probability and phone numbers with 40% probability.

By setting NumDataType we tell iBench that we want to use two custom data types.

DataType.NumDataType = 2

Now for each data type we have to give it a name, have to provide the fully classified name of the class implementing the data types, declare its probability of being used (as a percentage), and let iBench know what SQL data type should be used to store values of this data type. In our example, the classes are toxgene.util.cdata.xmark.Emails and toxgene.util.cdata.xmark.PhoneNumbers.

DataType.0.Name = myEmail
DataType.0.ClassPath = toxgene.util.cdata.xmark.Emails
DataType.0.Percentage = 60.0
DataType.0.DBType = TEXT

DataType.1.Name = myPhone
DataType.1.ClassPath = toxgene.util.cdata.xmark.PhoneNumbers
DataType.1.Percentage = 40.0
DataType.1.DBType = TEXT

Also let's increase the number of attributes per table and maximal length of values:

ConfigOptions.NumOfSubElements = 10
MaxStringLength = 100

Now let's run this modified scenario with iBench and check that the generated CSV file now contains emails and phone numbers.

cd build
./iBench -c ../exampleConfigurations/myconfig.txt
cat myout/test_cp_0_nl0_ce0.csv 

1|[email protected]|[email protected]|Ragunath|[email protected]|[email protected]|608-772-8040|Taisook.Dan@cwru|767-88|Peiwei.Solares@uni-freiburg|106-919-7847
2|[email protected]|[email protected]|[email protected]|[email protected]|[email protected]|268-894-7086|[email protected]|134-814-5514|[email protected]|216-467-7399
3|Teruyuki.D'Hooge@newpaltz|[email protected]|Sammy.Inamura|[email protected]|[email protected]|930-531-7678|[email protected]|247-999-5642|[email protected]|656-814-4562
4|Toshiko.Thoben@r|[email protected]|[email protected]|[email protected]|[email protected]|135-253-3362|Kam|479-826-8807|[email protected]|450-830-6324
5|Debasis|[email protected]|[email protected]|[email protected]|[email protected]|397-215-1367|Udaiprakash.Y|288-102-6155|[email protected]|858-719-7342
6|[email protected]|[email protected]|[email protected]|[email protected]|[email protected]|640-734-6583|Jouko.Nutt@|236-867-6759|[email protected]|566-398-8531

Primitives

Now let's use some different primitives in addition to COPY.

User-defined Primitives

TODO

Deviation of parameters

Many parameters in iBench can not just be set to a fixed value, but the user can provide a standard deviation for the parameter. iBench will then pick a random value using a normal distribution with median set to the configuration parameter and the standard deviation provided by the user. For instance, let's create 10 copy primitives and let's vary the number of attributes:

Scenarios.COPY = 10
ConfigOptions.NumOfSubElements = 5
ConfigOptionsDeviation.NumOfSubElements = 2