Skip to content
rumineykova edited this page Oct 25, 2023 · 22 revisions

FabGuard - FabFlee Input File Verification with Pandera

Introduction

FabGuard is a Python library that simplifies input file verification. It is based on the data validation library Pandera and adapted for FabFlee. This documentation will guide you through the steps to use FabGuard for input file verification.

Prerequisites

Before you get started with FabGuard, make sure you have the following prerequisites in place:

  • The Pandas library installed.
  • The Pandera library installed.

Project Structure

FabGuard is a plugin for FabFlee. The structure of the FabGuard folder is as follows:

  • tests Folder: Contains schemes (tests) for various input files. For example, the closure_scheme.py file contains verification tests for the closure.csv file.

  • config.py: Contains configuration information, including the names of the test input files.

  • error_messages.py: Contains error messages used in the verification checks.

  • fab_guard.py: The main wrapper for Pandera tests. It defines decorators, such as fg.log for functions defining error messages and fg-check for functions that should be executed as part of the test suite. It also provides utility functions like load_files for reading a CSV file and returning a DataFrame, and transpose for transposing a CSV file.

Each scheme file contains a class that inherits from pa.DataFrameModel.

Important Util Functions

To ensure efficient use of resources, all test files are loaded into memory only once. This prevents unnecessary file loading, and you can achieve this by using the singleton class FabGuard. Load all files using the following method:

FabGuard.get_instance().load_file(config.routes) 

How-To: Creating Tests for an Input File

In this guide, we will create tests for the locations.csv file as an example. (Such tests already exists and you can use the location_scheme.py file as a referencce) Follow these steps to create your tests:

  1. Add a new Python file location_test.py and add it to the FabGuard\tests folder.
  2. Create a Python class LocationsSchemeTest that inherits from pa.DataFrameModel.
class LocationsSchemeTest(pa.DataFrameModel):
  1. In this class, define constraints for each column as fields of the class. For example, if you have a location file with columns like region, country, lat, and lon, you can specify their data types as follows:
region: Series[pa.String] = pa.Field()
country: Series[pa.String] = pa.Field()
lat: Series[pa.Float] = pa.Field()
lon: Series[pa.Float] = pa.Field()

You can refine the data-type constraints further by passing parameters to the Field constructor. All build-in Pandera checks are available as name arguments to Field.

location_type: Series[pa.String] = pa.Field(
     isin = ["conflict_zone", "town","camp", "forwarding_hub", "marker", "idpcamp"])
conflict_date: Series[float] = pa.Field(nullable=True)
population: Series[float] = pa.Field(ge=0,nullable=True)
  • Above we are using the build-in Pandera constraints ge (Greater than or equal to) and isin (In a list). Check the full list of build-in methods here:

  • To specify that a fieled cannot be null, set the nullable argument to True..

Finally, we specify the constraints for the column name:

    name: Series[pa.String] = pa.Field(nullable=False, alias='#"name"')

We use the alias parameter to specify the real name of the column since # is a special character in Python.

  1. Add the file to the tests/__init__.py
  2. Finally, register your file for testing. In registry.py in the test_all_files function add self.register_for_test(<filename.classname>, <name of the file>). Your registry.py file should look as below:
@fgcheck
def test_all_files(self):
    # self.register_for_test(location_scheme.LocationsScheme, config.locations)
    self.register_for_test(location_test.LocationsSchemeTest, config.locations)
    # self.register_for_test(routes_scheme.RoutesScheme, config.routes)
    # self.register_for_test(closures_scheme.ClosuresScheme, config.closures) 

Notice that we have commented all other tests but LocationsSchemeTest as to avoid noise in your testing.

  1. Execute the test on a config files for a particular conflict. Below we run the test for the conflict CAR. The test folder is config_files/car:
fabsim localhost flee_verify_input:car

How-To: More Interesting Constraints

In addition to the simple build-in tests and data constraints we saw above, you can perform two additional types of checks: column-level checks and dataframe-level checks.

Column-level checks

Column-level checks define simple tests that apply to all values in a column. For example, to check if names are valid in the location file, use the decorator as follows:

@pa.check(name1, element_wise=true)
def names_in_routes(cls, name1):
    # Define your test logic here.

Dataframe level checks

For conditional tests or multi-column constraints, create custom test methods within the class. Use the @pa.dataframe_check decorator to mark these methods for testing. For example, to ensure that the country value in the locations.csv file matches the country in row 0 when the location type is "conflict_zone," use the following code:

@pa.dataframe_check()
def conflict_zone_country_should_be_0(cls, df: pd.DataFrame) -> Series[bool]:
   country = df["country"][0]
   mask = ((df["location_type"] == "conflict_zone") & (df["country"] != country))
   
   if mask.any():  # Check if any rows meet the condition
       raise ValueError(Errors.location_country_err(df.index[mask], config.locations))
   return ~mask

To create more complex masks, you can use & (and) and | (or) operators.

Note that all errors are stored inside the error_messages.py file. Pass the number of faulty rows (df.index[mask]) and the name of the input file under test (config.locations) to ensure that the correct error message is logged. Errors are logged in logs.txt file located in the FabGuard folder.