-
Notifications
You must be signed in to change notification settings - Fork 0
Home
FabGuard is a Python library that simplifies input file verification. It is based on the data validation library Pandera and adapted for FabFlee. This documentation will guide you through the steps to use FabGuard for input file verification.
Before you get started with FabGuard, make sure you have the following prerequisites in place:
- The Pandas library installed.
- The Pandera library installed.
FabGuard is a plugin for FabFlee. The structure of the FabGuard folder is as follows:
-
tests Folder: Contains schemes (tests) for various input files. For example, the closure_scheme.py file contains verification tests for the closure.csv file.
-
config.py: Contains configuration information, including the names of the test input files.
-
error_messages.py: Contains error messages used in the verification checks.
-
fab_guard.py: The main wrapper for Pandera tests. It defines decorators, such as
fg.log
for functions defining error messages andfg-check
for functions that should be executed as part of the test suite. It also provides utility functions likeload_files
for reading a CSV file and returning a DataFrame, andtranspose
for transposing a CSV file.
Each scheme file contains a class that inherits from pa.DataFrameModel
.
To ensure efficient use of resources, all test files are loaded into memory only once. This prevents unnecessary file loading, and you can achieve this by using the singleton class FabGuard. Load all files using the following method:
FabGuard.get_instance().load_file(config.routes)
In this guide, we will create tests for the locations.csv file as an example. (Such tests already exists and you can use the location_scheme.py file as a referencce) Follow these steps to create your tests:
- Add a new Python file location_test.py and add it to the FabGuard\tests folder.
- Create a Python class LocationsSchemeTest that inherits from
pa.DataFrameModel
.
class LocationsSchemeTest(pa.DataFrameModel):
- In this class, define constraints for each column as fields of the class. For example, if you have a location file with columns like region, country, lat, and lon, you can specify their data types as follows:
region: Series[pa.String] = pa.Field()
country: Series[pa.String] = pa.Field()
lat: Series[pa.Float] = pa.Field()
lon: Series[pa.Float] = pa.Field()
You can refine the data-type constraints further by passing parameters to the Field
constructor.
All build-in Pandera checks are available as name arguments to Field
.
location_type: Series[pa.String] = pa.Field(
isin = ["conflict_zone", "town","camp", "forwarding_hub", "marker", "idpcamp"])
conflict_date: Series[float] = pa.Field(nullable=True)
population: Series[float] = pa.Field(ge=0,nullable=True)
-
Above we are using the build-in Pandera constraints
ge
(Greater than or equal to) andisin
(In a list). Check the full list of build-in methods here: -
To specify that a fieled cannot be null, set the
nullable
argument toTrue.
.
Finally, we specify the constraints for the column name
:
name: Series[pa.String] = pa.Field(nullable=False, alias='#"name"')
We use the alias parameter to specify the real name of the column since # is a special character in Python.
- Add the file to the
tests/__init__.py
- Finally, register your file for testing. In registry.py in the test_all_files function add
self.register_for_test(<filename.classname>, <name of the file>)
. Yourregistry.py
file should look as below:
@fgcheck
def test_all_files(self):
# self.register_for_test(location_scheme.LocationsScheme, config.locations)
self.register_for_test(location_test.LocationsSchemeTest, config.locations)
# self.register_for_test(routes_scheme.RoutesScheme, config.routes)
# self.register_for_test(closures_scheme.ClosuresScheme, config.closures)
Notice that we have commented all other tests but LocationsSchemeTest as to avoid noise in your testing.
- Execute the test on a config files for a particular conflict. Below we run the test for the conflict CAR. The test folder is
config_files/car
:
fabsim localhost flee_verify_input:car
In addition to the simple build-in tests and data constraints we saw above, you can perform two additional types of checks: column-level checks and dataframe-level checks.
Column-level checks define simple tests that apply to all values in a column. For example, to check if names are valid in the location file, use the decorator as follows:
@pa.check(name1, element_wise=true)
def names_in_routes(cls, name1):
# Define your test logic here.
For conditional tests or multi-column constraints, create custom test methods within the class. Use the @pa.dataframe_check decorator to mark these methods for testing. For example, to ensure that the country value in the locations.csv file matches the country in row 0 when the location type is "conflict_zone," use the following code:
@pa.dataframe_check()
def conflict_zone_country_should_be_0(cls, df: pd.DataFrame) -> Series[bool]:
country = df["country"][0]
mask = ((df["location_type"] == "conflict_zone") & (df["country"] != country))
if mask.any(): # Check if any rows meet the condition
raise ValueError(Errors.location_country_err(df.index[mask], config.locations))
return ~mask
To create more complex masks, you can use & (and) and | (or) operators.
Note that all errors are stored inside the error_messages.py file. Pass the number of faulty rows (df.index[mask]
) and the name of the input file under test (config.locations
) to ensure that the correct error message is logged. Errors are logged in logs.txt file located in the FabGuard folder.