Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ActorPacks #66

Merged
merged 8 commits into from
Dec 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
include tagpack/db/*.cql
include tagpack/db/*.csv
include tagpack/conf/tagpack_schema.yaml
include tagpack/conf/actorpack_schema.yaml
include tagpack/conf/confidence.csv
46 changes: 33 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@
This repository provides a command line tool for managing [GraphSense TagPacks](https://github.com/graphsense/graphsense-tagpacks/wiki/GraphSense-TagPacks). It can be used for

1. [validating TagPacks against the TagPack schema](#validation)
2. [handling taxonomies and concepts](#taxonomies)
3. [ingesting TagPacks and related data into a TagStore](#tagstore)
4. [calculating the quality of the tags in the TagStore](#quality)
2. [validating ActorPacks against the ActorPack schema](#actorpack_validation)
3. [handling taxonomies and concepts](#taxonomies)
4. [ingesting TagPacks and related data into a TagStore](#tagstore)
5. [calculating the quality of the tags in the TagStore](#quality)

Please note that the last feature requires (installation of) a [Postgresql](https://www.postgresql.org/) database.

Expand All @@ -19,24 +20,37 @@ Please note that the last feature requires (installation of) a [Postgresql](http

Validate a single TagPack file

tagpack-tool tagpack validate tests/testfiles/ex_addr_tagpack.yaml
tagpack-tool tagpack validate tests/testfiles/ex_entity_tagpack.yaml
tagpack-tool tagpack validate tests/testfiles/simple/ex_addr_tagpack.yaml

Recursively validate all TagPacks in (a) given folder(s).

tagpack-tool tagpack validate tests/testfiles/

Tagpacks are validated against the [tagpack schema](tagpack/conf/tagpack_schema.yaml).

Confidence settings are validated against a set of acceptable [confidence](tagpack/conf/confidence.csv) values.
Confidence settings are validated against a set of acceptable [confidence](tagpack/db/confidence.csv) values.

## Validate an ActorPack <a name="actorpack_validation"></a>

Validate a single ActorPack file

tagpack-tool actorpack validate tests/testfiles/actors/ex_actorpack.yaml

Recursively validate all TagPacks in (a) given folder(s).

tagpack-tool actorpack validate tests/testfiles/actors/

Actorpacks are validated against the [actorpack schema](tagpack/conf/actorpack_schema.yaml).

Values in the field jurisdictions are validated against a set of [country codes](src/tagpack/db/countries.csv).

## View available taxonomies and concepts <a name="taxonomies"></a>

List configured taxonomy keys and URIs

tagpack-tool taxonomy list

Fetch and show concepts of a specific remote taxonomy (referenced by key)
Fetch and show concepts of a specific remote/local taxonomy (referenced by key: abuse, entity, confidence, country)

tagpack-tool taxonomy show entity

Expand Down Expand Up @@ -97,23 +111,20 @@ To use a specific config file pass the file's location:

tagpack-tool --config path/to/config.yaml config




### Initialize the tagstore database

To initialize the database with all the taxonomies needed for ingesting the tagpacks, use:

tagpack-tool tagstore init


### Ingest taxonomies and confidence scores

To insert individual taxonomies into database, use:

tagpack-tool taxonomy insert abuse
tagpack-tool taxonomy insert entity
tagpack-tool taxonomy insert confidence
tagpack-tool taxonomy insert country

To insert all configured taxonomies at once, simply omit taxonomy name

Expand Down Expand Up @@ -145,13 +156,22 @@ To ingest **new** tagpacks and **skip** over already ingested tagpacks, add the
By default, trying to insert tagpacks from a repository with **local** modifications will **fail**.
To force insertion despite local modifications, add the ``--no_strict_check`` command-line parameter

tagpack-tool tagpack insert --force --add_new tests/testfiles/
tagpack-tool tagpack insert --no_strict_check tests/testfiles/

By default, tagpacks in the TagStore provide a backlink to the original tagpack file in their remote git repository ([see here](README_tagpacks.md#versioning-with-git)).
To instead write local file paths instead, add the ``--no_git`` command-line parameter
To write local file paths instead, add the ``--no_git`` command-line parameter

tagpack-tool tagpack insert --no_git --add_new tests/testfiles/

### Ingest ActorPacks

Insert a single ActorPack file or all ActorPacks from a given folder:

tagpack-tool actorpack insert tests/testfiles/simple/ex_addr_actorpack.yaml
tagpack-tool actorpack insert tests/testfiles/

You can use the parameters `--force`, `--add_new`, `--no_strict_check` and `--no_git` options in the same way as with the `tagpack` command.

### Align ingested attribution tags with GraphSense cluster Ids

The final step after inserting a tagpack is to fetch the corresponding
Expand Down
14 changes: 14 additions & 0 deletions src/tagpack/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

import sys

import yaml

if sys.version_info[:2] >= (3, 8):
# TODO: Import directly (no need for conditional) when `python_requires = >= 3.8`
from importlib.metadata import PackageNotFoundError, version # pragma: no cover
Expand Down Expand Up @@ -48,3 +50,15 @@ def __str__(self):
if self.nested_exception:
msg = msg + "\nError Details: " + str(self.nested_exception)
return msg


# https://gist.github.com/pypt/94d747fe5180851196eb
class UniqueKeyLoader(yaml.FullLoader):
def construct_mapping(self, node, deep=False):
mapping = set()
for key_node, value_node in node.value:
key = self.construct_object(key_node, deep=deep)
if key in mapping:
raise ValidationError(f"Duplicate {key!r} key found in YAML.")
mapping.add(key)
return super().construct_mapping(node, deep)
207 changes: 207 additions & 0 deletions src/tagpack/actorpack.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
"""ActorPack - A wrapper for ActorPack files"""
import json
import os
import sys

import yaml
from yamlinclude import YamlIncludeConstructor

from tagpack import TagPackFileError, UniqueKeyLoader, ValidationError
from tagpack.cmd_utils import print_info


class ActorPack(object):
"""Represents an ActorPack"""

def __init__(self, uri, contents, schema, taxonomies):
self.uri = uri
self.contents = contents
self.schema = schema
self.taxonomies = taxonomies
self._unique_actors = []
self._duplicates = []

def load_from_file(uri, pathname, schema, taxonomies, header_dir=None):
YamlIncludeConstructor.add_to_loader_class(
loader_class=yaml.FullLoader, base_dir=header_dir
)

if not os.path.isfile(pathname):
sys.exit("This program requires {} to be a file".format(pathname))
contents = yaml.load(open(pathname, "r"), UniqueKeyLoader)

if "header" in contents.keys():
for k, v in contents["header"].items():
contents[k] = v
contents.pop("header")
return ActorPack(uri, contents, schema, taxonomies)

@property
def all_header_fields(self):
"""Returns all ActorPack header fields, including generic actor fields"""
try:
return {k: v for k, v in self.contents.items()}
except AttributeError:
raise TagPackFileError("Cannot extract ActorPack fields")

@property
def header_fields(self):
"""Returns only ActorPack header fields that are defined as such"""
try:
return {
k: v for k, v in self.contents.items() if k in self.schema.header_fields
}
except AttributeError:
raise TagPackFileError("Cannot extract ActorPack fields")

@property
def actor_fields(self):
"""Returns actor fields defined in the ActorPack header"""
try:
return {
k: v
for k, v in self.contents.items()
if k != "actors" and k in self.schema.actor_fields
}
except AttributeError:
raise TagPackFileError("Cannot extract ActorPack fields")

@property
def actors(self):
"""Returns all actors defined in a ActorPack's body"""
try:
return [
Actor.from_contents(actor, self) for actor in self.contents["actors"]
]
except AttributeError:
raise TagPackFileError("Cannot extract actors from ActorPack")

def get_unique_actors(self):
if self._unique_actors:
return self._unique_actors

seen = set()
duplicates = []

for actor in self.actors:
# check if duplicate entry
t = tuple(str(actor.all_fields.get(k)).lower() for k in ["id", "label"])
if t in seen:
duplicates.append(t)
else:
seen.add(t)
self._unique_actors.append(actor)

self._duplicates = duplicates
return self._unique_actors

def validate(self):
"""Validates an ActorPack against its schema and used taxonomies"""

# check if mandatory header fields are used by an ActorPack
for schema_field in self.schema.mandatory_header_fields:
if schema_field not in self.header_fields:
msg = f"Mandatory header field {schema_field} missing"
raise ValidationError(msg)

# check header fields' types, taxonomy and mandatory use
for field, value in self.all_header_fields.items():
# check a field is defined
if field not in self.schema.all_fields:
raise ValidationError(f"Field {field} not allowed in header")
# check for None values
if value is None:
msg = f"Value of header field {field} must not be empty (None)"
raise ValidationError(msg)

self.schema.check_type(field, value)
self.schema.check_taxonomies(field, value, self.taxonomies)

if len(self.actors) < 1:
raise ValidationError("No actors found.")

# iterate over all tags, check types, taxonomy and mandatory use
e2 = "Mandatory tag field {} missing in {}"
e3 = "Field {} not allowed in {}"
e4 = "Value of body field {} must not be empty (None) in {}"
for actor in self.get_unique_actors():
# check if mandatory actor fields are defined
if not isinstance(actor, Actor):
raise ValidationError(f"Unknown actor type {type(actor)}")

for schema_field in self.schema.mandatory_actor_fields:
if (
schema_field not in actor.explicit_fields
and schema_field not in self.actor_fields
):
raise ValidationError(e2.format(schema_field, actor))

for field, value in actor.explicit_fields.items():
# check whether field is defined as body field
if field not in self.schema.actor_fields:
raise ValidationError(e3.format(field, actor))

# check for None values
if value is None:
raise ValidationError(e4.format(field, actor))

# check types and taxomomy use
try:
self.schema.check_type(field, value)
self.schema.check_taxonomies(field, value, self.taxonomies)
except ValidationError as e:
raise ValidationError(f"{e} in {actor}")

if self._duplicates:
msg = (
f"{len(self._duplicates)} duplicate(s) found, starting "
f"with {self._duplicates[0]}\n"
)
print_info(msg)
return True

def to_json(self):
"""Returns a JSON representation of an ActorPack's header"""
actorpack = {}
for k, v in self.header_fields.items():
if k != "actors":
actorpack[k] = v
return json.dumps(actorpack, indent=4, sort_keys=True, default=str)

def __str__(self):
"""Returns a string serialization of the entire ActorPack"""
return str(self.contents)


class Actor(object):
"""An actor"""

def __init__(self, contents, actorpack):
self.contents = contents
self.actorpack = actorpack

@staticmethod
def from_contents(contents, actorpack):
return Actor(contents, actorpack)

@property
def explicit_fields(self):
"""Return only explicitly defined actor fields"""
return {k: v for k, v in self.contents.items()}

@property
def all_fields(self):
"""Return all actor fields (explicit and generic)"""
return {
**self.actorpack.actor_fields,
**self.explicit_fields,
}

def to_json(self):
"""Returns a JSON serialization of all actor fields"""
actor = self.all_fields
return json.dumps(actor, indent=4, sort_keys=True, default=str)

def __str__(self):
""" "Returns a string serialization of an Actor"""
return str(self.all_fields)
Loading