- Kafka not used for Data on Buidings, Generators, Consumers, storage ... (no Streaming Data)
- Streaming Data is collected within different Phases of Costumer Journey
- as datatype String collected and processed in Kafka Stream (gecastet, rounded, plausibilized, etc.) and assigned to an ID.
Within in Database:
- Timestamp, timeseries_id, value (Float /Integer / String)
-
mieterstromrelevanten Daten initially per csv-lists
-
Datatransfer per csv-Wechsellisten or MSCONS
-
API to make this data available for OPERATE tech
Schema:
{'GERAET_NR':string, 'GERAET_ZAEHLERPLATZ':string, 'GERAET_ZAEHLERTYP':string, 'GERAET_WFAKTOR':int, 'MESSART':string}
- Transformed with Kafka and loaded in the database
- The data is not raised in different phases of the Costumer Journey -> data assigned to specific use case
timeseries_id
is a hashed int value
{'Timestamp':string, 'timeseries_id':string, 'value':string}
δ-Doca Link
Implementation Link (based on this Master Thesis)
δ-DOCA is a strategy to anonymize data streams in a non-interactive context, with the addition of noise directly on the data. It consists of two stages: Domain Bounding by δ and DOCA. In the first stage, the data domain is defined and adjusted by a δ, obtaining the sensitivity value of the differential privacy. Then, in the second stage, the stage of utility improvement, an online microaggregation is performed prior to adding noise to data.
Parameter | Description | Default |
---|---|---|
ε | privacy budget | |
delay constraint | maximum number of tuples that can be active | |
b | maximum number of clusters that can be active | |
μ | maximum number of clusters that are used to calculate information loss |
CASTLEGUARD Link
Implementation Link
CASTLEGUARD is a data stream anonymization approach that provides a reliable guarantee of k-anonymity, l-diversity, and non-interactive differential privacy based on parameters l, β, and φ. It achieves differential privacy for data streams by sampling entries from an input data stream with probability β and using additive noise taken from a Laplace distribution with mean
Parameter | Description | Default |
---|---|---|
l | used to enforce l-diversity, which ensures that each group of k-anonymized tuples contains at least l different values for the sensitive attribute | |
k | used to enforce k-anonymity, which ensures that each quasi-identifier appears at least k-times in a cluster | |
β | used for β-sampling, which means that each incoming tuple is randomly sampled/discarded with probability β | |
φ | used for perturbation, which adds noise to the quasi-identifiers in the data stream to protect privacy. A higher value of φ results in more noise being added | |
b | maximum number of clusters that can be active | |
δ | maximum number of tuples that can be active (delay constraint) | |
μ |
- for tupel
$t$ in Stream$S$ :- if
$random(0,1) > beta$ -> break (= supress$t$ ) - else:
- Perturb tupel
$t$ - Cluster tupel
$t$ : (in other words: look for cluster with most fitting generalization; same technique as delta-Doca Algorithm)- Get set of cluster(s)
$C_{min}$ that has the minimum enlargement - For each cluster
$c \in C_{min}$ - check if
$infoLoss(c, t)$ would be smaller than the average information loss$\tau$ - if so -> add
$t$ to$C_{best}$
- check if
- If
$C_{best}$ is empty:- if non-ks-cluster in Memory < allowed non-ks-cluster in memory -> Create new cluster and add tuple
$t$ - Else -> add
$t$ to a cluster$c_{min} \in C_{min}$ with the smallest size
- if non-ks-cluster in Memory < allowed non-ks-cluster in memory -> Create new cluster and add tuple
- Else -> add
$t$ to a cluster$c_{best} \in C_{best}$
- Get set of cluster(s)
- Once a tupel
$t^{'}$ has reached time-step delta, cluster$c_{t^{'}}$ are evaluated for publication:- if
$c_{t^{'}}.size >= k$ :- output cluster*
$c$
- output cluster*
- Else: check if tupel
$t^{'}$ is in an already ks-anonymized cluster$c_{ks-a}$ (reuse strategy)- if so -> use generalization of
$c_{ks-a}$
- if so -> use generalization of
- Else check if cluster
$c_{t^{'}}$ is an outlier (= smaller than half if the non-ks-anonymized cluster)- if so -> use the most used generalization
- Else check if merge with other Clusters is impossible (Sum of size of each non-ks-cluster
$< k$ )- if so -> use the most used generalization
- Else
- Merge Clusters
- check enlargement of cluster
$c_{t^{'}}$ after possible merge with non-ks-cluster - choose cluster with minimum enlargement
- repeat until
$cluster_{merged} >= k$
- check enlargement of cluster
- *Output merged Cluster:
- Check if Cluster can be split (by split_l()-function for k-anonymity and l-diversity) (
$C.size >= 2k$ ) - Add (splitted) Clusters to
$SC$ - For each cluster
$sc$ in$SC$ :- Output all tuples
$t_{sc} in sc$ with its generalization - Update
$\tau$ with$infoLoss(sc)$ - if
$infoLoss(sc, t_{sc})$ is smaller than$\tau$ -> Add cluster to set of ks-anonymized-Cluster (for reuse strategy) - else delete cluster
$sc$
- Output all tuples
- Check if Cluster can be split (by split_l()-function for k-anonymity and l-diversity) (
- Merge Clusters
- if
- Perturb tupel
- if