Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] HDP Declarative Programming (working draft) #16

Open
fititnt opened this issue Mar 17, 2021 · 9 comments
Open

[meta] HDP Declarative Programming (working draft) #16

fititnt opened this issue Mar 17, 2021 · 9 comments
Labels

Comments

@fititnt
Copy link
Member

fititnt commented Mar 17, 2021

Trivia:

  • HDP naming:
    • HDP = 'HDP Declarative Programming' is the default name.
    • HDP = 'Humanitarian Declarative Programming' could be one way to call when the intent of the moment is strictly humanitarian.
      • The definition of humanitarian is out of scope.

The triggering motivation

Context: the HXL-Data-Science-file-formats was aimed to use HXL as file format to direct input on softwares for data mining/machine learning/statistical analysis and, since HXL is an solution for fast sharing of humanitarian data, one problem becomes how to also make easier authorization to access data and/or minimal level of anonymization also as fast as possible (ideally in real time).

  1. The initial motivation for HDP was to be able to abstract "acceptable use policies" (AUP) both understood by humans (think from judges able to enforce/authorize usage to even local community representatives that could write rules knowing that machines could enforce then) on how data could be processed and by machines.
    1. The average scenario usage in this context is already with a huge level of lack of trust between different groups of humans.
      1. While by no means (in special because I'm from @EticaAI, and we mostly advocate for avoiding misuse of A/IS) implementing systems would not make mistakes, from the start we're already planning ways to allow auditing without actually needing to access sensitive data.
      2. Auditing rules that could be reviewed even by people outside the origin country or without knowledge of the language would make things easier,
        1. even if this means an human, how know the native language but no programming, could create an quick file to say what one term means
        2. ... and even if it means already planning ahead one way that such translations tables could be digitally signed
  2. Is possible that the average usage of HDP (if actually go beyond proof of concepts or internal usage) may actually be ways to both reference datasets (eve if they are not ready to use, but could be triggered to built) and instructions to process them
    1. To be practical, only a syntax for abstract "acceptable use policies" (AUP) without some implementation would not be usable. So this actually is a requirement.
    2. At this moment (2021-03-16) is not clear what should be very optimized for end user HDP and what could be on just a few languages (like in English and Portuguese), but the idea of already try to allow use of different natural languages to express references to datasets works as a benchmark.

Some drafted goals/restrictions (as 2021-03-16):

  1. Both the documentation on how to write the concept of HDP and the proof of concepts to implement are public domain dedication. BSD-0 can be used as alternative.

    1. No licenses or pre-authorizations to use are necessary.
  2. Be in the user creator language. This means that the underlining tool should allow exchange HDP files (that in practice means how to find datasets or manipulate them) for example in Portuguese or Russian and others could still (with help of HDP) convert the key terms of don't understand such languages

    1. The v0.7.5 already drafted this. But > v0.8.0 should improve the proof of concepts. At the moment the core_vocab already has the 6 UN official languages and, because this project was born via the @HXL-CPLP, the Portuguese language.
          2. Note that special care is done with HDP keywords and instructions that would likely to be used by people who, de facto, need to homologate how data can be used. Since often the data columns may be in the native language one or two humans with both technical skills and a way to understand the native language may need to create a filter and label such filter with tasks that accomplish what that users want and then digitally sign this filter.
    2. The inner steps of commands delegated to underlining tools (the wonderful HXL python library is a great example!) is not aimed for average end users so, for the propose of this goal and (as new tools to abstract could increase over time) to make easier localization we strictly don't grant them translations.
      1. There exists a possibility that the code editors already show usage tips in English for the underlining tools, that at least for some languages (like Portuguese) we from HXL-CPLP may translate the help messages. Something similar could be done in other languages with volunteers.
  3. The syntax of HDP must be planned in such a way that make it intentionally hard to average user save information that would make the file itself a secret (like passwords or direct URLs to private resources)

    1. Users around human rights or typical collaboration in the middle of urgent disasters may share their HDP files with average end user cloud file sharing. Since HDP files themselves may be shared across several small groups (but users only know that the dataset exists, while not requesting it), if in the worst case scenario only the data that one group is affected, the potential damage is mitigated by default.
    2. In general the idea is to allow some level of indirection for things that need to be kept private while still maintaining usability.
  4. Be offline by default and, when applicable, be air-gapped network friendly

    1. In some cases people may want to potentially use HDP to manage files on one local network because they need to work offline (or the files are too big) while using the same HDP files to share with others, but others could still use an online version.
    2. Since HDP collection of files could potentially allow really big projects, soon or later in particular who need to have access to data from several other groups could fear that such level of abstraction could lead to being targeted. While this is actually not a problem unique to HDP potential usage and most documentation is likely to be focused to help who consumes sensitive data on the last mile, at least allow applicability of air-gapped network seems reasonable.
    3. Please note that each case is a case. By "being offline by default" doesn't mean that all resources must be downloaded (in fact, this would be opposite of interest of who would like to share data with others, even if they're trusted), but the fact that command line tools or projects that make reference for load resources outside of network need to have at least some explicitly authorization.
  5. "Batteries included", aka try already offer tools that do syntax checks of HDP files.
        1. If you use a code editor that supports JSON Schema, the v0.7.5 already has an early version that warns misuse. At the moment it still requires writing with the internal terms used (Latin). But if eventually the schema becomes generated using the internal core_vocab, this means that other languages would have such support too.

  6. The average HDP file should be optimized if it needs to be printed on paper as it is and have ways to express complex but common items of acceptable use policy (as 2021-03-16 not sure if this is tbe best approach) as some sort of constant. (This means the ideal max characters per line and typical indentation level should be carefully planned ahead). (This type of hint was based on suggestions we hear)

    1. Yes, we're in 2021, but as friendly as possible to allow the HDP files (in special the ones that are about authorization) being able to be saved as PDF or even on paper, the better. Even in places that allow attach digital archives, while the authorization can be public, the attachments may require extra authorization (like being a lawyer or at least be in person requesting the files).
    2. Ideally the end result could be concise enough to discourage large amounts of texts on the files themselves (even if it means we developer like "custom constants" that are part of the HDP specification itself, like an tag that means 'authorized with full non-anonymised access to strict use to red cross/MSF' or 'destruct any copy after not more necessary').
      1. Such types of constants can both help to make rules concise (so worst case scenario if people have to write again letter by letter something that is just not an customization of well know HDP example/template files, its possible) but also would allow with automatic translation
  7. Do exist other ideas, but as much of possible, both by the syntax of HDP files (that may be easier just have translation for the core key terms) and, if necessary, creation of constants to abstract concepts, ideally should allow that the exact file (either digitally signed or with literally PDF of an judge authorization, so the "authorization" could be an link to such file) be able to be understood even outside the original country.

    1. Both for how some custom filters may need to be created, or if either the language used was a totally new one, or the original source wrote a term wrong, the idea here is allow an human, who accepts and digitally signs an extra HDP file, can take full responsibility for mistakes.
    2. Again, the idea of average HDP files not requiring ways to point to resources or have reference to passwords also is perfect when is made by paper and the decisions (and who create the underlying rules if is something more specific) could be audited.
      1. Note that some types of auditing could be a human reading the new rule or, since the filters start to have common patterns, the filters someone else creates can be tested against example datasets. While not as ideal as human review, as long as some example datasets for that language already exist (think for example one that simulates Spreadsheets malformed but with personal information) could be used against what was proposed to help that rule of the initial user. (this type of extra validations don't need to be public)
fititnt added a commit that referenced this issue Mar 17, 2021
…f the YAML files, will mention the '[meta] HDP Declarative Programming (working draft) #16'
@fititnt fititnt pinned this issue Mar 17, 2021
fititnt added a commit that referenced this issue Mar 17, 2021
…SE; comment: maybe JSON Schemas could allow conditional loading, so an *.mul.hdp.yml or *.hdp.yml could allow completion all the time?; also playing with loops;
fititnt added a commit that referenced this issue Mar 17, 2021
fititnt added a commit that referenced this issue Mar 17, 2021
fititnt added a commit that referenced this issue Mar 18, 2021
@fititnt
Copy link
Member Author

fititnt commented Mar 19, 2021

TL;DR: we're also using testinfra for some tests

While hxlm.core, in particular the Htypes, makes sense to test the functions directly, at least for hdpcli in the short term, since internals are changing, it seems reasonable to test at a higher level.

By "higher level" I mean simulating the cli interface.

While this is not as detailed, it at least has more chances to get overall errors while allowing to move internal faster. Another problem is that attaching too much tests on internal methods not only force change things that don't matter for the end user, but also the tests themselves could take more time to write than code the things themselves.

Also, some sort of advantage (again, not for internal parts that could be intended to be reused) of doing such top level testing is that it actually may later be easy to move do like a lot of tests in batch. I mean in addition to the tests in this repository itself it would be possible to (either public or for private groups who would want to grant even more compliance with whatever HDP do, we somewhat would have a draft for this.

Top level tests could help with moving even faster

While retrocompativily is desired, as soon as the HDP syntax could on worst case simple require upgrade (think an human have to edit, even if is something very automatable to do in batch) but the HDP itself at very own core allow this while still able to have chain of accountability, I think that this could at least give some space for serious users that may have more localized community and may still rely on undocumented features.

In general this type of both have internal testing and (if at some point be relevant) document how to do with another repository also helps to test the full chain of environment. 

If exchanged HDP files need to meet higher level of compliance

Even if people trust more on public collaboration than their ability to do very deep checks, the idea of still, if as part of agreements to allow data exchange, audit code used, I personally believe that the best approach for who do this, in addition to the initial evaluation, should already implement some sort of automated testing (like to check if filters that should anonymize or block something still work).

This type of approach seems more reasonable  because without this very soon, in special who would have to meet governmental compliances, would eventually get outdated versions for weirdest reasons, like because the initial thing was paid by humans review for an period of time and then leadership chances (or budgets priorities chance) and then there is no one there even to do bare minimum checking.

Note that I by no means am saying that human auditing wouldn't be required (and, if not obviously, code released under public domain don't grant liability ), but what I'm saying is that such type of automation would still require humans to push buttons, so whoever would do auditing for this while would be an requirement continuous use, whoever pay for it should require some type of bare minimum automated test.

And what about "air gapped network"? This alone would not avoid the need to update software?

Even if we're drafting something that could be used without any access to the internet at all (so it means would not need to receive updates) no matter how paranoid is the organization threat model, soon or later people would need to update. If really fear, at least consider the possibility that may exist, and if not an critical fix to implement in hours, at least plan the scenario where in maximum one week the fix should be implemented.

But, again, "air gapped network" is something more specific and whoever uses this already should know what is doing. The thing about trying to automate extra tests is because if a country or community is Exchange data with other people from the same community, these people would still very likely use new versions so still a good idea to somewhat protect others or detect early any issue.

Also, doing automated tests help with issues not related at all with security, and in a context that would tolerate broader aspects of use of natural language instead of exact keywords, it actually very pertinent to have this.

fititnt added a commit that referenced this issue Mar 19, 2021
… created; hdpcli is already doing too much; lets break functionlity more on other parts of hxlm lib
fititnt added a commit that referenced this issue Mar 19, 2021
….vocab (good to know that is possible to add some sort of test even on an class or method itself
fititnt added a commit that referenced this issue Mar 19, 2021
fititnt added a commit that referenced this issue Mar 19, 2021
fititnt added a commit that referenced this issue Mar 19, 2021
fititnt added a commit that referenced this issue Mar 19, 2021
@fititnt
Copy link
Member Author

fititnt commented Mar 20, 2021

The hxlm.core.model.hdp.py file already has a bit over 1.000 lines of code. At least for new code, I will try to break in smaller pieces.

Another point is that I recently both discovered about python doctest (so inline comments that seem like interactive python sessions can be used to test code instead of only the tests/ folder). This seems good for test smaller files or things that are not functionality at a higher level.

1. Some points 

1.2 Fact: already is feasible to make every output also an valid source

If we really rush to make even the generated output from the Latin (the internal representation of HDP metadata) it would be possible. Maybe not too many lines over already exist. But it would be a hell of for loop!

The new point is both to break a bit more the functionality and also that some way to add more meaning to the keywords without actually changing much the natural language the human uses

1.2 We may actually use ( ), [ ], < >, { }

After Ansible itself (the idea about using YAML) we're hiring someone to allow use of natural language as programming language constructs. Note that Ansible uses like 95% plain English.

The problem with HDP as that is really, really pertinent to be able to be equally equivalent in several natural languages (see #15) is that we're actually using YAML as one way to have an valid base parser that allow build things over it, but we're already more near how some languages that are even used to teach programming logic allow clear syntax to allow full natural language.

Scheme/Lisp but in special Prolog (see https://www.swi-prolog.org/) can be somewhat more close on how we decide additional nap characters over only depend on the strings themselves

1.2.1 Using concept of DSLs (Domain Specific Language) to allow controller vocabulary

One thing we're very sure of: even if with HDP we tolerate/allow upgrading (like people using old verbs; or we just choose bad initial terms) the most important features NEED to be controlled words.

It means that implementations over HDP need to be both an combo of add new controlled verbs (think at least have translation for 6+1 languages, and we're just on v0.8) and, to be even more useful, software code that convert the instructions to something even more useful than allow exchange human text.

Anyway, modes to express what is controlled actions (things meant to have translations to everything) and what is something that users can extend need somewhat be more flexible. The closer to this is the use of "hrecipes[n]recipe" to "hrecipes[n]_recipe" but even then I was not sure if Arab users would use _ on the wrong side. So I think that at least we would meet _ on both sides or use other characters like ( ) / [ ] { }

End comments

Even if we tolerate some way to allow embed full Turing completeness (not just proxy commands to HXL tools, but like allow run python or other languages, like Ansible allow but complaints about), what is likely to be more an advantage may be more on the restricted DSLs (Domain Specific Language).

The approach of target be more an DSL (at least for creating the YAML files, the hdpcli could allow the full query-like thing, but this is not so soon) I think that tends to be more easy to allow large scale adoption.

But note that I'm not really assuming this would have a huge user base, but that making it less complex to teach a new person is a strong requirement. In other words: if to make it more powerful to allow conversion between natural languages the result would scare new people to test and be confident, the extra features (I suppose) may actually have less value.

The average user may actually assume that everything "is somewhat magic" (they may make comparisons with artificial intelligence or something) so just the extra features and potential may not be worth it if they think they would need something new that may already not be intuitive just looking at some previous YAML file.


Edit: edited links and quick quick errors

@fititnt
Copy link
Member Author

fititnt commented Mar 21, 2021

Even if we allow use any natural language (but, for sake of allow internationalization, at least know words used) we at least for one term (the one that could means equivalent of hsilo, or maybe something more generic like meta) this could simplify checks.

The idea of maybe use some printable characters like ( ), [ ], { } & < > could be a hint for what is the object that is equivalent to the meta (or hsilo) even for an unknow language (or simply user misspelling.

But here there is an design decision: we could either use an very generic term, like meta and then force user to explicitly add the language as one of key values, or don't require that the user always add the language code, but then meta would not be an good hint, but something like hsilolinguam: LAT & silolinguam: ENG|SPA|LAT|POR (as long as term silo we try to explicitly use different term for each language`... could work.

Note that the ideal idea is (at least for who create an document) be able to write using only own script, to a point of the baseline HDP could work even without use latin characters.

@fititnt
Copy link
Member Author

fititnt commented Mar 21, 2021

OMG I think I discovered one better approach: we use as hint the language name already in the exact script!

This not only solve the issue of have some term really unique (so, is feasible know the language without by default enforce use ISO 639-3 codes), but also we already solve the problem in detecting the script!!!

Also, to be really sure about the context, we could enforce that the prefix ([ and suffix ]) (maybe with some room to fit metadata betwen ([ and ]), so an really new language could temporaly be written in another script+language) and this could be sufficient to make things work!

Captura de tela de 2021-03-20 21-47-53

The problem with script systems

While Arab (the macro language) and Chinese (the macro language) can, at least, have 2 to 4 writing systems (so, with this we not only know the language, but writing system without any additional hint!) to my current knowledge, Japanese [1] have several ones. Like a lot.

I'm not saying that we would be able to implement this for some short term, but if HDP become more used, with interest, local communities could propose new vocabularies.

Note that we're already defining vocabularies on different files.

[1] https://en.wikipedia.org/wiki/Japanese_writing_system

fititnt added a commit that referenced this issue Mar 21, 2021
…with use of languages names for thenselves already with their original script (this could allow also detect both language AND script; see #16 (comment))
@fititnt
Copy link
Member Author

fititnt commented Mar 30, 2021

Oh boy. I think to simplify things I will implement as recursive call the load of files.

fixing_problems

I'm not finding a lot of examples using python (or at least not with classes) so I may just create an separate file just to put this sort of stuff on the hxlm.core.hdp. While not as dangerious like the python cryptography hazmat usage for packages that are not intended for end user, I think I may reuse such name.

fititnt added a commit that referenced this issue Mar 30, 2021
…PPolicyLoad ( and ...hasmat.policy) affects #17, the rest is mostly for the #16)
fititnt added a commit that referenced this issue Mar 30, 2021
…icy.{_get_user_know_what_is_doing(),_get_bunker()}
fititnt added a commit that referenced this issue Mar 31, 2021
fititnt added a commit that referenced this issue Mar 31, 2021
…ady have an draft to allow localized exceptions on the user language! (ok that we may need to make it stable engouth to not trow exceptions on exceptions themselves)
fititnt added a commit that referenced this issue Mar 31, 2021
…ed; the idea is abstract ram text messages to allow future l10n
@fititnt
Copy link
Member Author

fititnt commented Apr 4, 2021

v0.8.5 started. At the https://hdp.etica.ai/hxlm-js/ we have part of the HDP originaly written in Python ported to JavaScript

Captura de tela de 2021-04-04 15-37-00

Automated transcompiled code and (if necessary) open room for desktop application

Automated transcompiled code (see ceddfa6#diff-423a7725ffc65651ef9b48047bf6f5ccbeb669bef88aaa3deaf97701bc8db885) even if it works, is not as beauty. Also, since we're trying to make as simple as possible to parse the ontologies, it may actually not be hard to at least port the more important parts.

HDP JavaScript ports and even more sensitive content

Some parts of HDP may require (in special for potential use cases of humans centralizing a lot of work) steps that require GPG sign. For individuals who actually work with data transformation I will assume they can deal with command line (or at least interact with a combo of an code editor like VSCode that help to write HDP files and them call command line). But I already was thinking about some way that at least provide simple GUIs that allow people press buttons to know if a file is valid or not (but, note, on this case, I already am considering people with so sensitive content that they cannot or are expected to not be online). But considering the alternatives, is harder to create such interfaces using GUI. Also, even if we do things with Lisp-like dialects, the strategies also have ugly guis.

So, for more sensitive content, even if proof of concepts may exist to do in JavaScript, I would still recommend that potential people on next years just do a port that bind the Javascript implementations.

Web version for less sensitive content (or last-resort check if this is signed by who should be).

By design, HDP files are not mean to contain secrets (like file passwords or direct access to resources without authentication). So do exist some cases where people may receibe HDP files and they still not do the full thing like have installed software (or maybe they don't need this at all). So in such cases, for verify signatures, were HTTPS is acceptable, I think any web version is an perfect case.

fititnt added a commit that referenced this issue Apr 10, 2021
… Peter v1) will not be engouth; I think we could refactor all the things and already plan ahead 'programming by contract' already on the prototypes of functions when user create them
fititnt added a commit that referenced this issue Apr 10, 2021
…hon port to do heavy processing of HDPLisp until API be stable
@fititnt
Copy link
Member Author

fititnt commented Apr 17, 2021

As 2021-04-17, the HXL Standard does not have either from code maintained by the HXL working group or from community.

Since (even if very primitive way) we will need deal with already well formated CSV files generated by HXL tools, we would start to create a lot of ad-hoc functions just to repeat what HXL would do.

This means we will take at least a few days extra just to make a functional draft of HXL on Racket that at least should work with our ongologies written in hxl.csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant