-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[meta issue] hxlm #11
Comments
…ink allow even localized also on key) may be harder to do quick tests since my prefered code editor have this microsoft/vscode#11770
Ok. I think I will give up the idea of trying to make the code generate an Schema of what is on disk but do the opposite: let YAML describle what is on disk (or what should be the final state of what would be on disk). Turns out that this remember a lot about Ansible playbooks! but instead of entire group of servers, is an group of datasets on local disk! But even if on next days each of these points like hdatasets, hfiles, etc on the YAML inventory are already mapped to action classes, there would still be missing the equivalent of "ad-hoc" ansible tasks. The Ansible ad-hoc tasks would be what on HXL the name of these are recipes:
Why YAML over JSONAt this moment, I think that the main difference here over just use ad-hoc HXL-proxy recipes are the fact to start to have an inventory. Ansible separes what is inventory and what is task (so this can be used for several projects that are somewhat similar). But the main idea that started to look at YAML was not even this about Ansible, but I remembered that YAML is easier to deal with comments, while still powerful to process via tools. Special attention to the concept of compliance (this is likely to take months)Do exist some building blocks to abstract, but one special attention to in the end come with an descriptive language easier to abstract is the concept of compliance rules. So we're not only talking here about have one common way to express concepts, but needs to be on local language and needs to support spaces, accents, etc, also on the key terms (not only on values). compliance rules is somewhat the idea of compute an result of if something is authorized or not and what to do if is not authorized. Other thing could be compliance rules apply filters or at least require that the human ask permission of someone why some specific filter does not apply to some case. But in an scenario where people trust more an computer than each other (or actually trust each other, but need some explanation to not break laws that would need weeks or months to get clearance), if who approves feel safe about what is on an YAML compliance file and do exist people outside the organization that grant that it at least reduce human error and eventually allow faster data exchange for more sensitive contents. |
…ble playbooks, but allowing several hmetas on same file
…ction at Model level to load the YAML
…ect; still need some extra abstraction
…fs already on core); HFile now starts to check is_available_locally
HFile already is able to reload files from remote sources (the first one that works will be downloaded, if already does not have copy on disk). HRecipe already have some draft using the HXL-proxy recipes, but if we manage to make it work also with libhxl-python, the hmeta.yml project file can be used to play around with multiple JSON recipes. Before going to compliance rules, I think we need abstract the JSON recipes. I'm not fully sure if some features of HXL-proxy are only from HXL-proxy and not libhxl. But one thing that really necessary for compliance is some quick way to discover the headings of each file, since some compliance may need to allow/block (or at least require human review to force on the hmeta.yml that's ok) based on the typical headings. |
I just realized that Considering that the thing that really would be complicated would be the compliance rules (and compliance rules ideally should by design be strictly translatable even between languages, since some country/territory would take too much time to translate from local language to a common language, and this would be unacceptable) I think that the parts that matter should already be enforced. And go all the way Declarative programming makes easier for humans who approve what is right and what is wrong, while offloading complexity to who actually implements the software. I will make some tests. 1: The (not implemented) |
…source Identifier'/'Uniform_Resource_Name')
…)' actually seems the best term in English; maybe worthe change internal names
…t which crypto high level library to use (and also about care about developers usability
…se encrypted URNs (urnresolver #13) may require some way generic adapter
…_init, HDP._safer_zone_hosts, HDP._safer_zone_list, HDP.export_schema_json(), HDP.export_yml()
The new hdpcli as "offline-first" usageAt this point there is no problem with this, but I think that by design may be better to start "in offline mode" and either require extra command from user or interactively manually ask if the user allow connect to the host (at least if we detect that is an interactive session, not running as part of some scripts). Even with acceptable sandboxing-by-default (and eventual way to grant that public shared HDP/URN files would be signed by authenticity), there still the privacy pointBy offline mode for example, the As for who would read this later, I'm not talking about privacy of eventual humans that are referenced on data managed by these tools, but the humans that would use the command line tools to automate tasks. Like would happens with anything that access internet, if hdpcli/urnresolver would be allowed to fetch data from internet, whatever is the host, they can know the requester IP address. allow offline (structures cache) also as an way to mitigate overload remote servers requestsIn particular if some the urnesolver becomes ok to use as standalone CLI os as part of other libraries (and not just as internal tools here), depending on how cli tools deal with caching, misbehaving tools could make a lot of requests to know available URNs. (This is also why the URN index files are likely to allow simple text files, even if this means allow users to encrypt just specific content and don't care if the file may be public access. This approach is somewhat to mitigate server loads while still keeping some way to find content). Things are even worse if, since YAML would be easier to use even on a local machine which is often done with HXL-Proxy, the person often works with large files and don't dowload locally first. While these files may not be requested as often, they can be much larger than what the HXL-Proxy by default allows to use (that is a lot! It can easily pass 500.000 rows of data). So do exist cases where a mix of allowing online and offline (or organized local cache) is actually useful beyond the privacy part. In fact, this is the biggest reason
|
Just added the functionality of load files by expected suffix by directory and the HXL data processing specs (an array) already (as soon or later I would expect) doesn't grant exact order as when it is a single file. I think that on average user usage (beyond single file) with my experience with Ansible (medium to big projects, including creating Ansible Roles) I will try to optimize for medium to large projects. I do not have experience from the very early days of Ansible 1.0, so most of the things to run partial playbooks very likely were ready, but I think that definitely the idea of getting a recipe by array index is so prone to go wrong that should not be even on documentation. Analogies to AnsibleIn some aspects the HDataset (and implicity an potential result of each recipe) is similar to Ansible inventory (The playbooks / tasks I think should be as abstracted as possible to mitigate user being decepcioned with order of execution; or at least we try to delay time need for an user have no option but know about order of execution? The ideal would be an user who is just consuming already working project be able to reproduce and things still working even if the user start to merge more and more files Tags to allow control selection (include/exclude by tag)?Ansible allows users to use tags a lot to both select what they want or, by tag, what they do not want (in fact personally I overuse that's beyond average users on Ansible, but ok). Here may be the case so we could at bare minimum already have such tags (but I already know that this is not sufficient; in special if is to reuse projects from others; and then the likelihood or try ask people to keep same tagging conventions; this alone is not good). URNs, if used, would allow exact selection. But this may reduce reusability for inner parts and also could force users to make decisions too soon. But anyway, it could be a good idea to, if the user does not explicitly create exact URNs, we implicitly create based on context. Maybe one implicitly 2 letter iso country for "localhost" if the user did not select or receive project files already well organized? One thing that Ansible has for hosts is both 'localhost', 'all' (that include all hosts, except localhost, because this could break things like install/remove things from their own computer!) and 'ungrouped. Maybe we could create another pseudo concept, that would be almost like tags, but instead of allow apply like even for subitens, require be more "top level" like hsilo?
Another thing I'm considering doing is, if an user adds an tag directly on scope of About hsilos and how to "select by then" (Ansible uses concept of host groups)I'm thinking about actually not force that a single hsilo could have an unique name (like an unique ID), but tolerate (maybe strongly recommend; or at least make easier to users would tend to this) that by labeling an silo on an group, that group would make every hsilo with that thing actually "part of the same silo". This somewhat would work like tags, but groups (using this approach) would only apply to top level. Anyway, if the user really wanted an exact id for a file, it could simply create a very unique group name (or force an URN, that in this case could act like an prefix... But if we document use an unique URN as base, then this would break the concept of hsilos like a single silo but different files, hummm...). End commentsThe JSON schema (the file used to help with validating YAML files while using applications like VSCode) can actually help to enforce what could or not be in each file. So anyway whatever become the implementations, by using the helper, the user could have feedback without waiting to run hdpcli and receive errors. |
…ady was ugly, it would be much worse with anything beyond ASCII; so lets force it by default for everyone
… explicitly use; something like 'urn:oo:hsilo:domain_base:container_base-container_item_index' were container_base ofte would be 'local' (localhost) and container_base often the filename itself
… --non-nomen, --non-tag, --non-urn, --verum-adm0, --verum-grupum, --verum-tag, --verum-urn
…f the YAML files, will mention the '[meta] HDP Declarative Programming (working draft) #16'
Just create the issue [meta] HDP Declarative Programming (working draft) #16. Considering what could be on short to medium term production ready, even if abstraction with only YAML with this Domain Specific Language HDP would not be as powerful as plugins directly to Python code itself, this may be more realistic without require people not only start to use HXL on these contexts, but allocate individuals that would be able make it using Python and the person not be scared that data themselves are very sensitive. 1. Part of the functionality of auditing could be moved to filters instead of require custom python code (needs testing)Do exist some extra points (they still relevant with some YAML files that someone is overriding a default behavior) but that would be even more essential with plugins in plain Python: the code would have to me even more strictly audited than if at least most common features could be already possible using and DSL like language. If we manage to draft reusable HXL data processing specs on YAML that could be challenged with testing data (even if such tests would not need to become public) this could help to spot common errors. These "errors" like a human being aware that a customized rule allow pass private data could be ignored if the human is able to check that the authorization allows that. Note that this type of test is not applicable on all types of data sharing. But in cases we're do exist more explicit restrictions, it could be used. 2. Considering the idea of files that mention datasets by default don't require they on the same folderWeeks ago one screenshot had a way to express datasets inside the current folder. If the urnresolver`: Uniform Resource Names - URN Resolver #13 plus conventions on how to represent an URN on some base folder on local disk become viable, the use of HDP-like instructions would never store the data themselves where the HDP files are. This approach could both solve the problem of storing files on separate disk partitions (or maybe an S3-like storage) while the metadata files could be handled differently. 3. Avoiding defining new keywords to define terms by... Simple have translations to every language people care to use (or allow someone trusted to give a file that add missing terms)One very hard decision I discovered when planning for example the best hashtags to use when sharing datasets with @HXL-CPLP is that this decision is sometimes hard if we try to find something that is more universal. With the HDP, the drafted idea is in addition to the internal terms (that if using Latin script is... Latin) has both one canonical term to translate if is to a known language and if converting from such language, can understand some extra aliases. To simplify a lot of translations, most HDP keywords are single somewhat primitive words. In special cases for macro languages (Both Arabic and Chinese) this means that I already know that some terms are impossible and eventually will need to implement the variants. But at least we're already have from start that allow localization! As complicated as it may sound to tolerate such a level of localization, considering time needs "to fix", this seems more easy to fix permanently than alternatives. Also, the fact that the terms can be in the people native language simplifies a lot of documentation. End commentsThe three points above may give an idea of why the ideal of full "Declarative Programming" (here as abstraction on how things are really done) can be actually harder to implement, but may be less harder than alternatives. At bare minimum, it provides some level of sandboxing compared to allow full Python. Also the extra requirements/restrictions may help to implement early what is viable to be used in production. And, under context of HXL, the idea of for example allow localized terms to express commands is actually feasible for an programming interface (like the Excel formulas, as Example, often are on the person native language) but is not if deciding good reusable hashtags for datasets. I mean: the number of possibilities of keywords of a programming interface is controlled. That's it! |
This issue will be used to reference commits from this repository and others.
TODO: add more context.
Update 1 (2021-03-01):
Ok. I liked the idea of YAML-like projects!!! But may be easier to do the full thing than explain upfront. (I'm obviously biased because of Ansible, but ok; anyway I know is possible to even implement testinfra; but would be easier to create an "Ansible for datasets+ (automated) compliance" than reuse Ansible)
Also YAML, different from JSON, is much more Human friendly (for example: it allows comments!) so this can somewhat help.
Being practical, at this moment I think mostly will be wrapper to libraries and APIs already existing (aka syntetic sugar, not really new features). But as soon as the building blocks are ready, the YAML projects themselves become powerful!
The text was updated successfully, but these errors were encountered: