Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add command string-to-variable to reuse incoming string as variable #533

Open
TobiasNx opened this issue May 23, 2024 · 9 comments
Open

Comments

@TobiasNx
Copy link
Contributor

TobiasNx commented May 23, 2024

At the moment we cannot use the incoming url-string after it is used in open-http.

A useful scenario would be if we scrape a website but the website does not provide the url as metadata and to quickly identify the source. Another would be if catching errors in a later process it could state the _id as source of the error.

There also could be a more abstract approach since this could also be useful for open-file and provide the file name as _id

e.g.:
https://metafacture.org/playground/?flux=%22https%3A//phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html%22%0A%7C+open-http%28accept%3D%22application/xml%22%29%0A%7C+decode-html%0A%7C+fix%28%22copy_field%28%27_id%27%2C%27_id%27%29%22%29%0A%7C+encode-json%28prettyPrinting%3D%22true%22%29%0A%7C+print%0A%3B

Not sure where the value of _id comes from.

PS: 17.9.24:
I suggest to introduce a command that would reuse the incoming string as java variable string-to-variable that would be a generic approach and the command could be put infront of the specific opener

@blackwinter
Copy link
Member

_id is the internal record identifier which is set automatically by some decoder/handler modules and which can be set manually (based on some literal value) with the change-id Flux command.

It can not be set by input modules, because they don't know anything about records at that point. OTOH, the source location (URL, path) is not available anymore when the decoder receives the stream and there is (currently) no way to transport it out-of-band. Setting the ID to the source location would also mean that (potentially) multiple records would get the same ID, so it violates the uniqueness guarantee.

It might, however, be possible to save the URL in a variable which can then be used in the transformation. Maybe along the following lines:

default inputUrl = "https://phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html";

inputUrl
| open-http(accept="application/xml")
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

@TobiasNx
Copy link
Contributor Author

I would be fine with a variable that could be used in the FIX and the FLUX.

It would help in this scenario.

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

@blackwinter
Copy link
Member

I would be fine with a variable that could be used in the FIX and the FLUX.

So your initial use case is solved?

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?

@TobiasNx
Copy link
Contributor Author

TobiasNx commented May 23, 2024

I would be fine with a variable that could be used in the FIX and the FLUX.

So your initial use case is solved?

I think if I could use the variable in the fix my use case would be solved yes. :)

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?

If I could configure the logging message and add the variable to the output is one scenario where the variable could be handy. Another could be if the file-name is passed on as a variable I could use it to write a file with a given variable as name.
But these are additional feature, what would be good in the first place is to have the variable available for FIX and for other FLUX Commands.

@blackwinter
Copy link
Member

I think if I could use the variable in the fix my use case would be solved yes. :)

But you can. Doesn't the proposed solution work for you?

@TobiasNx
Copy link
Contributor Author

ahh, i now I see the specific aspect of your approach.
I tought you were suggesting that the opener-module would create the variable, but you were not.

something like this:

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| open-http(input-to-variable="inputUrl"))
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

Instead you would define the variable beforehand.

This would not solve my usecase since you have to provide/configure the variable outside of the flux-workflow itself.
The usecase would be in our scenario to use a sitemap via the sitemap reader in oersi, then open the html and fetch data.
I do not know the data before hand.

Perhaps another and more general solution would be a flux-module that sets the incoming string as variable.

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| string-to-variable("inputUrl")
| open-http(header=user_agent_header)
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

@TobiasNx TobiasNx changed the title Provide incoming url from open-http as _id Provide incoming string for url or path from open-http/pen-file as variable Jun 19, 2024
@TobiasNx TobiasNx changed the title Provide incoming string for url or path from open-http/pen-file as variable Provide incoming string for url or path from open-http/open-file as variable Sep 5, 2024
@TobiasNx TobiasNx changed the title Provide incoming string for url or path from open-http/open-file as variable Add command to reuse incoming string as variable Sep 17, 2024
@TobiasNx TobiasNx changed the title Add command to reuse incoming string as variable Add command string-to-variable to reuse incoming string as variable Sep 17, 2024
@TobiasNx
Copy link
Contributor Author

Perhaps another and more general solution would be a flux-module that sets the incoming string as variable.

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| string-to-variable("inputUrl")
| open-http(header=user_agent_header)
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

I suggest we go with this approach.

TobiasNx added a commit that referenced this issue Sep 18, 2024
TobiasNx added a commit that referenced this issue Sep 18, 2024
@dr0i
Copy link
Member

dr0i commented Sep 19, 2024

I don't think that your idea would work: you seem to propose like setting a variable globally i.e. that could be accessed independently of the modules. This must break, ultimatley when using threads. It would break even before, because the modules are of stream character, and you cannot guarantee that the variable is not changed before the content of the variable (the associated data) is already treated in downstream modules.

A possible solution could maybe be, if _id_ is really unique and can be accessed in an unambigous way throughout all modules, to make us of a global HashTable, where your variables are associated with that _id.

@TobiasNx
Copy link
Contributor Author

Yes the intend is to set a global variable, that can be reused at a later stage , e.g. usecase scenario we had in oersi-marc:

opening a folder with files manipulate them and later reuse the filenames of the incoming string.

"folderPath"
| open-dir
| string-to-variable
| open-file
| decode-json
| batch-reset("1")
| fix ("copy_field("$[inputString]","fileName")
| encode-json
| write("output/$[inputString]")
;

The other scenario coming from oersi, when using a sitemapreader one cannot get the URL of each subsite.

maybe I am thinking about this in an undercomplex way. btw writing this my solution would not be good enough you are right and would not solve my scenario since the incoming string in openFile is an relativePath not the filename...

Then I go back to my old idea. open-file and open-http should provide the filename/filepath or the URL as variable for later use. But to make this threadsafe it seems that it will be difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants