Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Add topic to parsed json #628

Closed
wants to merge 11 commits into from
Closed

Add topic to parsed json #628

wants to merge 11 commits into from

Conversation

EmilieDel
Copy link
Contributor

Fixes #619.

Description

Please provide here a summary of the changes introduced by this PR.

How to test?

Please provide here instructions on how to test the changes introduced by this PR.
(if some changes cannot be tested by automated tests)

Checklist

  • This PR refers to an issue from the issue tracker.
    (if it is not the case, please create an issue first).
  • Unit tests added.
    (if needed)
  • Documentation and whatsnew.rst updated.
    (if needed)
  • setup.py and requirements.txt updated with new dependencies.
    (if needed)
  • Type annotations added.
    (if a function is added or modified)
  • All CI tests pass.

@EmilieDel EmilieDel changed the title First draft of adding topic to json Add topic to parsed json Sep 15, 2022
Copy link
Contributor

@jankrepl jankrepl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one comment.

Also, IMO it would be nice to add some unit tests to make sure things work as expected.

Comment on lines +236 to +238
input_path.parent.parent
/ "topic"
/ f"{input_path.stem}.json"
Copy link
Contributor

@jankrepl jankrepl Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of this hardcoding.

IMO we should not assume this is run within a pipeline.

Why don't we just replace --include-topic with --topic-path (or similar) and if the user does not provide it then it means that they are not interested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is definitely my plan! I did not want to hardcode the variable. I just wanted to be sure we can move on in this direction. What do you think about the way to solve this issue ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe I can review again once you have a final version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! I let you know when the PR is done :) Thank you for your help!


serialized_json = json.loads(serialized)
if input_type == "pubmed-xml-set":
serialized_json["topics"] = topic_json[i]["topics"]
Copy link
Contributor

@jankrepl jankrepl Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea here is to dynamically add this topics key only inside of the JSON?

Wouldn't it make more sense to define a new attribute topics for Article that is an empty list by default? And then modify this attribute once we have the topics and then serialize the whole article?

This way the schema of the Article is going to be clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but I have the feeling that parsing the article and finding the topic are two completely separate steps. I am a bit afraid of making those steps more dependent on each other. Currently, I try to keep everything independent (and the changes in the CI command reasonable). Do you think it is possible to add the new attribute topics in Article and keeping things independent?

Copy link
Contributor

@jankrepl jankrepl Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point and I guess that is why I suggested to set it to an emptylist[str] by default. If the user wants, they can provide it later but at least we declare that the topics attribute could be populated.

IMO one should just look at the Article dataclass and be able to see what the final schema of the saved .json is going to be. Once we start writing code that adds new entries to the .json then I am worried about things becoming intractable.

Anyway, what do you think?

@EmilieDel EmilieDel closed this Sep 19, 2022
@EmilieDel EmilieDel deleted the add-topic branch October 24, 2022 08:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include topic info in JSON output parse of bbs_database parse
2 participants