-
Notifications
You must be signed in to change notification settings - Fork 12
Add topic to parsed json #628
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one comment.
Also, IMO it would be nice to add some unit tests to make sure things work as expected.
input_path.parent.parent | ||
/ "topic" | ||
/ f"{input_path.stem}.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a huge fan of this hardcoding.
IMO we should not assume this is run within a pipeline.
Why don't we just replace --include-topic
with --topic-path
(or similar) and if the user does not provide it then it means that they are not interested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is definitely my plan! I did not want to hardcode the variable. I just wanted to be sure we can move on in this direction. What do you think about the way to solve this issue ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe I can review again once you have a final version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! I let you know when the PR is done :) Thank you for your help!
|
||
serialized_json = json.loads(serialized) | ||
if input_type == "pubmed-xml-set": | ||
serialized_json["topics"] = topic_json[i]["topics"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea here is to dynamically add this topics
key only inside of the JSON?
Wouldn't it make more sense to define a new attribute topics
for Article
that is an empty list by default? And then modify this attribute once we have the topics and then serialize the whole article?
This way the schema of the Article
is going to be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but I have the feeling that parsing the article and finding the topic are two completely separate steps. I am a bit afraid of making those steps more dependent on each other. Currently, I try to keep everything independent (and the changes in the CI command reasonable). Do you think it is possible to add the new attribute topics
in Article
and keeping things independent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point and I guess that is why I suggested to set it to an emptylist[str]
by default. If the user wants, they can provide it later but at least we declare that the topics
attribute could be populated.
IMO one should just look at the Article
dataclass and be able to see what the final schema of the saved .json
is going to be. Once we start writing code that adds new entries to the .json
then I am worried about things becoming intractable.
Anyway, what do you think?
Fixes #619.
Description
Please provide here a summary of the changes introduced by this PR.
How to test?
Please provide here instructions on how to test the changes introduced by this PR.
(if some changes cannot be tested by automated tests)
Checklist
(if it is not the case, please create an issue first).
(if needed)
whatsnew.rst
updated.(if needed)
setup.py
andrequirements.txt
updated with new dependencies.(if needed)
(if a function is added or modified)