-
Notifications
You must be signed in to change notification settings - Fork 132
Let datasets know how to load themselves #378
Conversation
Thinking a bit more about this, if we go with this idea of having configurable datasets, to make |
This is related to #328. And, actually, I think having You could argue that you could put the dataset instance selection code into Addressing your three points:
I think the right way forward is to take the classes you've added here and convert them into |
I see what you mean about the problem of having the In terms of The problem I see with having |
Yeah, this was how I envisioned someone adding, e.g., a tuple matcher in third-party code. You import the dictionary, add a class, and then use methods from
Yeah, it's not the cleanest thing, but I don't think it breaks the API much: https://github.com/allenai/deep_qa/blob/master/deep_qa/models/memory_networks/memory_network.py#L140-L142.
This is a good point, and a use case I hadn't thought of. I don't think it's a huge deal to have several copies of SQuAD around, for instance, in part because it's hard to imagine another set of parameters, except in modifying SQuAD for use in a different task, like answer sentence selection. Also, for language modeling, you probably don't want to load the whole dataset into memory anyway (because it wouldn't fit), you want to scan across a directory in a data generator, so the whole data processing code would need to be rethought, or just bypassed, for the language modeling case. Another consideration is that for quick iteration time in running test experiments, you typically want to have as little data processing done in your experiment code as possible. Hence #352, which would cut out ~10 minutes of waiting when trying to debug something on the full SQuAD dataset. If we move to loading the data as you're suggesting, we add processing time on top of what we already have. So I think for all tasks we have except language modeling, the dataset reader approach is sufficient, and preferable to having |
@tusharkhot here is the discussion following chatting about loading in json graphs. |
I agree that in general this is not a problem. However, I could imagine that this would make manipulating datasets really easy for weird and specific formulations which you want to try - imagine if you want to know how Bidaf performs on "What" questions. If the dataset can be passed parameters in the json file, you could just run an evaluation passing a I don't quite see why this is incompatible with serialising pre-indexed and tokenised data though. Part of that serialisation would be the dataset parameters, which are just loaded along with the pickled instances/ padded text data file. I agree that it wouldn't be compatible with different dataset parameters, but if you are changing these, you're probably running longer jobs where the amount of time spent padding/indexing is inconsequential. |
What's the context on "loading in json graphs"? Ok, so, in the interest of doing highest-priority work first, what are the actual problems with the current data loading code that we want to solve right now? It looks like you're thinking of doing your seq2seq stuff in DeepQA, is that right? |
I think this PR is talking about a broader problem then what I was talking to Mark about. I was referring to the problem that currently the code assumes that every instance should fit in one line. With tuples/graphs encoded using JSON, we could still have a single instance in one line at the cost of readability. So this is not too much of a problem for me currently, but can be a problem E.g. what if my strings have newlines and I can't strip/replace them since my model uses these newlines. So, I was wondering if you need the restriction of one line == one instance. Based on the discussion here, it seems that there is no such restriction; it is just not the common case scenario and harder to do. (right ?) |
This wasn't really with the intention of doing seq2seq stuff; I was more looking at getting the OneNotes data for SRL in as an Instance and it is already formatted in a table, but with the annotations separate from the sentences which they annotate. This is a reasonable format which there is no need to force into another one, apart from the fact I can't read from multiple files using Also, I personally find the data loading part confusing and also people trying to add new data instances to deepQA have found it difficult - or maybe not, and that is just my impression. Concrete problems:
|
Ok - doing an SNLI loader and talking about the WikiBio dataset made me think you were going for the seq2seq stuff. Having now thought about this a bit more, I think these two options aren't mutually exclusive. For datasets with different representations on disk, we still need to convert them into a common
Both of these seem reasonable, and it's pretty easy to allow both - just add a key to the json file that specifies the dataset type, with So, yes, feel free to do it this way. Sorry for being ornery about it, and thanks for pushing back on my orneriness. This is a fine solution to the problem. But don't put all of the |
Ah yes, that's true, let's do that. Yeah I only put them in the same one as I wasn't sure if you would go for this, so it was minimal effort - I was thinking of putting them in the same files as their respective instances, so that all the code you need to read in order to understand how an instance can be loaded for a given type is in one file? I guess also renaming them from |
Either we could group them with the instances they create, or we could group them by task and have files with each dataset name. The latter is probably better, because a single dataset could be used to produce different kinds of instances (e.g., SQuAD for QA, answer sentence selection, or question generation), but all need the same loading code. Also the latter would be easier for someone browsing the code to see if we can load a dataset they're thinking of. This also argues against grouping them by task, actually, because a single dataset can be used for more than one task. So maybe just a module under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. There's still one blocking issue that needs to be fixed, though.
deep_qa/data/datasets/dataset.py
Outdated
from ..common.util import add_noise_to_dict_values | ||
from .data_indexer import DataIndexer | ||
from .instances.instance import Instance, TextInstance, IndexedInstance | ||
from deep_qa.common.util import add_noise_to_dict_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the relative imports here? Also in data/__init__.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I let Pycharm do some refactoring of a variable name for me and it destroyed so many things, I thought i'd got them all but apparently not.
deep_qa/data/datasets/dataset.py
Outdated
@@ -60,7 +71,10 @@ class TextDataset(Dataset): | |||
TextInstances aren't useful for much with Keras until they've been indexed. So this class just | |||
has methods to read in data from a file and converting it into other kinds of Datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/and converting/and convert/
deep_qa/data/datasets/dataset.py
Outdated
label_count_str = label_count_str[:100] + '...' | ||
logger.info("Finished reading dataset; label counts: %s", label_count_str) | ||
return TextDataset(instances) | ||
count_labels(instances) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name here looks like it should return something, so seeing it with an ignored return value seems odd. Maybe call this print_label_counts
, or log_label_counts
?
for line in open(filename, 'r'): | ||
example = json.loads(line) | ||
|
||
# TODO(mark) why does this not match snli? Fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What doesn't match SNLI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The labels - in the instance for some reason the labels are not the same as in SNLI, which is why there is this logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. Yeah, feel free to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from this also requires redoing the data which we have in /efs
, so i'm going to leave this out of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, no problem.
deep_qa/training/text_trainer.py
Outdated
dataset_type_key = self.dataset_params.pop_choice("type", list(concrete_datasets.keys()), | ||
default_to_first_choice=True) | ||
dataset_type = concrete_datasets[dataset_type_key] | ||
return dataset_type.read_from_file(files[0], self._instance_type()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to pass the remaining dataset_params
in here. Also, you probably can't do pop_choice
like this - subsequent dataset reads would not have the right dataset type. You should have a test somewhere (probably in TestTextTrainer
) that reading two datasets with some simple parameters gives the same type. As it is, if you say you want an SNLI dataset, your training data will be an SNLI dataset, but your validation data will be a TextDataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I like immutable state, instead of mutable state... But the alternatives to having mutable state, while still doing easy parameter validation, aren't very good either. Maybe the best option is to have Params
keep track of which items have been read, and instead of what we currently do for assert_empty
, change it to just check that all keys in the dictionary have actually been read. A bit of a pain, but probably better than having these subtle bugs due to this mutable state. That's for another PR, though, and I'm not sure how high priority it is...
======== | ||
|
||
|
||
deep_qa.data.dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This title seems odd. But we can worry about this later, when we're to the point of cleaning up docs.
doc/data/datasets.rst
Outdated
:undoc-members: | ||
:show-inheritance: | ||
|
||
Language Modelling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it "language modeling", with one l? Oh, just looked it up, this is another US/UK difference. I've been trying not to be picky about using American spellings, but I think this one deals with consistency in naming across modules. We have data.instances.language_modeling
, and models.language_modeling
, but data.datasets.language_modelling
- let's be consistent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha no worries, happy to have it wrong here 🇬🇧
tests/common/test_case.py
Outdated
with codecs.open(self.TRAIN_FILE, 'w', 'utf-8') as train_file: | ||
train_file.write("This is a sentence for language modelling.\n") | ||
train_file.write("Here's another one for language modelling.\n") | ||
def write_original_snli_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing blank line.
|
||
instances = [] | ||
for index in range(0, len(text) - sequence_length, sequence_length): | ||
word_sequence = " ".join(text[index: index + sequence_length]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting on spaces like this seems a little problematic to me, actually. You really should be doing this with a tokenizer. But we can worry about that later, if/when we actually use this for language modeling. Can you at least leave a note here that we might want to revisit how this is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I was going to replace the sentence_length
attribute of the language modelling dataset to be approximate_sentence_length
for this very reason. I'll add a note. I think if we could get the tokeniser etc passed as an argument to Dataset and then the Instances
using that tokeniser, instead of setting the class attribute as we currently do it, would be better, but that's a separate discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the tokenizer being passed to the dataset - that's the plan in the data generator rewrite that I'm planning.
The idea behind this PR is to make the loading of datasets more flexible.
At the moment, the only way to load data is to have an
instance
contained within a line of a single file. This is ok for some types of data, but is very inefficient and clunky for others, such as data which naturally spans multiple files (such as the Wikipedia bio dataset) or sequence tagging datasets which use windows of text for training (e.g language modeling).This change requires subclassing
Dataset
for each type of instance, but allows 3 new things:span_length
parameter toread_from_file
to split it up into 20 window span lengths on the fly (or approximate spans if the data is not already tokenised), reading these window spans into instances directly.index###question###answer
etc) and read directly from new datasets in whatever format they provide. This removes a big barrier to using the library with a new dataset (in my opinion).Eventually, we could also remove
Instance.read_from_line
, replacing it withDataset
subclasses which know how to read that particular instance (or variants of it) from a directory. We only actually use theDataset.read_from_file
method in one place inTrainer
, so we can just pass the dataset class in the parameter file and call the respective subclasses method to load data.Comments/things I inevitably haven't thought of appreciated!