Skip to content
This repository has been archived by the owner on Jan 19, 2019. It is now read-only.

Let datasets know how to load themselves #378

Merged
merged 13 commits into from
Jun 9, 2017

Conversation

DeNeutoy
Copy link
Contributor

@DeNeutoy DeNeutoy commented Jun 1, 2017

The idea behind this PR is to make the loading of datasets more flexible.

At the moment, the only way to load data is to have an instance contained within a line of a single file. This is ok for some types of data, but is very inefficient and clunky for others, such as data which naturally spans multiple files (such as the Wikipedia bio dataset) or sequence tagging datasets which use windows of text for training (e.g language modeling).

This change requires subclassing Dataset for each type of instance, but allows 3 new things:

  • The ability to read data from a filepath, not just a file. Useful when the data is naturally represented across multiple files. This is good because it also makes data loading more modular - Switching different sentences from different sources for sentence selection would be easy if the dataset reader read in 5 files, where the ith row in each file composed the ith instance, etc.
  • The ability to pass parameters to the Dataset specifying how you want the data to be loaded. This is useful in the case of language modeling, as now we could have a dataset which is just text, and pass a span_length parameter to read_from_file to split it up into 20 window span lengths on the fly (or approximate spans if the data is not already tokenised), reading these window spans into instances directly.
  • We can cut out the intermediate representation of instances we are using (index###question###answer etc) and read directly from new datasets in whatever format they provide. This removes a big barrier to using the library with a new dataset (in my opinion).

Eventually, we could also remove Instance.read_from_line, replacing it with Dataset subclasses which know how to read that particular instance (or variants of it) from a directory. We only actually use the Dataset.read_from_file method in one place in Trainer, so we can just pass the dataset class in the parameter file and call the respective subclasses method to load data.

Comments/things I inevitably haven't thought of appreciated!

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Jun 1, 2017

Thinking a bit more about this, if we go with this idea of having configurable datasets, to make read_from_file not static and make users instantiate an instance of the Dataset which then fills itself, rather than returning a new instance of the class. It's a bit confusing to have a dataset with attributes which are used via Params in read_from_file before they are set in __init__.

@matt-gardner
Copy link
Contributor

This is related to #328. And, actually, I think having DatasetReaders is a better solution than having the Dataset class read the data directly. The reason for this is that you can have two datasets that perform the same task, but have different formats. You don't want the model code to have to know if it needs a SquadDataset, a WdwDataset, or a TriviaQaDataset object - that would get really messy, and require you to change modeling code whenever you add a new dataset. If you have a separate dataset reader script that puts each particular dataset into a common format, the modeling code can just know it needs a CharacterSpanInstance, and point to files that are formatted correctly.

You could argue that you could put the dataset instance selection code into Trainer or TextTrainer, so the model itself doesn't have to know about the particulars of datasets, but then you still have to specify the class name in the json parameter file, and you have to allow for third-party users to have a way of adding their dataset class into that code, which is messy. It's a lot cleaner to have a third-party user just write their own dataset reader script to put their instances into a common format.

Addressing your three points:

  1. The code actually already permits this. That's the whole point of the BackgroundInstance, and why the API requires you to specify a list of files. Maybe this could be cleaner, and we could have done this for the SentenceSelectionInstances also, but it's already modular.

  2. The scala DatasetReader code is already parameterized, and the python reader code can be too, so nothing is lost on this point.

  3. The common intermediate format is actually important, not a bug, as described above. I think the barrier you're seeing to using the library is that the dataset reader code is in scala, not contained in this repo. I agree with you on that point.

I think the right way forward is to take the classes you've added here and convert them into DatasetReaders in the readers module. What do you think?

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Jun 1, 2017

I see what you mean about the problem of having the Trainer know about the dataset, I hadn't thought of that issue. We do still have this problem with Models anyway - could we solve it in the same way by allowing users to pass a Dataset class to run_model if they want to use one which is not supported? I think this is more of a question of how to support json file specifications - clearly if you want to use this code as a library and add models/classes, at some point you will not be able to specify json parameters directly unless you modify them after loading with actual classes or you add to our dicts of allowed types for different things.

In terms of BackgroundInstance, that's true, it let's you do this, but it also breaks from the api of how every other instance is read from files - i.e. they are added to a Dataset after reading the main instance type, not through calling read_from_file.

The problem I see with having DatasetReaders (presuming they stay as scripts) is that a user can't reuse a dataset with slightly different parameters, such as window_span for LM, without re-running their script and writing a new dataset file. If I want to do language modelling on all of wikipedia, I really don't want to have to have 3 copies of wikipedia with different window span lengths written into text files (overblown example but you get what I mean).

@matt-gardner
Copy link
Contributor

or you add to our dicts of allowed types for different things

Yeah, this was how I envisioned someone adding, e.g., a tuple matcher in third-party code. You import the dictionary, add a class, and then use methods from run.py. You could also do this with concrete models, if you wanted to.

In terms of BackgroundInstance, that's true, it let's you do this, but it also breaks from the api of how every other instance is read from files

Yeah, it's not the cleanest thing, but I don't think it breaks the API much: https://github.com/allenai/deep_qa/blob/master/deep_qa/models/memory_networks/memory_network.py#L140-L142.

three copies of wikipedia

This is a good point, and a use case I hadn't thought of. I don't think it's a huge deal to have several copies of SQuAD around, for instance, in part because it's hard to imagine another set of parameters, except in modifying SQuAD for use in a different task, like answer sentence selection.

Also, for language modeling, you probably don't want to load the whole dataset into memory anyway (because it wouldn't fit), you want to scan across a directory in a data generator, so the whole data processing code would need to be rethought, or just bypassed, for the language modeling case.

Another consideration is that for quick iteration time in running test experiments, you typically want to have as little data processing done in your experiment code as possible. Hence #352, which would cut out ~10 minutes of waiting when trying to debug something on the full SQuAD dataset. If we move to loading the data as you're suggesting, we add processing time on top of what we already have.

So I think for all tasks we have except language modeling, the dataset reader approach is sufficient, and preferable to having Dataset classes for each dataset format. Are there other tasks that are problematic for dataset readers?

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Jun 1, 2017

@tusharkhot here is the discussion following chatting about loading in json graphs.

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Jun 1, 2017

I agree that in general this is not a problem. However, I could imagine that this would make manipulating datasets really easy for weird and specific formulations which you want to try - imagine if you want to know how Bidaf performs on "What" questions. If the dataset can be passed parameters in the json file, you could just run an evaluation passing a what_questions_only flag to the dataset or something.

I don't quite see why this is incompatible with serialising pre-indexed and tokenised data though. Part of that serialisation would be the dataset parameters, which are just loaded along with the pickled instances/ padded text data file. I agree that it wouldn't be compatible with different dataset parameters, but if you are changing these, you're probably running longer jobs where the amount of time spent padding/indexing is inconsequential.

@matt-gardner
Copy link
Contributor

What's the context on "loading in json graphs"?

Ok, so, in the interest of doing highest-priority work first, what are the actual problems with the current data loading code that we want to solve right now? It looks like you're thinking of doing your seq2seq stuff in DeepQA, is that right?

@tusharkhot
Copy link

tusharkhot commented Jun 1, 2017

I think this PR is talking about a broader problem then what I was talking to Mark about. I was referring to the problem that currently the code assumes that every instance should fit in one line. With tuples/graphs encoded using JSON, we could still have a single instance in one line at the cost of readability. So this is not too much of a problem for me currently, but can be a problem E.g. what if my strings have newlines and I can't strip/replace them since my model uses these newlines.

So, I was wondering if you need the restriction of one line == one instance. Based on the discussion here, it seems that there is no such restriction; it is just not the common case scenario and harder to do. (right ?)

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Jun 1, 2017

This wasn't really with the intention of doing seq2seq stuff; I was more looking at getting the OneNotes data for SRL in as an Instance and it is already formatted in a table, but with the annotations separate from the sentences which they annotate. This is a reasonable format which there is no need to force into another one, apart from the fact I can't read from multiple files using Instance.read_from_line.

Also, I personally find the data loading part confusing and also people trying to add new data instances to deepQA have found it difficult - or maybe not, and that is just my impression.

Concrete problems:

  • It is difficult to quickly alter logic for loading in a dataset.
  • Personally I think that having an intermediate data format is not helpful, but I am happy to be overruled here if I am the only one. If so I can just close the PR.
  • Being able to read datasets from a filepath rather than from a line means that we aggregate all data loading code under the same api and it is strictly more general.

@matt-gardner
Copy link
Contributor

matt-gardner commented Jun 1, 2017

Ok - doing an SNLI loader and talking about the WikiBio dataset made me think you were going for the seq2seq stuff. Having now thought about this a bit more, I think these two options aren't mutually exclusive. For datasets with different representations on disk, we still need to convert them into a common Instance type, so those aren't going away. The only question is how to read the data from disk. We have two proposed solutions:

  1. the current way, which requires that every Instance type be read from a single line, and that we have DatasetReader scripts, which convert datasets into the common single-line format. This is limiting in the kinds of processing you can do, but it simplifies the code that reads in the data - you don't have to know about particular dataset readers in model code (it also simplifies some data creation pipeline stuff, but that's all in the scala code, anyway).
  2. Allow Dataset subclasses to read in their instances from a filesystem path, likely bypassing the read_from_line methods on Instances entirely. This allows for more flexibility in how datasets are stored and loaded, and removes the need for DatasetReader scripts (largely containing the same code, just putting it in a different place), but it requires additional complexity in the modeling code.

Both of these seem reasonable, and it's pretty easy to allow both - just add a key to the json file that specifies the dataset type, with TextDataset being the default, and let the dataset class load stuff however it wants, defaulting to option (1) above. It'd also be pretty easy to add DatasetReader scripts that just call the Dataset.read_from_path methods and save the instances in a one-line-per-instance format.

So, yes, feel free to do it this way. Sorry for being ornery about it, and thanks for pushing back on my orneriness. This is a fine solution to the problem. But don't put all of the Dataset objects in the same file - that would get really big really fast. Not sure if it'd be overkill to use the same format we have for instances and models, organizing things by task and then by dataset, but it'd at least be consistent.

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Jun 1, 2017

Ah yes, that's true, let's do that. Yeah I only put them in the same one as I wasn't sure if you would go for this, so it was minimal effort - I was thinking of putting them in the same files as their respective instances, so that all the code you need to read in order to understand how an instance can be loaded for a given type is in one file? I guess also renaming them from instances to ... data_types or something?

@matt-gardner
Copy link
Contributor

Either we could group them with the instances they create, or we could group them by task and have files with each dataset name. The latter is probably better, because a single dataset could be used to produce different kinds of instances (e.g., SQuAD for QA, answer sentence selection, or question generation), but all need the same loading code. Also the latter would be easier for someone browsing the code to see if we can load a dataset they're thinking of. This also argues against grouping them by task, actually, because a single dataset can be used for more than one task. So maybe just a module under data called datasets, with files like squad.py, snli.py, etc.? The trouble is that we already have datasets.py, so I'm not sure what to do there...

@DeNeutoy DeNeutoy changed the title [WIP] Let datasets know how to load themselves Let datasets know how to load themselves Jun 8, 2017
@DeNeutoy DeNeutoy requested a review from matt-gardner June 8, 2017 18:33
Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. There's still one blocking issue that needs to be fixed, though.

from ..common.util import add_noise_to_dict_values
from .data_indexer import DataIndexer
from .instances.instance import Instance, TextInstance, IndexedInstance
from deep_qa.common.util import add_noise_to_dict_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the relative imports here? Also in data/__init__.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I let Pycharm do some refactoring of a variable name for me and it destroyed so many things, I thought i'd got them all but apparently not.

@@ -60,7 +71,10 @@ class TextDataset(Dataset):
TextInstances aren't useful for much with Keras until they've been indexed. So this class just
has methods to read in data from a file and converting it into other kinds of Datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/and converting/and convert/

label_count_str = label_count_str[:100] + '...'
logger.info("Finished reading dataset; label counts: %s", label_count_str)
return TextDataset(instances)
count_labels(instances)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name here looks like it should return something, so seeing it with an ignored return value seems odd. Maybe call this print_label_counts, or log_label_counts?

for line in open(filename, 'r'):
example = json.loads(line)

# TODO(mark) why does this not match snli? Fix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What doesn't match SNLI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The labels - in the instance for some reason the labels are not the same as in SNLI, which is why there is this logic here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. Yeah, feel free to fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from this also requires redoing the data which we have in /efs, so i'm going to leave this out of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no problem.

dataset_type_key = self.dataset_params.pop_choice("type", list(concrete_datasets.keys()),
default_to_first_choice=True)
dataset_type = concrete_datasets[dataset_type_key]
return dataset_type.read_from_file(files[0], self._instance_type())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to pass the remaining dataset_params in here. Also, you probably can't do pop_choice like this - subsequent dataset reads would not have the right dataset type. You should have a test somewhere (probably in TestTextTrainer) that reading two datasets with some simple parameters gives the same type. As it is, if you say you want an SNLI dataset, your training data will be an SNLI dataset, but your validation data will be a TextDataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I like immutable state, instead of mutable state... But the alternatives to having mutable state, while still doing easy parameter validation, aren't very good either. Maybe the best option is to have Params keep track of which items have been read, and instead of what we currently do for assert_empty, change it to just check that all keys in the dictionary have actually been read. A bit of a pain, but probably better than having these subtle bugs due to this mutable state. That's for another PR, though, and I'm not sure how high priority it is...

========


deep_qa.data.dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This title seems odd. But we can worry about this later, when we're to the point of cleaning up docs.

:undoc-members:
:show-inheritance:

Language Modelling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it "language modeling", with one l? Oh, just looked it up, this is another US/UK difference. I've been trying not to be picky about using American spellings, but I think this one deals with consistency in naming across modules. We have data.instances.language_modeling, and models.language_modeling, but data.datasets.language_modelling - let's be consistent here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha no worries, happy to have it wrong here 🇬🇧

with codecs.open(self.TRAIN_FILE, 'w', 'utf-8') as train_file:
train_file.write("This is a sentence for language modelling.\n")
train_file.write("Here's another one for language modelling.\n")
def write_original_snli_data(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing blank line.


instances = []
for index in range(0, len(text) - sequence_length, sequence_length):
word_sequence = " ".join(text[index: index + sequence_length])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting on spaces like this seems a little problematic to me, actually. You really should be doing this with a tokenizer. But we can worry about that later, if/when we actually use this for language modeling. Can you at least leave a note here that we might want to revisit how this is done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I was going to replace the sentence_length attribute of the language modelling dataset to be approximate_sentence_length for this very reason. I'll add a note. I think if we could get the tokeniser etc passed as an argument to Dataset and then the Instances using that tokeniser, instead of setting the class attribute as we currently do it, would be better, but that's a separate discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the tokenizer being passed to the dataset - that's the plan in the data generator rewrite that I'm planning.

@DeNeutoy DeNeutoy merged commit eed3c5c into allenai:master Jun 9, 2017
@DeNeutoy DeNeutoy deleted the dataset-idea branch June 9, 2017 16:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants