Flair 0.5 features #563

alanakbik · 2019-02-24T15:01:43Z

Here, I'd like to collect some ideas for features that we would like to see in the next version of Flair.

Ideas:

Refactor data loading methods. We currently load the entire training data set into memory, but this is a problem for large datasets (Iterating data fetcher for large training data sets #458 Unable to load corpus #457) and may also cause bottlenecks in GPU usage. Idea is to use the DataLoader abstraction (as currently used in the LanguageModelTrainer) for asynchronous loading from disk. This should make training over large datasets possible and may also significantly improve training speed.
Refactor flair.nn.Model and ModelTrainer. The ModelTrainer currently supports training SequenceLabeler and TextClassification classes, but community members have suggested other tasks, such as regression (Support for regression #440) or seq2seq (Adding basic Seq2Seq model on top of Flair Embeddings. This is because in NMT context plays a important role. #560). The flair.nn.Model interface needs to be simplified (fewer methods) and generalized in such a way that implementing this interface will immediately enable training using the ModelTrainer class (see also Trainer only uses TextClassifier.load_from_file #474).
Multi-Task Learning: This one has been on our list for a while, but we'd like to add simple methods for training multiple tasks at the same time. To do this, we may need to refactor the embeddings classes to make it easier to expose internal states (see How to get intermediate values of pretrained models? #524).
Tokenization. Right now, we use segtok for tokenization, but maybe we can include other tokenizers (Which tokenizer does Flair use? #394), perhaps even our own trained over the UD corpora.
Multi-GPU support: With the changes to the new CUDA semantics introduced in 0.4.1 we can now look into multi-GPU support.

Any other ideas? Please let us know!

A side note: In March, a few of us will be out of office for vacations, so development will likely slow down a bit. But come April, we'll start working full steam on 0.5 :)

The text was updated successfully, but these errors were encountered:

gccome · 2019-02-24T15:21:00Z

Currently, the transfer learning is purely feature based, do you consider adding fine-tuning based transfer learning for both sequence tagging and text classification? I think it would be a great addition.

ixxie · 2019-02-24T16:54:08Z

Disclaimers: I'm pretty new to Flair so this is probably at least somewhat misinformed. Also, this might be a bit of a tall order, especially for 0.5 since it alters the API significantly. Hopefully its not a waste of your time!

Flair is a great library in that it provides a uniform interface for disparate NLP models in one place. So far I see two main weaknesses in the library in this respect:

Some things, like serialization/deserialization of the models and caching of downloaded models do not really have explicit interfaces in all models, which means at best I am left to my own devices in implementing a solution and at worst that I have to fight the library for control over these processes.
Some of the existing interfaces are different across different models, in particular Embedding is very different from the other models. SequenceTagger for example has a .predict method and Embeddings have .embed methods.

My suggestion is that the model class hierarchy would be refactored in a way that there is a new universal (abstract) base class for the whole library. It would include uniform interfaces with methods like:

.predict - incremental prediction
.train - incremental training
.train_predict - incremental training and prediction
.batch - training and/or prediction for a batch of files
.fetch - fetch pretrained model from the web
.load - deserialize model
.save - serialize model

Doing this would make it easier to combine the models and abstract over them. What I would love to see is that in the future I could instantiate a FlairModel with my choice of language model(s), tagger(s), embedder(s) and classifier(s) suitably stacked and linked. This class would help me seamlessly combine all of these into a single NLP model, the complete lifecycle of which I could control completely using a combined interface. Thus calling flairmodel.train(sentence) would train the all the taggers, embeddings, and other models I put inside the model, simultaneously.

stefan-it · 2019-02-24T17:18:59Z

I plan to release language models (trained on Wikipedia dumps) for 16 languages (no, fa, ar, id, pl, da, hi, nl, eu, sl, he, hr, fi, bg, cs and sv). LMs are already trained, but I have to check their performance (at least on Universal Dependencies) first.
Support OpenAI GPT Embeddings is implemented, I'll prepare a PR in the next days
I think I found a way to use XLM embeddings in flair, but integrating has a couple of challenging tasks (library import in Python, License...)

alanakbik · 2019-02-25T07:59:27Z

@gccome @ixxie @stefan-it thanks for the input!

Fine-tuning is definitely a feature to add. Also it would be great to clean up the whole serialization process and make it more robust to changes between version etc.

Also agree on making everything more modular so researchers can stack components, though I am not sure if all modules need exactly the same interface since they have different functions. It may end up being more intuitive for users if embeddings have a .embed() method and models a .predict() method since they both do different things, but this is something we'll need to figure out.

@stefan-it really looking forward to your next PRs! :)

ixxie · 2019-02-25T08:23:06Z

@alanakbik I understand the concerns; since this is a longer discussion, I created a seperate issue for this purpose: #567

davidsbatista · 2019-02-26T22:41:49Z

I vote on:

Refactor flair.nn.Model and ModelTrainer

stbirc · 2019-02-27T10:58:18Z

Refactor data loading methods is clearly preferred.
We currently try to train some NER models based on string embeddings. If stored in cache (which doesn't help to speed up the processes), the forward and backward string embeddings take nearly 35 GB each. A total of 70 GB would be somewhat too much for our memory...

mauryaland · 2019-02-28T15:27:25Z

One new feature could be the integration of magnitude embeddings. It is a "fast, lightweight tool for utilizing and processing embeddings". Very different trained embeddings, from glove to elmo, are already available. It can bring some homogenization in the different classes. Let me know if it seems a good idea!

PS: You can now pinned this issue if wanted

alanakbik · 2019-02-28T15:49:51Z

@mauryaland hey this is a great idea - magnitude has really come a long way. Way back in the first version of Flair we looked at magnitude but there were some reasons for why we eventually went with gensim (can't remember what exactly, something with regards to speed and serialization I think). From a quick glance at your links, I really think we should look at integrating magnitude for v0.5.

Hellisotherpeople · 2019-03-01T19:10:09Z

I agree, please work on
Refactor data loading methods
as much as possible. I'm waiting to deploy and open source a state of the art Sentence Compression model, but literally no one else has a good framework for doing binary PoS tagging for word-level extractive compression. Flair would be preferred except that I can't load my dataset into VRAM.

alanakbik · 2019-03-03T03:36:34Z

Ok! Yes, the data loader is definitely the priority for v0.5!

stefan-it · 2019-03-03T18:37:35Z

A side note: In March, a few of us will be out of office for vacations, so development will likely slow down a bit. But come April, we'll start working full steam on 0.5 :)

Same here :) I'll be on vacation until 13th of March (maybe we could set this in our GitHub status here)

mfojtak · 2019-03-06T08:14:54Z

Please, see pull request #595 for implementation of "Refactor data loading methods"

alanakbik · 2019-03-06T13:19:45Z

@mfojtak wow this is great, thanks! :) We'll check it out!

mauryaland · 2019-03-07T15:37:52Z

Concerning the Tokenization part, it could be a good idea to use the freshly release Python Stanford NLP library. We can only use the tokenizer by disable other components such as lemmatization. With it, we can easily choose the language and get really good tokenization. If you are ok, I can try to implement it.

Another feature that could be nice for the SequenceTagger is the possibility to add metadata in order that the entity recognizer will respect the existing entity spans and adjust its predictions around it. It could be a one hot encoder into the crf or another idea.

Let me know if you are interested in those ideas!

alanakbik · 2019-03-07T23:35:01Z

Hello @mauryaland both ideas sound really interesting. I just checked the python stanford NLP library and it seems that it is fairly lighweight in terms of dependent libraries so including it should be possible! We'd very much appreciate an implementation that includes their multilingual tokanization!

For the second idea it would be great if this could be implemented as another Embeddings class. I.e. some sort of Gazetteer embedder that somehow encodes this metadata as one-hot embeddings at the word level. This way, we would have a nice way of including gazeteer knowledge (such as known entity spans) in any task, not just the SequenceTagger but also TextClassification and TextRegression etc.

For both, we'd very much appreciate your contribution!

stefan-it · 2019-03-14T11:51:02Z

@alanakbik I've seen a lot of issues and work on evaluation metrics over several months here. What do you think about using using the scikit-learn precision_recall_fscore_support methods as a kind of addition (or replacement)?

alanakbik · 2019-03-18T13:16:18Z

@stefan-it yes absolutely agree - we could probably use scikit learn instead for all evaluation metrics except the span F1 measure for which we would need to add back in the CoNLL-03 script.

leo-gan · 2019-03-25T21:50:21Z

what about supporting Python 3.7? It is supported by the most NLP packages now.

emrul · 2019-03-25T23:15:13Z

I'm becoming more familiar with Flair now and really impressed with it. My next task is to investigate productionisation options and in this vein I plan to look at torch.jit (https://pytorch.org/docs/stable/jit.html) - not sure if that's something that's on the projects roadmap?

mauryaland · 2019-03-26T10:06:20Z

Hello @alanakbik, I was wondering about the possibility to use another tokenizer (from the spacy to stanfordnlp ones) than segtok, what do you think of outsourcing this part, for example by making possible to call a list containing the tokens as an argument ? This solution could be very convenient in my opinion.

An idea of implementation:

class Sentence:
    """
    A Sentence is a list of Tokens and is used to represent a sentence or text fragment.
    """

    def __init__(self, text: str = None, use_tokenizer: bool = False, tokens_from_list: List[str] = None, labels: Union[List[Label], List[str]] = None):

        super(Sentence, self).__init__()

        self.tokens: List[Token] = []

        self.labels: List[Label] = []
        if labels is not None: self.add_labels(labels)

        self._embeddings: Dict = {}

        # if text is passed, instantiate sentence with tokens (words)
        if text is not None:

            # tokenize the text first if option selected
            if use_tokenizer:
            
                # use a list of tokens obtained by any tokenizer of your choice
                if tokens_from_list is not None and len(tokens_from_list) > 0:
                    tokens = tokens_from_list.copy()
                    
                else:
                # use segtok for tokenization
                    tokens = []
                    sentences = split_single(text)
                    for sentence in sentences:
                        contractions = split_contractions(word_tokenizer(sentence))
                        tokens.extend(contractions)

alanakbik · 2019-03-27T14:13:28Z

@leo-gan I think many users are already running Flair with Python 3.7. Our requirement is only 3.6+ so that should include newer versions. Or do you mean specific features of Python 3.7+?

@emrul better productization options would be great. It is on our mind but not part of a specific roadmap, so if you could share your findings on torch jit that would be great :) We also want to look into PyText in this context.

alanakbik · 2019-03-27T14:20:02Z

@mauryaland yes, tokenization is definitely a feature we want to work on more. Right now, the way to accomplish using external tokenizers would be to take the tokens_from_list parameter (i.e. a list of string tokens) and make a whitespace tokenized string out of it when calling the constructor, like this:

sentence= Sentence(' '.join(tokens_from_list))

The way you posted might be more convenient so we'll definitely look into integrating an option like this!

mauryaland · 2019-03-27T14:29:09Z

@alanakbik Thanks for the tip for using an external tokenizer. Could I propose a PR with the code I suggested in order to launch the work on this topic?

stefan-it · 2019-04-01T21:57:53Z

@mauryaland Would be great :) I think we can then add several tokenizers (and even kind of "benchmark" then)!

stefan-it · 2019-04-05T14:35:14Z

@stefan-it yes absolutely agree - we could probably use scikit learn instead for all evaluation metrics except the span F1 measure for which we would need to add back in the CoNLL-03 script.

@alanakbik Even with the latest master there's a F1-score vs. accuracy calculation bug for PoS tagging tasks, I will open an issue for that. I think we should really rely on the CoNLL-03 script (with of course correct conversion to IOB tagging scheme).

BernierCR · 2019-08-07T19:03:23Z

Thanks for the advice. If I need more GPUs, they'd probably let me have more. I just wanted to prove responsibility and usefulness for the 2 GPU use case before asking for more.

For now, I just ran two separate hyperopt runs on different GPUs (I made the cuda_device selectable by env variable) and it worked.

I might get to train my own embeddings this week, I hope so. If not this week, then soon for sure.

stefan-it · 2019-08-08T21:27:49Z

I also have new Flair Embeddings for the next release: better lm for Basque (more training data and longer training) and new lm for Tamil :)

richardliaw · 2019-08-21T23:08:17Z

Hi all, I work on https://github.com/ray-project/ray and maintain Tune. I stumbled upon this discussion just browsing github and was wondering if I could help.

RE: distributed hyperparameter tuning - Is there anything that we can do from the Tune side to help support your workload? I'd be happy to extend.

(FYI, not sure if this was a confusion, but you can easily do distributed data-parallel with distributed hyperparameter search in Tune.)

alanakbik · 2019-08-22T06:36:49Z

Hi @richardliaw we haven't yet looked into this, but hyperparameter selection is something that we really want to get better support for in the framework so thanks for reaching out! I'm sure once we get started with integrating Tune there'll be lots of questions from our side :)

…arch/flair into GH-563-prepare-release-0-4-3

…0-4-3 GH-563: prepare release 0.4.3

stefan-it · 2019-08-27T22:46:53Z

It would be great if we could also support attention-based architectures models, like proposed in that recent EMNLP paper: "Hierarchically-Refined Label Attention Network for Sequence Labeling":

CRF has been used as a powerful model for statistical sequence labeling. For better representing label sequences, we investigate a hierarchically-refined label attention network, which explicitly leverages label embeddings and captures potential long-term label dependency by giving each word incrementally refined label distributions with hierarchical attention.

That is really interesting, and a PyTorch implementation can be found in this repository by @Nealcly.

stefan-it · 2019-10-04T13:31:20Z

For one of the next releases I can provide the following (new) Flair Embeddings:

Greek (already trained), Estonian (already trained), Irish (already trained), Hungarian (currently training) and Romanian (scheduled after Hungarian).

I obtained special prepared/collected corpora from the Leipzig Corpora Collection for these languages :)

alanakbik · 2019-10-04T13:47:26Z

Awesome! :)

stefan-it · 2019-10-04T23:07:35Z

@alanakbik One question: could you add/upload WordEmbeddings for el, et, ga and hu? They're currently missing and I would really like to run experiments with these word embeddings later :)

alanakbik · 2019-10-07T08:47:46Z

@stefan-it sure, will do!

alanakbik · 2019-10-07T13:08:49Z

@stefan-it embeddings are added, both X-crawl and X-wiki should work now for these languages!

stefan-it · 2019-10-07T17:02:42Z

Thanks @alanakbik I totally forgot the word embeddings for Tamil (ta) 😅 Could you also add them? I'm currently doing PoS tagging experiments for these languages (with that new trained Flair Embeddings) and the results are looking great!

alanakbik · 2019-10-07T17:35:40Z

@stefan-it done! Look forward to the POS results! :)

GH-563: bump Flair version for release

Add BPEmbSerializable back in for serialized models

LifeIsStrange · 2019-12-13T01:09:44Z

Please add coreference resolution and dependency parsing, I cannot use this wonderful library without it...
You could use standfordnlp which leverage CoreNLP, please!

aychang95 · 2020-02-20T17:35:18Z

Hi @gccome and all,

For fine-tuning based transfer learning, I built out an experimental library called AdaptNLP that take a ULM-FIT approach for incorporating the latest transformers model architectures like BERT, GPT2, and ALBERT with Flair's prediction heads and trainers.

The library is located here: https://github.com/Novetta/adaptnlp and it's built atop Flair. It also provides some other interesting applications I've been working on or thought was very cool/useful NLP-wise so feel free to try out and raise some issues or feature requests in the repo. The library is in its early release stages so please feel free to address any issues in the repo too.

Flair has also been an awesome dependency and a joy to work with, so please let me know if there's anything AdaptNLP could be useful for Flair's development as well!

stale · 2020-06-19T17:40:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

alanakbik · 2020-06-21T09:42:10Z

We can close this since 0.5 is out (and 0.5.1 coming soon), but the remaining points are still on our list, especially multi-task learning.

alanakbik mentioned this issue Feb 27, 2019

Unable to load corpus #457

Closed

alanakbik pinned this issue Feb 28, 2019

alanakbik mentioned this issue Mar 12, 2019

Support for custom architectures for the TextClassifier class #604

Closed

mauryaland mentioned this issue Apr 2, 2019

Use a list of tokens obtained by any external tokenizer in Sentence class #640

Closed

pommedeterresautee mentioned this issue Aug 22, 2019

Make tokenisation modular (easy integration of 3rd party lib) #1022

Merged

alanakbik pushed a commit that referenced this issue Aug 23, 2019

Merge branch 'GH-563-prepare-release-0-4-3' of github.com:zalandorese…

a754922

…arch/flair into GH-563-prepare-release-0-4-3

alanakbik pushed a commit that referenced this issue Aug 26, 2019

Merge branch 'GH-563-prepare-release-0-4-3' of github.com:zalandorese…

6d56f9b

…arch/flair into GH-563-prepare-release-0-4-3

alanakbik pushed a commit that referenced this issue Aug 26, 2019

Merge pull request #1037 from zalandoresearch/GH-563-prepare-release-…

53d3f0a

…0-4-3 GH-563: prepare release 0.4.3

alanakbik pushed a commit that referenced this issue Aug 26, 2019

GH-563: bump Flair version number

ee86c99

alanakbik mentioned this issue Sep 4, 2019

Scalable hyperparameter selection with Ray Tune #1067

Closed

maciek16180 mentioned this issue Sep 6, 2019

embeddings_storage_mode not impacting GPU memory consumption #1078

Closed

alanakbik pushed a commit that referenced this issue Oct 20, 2019

GH-563: bump Flair version for release

0158896

alanakbik pushed a commit that referenced this issue Oct 20, 2019

Merge pull request #1231 from zalandoresearch/GH-563-version-4-4

90215d2

GH-563: bump Flair version for release

alanakbik pushed a commit that referenced this issue Oct 20, 2019

Merge pull request #1232 from zalandoresearch/GH-563-version-4-4

6cc6449

Add BPEmbSerializable back in for serialized models

whoisjones unpinned this issue May 19, 2020

stale bot added the wontfix This will not be worked on label Jun 19, 2020

alanakbik closed this as completed Jun 21, 2020

alanakbik removed the wontfix This will not be worked on label Jun 21, 2020

Flair 0.5 features #563

Flair 0.5 features #563

Comments

alanakbik commented Feb 24, 2019 • edited by MichaelHintz Loading

gccome commented Feb 24, 2019

ixxie commented Feb 24, 2019 • edited Loading

stefan-it commented Feb 24, 2019 • edited Loading

alanakbik commented Feb 25, 2019

ixxie commented Feb 25, 2019 • edited Loading

davidsbatista commented Feb 26, 2019 • edited Loading

stbirc commented Feb 27, 2019

mauryaland commented Feb 28, 2019

alanakbik commented Feb 28, 2019

Hellisotherpeople commented Mar 1, 2019

alanakbik commented Mar 3, 2019

stefan-it commented Mar 3, 2019 • edited Loading

mfojtak commented Mar 6, 2019

alanakbik commented Mar 6, 2019

mauryaland commented Mar 7, 2019

alanakbik commented Mar 7, 2019

stefan-it commented Mar 14, 2019

alanakbik commented Mar 18, 2019

leo-gan commented Mar 25, 2019

emrul commented Mar 25, 2019

mauryaland commented Mar 26, 2019 • edited Loading

alanakbik commented Mar 27, 2019

alanakbik commented Mar 27, 2019

mauryaland commented Mar 27, 2019 • edited Loading

stefan-it commented Apr 1, 2019

stefan-it commented Apr 5, 2019

BernierCR commented Aug 7, 2019

stefan-it commented Aug 8, 2019

richardliaw commented Aug 21, 2019 • edited Loading

alanakbik commented Aug 22, 2019

stefan-it commented Aug 27, 2019 • edited Loading

stefan-it commented Oct 4, 2019

alanakbik commented Oct 4, 2019

stefan-it commented Oct 4, 2019

alanakbik commented Oct 7, 2019

alanakbik commented Oct 7, 2019

stefan-it commented Oct 7, 2019

alanakbik commented Oct 7, 2019

LifeIsStrange commented Dec 13, 2019

aychang95 commented Feb 20, 2020

stale bot commented Jun 19, 2020

alanakbik commented Jun 21, 2020

alanakbik commented Feb 24, 2019 •

edited by MichaelHintz

Loading

ixxie commented Feb 24, 2019 •

edited

Loading

stefan-it commented Feb 24, 2019 •

edited

Loading

ixxie commented Feb 25, 2019 •

edited

Loading

davidsbatista commented Feb 26, 2019 •

edited

Loading

stefan-it commented Mar 3, 2019 •

edited

Loading

mauryaland commented Mar 26, 2019 •

edited

Loading

mauryaland commented Mar 27, 2019 •

edited

Loading

richardliaw commented Aug 21, 2019 •

edited

Loading

stefan-it commented Aug 27, 2019 •

edited

Loading