Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flair 0.5 features #563

Closed
3 of 5 tasks
alanakbik opened this issue Feb 24, 2019 · 55 comments
Closed
3 of 5 tasks

Flair 0.5 features #563

alanakbik opened this issue Feb 24, 2019 · 55 comments

Comments

@alanakbik
Copy link
Collaborator

alanakbik commented Feb 24, 2019

Here, I'd like to collect some ideas for features that we would like to see in the next version of Flair.

Ideas:

Any other ideas? Please let us know!

A side note: In March, a few of us will be out of office for vacations, so development will likely slow down a bit. But come April, we'll start working full steam on 0.5 :)

@gccome
Copy link

gccome commented Feb 24, 2019

Currently, the transfer learning is purely feature based, do you consider adding fine-tuning based transfer learning for both sequence tagging and text classification? I think it would be a great addition.

@ixxie
Copy link

ixxie commented Feb 24, 2019

Disclaimers: I'm pretty new to Flair so this is probably at least somewhat misinformed. Also, this might be a bit of a tall order, especially for 0.5 since it alters the API significantly. Hopefully its not a waste of your time!

Flair is a great library in that it provides a uniform interface for disparate NLP models in one place. So far I see two main weaknesses in the library in this respect:

  1. Some things, like serialization/deserialization of the models and caching of downloaded models do not really have explicit interfaces in all models, which means at best I am left to my own devices in implementing a solution and at worst that I have to fight the library for control over these processes.
  2. Some of the existing interfaces are different across different models, in particular Embedding is very different from the other models. SequenceTagger for example has a .predict method and Embeddings have .embed methods.

My suggestion is that the model class hierarchy would be refactored in a way that there is a new universal (abstract) base class for the whole library. It would include uniform interfaces with methods like:

  • .predict - incremental prediction
  • .train - incremental training
  • .train_predict - incremental training and prediction
  • .batch - training and/or prediction for a batch of files
  • .fetch - fetch pretrained model from the web
  • .load - deserialize model
  • .save - serialize model

Doing this would make it easier to combine the models and abstract over them. What I would love to see is that in the future I could instantiate a FlairModel with my choice of language model(s), tagger(s), embedder(s) and classifier(s) suitably stacked and linked. This class would help me seamlessly combine all of these into a single NLP model, the complete lifecycle of which I could control completely using a combined interface. Thus calling flairmodel.train(sentence) would train the all the taggers, embeddings, and other models I put inside the model, simultaneously.

@stefan-it
Copy link
Member

stefan-it commented Feb 24, 2019

  • I plan to release language models (trained on Wikipedia dumps) for 16 languages (no, fa, ar, id, pl, da, hi, nl, eu, sl, he, hr, fi, bg, cs and sv). LMs are already trained, but I have to check their performance (at least on Universal Dependencies) first.

  • Support OpenAI GPT Embeddings is implemented, I'll prepare a PR in the next days

  • I think I found a way to use XLM embeddings in flair, but integrating has a couple of challenging tasks (library import in Python, License...)

@alanakbik
Copy link
Collaborator Author

@gccome @ixxie @stefan-it thanks for the input!

Fine-tuning is definitely a feature to add. Also it would be great to clean up the whole serialization process and make it more robust to changes between version etc.

Also agree on making everything more modular so researchers can stack components, though I am not sure if all modules need exactly the same interface since they have different functions. It may end up being more intuitive for users if embeddings have a .embed() method and models a .predict() method since they both do different things, but this is something we'll need to figure out.

@stefan-it really looking forward to your next PRs! :)

@ixxie
Copy link

ixxie commented Feb 25, 2019

@alanakbik I understand the concerns; since this is a longer discussion, I created a seperate issue for this purpose: #567

@davidsbatista
Copy link
Contributor

davidsbatista commented Feb 26, 2019

I vote on:

  • Refactor flair.nn.Model and ModelTrainer

@stbirc
Copy link

stbirc commented Feb 27, 2019

Refactor data loading methods is clearly preferred.
We currently try to train some NER models based on string embeddings. If stored in cache (which doesn't help to speed up the processes), the forward and backward string embeddings take nearly 35 GB each. A total of 70 GB would be somewhat too much for our memory...

@mauryaland
Copy link
Contributor

One new feature could be the integration of magnitude embeddings. It is a "fast, lightweight tool for utilizing and processing embeddings". Very different trained embeddings, from glove to elmo, are already available. It can bring some homogenization in the different classes. Let me know if it seems a good idea!

PS: You can now pinned this issue if wanted

@alanakbik
Copy link
Collaborator Author

@mauryaland hey this is a great idea - magnitude has really come a long way. Way back in the first version of Flair we looked at magnitude but there were some reasons for why we eventually went with gensim (can't remember what exactly, something with regards to speed and serialization I think). From a quick glance at your links, I really think we should look at integrating magnitude for v0.5.

@alanakbik alanakbik pinned this issue Feb 28, 2019
@Hellisotherpeople
Copy link

I agree, please work on
Refactor data loading methods
as much as possible. I'm waiting to deploy and open source a state of the art Sentence Compression model, but literally no one else has a good framework for doing binary PoS tagging for word-level extractive compression. Flair would be preferred except that I can't load my dataset into VRAM.

@alanakbik
Copy link
Collaborator Author

Ok! Yes, the data loader is definitely the priority for v0.5!

@stefan-it
Copy link
Member

stefan-it commented Mar 3, 2019

A side note: In March, a few of us will be out of office for vacations, so development will likely slow down a bit. But come April, we'll start working full steam on 0.5 :)

Same here :) I'll be on vacation until 13th of March (maybe we could set this in our GitHub status here)

@mfojtak
Copy link

mfojtak commented Mar 6, 2019

Please, see pull request #595 for implementation of "Refactor data loading methods"

@alanakbik
Copy link
Collaborator Author

@mfojtak wow this is great, thanks! :) We'll check it out!

@mauryaland
Copy link
Contributor

Concerning the Tokenization part, it could be a good idea to use the freshly release Python Stanford NLP library. We can only use the tokenizer by disable other components such as lemmatization. With it, we can easily choose the language and get really good tokenization. If you are ok, I can try to implement it.

Another feature that could be nice for the SequenceTagger is the possibility to add metadata in order that the entity recognizer will respect the existing entity spans and adjust its predictions around it. It could be a one hot encoder into the crf or another idea.

Let me know if you are interested in those ideas!

@alanakbik
Copy link
Collaborator Author

Hello @mauryaland both ideas sound really interesting. I just checked the python stanford NLP library and it seems that it is fairly lighweight in terms of dependent libraries so including it should be possible! We'd very much appreciate an implementation that includes their multilingual tokanization!

For the second idea it would be great if this could be implemented as another Embeddings class. I.e. some sort of Gazetteer embedder that somehow encodes this metadata as one-hot embeddings at the word level. This way, we would have a nice way of including gazeteer knowledge (such as known entity spans) in any task, not just the SequenceTagger but also TextClassification and TextRegression etc.

For both, we'd very much appreciate your contribution!

@stefan-it
Copy link
Member

@alanakbik I've seen a lot of issues and work on evaluation metrics over several months here. What do you think about using using the scikit-learn precision_recall_fscore_support methods as a kind of addition (or replacement)?

@alanakbik
Copy link
Collaborator Author

@stefan-it yes absolutely agree - we could probably use scikit learn instead for all evaluation metrics except the span F1 measure for which we would need to add back in the CoNLL-03 script.

@leo-gan
Copy link

leo-gan commented Mar 25, 2019

what about supporting Python 3.7? It is supported by the most NLP packages now.

@emrul
Copy link

emrul commented Mar 25, 2019

I'm becoming more familiar with Flair now and really impressed with it. My next task is to investigate productionisation options and in this vein I plan to look at torch.jit (https://pytorch.org/docs/stable/jit.html) - not sure if that's something that's on the projects roadmap?

@mauryaland
Copy link
Contributor

mauryaland commented Mar 26, 2019

Hello @alanakbik, I was wondering about the possibility to use another tokenizer (from the spacy to stanfordnlp ones) than segtok, what do you think of outsourcing this part, for example by making possible to call a list containing the tokens as an argument ? This solution could be very convenient in my opinion.

An idea of implementation:

class Sentence:
    """
    A Sentence is a list of Tokens and is used to represent a sentence or text fragment.
    """

    def __init__(self, text: str = None, use_tokenizer: bool = False, tokens_from_list: List[str] = None, labels: Union[List[Label], List[str]] = None):

        super(Sentence, self).__init__()

        self.tokens: List[Token] = []

        self.labels: List[Label] = []
        if labels is not None: self.add_labels(labels)

        self._embeddings: Dict = {}

        # if text is passed, instantiate sentence with tokens (words)
        if text is not None:

            # tokenize the text first if option selected
            if use_tokenizer:
            
                # use a list of tokens obtained by any tokenizer of your choice
                if tokens_from_list is not None and len(tokens_from_list) > 0:
                    tokens = tokens_from_list.copy()
                    
                else:
                # use segtok for tokenization
                    tokens = []
                    sentences = split_single(text)
                    for sentence in sentences:
                        contractions = split_contractions(word_tokenizer(sentence))
                        tokens.extend(contractions)

@alanakbik
Copy link
Collaborator Author

@leo-gan I think many users are already running Flair with Python 3.7. Our requirement is only 3.6+ so that should include newer versions. Or do you mean specific features of Python 3.7+?

@emrul better productization options would be great. It is on our mind but not part of a specific roadmap, so if you could share your findings on torch jit that would be great :) We also want to look into PyText in this context.

@alanakbik
Copy link
Collaborator Author

@mauryaland yes, tokenization is definitely a feature we want to work on more. Right now, the way to accomplish using external tokenizers would be to take the tokens_from_list parameter (i.e. a list of string tokens) and make a whitespace tokenized string out of it when calling the constructor, like this:

sentence= Sentence(' '.join(tokens_from_list))

The way you posted might be more convenient so we'll definitely look into integrating an option like this!

@mauryaland
Copy link
Contributor

mauryaland commented Mar 27, 2019

@alanakbik Thanks for the tip for using an external tokenizer. Could I propose a PR with the code I suggested in order to launch the work on this topic?

@stefan-it
Copy link
Member

@mauryaland Would be great :) I think we can then add several tokenizers (and even kind of "benchmark" then)!

@stefan-it
Copy link
Member

@stefan-it yes absolutely agree - we could probably use scikit learn instead for all evaluation metrics except the span F1 measure for which we would need to add back in the CoNLL-03 script.

@alanakbik Even with the latest master there's a F1-score vs. accuracy calculation bug for PoS tagging tasks, I will open an issue for that. I think we should really rely on the CoNLL-03 script (with of course correct conversion to IOB tagging scheme).

@BernierCR
Copy link

Thanks for the advice. If I need more GPUs, they'd probably let me have more. I just wanted to prove responsibility and usefulness for the 2 GPU use case before asking for more.

For now, I just ran two separate hyperopt runs on different GPUs (I made the cuda_device selectable by env variable) and it worked.

I might get to train my own embeddings this week, I hope so. If not this week, then soon for sure.

@stefan-it
Copy link
Member

I also have new Flair Embeddings for the next release: better lm for Basque (more training data and longer training) and new lm for Tamil :)

@richardliaw
Copy link

richardliaw commented Aug 21, 2019

Hi all, I work on https://github.com/ray-project/ray and maintain Tune. I stumbled upon this discussion just browsing github and was wondering if I could help.

RE: distributed hyperparameter tuning - Is there anything that we can do from the Tune side to help support your workload? I'd be happy to extend.

(FYI, not sure if this was a confusion, but you can easily do distributed data-parallel with distributed hyperparameter search in Tune.)

@alanakbik
Copy link
Collaborator Author

Hi @richardliaw we haven't yet looked into this, but hyperparameter selection is something that we really want to get better support for in the framework so thanks for reaching out! I'm sure once we get started with integrating Tune there'll be lots of questions from our side :)

alanakbik pushed a commit that referenced this issue Aug 23, 2019
alanakbik pushed a commit that referenced this issue Aug 26, 2019
alanakbik pushed a commit that referenced this issue Aug 26, 2019
alanakbik pushed a commit that referenced this issue Aug 26, 2019
@stefan-it
Copy link
Member

stefan-it commented Aug 27, 2019

It would be great if we could also support attention-based architectures models, like proposed in that recent EMNLP paper: "Hierarchically-Refined Label Attention Network for Sequence Labeling":

CRF has been used as a powerful model for statistical sequence labeling. For better representing label sequences, we investigate a hierarchically-refined label attention network, which explicitly leverages label embeddings and captures potential long-term label dependency by giving each word incrementally refined label distributions with hierarchical attention.

That is really interesting, and a PyTorch implementation can be found in this repository by @Nealcly.

@stefan-it
Copy link
Member

For one of the next releases I can provide the following (new) Flair Embeddings:

Greek (already trained), Estonian (already trained), Irish (already trained), Hungarian (currently training) and Romanian (scheduled after Hungarian).

I obtained special prepared/collected corpora from the Leipzig Corpora Collection for these languages :)

@alanakbik
Copy link
Collaborator Author

Awesome! :)

@stefan-it
Copy link
Member

@alanakbik One question: could you add/upload WordEmbeddings for el, et, ga and hu? They're currently missing and I would really like to run experiments with these word embeddings later :)

@alanakbik
Copy link
Collaborator Author

@stefan-it sure, will do!

@alanakbik
Copy link
Collaborator Author

@stefan-it embeddings are added, both X-crawl and X-wiki should work now for these languages!

@stefan-it
Copy link
Member

Thanks @alanakbik I totally forgot the word embeddings for Tamil (ta) 😅 Could you also add them? I'm currently doing PoS tagging experiments for these languages (with that new trained Flair Embeddings) and the results are looking great!

@alanakbik
Copy link
Collaborator Author

@stefan-it done! Look forward to the POS results! :)

alanakbik pushed a commit that referenced this issue Oct 20, 2019
alanakbik pushed a commit that referenced this issue Oct 20, 2019
alanakbik pushed a commit that referenced this issue Oct 20, 2019
Add BPEmbSerializable back in for serialized models
@LifeIsStrange
Copy link

Please add coreference resolution and dependency parsing, I cannot use this wonderful library without it...
You could use standfordnlp which leverage CoreNLP, please!

@aychang95
Copy link
Contributor

Hi @gccome and all,

For fine-tuning based transfer learning, I built out an experimental library called AdaptNLP that take a ULM-FIT approach for incorporating the latest transformers model architectures like BERT, GPT2, and ALBERT with Flair's prediction heads and trainers.

The library is located here: https://github.com/Novetta/adaptnlp and it's built atop Flair. It also provides some other interesting applications I've been working on or thought was very cool/useful NLP-wise so feel free to try out and raise some issues or feature requests in the repo. The library is in its early release stages so please feel free to address any issues in the repo too.

Flair has also been an awesome dependency and a joy to work with, so please let me know if there's anything AdaptNLP could be useful for Flair's development as well!

@whoisjones whoisjones unpinned this issue May 19, 2020
@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Jun 19, 2020
@alanakbik
Copy link
Collaborator Author

We can close this since 0.5 is out (and 0.5.1 coming soon), but the remaining points are still on our list, especially multi-task learning.

@alanakbik alanakbik removed the wontfix This will not be worked on label Jun 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests