-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flair 0.5 features #563
Comments
Currently, the transfer learning is purely feature based, do you consider adding fine-tuning based transfer learning for both sequence tagging and text classification? I think it would be a great addition. |
Disclaimers: I'm pretty new to Flair so this is probably at least somewhat misinformed. Also, this might be a bit of a tall order, especially for 0.5 since it alters the API significantly. Hopefully its not a waste of your time! Flair is a great library in that it provides a uniform interface for disparate NLP models in one place. So far I see two main weaknesses in the library in this respect:
My suggestion is that the model class hierarchy would be refactored in a way that there is a new universal (abstract) base class for the whole library. It would include uniform interfaces with methods like:
Doing this would make it easier to combine the models and abstract over them. What I would love to see is that in the future I could instantiate a |
|
@gccome @ixxie @stefan-it thanks for the input! Fine-tuning is definitely a feature to add. Also it would be great to clean up the whole serialization process and make it more robust to changes between version etc. Also agree on making everything more modular so researchers can stack components, though I am not sure if all modules need exactly the same interface since they have different functions. It may end up being more intuitive for users if embeddings have a .embed() method and models a .predict() method since they both do different things, but this is something we'll need to figure out. @stefan-it really looking forward to your next PRs! :) |
@alanakbik I understand the concerns; since this is a longer discussion, I created a seperate issue for this purpose: #567 |
I vote on:
|
Refactor data loading methods is clearly preferred. |
One new feature could be the integration of magnitude embeddings. It is a "fast, lightweight tool for utilizing and processing embeddings". Very different trained embeddings, from glove to elmo, are already available. It can bring some homogenization in the different classes. Let me know if it seems a good idea! PS: You can now pinned this issue if wanted |
@mauryaland hey this is a great idea - magnitude has really come a long way. Way back in the first version of Flair we looked at magnitude but there were some reasons for why we eventually went with gensim (can't remember what exactly, something with regards to speed and serialization I think). From a quick glance at your links, I really think we should look at integrating magnitude for v0.5. |
I agree, please work on |
Ok! Yes, the data loader is definitely the priority for v0.5! |
Same here :) I'll be on vacation until 13th of March (maybe we could set this in our GitHub status here) |
Please, see pull request #595 for implementation of "Refactor data loading methods" |
@mfojtak wow this is great, thanks! :) We'll check it out! |
Concerning the Tokenization part, it could be a good idea to use the freshly release Python Stanford NLP library. We can only use the tokenizer by disable other components such as lemmatization. With it, we can easily choose the language and get really good tokenization. If you are ok, I can try to implement it. Another feature that could be nice for the SequenceTagger is the possibility to add metadata in order that the entity recognizer will respect the existing entity spans and adjust its predictions around it. It could be a one hot encoder into the crf or another idea. Let me know if you are interested in those ideas! |
Hello @mauryaland both ideas sound really interesting. I just checked the python stanford NLP library and it seems that it is fairly lighweight in terms of dependent libraries so including it should be possible! We'd very much appreciate an implementation that includes their multilingual tokanization! For the second idea it would be great if this could be implemented as another Embeddings class. I.e. some sort of Gazetteer embedder that somehow encodes this metadata as one-hot embeddings at the word level. This way, we would have a nice way of including gazeteer knowledge (such as known entity spans) in any task, not just the SequenceTagger but also TextClassification and TextRegression etc. For both, we'd very much appreciate your contribution! |
@alanakbik I've seen a lot of issues and work on evaluation metrics over several months here. What do you think about using using the scikit-learn |
@stefan-it yes absolutely agree - we could probably use scikit learn instead for all evaluation metrics except the span F1 measure for which we would need to add back in the CoNLL-03 script. |
what about supporting Python 3.7? It is supported by the most NLP packages now. |
I'm becoming more familiar with Flair now and really impressed with it. My next task is to investigate productionisation options and in this vein I plan to look at torch.jit (https://pytorch.org/docs/stable/jit.html) - not sure if that's something that's on the projects roadmap? |
Hello @alanakbik, I was wondering about the possibility to use another tokenizer (from the spacy to stanfordnlp ones) than segtok, what do you think of outsourcing this part, for example by making possible to call a list containing the tokens as an argument ? This solution could be very convenient in my opinion. An idea of implementation: class Sentence:
"""
A Sentence is a list of Tokens and is used to represent a sentence or text fragment.
"""
def __init__(self, text: str = None, use_tokenizer: bool = False, tokens_from_list: List[str] = None, labels: Union[List[Label], List[str]] = None):
super(Sentence, self).__init__()
self.tokens: List[Token] = []
self.labels: List[Label] = []
if labels is not None: self.add_labels(labels)
self._embeddings: Dict = {}
# if text is passed, instantiate sentence with tokens (words)
if text is not None:
# tokenize the text first if option selected
if use_tokenizer:
# use a list of tokens obtained by any tokenizer of your choice
if tokens_from_list is not None and len(tokens_from_list) > 0:
tokens = tokens_from_list.copy()
else:
# use segtok for tokenization
tokens = []
sentences = split_single(text)
for sentence in sentences:
contractions = split_contractions(word_tokenizer(sentence))
tokens.extend(contractions) |
@leo-gan I think many users are already running Flair with Python 3.7. Our requirement is only 3.6+ so that should include newer versions. Or do you mean specific features of Python 3.7+? @emrul better productization options would be great. It is on our mind but not part of a specific roadmap, so if you could share your findings on torch jit that would be great :) We also want to look into PyText in this context. |
@mauryaland yes, tokenization is definitely a feature we want to work on more. Right now, the way to accomplish using external tokenizers would be to take the sentence= Sentence(' '.join(tokens_from_list)) The way you posted might be more convenient so we'll definitely look into integrating an option like this! |
@alanakbik Thanks for the tip for using an external tokenizer. Could I propose a PR with the code I suggested in order to launch the work on this topic? |
@mauryaland Would be great :) I think we can then add several tokenizers (and even kind of "benchmark" then)! |
@alanakbik Even with the latest |
Thanks for the advice. If I need more GPUs, they'd probably let me have more. I just wanted to prove responsibility and usefulness for the 2 GPU use case before asking for more. For now, I just ran two separate hyperopt runs on different GPUs (I made the cuda_device selectable by env variable) and it worked. I might get to train my own embeddings this week, I hope so. If not this week, then soon for sure. |
I also have new Flair Embeddings for the next release: better lm for Basque (more training data and longer training) and new lm for Tamil :) |
Hi all, I work on https://github.com/ray-project/ray and maintain Tune. I stumbled upon this discussion just browsing github and was wondering if I could help. RE: distributed hyperparameter tuning - Is there anything that we can do from the Tune side to help support your workload? I'd be happy to extend. (FYI, not sure if this was a confusion, but you can easily do distributed data-parallel with distributed hyperparameter search in Tune.) |
Hi @richardliaw we haven't yet looked into this, but hyperparameter selection is something that we really want to get better support for in the framework so thanks for reaching out! I'm sure once we get started with integrating Tune there'll be lots of questions from our side :) |
…arch/flair into GH-563-prepare-release-0-4-3
…arch/flair into GH-563-prepare-release-0-4-3
…0-4-3 GH-563: prepare release 0.4.3
It would be great if we could also support attention-based architectures models, like proposed in that recent EMNLP paper: "Hierarchically-Refined Label Attention Network for Sequence Labeling":
That is really interesting, and a PyTorch implementation can be found in this repository by @Nealcly. |
For one of the next releases I can provide the following (new) Flair Embeddings: Greek (already trained), Estonian (already trained), Irish (already trained), Hungarian (currently training) and Romanian (scheduled after Hungarian). I obtained special prepared/collected corpora from the Leipzig Corpora Collection for these languages :) |
Awesome! :) |
@alanakbik One question: could you add/upload |
@stefan-it sure, will do! |
@stefan-it embeddings are added, both |
Thanks @alanakbik I totally forgot the word embeddings for Tamil (ta) 😅 Could you also add them? I'm currently doing PoS tagging experiments for these languages (with that new trained Flair Embeddings) and the results are looking great! |
@stefan-it done! Look forward to the POS results! :) |
GH-563: bump Flair version for release
Add BPEmbSerializable back in for serialized models
Please add coreference resolution and dependency parsing, I cannot use this wonderful library without it... |
Hi @gccome and all, For fine-tuning based transfer learning, I built out an experimental library called AdaptNLP that take a ULM-FIT approach for incorporating the latest transformers model architectures like BERT, GPT2, and ALBERT with Flair's prediction heads and trainers. The library is located here: https://github.com/Novetta/adaptnlp and it's built atop Flair. It also provides some other interesting applications I've been working on or thought was very cool/useful NLP-wise so feel free to try out and raise some issues or feature requests in the repo. The library is in its early release stages so please feel free to address any issues in the repo too. Flair has also been an awesome dependency and a joy to work with, so please let me know if there's anything AdaptNLP could be useful for Flair's development as well! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We can close this since 0.5 is out (and 0.5.1 coming soon), but the remaining points are still on our list, especially multi-task learning. |
Here, I'd like to collect some ideas for features that we would like to see in the next version of Flair.
Ideas:
flair.nn.Model
interface needs to be simplified (fewer methods) and generalized in such a way that implementing this interface will immediately enable training using theModelTrainer
class (see also Trainer only uses TextClassifier.load_from_file #474).segtok
for tokenization, but maybe we can include other tokenizers (Which tokenizer does Flair use? #394), perhaps even our own trained over the UD corpora.Any other ideas? Please let us know!
A side note: In March, a few of us will be out of office for vacations, so development will likely slow down a bit. But come April, we'll start working full steam on 0.5 :)
The text was updated successfully, but these errors were encountered: