-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiword expressions tagger #1
Comments
Hi, |
Hi @Liontooth I am Roque López, a master student in Computer Science at São Paulo University-Brazil. I have strong interest in Natural Language Processing, for this reason I am doing my master about Opinion Of the ideas listed on your GSoC page, I am very interested in this. I already exchanged some emails with professor Steen. I performed some experiments for this tagger. The implementation of this is available here: https://github.com/rlopezc27/Multiword_Expression_Tagger In the little sample of the corpus I did not found any multiword expressions and I added some of them to verify if the script is correct. I wonder if I'm in the right way? I would appreciate any suggestion of you. More about me, in my personal page [1] and github account [2]. Best regards, [1] http://nilc.icmc.usp.br/nilc/pessoas/rlopez/ (in migration of servers) or http://br.linkedin.com/in/roquelopez |
Hey guys I saw this and was confused about what the input will be exactly. We need a pattern to feed the mwetoolkit, so I'm guessing that the pattern (such as: UNIT OF TIME + MANNER-OF-MOTION VERB) will also be needed as input? Also, the "input text" you have specified, will this be a file or will it be a single string? (A single string seems to make little sense to me, but the issue, at one place, assumes the input to be "AND SO THE YEARS ROLLED BY.") |
Python tagger for multiword expression lexicon
Ver descripción en español.
This is a task related to research on language and gesture with the NewsScape Library of International Television News. NewsScape is hosted by the University of California Los Angeles Library and developed by the Red Hen consortium for research on multimodal communication. Besides UCLA, Red Hen has capture nodes and research teams at Case Western Reserve University, University of Illinois at Urbana Champaign, University of Southern Denmark, University of Oxford, University of Osnabrück, Texas Tech, National Institute for Advanced Studies in Bangalore, University of Navarra, University of Murcia, and other places (the consortium is constantly expanding). NewsScape contains more than 200.000 hours of television news in English, Spanish and other European languages, indexed by their subtitles/close captioning (more than 3 billion words). Among other functionalities, NewsScape is the first audiovisual database that allows for synchronized searches of subtitles and images. Its search results take to the exact moment of the show when the words in the subtitles/close captioning were uttered.
Almost all large linguistic corpora to this date are written corpora (Corpus of American English, CREA and CORDE from Spain’s Royal Academy, newspaper archives, etc.). NewsScape opens new horizons for the study of oral communication alongside the great variety of elements that accompany verbal expression: gesture and intonation, along with, in the case of television, music, image and sound effects, graphics, etc. NewsScape also facilitates the study of particular news, topics, statements by individuals or institutions, etc. We are developing automatic and manual search and annotation tools for semantic patterns. Besides verbal patterns, we are also developing tools for face recognition, detection of visual patterns, story segmentation, etc. The research groups at Navarra and Murcia are developing the SCHEMOTIME project, which compares language and gesture in the expression of emotions and time, two central concepts for theories of metaphor and cognition. Besides, the collaboration between Navarra and Murcia leads the development of NewsScape in Spanish.
The present task is to write a program that receives an input text in natural language and tags certain phrases. The phrases to be tagged are multiword expressions of time, such as "the years rolled by".
Python is the probably the right programming language for the libraries available (we recommend mwetoolkit).
Part of the job is already done by a preprocessor that tags Parts-of-Speech (prepositions, verbs, nouns, etc) in the raw text.
For instance, the raw text may be the sentence, "AND SO THE YEARS ROLLED BY."
A tool called MBSP, from the CLiPS research group at the University of Antwerp, tags it like this, using the pipe symbol as field separator:
"and/CC/O/O/and|so/IN/I-ADVP/O/so|the/DT/I-NP/O/the|years/NNS/I-NP/O/year|rolled/VBN/I-VP/O/roll|by/RP/I-PRT/O/by|././O/O/."
You are not expected to understand those annotations yet, just know that they exist and that they are what your program will use.
The multiword expressions are specified through a combination of lists of words and these prepared Parts of Speech tags. The full set of specifications is called a lexicon.
For instance, an expression may have the structure As + UNIT OF TIME + MOTION VERB + PREPOSITION. Some examples: As centuries float slowly by, As the seconds trickled past, As the holidays slowly snuck up on her. The construction is further specified as follows in the lexicon:
So the lexicon defines the multiword expression, and the program must locate that expression in the source text. Three steps are needed:
The final product is a utility that the user submits a sentence to, and the utility tags the sentence according to the multiword expression lexicon. The utility should support a socket server mode.
The project will be mentored by software developers in the Red Hen Lab, which includes faculty at University of Navarra in Spain and the University of California in Los Angeles.
Sample Lexicon of English Time Expressions
Example sentences:
-Time flies. -Days shuffle. -Holidays sneak up on.
-Months come tumbling down. - The years rolled slowly past
UNITS OF TIME: afternoon, age, autumn, century, dawn, decade, evening, fall, holiday, holidays, hour, night, midday, midnight, millenium, milisecond, minute, moment, month, morning, morrow, noon, period, second, spring, summer, today, tomorrow, tonight, twilight, week, weekday, weekend, winter, yesterday. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. January, February, March, April, May, June, July, August, September, October, November, December. Time
Also, some nouns and pronouns refering to processes: movie, stay, course, class, lecture, show, concert, exam, party, meeting, match, war, Christmas, summer, season, it (all), race, project, recording, visit. *This list is expandable.
MANNER-OF-MOTION VERBS: fly, shuffle, sneak up, come tumbling down, roll slowly/quickly past, run, walk, bounce, drift, drop, float, glide, move, roll, slide, swing, revolve, rotate, spin, turn, twirl, twist, whirl, wind, amble, bolt, bounce, charge, coast, crawl, creep, dart, dash, dodder, drift ,flit float, fly, frolic, gallop, glide, hasten, hike, hobble, hop, hurry, inch, jump, leap, lurch, march, meander, mince, parade, perambulate, plod, promenade, prowl, race, ramble, roam, roll, run, rush, saunter, scurry, scutter, scuttle, shamble, shuffle, skedaddle, skip, slide, slink, slither, slog, slouch, sneak, speed, stagger, stray, streak, stroll, strut, stumble, swagger, sweep, swim, tear, tiptoe, toddle, totter, traipse, travel, troop, trot, vault, walk, wander, whiz, zigzag, zoom. * This list is expandable.
-As seconds go by. -As minutes pass on. -As days go on.
-As centuries go by. -As hours pass by. - As years go by.
This can already be captured, but we want to tag it automatically as a class of multi-word time expressions
This sample lexicon will be expanded, but contains the typical construction types the program needs to handle.
Web-based frontend
The files to be annotated can be assumed to be present in a database, let's say mungodb, mysql, or solr.
The user input consists instead of semantic categories that act as components of multiword expressions.
Examples of such semantic categories are included in the backend task description at #1
For instance, they may include the semantic categories "UNIT OF TIME" and "MANNER-OF-MOTION VERB".
Do we use parameter files for the contents of these categories? If so, how do these parameter files interact with the mwetoolkit?
If we can use parameter files, can we have a number that is small enough to fit the options into a user interface?
The text was updated successfully, but these errors were encountered: