Documentation | Paper | Blog | ModelScope
OFASys is a multi-modal multi-task learning system designed to make multi-modal tasks declarative, modular and task-scalable. With OFASys, it is easy to:- Rapidly introduce new multi-modal tasks/datasets by defining a declarative one-line instruction.
- Develop new or reuse existing modality-specific components.
- Jointly train multiple multi-modal tasks together without manual processing of multi-modal data collating.
For now, OFASys supports 7 modalities and more than 20 classes of multi-modal tasks, including:
- Text: for tasks like Natural language Understanding, Text Summarization and Text Infilling.
- Image: for tasks like Image Classification, Visual Entailment, Image Captioning, Visual Question Answering, Text-to-Image Generation and Image Infilling.
- Box: for tasks like Visual Grounding, Grounded Caption, Object Detection
- Video: for tasks like Video Classification, Video Captioning and Video Question Answering.
- Audio: for tasks like Automatic Speech Recognition, and Text to Speech.
- Structural Language: for tasks like Text-to-SQL, Table-to-Text, Table question answering, and Sudoku.
- Motion: for tasks like Text-to-Motion.
- 2022.12.23 v0.1.0-patch1:
- Refactored and released diffusion-based
Text-to-Motion
task (v0.1), see doc for usage. - Refactored TextPreprocess: BOS and EOS no longer required when writing an instruction.
- Added DatabasePreprocess for the
Text-to-SQL
task.
- Refactored and released diffusion-based
- PyTorch version >= 1.8.0
- Python version >= 3.7
- Torchaudio >= 0.8.0
Through the pip installation, users can experience the basic multi-task training and inference functions of OFASys.
pip install http://ofasys.oss-cn-zhangjiakou.aliyuncs.com/pkg/ofasys-0.1.0-py3-none-any.whl
Test your installation.
python -c "import ofasys"
Using the audio feature in OFASys requires the soundfile library to be installed. In the Ubuntu OS, run the following command:
sudo apt-get update
sudo apt-get install libsndfile1
Users can install OFASys from the source code to customize their training tasks and full functions.
git clone https://github.com/OFA-Sys/OFASys.git
cd OFASys
python setup.py develop
The documents contains more instructions for getting started.
OFASys can co-train multiple multi-modal tasks flexibly.
from ofasys import Task, Trainer, GeneralistModel
task1 = Task(
name='caption',
instruction='[IMAGE:image_url] what does the image describe? -> [TEXT:caption]',
micro_batch_size=4,
)
task2 = Task(
name='text_infilling',
instruction='what is the complete text of " [TEXT:sentence,mask_ratio=0.3] "? -> [TEXT:sentence]',
micro_batch_size=2,
)
In the simplest scenario, you only need to specify an instruction to define your task and a task name as an identifier.
The Task can use a regular Pytorch Dataloader which can be constructed by Huggingface Dataset or a customized Pytorch Dataset.
from datasets import load_dataset
task1.add_dataset(load_dataset('TheFusion21/PokemonCards')['train'], 'train')
task2.add_dataset(load_dataset('glue', 'cola')['train'], 'train')
The GeneralistModel of OFASys (OFA+) is capable of handling multiple modalities including: TEXT, IMAGE, AUDIO, VIDEO, MOTION, BOX, PHONE.
The OFASys Trainer “mixes” multiple Tasks with any dataset and abstracts away all the engineering complexity needed for scale.
model = GeneralistModel()
trainer = Trainer()
trainer.fit(model=model, tasks=[task1, task2])
The complete script is available at scripts/trainer_api.py.
OFASys can infer multiple multi-modal tasks using just One checkpoint.
from ofasys import OFASys
model = OFASys.from_pretrained('multitask.pt')
OFASys enables multi-task multi-modal inference through the instruction alone. The multitask checkpoint can be download at here. Let's go through a couple of examples!
instruction = '[IMAGE:img] what does the image describe? -> [TEXT:cap]'
data = {'img': "./COCO_val2014_000000222628.jpg"}
output = model.inference(instruction, data=data)
print(output.text)
# "a man and woman sitting in front of a laptop computer"
instruction = '[IMAGE:img] which region does the text " [TEXT:cap] " describe? -> [BOX:patch_boxes]'
data = {'img': "https://www.2008php.com/2014_Website_appreciate/2015-06-22/20150622131649.jpg", "cap": "hand"}
output = model.inference(instruction, data=data)
output.save_box("output.jpg")
instruction = 'what is the summary of article " [TEXT:src] "? -> [TEXT:tgt]'
data = {'src': "poland 's main opposition party tuesday endorsed president lech walesa in an upcoming "
"presidential run-off election after a reformed communist won the first round of voting ."}
output = model.inference(instruction, data=data)
print(output.text)
# "polish opposition endorses walesa in presidential run-off"
instruction = 'structured knowledge: " [STRUCT:database,uncased] " . how to describe the tripleset ? -> [TEXT:tgt]'
data = {
'database': [['Atlanta', 'OFFICIAL_POPULATION', '5,457,831'],
['[TABLECONTEXT]', 'METROPOLITAN_AREA', 'Atlanta'],
['5,457,831', 'YEAR', '2012'],
['[TABLECONTEXT]', '[TITLE]', 'List of metropolitan areas by population'],
['Atlanta', 'COUNTRY', 'United States'],
]
}
output = model.inference(instruction, data=data, beam_size=1)
print(output.text)
# "atlanta, united states has a population of 5,457,831 in 2012."
instruction = ' " [TEXT:src] " ; structured knowledge: " [STRUCT:database,max_length=876] " . generating sql code. -> [TEXT:tgt]'
database = [
['concert_singer'],
['stadium', 'stadium_id , location , name , capacity , highest , lowest , average'],
['singer', 'singer_id , name , country , song_name , song_release_year , age , is_male'],
['concert', 'concert_id , concert_name , theme , stadium_id , year'],
['singer_in_concert', 'concert_id , singer_id']
]
data = [
{'src': 'What are the names, countries, and ages for every singer in descending order of age?', 'database': database},
{'src': 'What are all distinct countries where singers above age 20 are from?', 'database': database},
{'src': 'Show the name and the release year of the song by the youngest singer.', 'database': database}
]
output = model.inference(instruction, data=data)
print('\n'.join([o.text for o in output]))
# "select name, country, age from singer order by age desc"
# "select distinct country from singer where age > 20"
# "select song_name, song_release_year from singer order by age limit 1"
instruction = '[VIDEO:video] what does the video describe? -> [TEXT:cap]'
data = {'video': './video7021.mp4'}
output = model.inference(instruction, data=data)
print(output.text)
# "a baseball player is hitting a ball"