-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Support multi-modal evaluation on MME benchmark. #197
Conversation
metric['Perception'] = score | ||
|
||
score = 0 | ||
for task in self.task_dict['Cognition']: | ||
score += metric[task]['score'] | ||
metric['Cognition'] = score | ||
|
||
metric['Overall'] = metric['Perception'] + metric['Cognition'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these metrics be sum or average?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From MME paper:
In addition, we calculate the score of a subtask based on the sum of accuracy and accuracy+. The perception score is the sum of scores of all perception subtasks. The cognition score is calculated in the same way.
So it should be a sum.
'acc': acc, | ||
'acc_plus': acc_plus, | ||
'score': 100 * (acc + acc_plus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will single_acc
/ double_acc
be better names than acc
/ acc_plus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From MME paper:
Since the output of the model is limited to two types (“yes” or “no”), it is convenient to measure the metrics of accuracy and accuracy+. The former is calculated based on each question, while the latter is based on each image where both of the two questions need to be answered correctly. The random accuracies of the two metrics are equal to 50% and 25%, respectively.
acc
and acc_plus
may better keep up with the paper.
data_dir='/path/to/MME', | ||
pipeline=val_pipeline) | ||
|
||
minigpt_4_dataloader = dict(batch_size=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better rename it to minigpt_4_mme_dataloader
. The same to minigpt_4_model
and minigpt_4_evaluator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
||
Args: | ||
data_dir (str): The path of the dataset. | ||
pipeline (dict): The data augmentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline (dict): The data augmentation. | |
pipeline (List[dict]): The data augmentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified in the latest commit.
self.pipeline = Compose(pipeline) | ||
self.load_data(data_dir) | ||
|
||
def load_data(self, data_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add typehint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
"""Prompt constructor for MiniGPT-4 on MME. | ||
|
||
Args: | ||
image_prompt (str): Image prompt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image_prompt (str): Image prompt. | |
image_prompt (str): Image prompt. Defaults to `''`. |
Please also check other parts and make changes accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Add default value to both mmbench and mme prompt constructors.
IssueFailed to reproduce the paper result of MiniGPT-4. Implementation DetailsGenerate FunctionFollowing the demo in the official repo of MiniGPT-4, we build our generate function as below: Prompt BuildingWe have tried several different formats. Two of them are given here: # prompt 1
sys_prompt+"###Human: "+question+img+" "+"###Assistant: "
# prompt 2
sys_prompt+"###Human: "+question+" "+ "###Human: " +img+" "+"###Assistant: " The
And the question and image are loaded from MME benchmark. Generate FunctionIn the official repo, the default recipe of generate function is as below: outputs = self.llama_model.generate(
inputs_embeds=prompt_embs,
max_new_tokens=300,
stopping_criteria=self.stopping_criteria,
num_beams=1,
do_sample=True,
min_length=1,
top_p=0.9,
repetition_penalty=1.0,
length_penalty=1,
temperature=1.0,
) We also tried inference with beam search, see: In the following section, we name the official recipe as Experiments
Note: The Metric Sanity CheckTo validate the correctness of import os
from collections import defaultdict
samples = []
task_dict = {
'Perception': [
'existence', 'count', 'position', 'color', 'posters', 'celebrity',
'scene', 'artwork', 'OCR', 'landmark'
],
'Cognition': [
'commonsense_reasoning', 'numerical_calculation',
'text_translation', 'code_reasoning'
]
} # noqa
def read_result(fn, category):
with open(fn, 'r') as f:
line = f.readline()
while line:
img_path, question, answer, response = line.split('\t')
prefix_pred_ans = response[:4].lower()
if 'yes' in prefix_pred_ans:
pred_answer = 'yes'
elif 'no' in prefix_pred_ans:
pred_answer = 'no'
else:
pred_answer = 'other'
samples.append({'img_path': img_path, 'pred': 1 if answer.lower() == pred_answer.lower() else 0, 'task': category})
line = f.readline()
print(category, " done.")
def compute_metrics(results: list) -> dict:
# reorganize results
record = dict()
for task in (task_dict['Perception'] +
task_dict['Cognition']):
record[task] = defaultdict(int)
for sample in results:
record[sample['task']][sample['img_path']] += sample['pred']
# compute subtask score
metric = dict()
for task in (task_dict['Perception'] +
task_dict['Cognition']):
single_sum, double_sum = 0., 0.
for v in record[task].values():
assert 0 <= v <= 2
if v == 2:
single_sum += 2
double_sum += 1
elif v == 1:
single_sum += 1
acc = single_sum / 2 / len(record[task])
acc_plus = double_sum / len(record[task])
metric[task] = {
'acc': acc,
'acc_plus': acc_plus,
'score': 100 * (acc + acc_plus)
}
# compute overall score
score = 0
for task in task_dict['Perception']:
score += metric[task]['score']
metric['Perception'] = score
score = 0
for task in task_dict['Cognition']:
score += metric[task]['score']
metric['Cognition'] = score
metric['Overall'] = metric['Perception'] + metric['Cognition']
return metric
if __name__ == "__main__":
fn_list = os.listdir("./LaVIN")
for fn in fn_list:
read_result(os.path.join("./LaVIN", fn), fn[:-4])
metric = compute_metrics(samples)
print(metric)
The result is :
Same as the result obtained by official evaluation script. |
* [Feat] Support multi-modal evaluation on MME benchmark. * [Fix] Remove debug code. * [Fix] Remove redundant codes and add type hints. * [Fix] Rename in config. * [Fix] Rebase main. * [Fix] Fix isort and yapf conflict.
…#197) * [Feat] Support multi-modal evaluation on MME benchmark. * [Fix] Remove debug code. * [Fix] Remove redundant codes and add type hints. * [Fix] Rename in config. * [Fix] Rebase main. * [Fix] Fix isort and yapf conflict.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
Checklist
Before PR:
After PR: