-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Update SC #126
[Feature] Update SC #126
Conversation
``` | ||
|
||
```{note} | ||
注意,OpenCompass 默认使用默认使用 argmax 的方式采样下一个 token,因此若不指定采样参数,模型每次的推理结果将会是完全一致的,多轮评测将会失效。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
English?
注意,OpenCompass 默认使用默认使用 argmax 的方式采样下一个 token,因此若不指定采样参数,模型每次的推理结果将会是完全一致的,多轮评测将会失效。 | ||
``` | ||
|
||
Where `SAMPLE_SIZE` is the number of reasoning paths in Self-Consistency, higher value usually outcome higher performance. The following figure from the paper demonstrates the relation between reasoning paths and performance in several reasoning tasks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From which paper? We need to make a citation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to point out that the sample generation_kwargs
only works for HuggingFace models.
Where `SAMPLE_SIZE` is the number of reasoning paths in Self-Consistency, higher value usually outcome higher performance. The following figure from the paper demonstrates the relation between reasoning paths and performance in several reasoning tasks: | ||
 | ||
From the figure, it can be seen that in different reasoning tasks, performance tends to improve as the number of reasoning paths increases. However, for some tasks, increasing the number of reasoning paths may reach a limit, and further increasing the number of paths may not bring significant performance improvement. Therefore, it is necessary to conduct experiments and adjustments on specific tasks to find the optimal number of reasoning paths that best suit the task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A blank line between the paragraph and image makes layout better
|
||
## 3. Self-Consistency | ||
|
||
The SC (Self-Consistency) method is proposed in [this paper](https://arxiv.org/abs/2203.11171), which will sample multiple reasoning paths for the question, and make majority voting to the generated answers for LLMs. This method displays remarkable proficiency among reasoning tasks with high accuracy but may consume more time and resources when inferencing, because of the majority voting strategy. In OpenCompass, you can simply set SC method in the dataset config like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explicitly tell readers they have to replace GenInferencer
with SCInferencer
) | ||
) | ||
gsm8k_eval_cfg = dict(sc_size=SAMPLE_SIZE) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a link to the new gsm8k
config for interested readers to follow
sc_results.append(results) | ||
sc_prediction = list(map(list, zip(*sc_results))) | ||
generated = sc_prediction | ||
print(generated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
del
save_every: Optional[int] = None, | ||
fix_id_list: Optional[List[int]] = None, | ||
sc_size: Optional[int] = 1, | ||
infer_type: Optional[str] = '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infer_type
is not even used here.
Its implementation seems pretty close to GenInferencer
. Consider employing inheritance to cut down on code redundancy and ease future maintenance.
@@ -164,6 +186,14 @@ def _extract_role_pred(self, s: str, begin_str: Optional[str], | |||
|
|||
return s[start:end] | |||
|
|||
def _get_vote_out( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A short docstring is required here.
Don't forget to update the ToC of documentation |
* add self-consistency * add CoT method Self-Consistency * fix typo error and update openicl_eval * add tydiQA-GoldP task * fix sc * rename gsm8k_sc * fix sc * add self-consistency doc * refine sc --------- Authored-by: liushz <[email protected]>
* add self-consistency * add CoT method Self-Consistency * fix typo error and update openicl_eval * add tydiQA-GoldP task * fix sc * rename gsm8k_sc * fix sc * add self-consistency doc * refine sc --------- Authored-by: liushz <[email protected]>
The same as #57