Skip to content

Releases: open-compass/opencompass

OpenCompass v0.1.3

25 Aug 10:56
b2d602f
Compare
Choose a tag to compare

OpenCompass keeps getting better! v0.1.3 brings a variety of enhancements, new features, and crucial fixes. Here’s a summary of what we've packed into this release:

🆕 Highlights:

Extended Dataset Support: OpenCompass now integrates a broader range of public datasets, including but not limited to adv_glue, codegeex2, Humanevalx, SEED-Bench, LongBench, and LEval. We aim to provide extensive coverage to cater to a variety of research needs.
Utility Additions: From the inclusion of multi-modal evaluations on MME benchmark to the Tree-of-Thought method, this release comes packed with functionality enhancements.
Bug Extermination: Your feedback helps us grow. We’ve squashed a series of bugs to improve your experience.
More Evaluation Benchmark for Multimodal Models. We support another 10 evaluation benchmarks for multimodal models, including COCO Caption and ScienceQA, and provide corresponding evaluation code.

Let's delve deeper into what's new:

🌟 New Features:

📦 Extended Dataset Support:

  • Introduction of other public datasets (#206, #214).
  • Support for adv_glue dataset focused on adversarial robustness (#205).
  • Added codegeex2, Humanevalx (#210).
  • Integration of SEED-Bench (#203).
  • LongBench support (#236).
  • Reconstruct LEval dataset (#266).
  • Support another 10 public evaluation benchmarks for multimodal models (#214)

🛠 Utilities and Functionality:

  • Launch script added for ease of operations (#222).
  • Multi-modal evaluation on MME benchmark (#197).
  • Support for visualglm and llava on MMBench evaluation (#211).
  • Tree-of-Thought method introduced (#173).
  • Introduction of llama2 native implementations (#235).
  • Flamingo and Claude support added (#258, #253).

📝 Documentation:

  • Navigation bar language type updated for better clarity (#212).
  • News updates for keeping users informed (#241, #243).
  • Summarizer documentation added (#231).

🛠️ Bug Fixes:

  • Addressed an issue with multiple rounds of inference using mm_eval (#201).
  • Miscellaneous fixes such as name adjustments, requirements, and bin_trim corrections (#223, #229, #237).
  • Local runner debug issue fixed (#238).
  • Resolved bugs for PeftModel generate (#252).

⚙ Enhancements and Refactors:

  • Refactored instructblip for better performance and readability (#227).
  • Improved crowspairs postprocess (#251).
  • Optimization to use sympy only when necessary (#255).

🎉 New Contributors:

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@yyk-wew (First PR)
@fangyixiao18 (First PR)
@philipwangOvO (First PR)
@cdpath (First PR)

Thank you to our dedicated contributors for making OpenCompass even more comprehensive and user-friendly! 🙌 🎉

Remember to star 🌟 our GitHub repository if you find OpenCompass helpful! Your feedback and contributions are invaluable.


Change log

For a complete list of changes, please refer to our Full Changelog.

OpenCompass v0.1.2

11 Aug 10:45
4fc1701
Compare
Choose a tag to compare

This release continues the evolution of OpenCompass, bringing a mix of new features, optimizations, documentation improvements, and bug fixes.

🆕Highlights

🏆 Leaderboard: The evaluation results of Qwen-7B, XVERSE-13B, LLaMA-2, and GPT-4 has been posted to our leaderboard. Now it's also possible to conduct model comparison online. We hope this feature offers deeper insights!

📊 Datasets: Introduction of Xiezhi, SQuAD2.0, ANLI, LEval datasets, and more for diverse applications. (#101, #192) Add datasets related to safety to collections. [#185]

🎭New modality: Support for MMBench is introduced, and the evaluation of multi-modal models is on the way! (#56 ,#161) Besides, Intern language model is introduced. (#51)

⚙️Enhancement: Several enhancements on OpenAI models, including key deprecation, temperature setting, etc. [#121] [#128] Supporting multiple tasks on one GPU, filtering messages by levels, and more. [#148] [#187]

📝 Documentation: Comprehensive updates and fixes across READMEs, issue templates, prompt docs, metric documentation, and more.

🛠️ Bug Fixes: Including seed fixes in HFEvaluator, addressing issues in AGIEval multiple choice questions, and more. [#122] [#137]

🎉 New Contributors

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@go-with-me000 (First Contribution)
@anakin-skywalker-Joseph (First Contribution)
@zhouzaida (First Contribution)
@dependabot (First Contribution)

Changelog

Full Changelog: 0.1.1...0.1.2

v0.1.1

26 Jul 07:11
b7184e9
Compare
Choose a tag to compare

Add some more datasets.

  • AGIEval
  • anli
  • cmmlu
  • jigsawmultilingual
  • realtoxicprompts
  • SQuAD2.0
  • TheoremQA
  • triviaqa
  • xiezhi
  • Xsum

v0.1.0

06 Jul 07:23
Compare
Choose a tag to compare

First release with some datasets.

  • ARC
  • BBH
  • ceval
  • CLUE
  • FewCLUE
  • GAOKAO-BENCH
  • LCSTS
  • math
  • mbpp
  • mmlu
  • nq
  • summedits
  • SuperGLUE