-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for translating Dart #153
Conversation
Here's the output for
And a DartPad snippet for the above: https://dartpad.dev/?id=8f9fe06a67328d6c82c35c202915227e |
Let's start with the stop tokens. Short versionUse Long version
So, the stop tokens are the accepted hack for telling the model when to stop generating text. Since the task is to generate a top-level function, the stop tokens are the text that typically follows a top-level function. Hopefully that explains why we use But, there is a simpler approach that we later realized works for any curly-brace language: just use |
The prompt terminology stuff: as mentioned in the MultiPL-E paper, it hardly matters. The LLMs are very robust to terminology, especially on high-resource languages (and Dart probably is high-resource). However, you can add Dart terms here: https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/terms.csv |
Gotcha, thanks for the explanation; I updated the PR to use the closing curly as the stop token. |
Gotcha. I do have the terms added for Dart in that file; I don't see any translation when running something like 'test.py humaneval_to_dart ../datasets/originals/HumanEval_136_largest_smallest_integers.py' but perhaps that's not where the translation should show up. If this does look configured correctly, it may be worth double checking the logic in https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/generic_translator.py#L273. In any case, I understand that translating the python terms to dart isn't critical. |
evaluation/src/eval_dart.py
Outdated
elif "SyntaxError" in r.stderr: | ||
status = "SyntaxError" | ||
elif "ReferenceError" in r.stderr: | ||
status = "ReferenceError" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, for benchmarking, it only really matters that you classify each run as pass or fall. The MultiPL-E paper had a finer-grained analysis of the types of errors, which is what this stuff was for. So, this fine-grained categorization is optional at this point.
So, I was able to spot-check output against the list here: #152 (comment), and things seemed reasonable (the output analyzed correctly when given a stub impl. for the method under test). What's a good way to do some end-to-end testing? To drive this against a specific LLM and single file? Or, an LLM and one run of all the HumanEval benchmark scripts? I'd like to get a bit more confidence that this translation is generating reasonable code / the benchmark results would be reliable. |
You can create a local dataset of Dart prompts like this: cd MultiPL-E/dataset_builder
python3 prepare_prompts_json.py \
--lang humaneval_to_dart.py \
--doctests transform \
--prompt-terminology reworded \
--output ../dart_prompts.jsonl When I try this, I get a bunch of errors. Maybe something is not yet pushed? |
Here is what I've done:
|
Thanks for the feedback! I was OOO for a bit, but will circle back to this PR. |
I'm still digging through this a bit, but when I run the Typescript translation locally for comparison, I see similar numbers to the Dart translator:
|
@arjunguha - I may be done tinkering with this PR now; I converted it from a draft to 'ready for review'. |
I'll plan to do this Monday. Thanks! |
Note to self: I see that I'm also getting 152 translations for TypeScript. This is a regression as of the last release a few months ago. I'll try to see what's up before merging this one. |
There were a few language features I didn't generate code for (unions, ellipsis) that may be able to be handled w/ some creative code gen. If those do end up skewing benchmark numbers I could revisit. |
Nope, sorted this out. The translation script was using the prompt dataset by default. This was the fix: This dataset has the doctests in the Python originals manually cleaned by an undergraduate and high school student, which is what we use for everything else. With this dataset, here is what I get with Dart:
I'm going to merge this in and add some results. |
Note that I'm starting this PR as a draft as I have some open questions (and haven't finished all testing).
My questions are mostly called out in todo comments in the code, but: