CAPYBARA: Decompiled Binary Functions and Related Summaries

Dataset

Description

CAPYBARA

This dataset is published as part of the paper: "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries". It includes both the training/evaluation data as well as the raw data.

The data_split folder contains .pickle files with the test and validation repos, the train repos are all the remaining repos. It also contains a .pickle file with a dictionary that specifies the optimization level for each repository.
In the processed_data folder, the processed datasets can be found in .csv format. The columns of the CSV are the `summaries`, the `original documentation`, the `repo`, the `source` and `decompiled` code, the `function name` and a unique `identifier`. We also include the deduplicated samples in separate CSVs.
The processed training files can be found in the training_data folder. `Source C`, `decompiled`, `demiStripped`, and `stripped` can each be found in their corresponding folders and are split into deduplicated and regular datasets. The data is further split into .jsonl files, for the train, test, and validation sets. These .jsonl files can be loaded into CodeT5 and CodeXGlue as is.
The raw_data folder contains all the stripped and decompiled functions without any pre-processing applied. The columns of the CSV are `repo`, the `location`, the `original` code, the corresponding `decompiled` code, the `function name`, a unique `identifier` key, and the corresponding `documentation` for both the decompiled and stripped functions.
License

Copyright 2022 ##########

Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License.

You may obtain a copy of the License at:


http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.
Date made available20 Oct 2022
PublisherZenodo
Date of data production2021 - 2022

Software license

  • Apache License 2.0

Cite this