Does Configuration Encoding Matter in Learning Software Performance? An Empirical Study on Encoding Schemes

Published in MSR, 2022

Despite the presence of different encoding schemes, there is still little understanding of which is better and under what circumstances, as the community often relies on some general beliefs that inform the decision in an ad-hoc manner. To bridge this gap, in this paper, we empirically compared the widely used encoding schemes for software performance learning, namely label, scaled label, and one-hot encoding, with five systems, seven ML models, and three encoding schemes, leading to 105 cases of investigation. We show that it is non-trivial and expensive to select the encoding scheme, and we reveal important and actionable suggestions for future researchers.

The source codes, datasets, raw results and supplementary materials can be found at our github repository.

The full paper can be downloaded here.