Andrew's C/C++ Token Count Dataset 2016 (ACTCD16)

Project: Programming Language C++
Author: Andrew Tomazos <>
Date: 2016-01-26

Abstract: We parsed 4,689,316,529 C/C++ tokens from 2,566,989 C/C++ source files taken from 11,423 open source packages of a popular Linux distribution. For each of the 50,325,647 distinct token spellings, we counted the number of occurrences, and output these tokens and counts into a single data file. We make that data file available for download as the ACTCD16 dataset.