Abstract:
Despite the importance of electoral processes in India, the world’s largest democracy, data on Indian electoral processes has not been easily available for political analysis in the past. This has been due to the problems inherent in assembling any data archive of social and political data that spans many decades. We shed some light on these problems and present some solutions in the context of a system we built called LokDhaba, which is the first structured and cleaned data archive on Indian elections data at the national or state level from 1962 onwards. In this paper, we describe the challenges of data scraping, parsing, cleaning, consistency checking and integration with multiple sources that we encountered, and novel tools we developed to overcome them. LokDhaba is being used extensively by political scientists, researchers, journalists and others to better understand electoral outcomes and long-term trends in India.