Theory building with big data-driven research

Title: Theory Building with Big Data-Driven Research – Moving away from the “What” towards the “Why”

Authors: Arpan Kumar Kar and Yogesh K Dwivedi

Abstract: Data availability and access to various platforms, is changing the nature of Information Systems (IS) studies. Such studies often use large datasets, which may incorporate structured and unstructured data, from various platforms. The questions that such papers address, in turn, may attempt to use methods from computational science like sentiment mining, text mining, network science and image analytics to derive insights. However, there is often a weak theoretical contribution in many of these studies. We point out the need for such studies to contribute back to the IS discipline, whereby findings can explain more about the phenomenon surrounding the interaction of people with technology artefacts and the ecosystem within which these contextual usage is situated. Our opinion paper attempts to address this gap and provide insights on the methodological adaptations required in “big data studies” to be converted into “IS research” and contribute to theory building in information systems.

Keywords: Big data analytics; Image mining; Network mining; Sentiment analysis; Text mining; Inductive theory building; Machine learning; Information management.


IS research with big data is still at a nascent stage. There is a lot of scope for it to mature in the years to come, and to develop valuable theoretical contributions. We feel that the authors of research papers should attempt to address the ten points which are highlighted in this editorial note. All researchers should strive to meet the following objectives with possible aligned methodological solutions as indicated in table 1:

SN Focused objective Possible methodological solutions
1 Data acquisition based on “theoretical research questions” to minimize data acquisition bias. Sampling, keyword, entity and user profile identification. Address data imbalance problems if needed.
2 Handle outliers in data better Data cleaning, stemming, sub-sampling
3 Improve validity of measures Qualitative intervention and inputs of subject matter experts may be required. Focus group discussions and field experiments may help.
4 Improve reliability of measures Reporting inter-coder reliability and category reliability for content analysis type approaches.
5 Use computationally derived measures from data where ever possible in inferential model and bring objectivity, to the dependent variable More than one measure is a better proxy for constructs identified from literature. Hypothesis building is very important, wherever feasible.
6 Understand data limitations from a single type of data Use of text, networks, images and links or a mix of these data types, for building the models would be desirable. Multi-modal data analysis would be particularly exciting and enriching.
7 Address data measurement challenges due to biases affecting the generation of the data Using objective or computed variables which can be used as control variables, would improve trust on the outcome.
8 Minimize trade-off between internal and external validity of research model Statistical validation of differences between groups, inferential statistics like penalised regression, logit models or multivariate analysis.
9 Check the data compatibility in measures Time period match of data, adjusting for multi-source data problems
10 Realistic assessment of limitations and trade-offs should be reported. Report low explainability of inferential model, if needed. Data is expected to have high noise.

Table 1: Bringing theoretical contributions in big data research methodologically

Approaching and adopting the highlighted research methodologies would bring possibilities to contribute in IS theory development. With methodological improvements, studies would also be able to minimize the usual trade-offs between internal (i.e., confidence in inferences about contextual findings) and external validity (i.e., confidence in the generalizability of findings). This would bring in more objectivity and rigour of the findings and enable big data-driven research to take the next steps beyond the “what has happened” to “why it happens”. Further it is important to note that the perspective taken in this editorial review, can be extended for thinking from the perspective of design science and action research value for IS, and how these can complement data-driven studies. Future studies need to attempt to integrate these islets of literature to make theory building for more practice relevant.