Section-Type Constraints on the Choice of Linguistic Mechanisms in Research Articles: A Corpus-Based Approach
10 Modern Languages
Type(s) of data
This thesis investigates the structure of research articles in the field of Computational Linguistics with the goal of establishing that a set of distinctive linguistic features is associated with each section type. The empirical results of the study are derived from the quantitative and qualitative evaluation of research articles from the ACL Anthology Corpus. More than 20,000 articles were analyzed for the purpose of retrieving the target section types and extracting the predefined set of linguistic features from them. Approximately 1,100 articles were found to contain all of the following five section types: abstract, introduction, related work, discussion, and conclusion. These were chosen for the purpose of comparing the frequency of occurrence of the linguistic features across the section types. Making use of frameworks for Natural Language Processing, the Stanford CoreNLP Module, and the Python library SpaCy, as well as scripts created by the author, the frequency scores of the features were retrieved and analyzed with state-of-the-art statistical techniques.
The results show that each section type possesses an individual profile of linguistic features which are associated with it more or less strongly. These section-feature associations are shown to be derivable from the hypothesized purpose of each section type.
Overall, the findings reported in this thesis provide insights into the writing strategies that authors employ so that the overall goal of the research paper is achieved.
The results of the thesis can find implementation in new state-of-the-art applications that assist academic writing and its evaluation in a way that provides the user with a more sophisticated, empirically based feedback on the relationship between linguistic mechanisms and text type. In addition, the potential of the identification of text-type specific linguistic characteristics (a text-feature mapping) can contribute to the development of more robust language-based models for disinformation detection.