Midterm Project

Introduction

This page is demonstrating the text analysis of the book Analects of Confucius. For this exam, I aim to explore word frequency patterns, dominant themes, and the characteristics of the text through computational analysis.

Data Source

The dataset used in this project consists of a digitized English translation of Analects of Confucius. The text was obtained in plain .txt format and processed for computational analysis. The book has been a classical Chinese text composed of sayings, dialogues, and short narratives attributed to Confucius and his disciples. Compiled during the Warring States period, the work presents foundational ideas of early Confucian thought, including the cultivation of virtue, moral self-discipline, ritual propriety, and the ethical responsibilities of governance. Rather than offering systematic philosophical argumentation, the text conveys its principles through concise exchanges and situational reflections, emphasizing moral practice and character formation. As a central canon in the Confucian tradition, The Analects has shaped intellectual, political, and educational culture in East Asia for over two millennia.

Prior to analysis, several cleaning steps were performed. By importing the plain text into R, I was able to seperate the condensed text into 20 separated BOOKs, which reflect the actual Book Chapter of the literature.

View the R script in a new tab

Most common stop words were removed, including high-frequency functional verbs and reporting terms such as "said" and "heard", since these words do not contribute meaningful information.

In addition, proper nouns and personal names were excluded from the dataset to avoid skewing frequency distributions. Many passages in the text contain repeated references to specific individuals, and retaining these names would disproportionately increase their frequency counts.

After cleaning, the resulting corpus was formatted for analysis in Voyant Tools, allowing for visualization of term frequency patterns and dominant thematic clusters.

Processes

Two primary visualization techniques were employed: Cirrus and Trends.

The Cirrus tool was used to generate a word cloud based on term frequency. This visualization provides a high-level overview of dominant lexical patterns in the corpus, allowing for the identification of frequently occurring concepts after stopwords and names removal.

The Trends tool was used to examine the distribution of selected terms across the text. By plotting word frequencies sequentially, this tool makes it possible to observe the shifting frequencies of each word across chapters.

Text Analysis Visualization

The Cirrus visualization provides an immediate overview of dominant terms after cleaning, revealing a thematic concentration around moral cultivation and social order,as seen in high frequency concepts such as virtue, government, propriety, and people. This supports an interpretive claim that the literature repeatedly frames ethical self cultivation and political responsibility as mutually reinforcing, than treating them as separate domains.

Text Analysis Trend

Very interestingly, several books stand out in the trend visualization when aligned with their themes.

Book 4 shows a clear rise in virtue, which matches its focus on ren and the moral character of the superior person& palace. The lexical emphasis reflects the chapter’s central concern with ethical self-cultivation.

Book 12 display increased frequencies of both virtue and governance-related terms. This book explicitly connect moral cultivation with political responsibility, which explains the convergence of ethical and administrative vocabulary.

Book 15 corresponds to a visible peak in terms associated with public life, including man and people. This reflects its strong engagement with questions of conduct, rulers, and political order.

Finally, Book 20 , which summarizes principles of rulership, aligns with renewed attention to governance terminology. The trend confirms the structural movement of the text toward concluding reflections on political authority.

These selected peaks demonstrate that lexical frequency patterns correspond closely with the thematic orientation of key books, reinforcing the connection between computational signals and traditional interpretation.

Conclusion: Relation to Digital Arts & Humanities

This project demonstrates the extension of digital tools. By transforming The Analects into organized data and applying visualization tools, the analysis makes underlying patterns visible at a scale that close reading alone cannot easily grasp. The computational segmentation by BOOK and the comparison of term frequencies provide empirical support for established interpretive claims about moral cultivation and political governance.

Within Digital Arts & Humanities, this project exemplifies the combination of technical method and critical inquiry. The act of cleaning the text, defining stop words, selecting key terms, and structuring the corpus are interpretive decisions that shape the outcome of the analysis. The project therefore reflects that digital tools do not produce neutral results, but operate within conceptual frameworks defined by human judgment.