DataSquad: Analyzing ancient Chinese Buddhist text with Python

More Information

The UCLA Library DataSquad is a team of undergraduate students who support data-related projects at UCLA. As part of the UCLA Library Data Science Center, the DataSquad works with students who need help with their data projects and highlights the work of researchers using data at UCLA.

A graduate student from the Asian languages and cultures department(opens in a new tab) sought help for a complex task: to analyze ancient Buddhist texts quantitatively.

The researcher wanted to compare the texts semantically, including extracting the meaning of them and comparing their similarities. This required some basic analysis on several dozen text documents with a twist. To do this, the DSC used a standard set of popular Natural Language Processing (NLP) tools, including:

spaCy(opens in a new tab)
scikit-learn(opens in a new tab)
Data visualization tools, such as word clouds

While analyzing text is a straight-forward process, this consultation presented its own unique challenges. One such challenge was that the documents were Buddhist texts from the Ming dynasty. Most of the text tools the DSC uses are designed for Western languages, so the team had to be creative with analysis.

The source text was originally unsegmented, but by using the Python library, the text was able to be divided based on white space and vocabulary matching. While this segmented the text, the default was to use the modern Chinese dictionary, which did not match up perfectly with the older syntax. Since the analysis was done using a modern Chinese dictionary, the team explored replacing it with an ancient Chinese corpus and dictionary. Several are maintained by the Georgetown Treebank project(opens in a new tab), a text and language analysis tool.

Once the source files were segmented, the scikitlearn tools were used for basic statistical analysis. Visualization tools were also used, including the word cloud.

The goal of this project is eventually to explore automating tagging content.