Text Mining

A Word Cloud of all the words I have used in all of my blogs.

I will shout from the heavens that I love voyant-tools! Before Voyant I didn’t have much of a concept or appreciation for text mining. I was first introduced to Voyant by Laura McGrath when I was a TA for the Mellon Scholars Program at Hope College. She was a English Major and much of her work involved distant reading and extensive text analysis, so she built a lot of the Mellon seminar around working with data sets, utilizing text and data analysis tools, and producing visualizations. I was blown away at the speed, user friendliness, and insightfulness of Voyant. I love the visuals with colors, the interactivity, the multilingual capacity, and the embedding functions. Such a brilliant tool! Okay, rant over. I was inspired and chose to incorporate this as a unit in the Mellon seminar when I taught it the following year as a post-bacc. (See Blog Tutorial here) I had my students do a simple activity to introduce them to Voyant and similar methodologies by having them read a Edgar Allen Poe short story about a murder mystery, “Murders in the Rue Morgue.” When they came to class having read the story we then ran it through Voyant to see what the tool might reveal about the text that was unexpected, and whether or not we could see any representations that were accurate reflections of the text. For example, looking at terms like “murder” and “weapon” in the Terms Berry visualization revealed in-text close presence with “monster” and “animal” which corresponded with the murderer actually being an orangutan (spoiler).  Additionally, since I intentionally chose a murder mystery, could Voyant reveal any crucial plot points, who the main character is based on name recurrence and so on? The Word Cloud matched the close reading that the window was key to unraveling the mystery and that Dupin is the main character. It was a fun, light exercise but feedback from students was great in getting their feet wet with what text analysis and mining could do.

In general I have not conducted extensive research using text mining but through projects like Ben Schmidt and Mitch Fraas’s “Mapping the State of the Union,” I have come to appreciate its potential, as well as the nuances and intentional decisions that get made about key terms selected. Lisa Rhody’s “The Story of Stop Words,” did an exception job of bringing to the forefront many of the nuanced decisions and implications of said decisions in the realm of text mining. I think that is one of the main takeaways I have for this seminar thus far: that every facet of a project involves intentional decision-making, which prompts the reality that bias in research at all stages, whether conscious or not, is quite possible, and one should actively interrogate their decisions, and document them. In fact looking back, before exposure to Voyant, Brandon Walsh ran a workshop for the 2017 annual Undergraduate Network for Research in the Humanities conference; there he introduced the concept of text mining, walking us through “bags of words” and what it means to make selective choices about which words to include and the impact such choices can have. It seems that because the Digital Humanities incorporates technologies there is a false perception of the field that this scientific technology makes research more objective. I think this is wrong on two accounts: one, scientific research is value-laden and not as objective as one may want to think (as my Philosophy of Science course would attest, see Robert Merton’s Social Theory and Social Nature but also Safiya Umoja Noble’s Algorithms of Oppression highlights the biased, prejudiced nature of human-made technologies many take for granted as objective, like Google searches). Two: the number of perhaps small, but still significant, deliberate choices that have impact on the scope, nature, results, and effects of a project within the Digital Humanities through using such technologies to me indicates a higher degree of subjectivity, and this, I’d argue, is extremely important to remember.

This week I played around with an idea I mentioned in class: putting my own papers into Voyant to see if I can identify a particular style. Not the most relevant to my research, but text mining is not exactly in my area anyway, so I went for something different and relevant for me as an emerging scholar. I made the following decisions:

  • I chose the final papers I wrote for my first semester graduate school seminars
  • I did not include the bibliographies
  • I did include the Titles of the papers
  • My citation format is MLA so there are in-text parenthetical citations

This Word Cloud is fascinating to me. It definitely fits with what the content of the papers are, but it’s also helpful to track recurring themes in my work, and possible topics for my dissertation.

I think the most interesting results of the Terms Berry is showing the relationship of church, which occurs with Maracle, indigenous, and structure, and the word spiritual occurring with music, physical, and knowledge. This has reignited my passion in these subjects and given me an interesting lens into my writing style.

I also noticed that the Bubblelines tracked the main terms across all of the papers, noting that language and body were the most consistently-used term throughout all of my writing. Perhaps Philosophy of Language is where I am headed…

Data in Humanities

Archives, Data, and Humanities: A Philosopher’s Reflections

This week our Digital Humanities seminar served as a good reminder of the possibilities and breadth of data potential in humanities fields. Miriam Posner’s blog “Humanities Data: A Necessary Contradiction” was not only an excellent introduction to the notion of all objects bearing metadata, but also a further case for why philosophers should consider data-based projects because data speaks, data can argue. Though I am not a historian and do not actively seek out archival projects, I have had a few experiences with archival-turned-dataset research projects that have taught me a great deal about the local history of Dutch Holland, Michigan, Portuguese influence in the former colony of Goa, India, and the vast works of Indian philosopher and writer Rabindranath Tagore. Visiting archives was both astounding and concerning for different reasons. (As an aside, there was something so profoundly sad about visiting the Goan archives in India and seeing worn, worm-eaten, molding diaries falling further into decay. The loss of cultural history like that hurts the soul.) As discussed in class, it is important to remember that an archive tells a story, and there are those in control of this narrative actively deciding to sculpt this story in a particular fashion. Remembering that archives are the results of decisions made by specific people is crucial to pushing against problematic understandings of history and modern culture; one must challenge easy excuses that historically oppressed or marginalized communities were not participating in events and narratives, because more often than not these communities have been intentionally curated out of such narratives. For example, during my sophomore year of college I was struck by this fact when I was faced with the task of producing a Holland-based digital humanities project. I was concerned about the lack of visibility of the Hispanic/Latino community in Holland both in terms of businesses and physical design and presence (or rather lack thereof) in the archives. According to the most recent census comprises nearly 30% of the Holland population and yet there are next to no references of this community in the archives. There was such a contradiction between what the Tihle Archives said was the history of Holland and what the actual communities, physical architectures, and ongoing traditions like Fiesta, said was the history. I loved the Data Feminism book by Catherine D’Ignazio and Lauren Klein because this book specifically addressed the ways in which data can be shaped to ignore, or, in contrast, intentionally reveal undocumented narratives. This focus on articulating narratives, especially counternarratives to the dominant historical discourse was one I sought going forward into actual data-centered projects like Ethics of Expropriated Art, involving museum permanent collection data that demonstrated power dynamics and complex international relationships in art expropriation. This project taught me about the challenges of data curation and standardization.  The readings by Gilliland, Tanner, Milligan, and the Library of Congress all pointed to various facets of data and metadata curation standards and practices which were insightful and would have been incredibly helpful when I was designing my project! Though I am still not 100% clear on all of what TEI does, from what I do understand, this is just one more tool to help systematize, organize, and standardize data to make it accessible and computer analyzable, which is fantastic. Also these readings reminded me of how much I love that Omeka lets users add their own metadata categories. The flexibility is so valuable for big messy projects!

When considering the new capabilities of big dataset curation, I am fascinated by the new possibilities of research approaches. Specifically, with data analysis and visualization tools like Voyant, Palladio, and Raw Graphs, plugging datasets or text files into these programs can actually prompt questions, not just attempt to reveal answers. I liked reading Franco Moretti’s book Graphs, Maps, and Trees in my undergraduate years because he dissects the ways in which computer readings of texts present new perspectives and questions for exploration that may not have been realized otherwise through close readings. As aforementioned, philosophy does not lend itself to many obvious avenues for data-based projects, so I have not had extensive time to devote to this method of work; however, my understandings of this type of research were broadened by Moretti and then greatly enhanced when I designed and taught the datasets unit of the Mellon seminar. I reengaged in the process with my students as they chose sources like rare books or twitter hashtags to curate into spreadsheets, ask research questions of the data, run them through data analysis and visualization programs, and draft prospecti about projects based on these initial findings. Taking students through this process was tedious for them, much more so than working with pre-made datasets, but I think it was valuable for them to see from just how many sources they can glean valuable information and compelling research topics. Most importantly and most relevant to being a philosopher, this work can form arguments, and strong ones at that! For me and many of my students, this type of work was the first of its kind to be argument-based but without just citations and occasional statistics. Graphs, maps, tables, charts, and figures coalesced into robust statements that translate to broader audiences. I love this aspect of the digital humanities and though philosophy may not be an obvious or easy fit with this type of work, when the two can come together, I think there is great potential for powerful projects, especially in the areas I am interested in: Latina Feminism and decolonial/anticolonial studies.

Text & Data Analysis Part 3: RAWGraphs Tutorial

Some of the first ventures into the field now known as the Digital Humanities began when humanists wanted to use computers to read and analyze large bodies of text. Starting with Father Busa’s computer-readable project of Aquinas, today some of the most popular DH projects are text and data analysis-based. This tutorial highlights how to use and understand one of three unique analysis tools. RAWGraphs is an easy-to-use free online program that coverts highly quantitative data sets into visually attractive graphs. This tool is very easy to use so my tutorial feels almost unecesary, but as it is often unheard of, I will include it in the three-part text and data analysis series. Click the picture below to find the PDF tutorial!

Text & Data Analysis Part 2: Palladio Tutorial

Some of the first ventures into the field now known as the Digital Humanities began when humanists wanted to use computers to read and analyze large bodies of text. Starting with Father Busa’s computer-readable project of Aquinas, today some of the most popular DH projects are text and data analysis-based. This tutorial highlights how to use and understand one of three unique analysis tools. Palladio coverts large data sets into original visualizations of graphs, webs, maps, timelines, and galleries. Click the image below to find the PDF tutorial!

Text & Data Analysis Part 1: Voyant Tools Tutorial

Some of the first ventures into the field now known as the Digital Humanities began when humanists wanted to use computers to read and analyze large bodies of text. Starting with Father Busa’s computer-readable project of Aquinas, today some of the most popular DH projects are text and data analysis-based. This tutorial highlights how to use and understand one of three unique analysis tools, starting with Voyant Tools, a text analyzer and visualizer that is completely free and user-friendly. Click the image below for a PDF tutorial.