Why YouTube?

When my research team at the University of Alberta began looking for sources of authentic immigrant experience in Canada, we kept running into the same problem with traditional data sources: they were mediated. Survey responses are shaped by how questions are framed. Interview data reflects the researcher–participant relationship. Administrative data captures categories, not voices.

YouTube comment sections are different. People who watch videos about Canadian immigration — "My first year in Canada," "Why I regret moving to Canada," "How I got my PR card" — write comments because they want to. The content is voluntary, unsolicited, and frequently candid in ways that formal research encounters rarely are.

The Corpus

We built a corpus of YouTube comments from videos focused on Canadian immigration, pulling data via the YouTube Data API. The resulting dataset ran to hundreds of thousands of comments across dozens of videos spanning several years. The first thing you notice at scale is noise: spam, promotional content, single-word reactions, non-English text. Significant preprocessing was required before analysis could begin.

The second thing you notice is how comments cluster around a small number of recurring concerns: housing costs, employment recognition of foreign credentials, language barriers, racism and discrimination, bureaucratic delays, loneliness and family separation, and — more recently — disappointment with outcomes relative to expectations.

Voyant Tools and Spyral Notebooks

The primary analytical environment was Voyant Tools, with detailed analyses scripted in Spyral Notebooks. Voyant's corpus reader allowed us to examine term frequencies, collocates, and document-level patterns across the full comment set. But working with YouTube comments in Voyant also illustrated limitations I had written about in my postphenomenology paper. Voyant's stopword list, tokenisation defaults, and visualisation choices were calibrated for formal text. Comment language is full of abbreviations, emojis, code-switching, and non-standard spelling — some of which required custom preprocessing, others simply falling outside what the tool could handle gracefully. The tool's limitations were methodologically productive: they forced us to be explicit about what we were and were not capturing.

What the Data Shows (Preliminary)

  • Sentiment in immigration-related comments has become more negative in recent years — more expressions of disappointment, hardship, or regret — consistent with reporting on housing affordability and immigration system backlogs.
  • Employment and credential recognition is the single most frequently discussed frustration, more prominent than housing costs even in periods when housing dominates public discourse.
  • There is a strong positive sentiment cluster around social connections, community, and multiculturalism — people who are struggling economically still frequently express appreciation for Canadian social values.

These are patterns in language use, not facts about immigrant experience. The work of connecting linguistic patterns to lived realities is where corpus analysis must be supplemented by qualitative reading — and, ultimately, by conversations with immigrant communities themselves.

Share: LinkedIn Twitter

Comments (0)

No comments yet. Be the first to comment.

You must be logged in to comment.

Login to Comment