Bias Detection in Media: An NLP-Based Approach using Corpus Statistics and Sentence Embeddings

Under the mentorship of Dr. Clayton Greenberg at the University of Pennsylvania, Neeraj Gummalam conducted research published in the Journal of Student Research (https://doi.org/10.47611/jsrhs.v13i2.6587) that used various natural language processing (NLP) techniques—including corpus statistics, sentence encodings, and other vectorization methods—to detect bias in written text. Neeraj developed a model combining Pointwise Mutual Information (PMI) and a dual TF-IDF strategy to classify biased and unbiased sentences more effectively, highlighting its potential as a tool to help readers identify media bias. Through this work, Neeraj and his mentor also identified a potential flaw in the training data of Google’s Universal Sentence Encoder (USE), contributing to broader discussions on data quality in machine learning.

Abstract

In this paper, we implement a Natural Language Processing (NLP) solution for binary classification to categorize a sentence as biased or unbiased. Detecting bias is a challenge in the media today, but can be utilized to help readers identify which sources portray bias. The general approach to classifying a sentence as biased or unbiased involves representing words and sentences using probability or pretrained vectorization models. Our final model only contained probabilistic data about the connection between words, sentences, and each class. We used Pointwise Mutual Information (PMI) and Term Frequency Inverse Document Frequency (TF-IDF) as heuristics for finding the relationship between sentences and the biased and unbiased classes. We also leveraged Google’s Universal Sentence Encodings (USE) to capture the meaning of the sentences. Our results revealed a possible limitation in USE’s training data in terms of bias detection. Through topic analysis, we were able to uncover insights surrounding which topics are characterized by minimal bias. We were able to use these discoveries to contextualize the model’s performance.

https://doi.org/10.47611/jsrhs.v13i2.6587