← Back to feed
2026-05-26datacode

Semantic Gradients Interactions in SSD: A Case Study in Racial Identity and Hate Speech

Felix Ostrowicki, Hubert Plisiecki

PDF preview unavailable
Read on arXiv →

Key claim

Interaction SSD reveals moderated meaning in hate-speech judgments.

The paper presents interaction SSD, a novel method for analyzing how semantic meaning varies across different moderators. It effectively illustrates this method using the UC Berkeley Measuring Hate Speech corpus, revealing significant moderation effects based on annotator racial identity. This approach enhances the interpretability of hate-speech judgments.

In plain English

The paper presents interaction SSD, a novel method for analyzing how semantic meaning varies across different moderators. It effectively illustrates this method using the UC Berkeley Measuring Hate Speech corpus, revealing significant moderation effects based on annotator racial identity. This approach enhances the interpretability of hate-speech judgments.

Novelty
7.5/10

The method introduces a new way to model semantic meaning variations, extending existing frameworks.

Reliability
8.0/10

The claims are supported by empirical testing on a relevant dataset, demonstrating significant moderation effects.

Deep reliability assessment

The methodology supports a descriptive, interpretable test of whether linear semantic-gradient-to-rating relationships differ across a binary annotator moderator within this specific hate-speech dataset. It overclaims if read as evidence of causal psychological mechanisms, stable racial-group differences, or generalizable annotator behavior, especially because it lacks crossed annotator/comment random effects and depends on the chosen embedding geometry.

Reproducibility

Yes. The paper uses the public UC Berkeley Measuring Hate Speech corpus and cites code availability via OSF at https://osf.io/xqv7h.

Discussion questions

  1. 1.Does SSD's assumption that a single linear direction in embedding space captures an interpretable semantic gradient hold when hate-speech judgments depend on context, pragmatics, reclamation, and speaker identity?
  2. 2.For builders of annotation systems or moderation tools, should interaction SSD be used to audit subgroup differences in labels, or could it unintentionally reify demographic categories and produce misleading fairness narratives?
  3. 3.What result would falsify the claimed moderation effect: disappearance under a mixed-effects model with annotator/comment random effects, instability across embedding models, or failure to replicate on another hate-speech corpus?

Key figure

The key method pipeline embeds texts, reduces them with PCA, fits a regression with semantic features, moderator, and feature-by-moderator interactions, then back-projects main, interaction, and conditional gradients for nearest-neighbor and cluster interpretation.

Codelink
osf.io/xqv7hOfficial