Topic Modeling
for Amnesty International Data



Bugra Akyildiz

Data Scientist at Axial


Machine Learning Newsletter |


A network that brings private companies with investors together

Enables business owners access to private capital markets

We are hiring! |


Data science in the service of humanity

Nonprofit: NGO's, governments and such

Amnesty International

Global movement of people fighting injustice and promoting human rights

Data science for humanity and profit


Amnesty International Data


  • Fear of Safety
  • Getting threats
  • ...


  • Conviction
  • Execution
  • Disappearance
  • ...

Example News

Another One

Machine Learning Model

Topic Modeling

The unsupervised learning method you apply to a bunch of text when you have no idea what to do with them

Topic model is umbrella name for a suite of graphical models for discovering topics or themes in a collection of documents.

Why Topic Model?

  • Useful to explore the documents
  • Easy to apply to unstructured documents
  • Unsupervised, no need for labels
  • Applicable for any medium-to-long form text

Latent Dirichlet Allocation

### LDA - $M$: Number of Documents - $N$: Number of Words - $\alpha$: Dirichlet parameter on document-topic - $\beta$: Dirichlet parameter topic-word - $\theta_i$: Topic distribution of the document $i$ - $\phi_k$: Word distribution for topic $k$ - $w_{ij}$: $j$th word of of $i$th document - $z_{ij}$: topic assignment of the word above
### Observed - $w_{ij}$
### How? - Generate $\theta_i \sim Dir(\alpha)$ - Generate $\theta_k \sim Dir(\beta)$ - For each document $i$ and word $j$: - Choose a topic $z_{ij} \sim Multinomial(\theta_i)$ - Choose a word $w_{ij}$ from - $Multinomial( {\phi_z}_ij )$

Simple LDA Illustration

Topic Keywords

How Similar the Topics are?

### Dirichlet Distribution






More on in Amnesty Repository

Topics Represented as Graph

Every color represents a topic

## Total Taxonomy ```python event_types = { 'event_1' : "Threat", 'event_2' : "Action", 'event_3' : "Country", 'event_4' : "Person/Group", 'event_5' : "Issue", 'event_6' : "Human Rights Protest Actions" } # Classification Types, Only apply to events 1 and 2 class_types = { 'class_1' : "Death", 'class_2' : "Incarceration with Impending Death", 'class_3' : "Violence", 'class_4' : "Incarceration", 'class_5' : "Ill-treatment", 'class_6' : "Forced Movement", 'class_7' : "Threat of Death", 'class_8' : "Threat of Violence ", 'class_9' : "Threat of Incarceration", 'class_10' : "Risk", 'class_11' : "legal issues", 'class_12' : "Health Concern" , 'class_13' : "Abduction", 'class_14' : "Corporate Abuse", 'class_15' : "Discrimination", 'class_16' : "Torture", 'class_17' : "Threat of Torture" } ```

Fear of Safety



If all you have is a hammer, everything looks like a nail

Not everything has to be a clustering problem.

Know its limitations


Correlated Topic Models

Dynamic Topic Models

Supervised Topic Models

Turbo Topics



  • Factorie (Scala)
  • Mallet (Java)
  • Gensim (Python)
  • Stanford Topic Modeling Toolbox (Scala)

Blei's Topic Modeling Page


## References - [Introduction to Dirichlet Distribution and Related Processes]( - [Introduction to Probabilistiic Topic Models]( - [Review of Topic Models]( - [A tutorial on Topic Models](