Resources for Text Analysis

Welcome to the ASA Section on Text Analysis's curated Resource Page. It serves as a hub for statisticians interested in text analysis. It is a living repository of academic papers, tools, tutorials, datasets, educational resources, and community links-all aimed at promoting responsible, effective, and innovative use of text data in statistical research and practice.

Content on this page is community driven. You can help us crowdsource high-quality links and references that can benefit researchers, educators, students, and practitioners across a variety of domains. To contribute, please submit via our Google form: 👉 https://forms.gle/5ibxAN1WChkcbbMV9 

Academic Papers

Seminal Papers

  • "Attention is All You Need" Vaswani, Ashish, et al., 2017)
    • This paper introduced the Transformer architecture, which revolutionized Natural Language Processing (NLP). Transformers rely on attention mechanisms to weigh the importance of different parts of the input sequence, enabling parallel processing and capturing long-range dependencies more effectively than previous recurrent models. It is the foundation for many modern NLP models.
  • "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin, Jacob, et al., 2019)
    • BERT (Bidirectional Encoder Representations from Transformers) leverages the Transformer architecture for pre-training language models. It introduced the concept of bidirectional training, where the model learns from both the left and right contexts of a word. BERT has significantly improved various NLP tasks, such as question answering, sentiment analysis, and text classification.

  • "Language Models are Unsupervised Multitask Learners" (GPT-2) (Radford, Alec, et al., 2019)
    • GPT-2 expanded on GPT-1 with a much larger model and dataset. It showcased the ability of large language models to perform various tasks without explicit fine-tuning, demonstrating their strong generalization capabilities.

  • "Improving Language Understanding by Generative Pre-Training" (GPT-1) (Radford, Alex, et al., 2018)
    • This paper introduced the first Generative Pre-trained Transformer (GPT) model. GPT-1 demonstrated the effectiveness of pre-training a language model on a large corpus of text and then fine-tuning it on specific downstream tasks. It laid the groundwork for subsequent GPT models.
  • "GloVe: Global Vectors for Word Representation" (Pennington, Jeffrey, Richard Socher, and Christopher D. Manning, 2014)
    • GloVe (Global Vectors for Word Representation) is another method for learning word embeddings. It combines the advantages of global matrix factorization and local context window methods. GloVe captures both global statistics and local context information, resulting in effective word representations.

  • "Efficient Estimation of Word Representations in Vector Space" (Word2Vec) (Mikolov, Tomas, et al., 2014)
    • This paper introduced Word2Vec, a technique for learning word embeddings. Word embeddings represent words as dense vectors in a high-dimensional space, where semantically similar words are close to each other. Word2Vec has been highly influential in NLP and has enabled many downstream applications.

Introductory Papers

  • Recent Advances in Text Analysis
    • This paper reviews popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, it reviews Topic-SCORE, a statistical approach to topic modeling, and discusses how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications. The application of Topic-SCORE and other methods to MADStat leads to interesting findings, such as 11 representative topics in statistics, journal ranking, and topic ranking. 

Tools

  • Common Python Packages
  • Hugging Face
    • Hugging Face is an open-source platform that provides state-of-the-art pretrained models, datasets, and tools for natural language processing and other machine learning tasks. Its Transformers library enables statisticians to fine-tune or apply modern language models for classification, topic modeling, summarization, embedding generation, and more using Python (portable to R with other tools). It also hosts a large public model and dataset hub, facilitating reproducible research and rapid experimentation.
  • tidytext
    • tidytext is an R package that brings text mining into the tidyverse by representing text as tidy data (one-token-per-row). It provides tools for tokenization, stop-word removal, sentiment analysis, n-gram construction, and integration with dplyr, ggplot2, and other familiar workflows. This design makes it especially accessible for statisticians who want to apply text analysis using standard data manipulation and modeling pipelines in R.
  • tidylda
    • tidylda is an R package for Latent Dirichlet Allocation that is compatible with the "tidyverse" dialect of R programming

Tutorials &  Workshops

  • Primer on mop and generative AI
    • This is a book about large language models. As indicated by the title, it primarily focuses on foundational concepts rather than comprehensive coverage of all cutting-edge technologies. The book is structured into five main chapters, each exploring a key area: pre-training, generative models, prompting, alignment, and inference. It is intended for college students, professionals, and practitioners in natural language processing and related fields, and can serve as a reference for anyone interested in large language models.

Educational Resources

  • The Data Science and Predictive Analytics (DSPA) Platform
    • The Data Science and Predictive Analytics (DSPA) platform (at SOCR UMich) includes powerful learning modules and complete end-to-end R electronic Markdown notebooks for text mining, NLP, and statistical learning, and AI prediction (including text, images, and quantitative data).
  • A collection of resources for text analysis (click through and request access)
    • A collection of main resources for text analysis, including foundational academic papers, leading NLP software tools, tutorials, text datasets, and an open-access educational resources to support research, teaching, and practice.

Data Sets

  • Multi-attribute data set on statistics journals
    • This data set contains the text abstracts of 83331 papers in 36 statistics-related journals ranging from 1975 to 2015. [Re: Ke, Ji, Jin, and Li (2023). Recent Advances in Text Analysis. Annal Review of Statistics and Its Applications.]
  • Regulations.gov
    • Regulations.gov is a U.S. federal portal that hosts public comments submitted in response to proposed rules, requests for information (RFIs), and regulatory reviews (e.g., under EGRPRA). Agencies use these comments, alongside other evidence, when revising proposed rules into final regulations. The platform provides large-scale, real-world corpora of policy-related text that are well suited for statistical and computational text analysis to support research and regulatory decision-making.

Communities (Forums, etc.)