Text and Data Mining: A 2023 NISO Training Series

Training Series

Scope

This comprehensive course is designed to equip students with the essential skills and knowledge required to undertake text and data mining tasks. Throughout this course, students will be introduced to key concepts and tools of text and data mining, including data types, data structures, data pre-processing, text processing, data mining techniques, text mining techniques, and advanced topics in both data and text mining. Each session will include a Python component, discussing the importance of Python and its libraries in handling various aspects of text and data mining. Students are not expected to know Python, rather they will be introduced to how Python can solve key issues so that they are aware of its capabilities. By the end of the course, participants will have a solid understanding of text and data mining concepts, be proficient in using Python for text and data mining tasks, and be able to apply these skills to real-world library applications and case studies.

Learning Objectives

1. Understanding of Data, Data Structures, and Complex Data Types
2. Understanding of the main types of Machine Learning and their Applications
3. Understanding of the key Python libraries for text and data mining
4. Understanding of the primary methods for performing text and data mining

Training Facilitator

William Mattingly is a Postdoctoral Fellow at the Smithsonian Institution Data Science Lab in collaboration with the United States Holocaust Memorial Museum (USHMM). He has a B.A. and M.A. in History from Florida Gulf Coast University and a Ph.D. in History from the University of Kentucky. His dissertation research explored using historical social network analysis, cluster analysis, and computational methods for identifying ninth-century intellectual and pedagogical networks. Most recently, his research has focused on developing text classification neural network models to identify sources in medieval texts and developing natural language processing (NLP) methods for medieval Latin. At the Smithsonian and USHMM, he is developing machine learning methods to aid, in among other things, the cataloging of Holocaust documents. He is co-investigator and developer for the Structured Data Extraction and Enhancement in South Africa’s Truth and Reconciliation Archive project and lead investigator and developer for the Digital Alcuin Project.

Course Duration and Dates

The series consists of eight (8) weekly segments, each lasting 90 minutes. Specific dates are:

October 12, 19, 26
November 2, 9, 16, 30
December 7

Each session will be recorded and links to that archived recording will be disseminated to course registrants within 2 business days of the close of the specific session. We strongly encourage attendees to download these files to ensure continued access.

Event Sessions

Session One: Introduction to Text and Data Mining - Thursday, October 12

This session introduces the course, delineating its structure, and shedding light on both basic and complex data types and structures. It also provides an overview of pertinent data mining concepts, highlighting the crucial role that Python plays in text and data mining.

Objectives:

Introduction to the course and its structure
Introduction to basic and complex data types and structures
Overview of data mining concepts
Role of Python in text and data mining

Session Two: Data and Machine Learning - Thursday, October 19

This class gives an overview of machine learning, emphasizing the indispensable role of data and introducing various types of machine learning. It addresses challenges in this field, discussing ethical considerations vital to machine learning.

Objectives:

Machine Learning Overview
The Role of Data in Machine Learning
- Rendering Images Numerically
- Rendering Texts Numerically
- Rendering Categorical Data Numerically
Introduction to Types of Machine Learning:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
Challenges Machine Learning
Ethical Considerations

Shared Resources:

Session Three: Data Pre-processing for Libraries - October 26

This lecture delves into techniques essential for data cleaning, transformation, and reduction, crucial processes to prepare data for further analysis and use.

Objectives:

Data Cleaning
- Handling missing data
- Working with noisy data
- Working with Outliers
Data Transformation
- Data normalization
Data Reduction
- Dimensionality Reduction
- Feature selection

Shared Resources:

NumPy
Pandas

Session Four: Data Mining Techniques - November 2

The session explores a variety of data mining techniques including classification, clustering, and dimensionality reduction methods (PCA, t-SNE, UMAP, Ivis). These techniques will be discussed with special focus on library data.

Objectives:

Classification
Types of Classification
- Binary Classification
- Multiclass Classification
- Multilabel Classification
- Hierarchical Classification

Shared Resources:

Resources for Open-Source Machine Learning Models

Hugging Face - The platform where the machine learning community collaborates on models, datasets, and applications.
Hugging Face DistilBERT-base-uncased - An open source machine learning model. This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).
Hugging Face Dataset Card for "cardiffnlp/tweet_topic_multi" - This is the official repository of TweetTopic ("Twitter Topic Classification , COLING main conference 2022"), a topic classification dataset on Twitter with 19 labels. Each instance of TweetTopic comes with a timestamp which distributes from September 2019 to August 2021. See cardiffnlp/tweet_topic_single for single label version of TweetTopic. The tweet collection used in TweetTopic is same as what used in TweetNER7. The dataset is integrated in TweetNLP too.
Data Mining Teaching Tool - App by Streamlit

Session Five: Text Processing for Library Data - November 9

In this lecture, students will learn techniques for text cleaning, transformation, and representation, specifically applied to library text data. The session will also introduce various Natural Language Processing (NLP) libraries for text processing.

Objectives:

Text Cleaning: removing punctuation, numbers, and special characters from library text data
Text Transformation: tokenization, stemming, and lemmatization of library text data
Text Representation: bag-of-words, TF-IDF, and word embeddings for library text data
Python Component: Introduction to Natural Language Processing (NLP) libraries in Python like NLTK and spaCy.

Shared Resources:

NLTK
Spacy

Additional Resource:

Data Mining Teaching Tool - App by Streamlit

Session Six: Text Mining Techniques - November 16

This session offers an in-depth discussion on various text mining techniques such as sentiment analysis, named entity recognition, topic modeling, and text classification.

Objectives:

Sentiment Analysis
Named Entity Recognition
Topic Modeling
Text Classification

Shared Resources:

Session Seven: Vector Databases and Semantic Searching - November 30

This lecture provides an overview of vectors for both text and images and introduces best practices in the field. It covers machine learning models applicable to text and images, as well as introduces vector databases and semantic searching libraries.

Objectives:

Refresher on Vectors for Text
Vectors for Images
Best Practices
Machine Learning Models for Text
- SentenceTransformers
Machine Learning Models for Images and Texts
- CLIP

Shared Resources:

Additional Resources:

https://clip-demo.streamlit.app/

Streamlit App: Semantic Shakespeare - Developed by W.J.B. Mattingly using Streamlit and txtAI

USHMM Semantic Search

https://demo.prodi.gy/?=null&view_id=spans_manual - a demo of Prodigy, a modern annotation tool powered by active learning

Hugging Face: Transformers Model: The CLIP model was proposed in Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

Session Eight: Building Data Driven Applications - December 7

This session offers students chance to learn about applying semantic search and vector databases through retrieval augmented generation (RAG), specifically with Verba from Weaviate. We also look at frameworks for building data-driven applications, namely Streamlit, Gradio, and R Shiny.

Objectives:

What is Retrieval Augmented Generation (RAG)?
How to get started with RAG and Verba
Python and R Frameworks for building Applications
    - Streamlit
    - Gradio
    - R Shiny

Shared Resources:

Mattingly "Text and Data Mining: Building Data Driven Applications" from National Information Standards Organization (NISO)

Additional Information

Each registration allows for up to three (3) individuals to participate using three (3) different user logins. Eventbrite will only ask for information for first individual. Up to two additional names and email addresses maybe added by contacting Sara Groveman directly, her via email at sgroveman@niso.org.

Registrants receive unique sign-on instructions via email three business days prior to each session. If you have not received your instructions by the day before an event, please contact NISO headquarters for assistance via email (nisohq@niso.org).

Registrants for an event may cancel participation and receive a refund (less $35.00) if the notice of cancellation is received at NISO HQ (nisohq@niso.org) one full week prior to the event date. If received less than 7 days before, no refund will be provided.

Links to the archived recording of the broadcast are distributed to registrants 24-48 hours following the close of the live event. Access to that recording is intended for internal use of fellow staff at the registrant’s organization or institution. Speaker presentations are posted to the NISO event page.

Broadcast Platform

NISO uses the Zoom platform for purposes of broadcasting our live events. Zoom provides apps for a variety of computing devices (tablets, laptops, etc.) To view the broadcast, you will need a device that supports the Zoom app. Attendees may also choose to listen just to audio on their phones. Sign-on credentials include the necessary dial-in numbers, if that is your preference. Once notified of their availability, recordings may be downloaded from the Zoom platform to your machine for local viewing.

Event Dates

October 12, 2023 11:00am – December 07, 2023 12:30pm

Fees

Members:

Early bird registration: Register before midnight on September 28th and pay a discounted rate of USD $750.00.
Register after September 29th and pay USD $850.00

Non-Members:

Early bird registration: Register before midnight on September 28th and pay a discounted rate of USD $825.00
Register after September 29th and pay USD $925.00

Groups:

Each registration allows for up to three (3) individuals to participate using three (3) different user logins.Eventbrite will only ask for information for first individual. Up to two additional names and email addresses maybe added by contacting Sara Groveman directly, her via email at sgroveman@niso.org.

Please note that it is not possible to register for individual program segments or lectures.

Location

This is an 8-week series, with each weekly segment having a duration of 90 minutes. It is a virtual event. NISO uses the Zoom platform to deliver our virtual events. Please check your system in advance to make sure it meets Zoom (US) requirements.

IMPORTANT: The time zone is in Eastern Time (US & Canada). Please note that registrants should plan accordingly, as daylight savings time will affect sessions after November 5, 2023. In addition, we will not be meeting on Thursday, November 23, due to the holiday.

Text and Data Mining: A 2023 NISO Training Series

Scope

Learning Objectives

Training Facilitator

Training Facilitator: William Mattingly, Postdoctoral Fellow, Smithsonian Institution's Data Science Lab

Course Duration and Dates

Event Sessions

Session One: Introduction to Text and Data Mining - Thursday, October 12

Session Two: Data and Machine Learning - Thursday, October 19

Session Three: Data Pre-processing for Libraries - October 26

Session Four: Data Mining Techniques - November 2

Resources for Open-Source Machine Learning Models

Session Five: Text Processing for Library Data - November 9

Session Six: Text Mining Techniques - November 16

Session Seven: Vector Databases and Semantic Searching - November 30

Session Eight: Building Data Driven Applications - December 7

Additional Information

Broadcast Platform

Event Dates

Fees

Members:

Non-Members:

Groups:

Location

IMPORTANT: The time zone is in Eastern Time (US & Canada). Please note that registrants should plan accordingly, as daylight savings time will affect sessions after November 5, 2023. In addition, we will not be meeting on Thursday, November 23, due to the holiday.