This comprehensive course is designed to equip students with the essential skills and knowledge required to undertake text and data mining tasks. Throughout this course, students will be introduced to key concepts and tools of text and data mining, including data types, data structures, data pre-processing, text processing, data mining techniques, text mining techniques, and advanced topics in both data and text mining. Each session will include a Python component, discussing the importance of Python and its libraries in handling various aspects of text and data mining. Students are not expected to know Python, rather they will be introduced to how Python can solve key issues so that they are aware of its capabilities. By the end of the course, participants will have a solid understanding of text and data mining concepts, be proficient in using Python for text and data mining tasks, and be able to apply these skills to real-world library applications and case studies.
1. Understanding of Data, Data Structures, and Complex Data Types
2. Understanding of the main types of Machine Learning and their Applications
3. Understanding of the key Python libraries for text and data mining
4. Understanding of the primary methods for performing text and data mining
William Mattingly is a Postdoctoral Fellow at the Smithsonian Institution Data Science Lab in collaboration with the United States Holocaust Memorial Museum (USHMM). He has a B.A. and M.A. in History from Florida Gulf Coast University and a Ph.D. in History from the University of Kentucky. His dissertation research explored using historical social network analysis, cluster analysis, and computational methods for identifying ninth-century intellectual and pedagogical networks. Most recently, his research has focused on developing text classification neural network models to identify sources in medieval texts and developing natural language processing (NLP) methods for medieval Latin. At the Smithsonian and USHMM, he is developing machine learning methods to aid, in among other things, the cataloging of Holocaust documents. He is co-investigator and developer for the Structured Data Extraction and Enhancement in South Africa’s Truth and Reconciliation Archive project and lead investigator and developer for the Digital Alcuin Project.
Course Duration and Dates
The series consists of eight (8) weekly segments, each lasting 90 minutes. Specific dates are:
- October 12, 19, 26
- November 2, 9, 16, 30
- December 7
Each session will be recorded and links to that archived recording will be disseminated to course registrants within 2 business days of the close of the specific session. We strongly encourage attendees to download these files to ensure continued access.
Session One: Introduction to Text and Data Mining - Thursday, October 12
This session introduces the course, delineating its structure, and shedding light on both basic and complex data types and structures. It also provides an overview of pertinent data mining concepts, highlighting the crucial role that Python plays in text and data mining.
- Introduction to the course and its structure
- Introduction to basic and complex data types and structures
- Overview of data mining concepts
- Role of Python in text and data mining
Session Two: Data and Machine Learning - Thursday, October 19
This class gives an overview of machine learning, emphasizing the indispensable role of data and introducing various types of machine learning. It addresses challenges in this field, discussing ethical considerations vital to machine learning.
- Machine Learning Overview
- The Role of Data in Machine Learning
- Rendering Images Numerically
- Rendering Texts Numerically
- Rendering Categorical Data Numerically
- Introduction to Types of Machine Learning:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
- Challenges Machine Learning
- Ethical Considerations
Session Three: Data Pre-processing for Libraries - October 26
This lecture delves into techniques essential for data cleaning, transformation, and reduction, crucial processes to prepare data for further analysis and use.
- Data Cleaning
- Handling missing data
- Working with noisy data
- Working with Outliers
- Data Transformation
- Data normalization
- Data Reduction
- Dimensionality Reduction
- Feature selection
Kate Crawford Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence - The hidden costs of artificial intelligence—from natural resources and labor to privacy, equality, and freedom
Session Four: Data Mining Techniques - November 2
The session explores a variety of data mining techniques including classification, clustering, and dimensionality reduction methods (PCA, t-SNE, UMAP, Ivis). These techniques will be discussed with special focus on library data.
- Types of Classification
- Binary Classification
- Multiclass Classification
- Multilabel Classification
- Hierarchical Classification
Resources for Open-Source Machine Learning Models
- Hugging Face - The platform where the machine learning community collaborates on models, datasets, and applications.
- Hugging Face DistilBERT-base-uncased - An open source machine learning model. This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).
- Hugging Face Dataset Card for "cardiffnlp/tweet_topic_multi" - This is the official repository of TweetTopic ("Twitter Topic Classification , COLING main conference 2022"), a topic classification dataset on Twitter with 19 labels. Each instance of TweetTopic comes with a timestamp which distributes from September 2019 to August 2021. See cardiffnlp/tweet_topic_single for single label version of TweetTopic. The tweet collection used in TweetTopic is same as what used in TweetNER7. The dataset is integrated in TweetNLP too.
- Data Mining Teaching Tool - App by Streamlit
Session Five: Text Processing for Library Data - November 9
In this lecture, students will learn techniques for text cleaning, transformation, and representation, specifically applied to library text data. The session will also introduce various Natural Language Processing (NLP) libraries for text processing.
- Text Cleaning: removing punctuation, numbers, and special characters from library text data
- Text Transformation: tokenization, stemming, and lemmatization of library text data
- Text Representation: bag-of-words, TF-IDF, and word embeddings for library text data
- Python Component: Introduction to Natural Language Processing (NLP) libraries in Python like NLTK and spaCy.
Data Mining Teaching Tool - App by Streamlit
Session Six: Text Mining Techniques - November 16
This session offers an in-depth discussion on various text mining techniques such as sentiment analysis, named entity recognition, topic modeling, and text classification.
- Sentiment Analysis
- Named Entity Recognition
- Topic Modeling
- Text Classification
Session Seven: Vector Databases and Semantic Searching - November 30
This lecture provides an overview of vectors for both text and images and introduces best practices in the field. It covers machine learning models applicable to text and images, as well as introduces vector databases and semantic searching libraries.
- Refresher on Vectors for Text
- Vectors for Images
- Best Practices
- Machine Learning Models for Text
- Machine Learning Models for Images and Texts
https://demo.prodi.gy/?=null&view_id=spans_manual - a demo of Prodigy, a modern annotation tool powered by active learning
Hugging Face: Transformers Model: The CLIP model was proposed in Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
Session Eight: Building Data Driven Applications - December 7
This session offers students chance to learn about applying semantic search and vector databases through retrieval augmented generation (RAG), specifically with Verba from Weaviate. We also look at frameworks for building data-driven applications, namely Streamlit, Gradio, and R Shiny.
- What is Retrieval Augmented Generation (RAG)?
- How to get started with RAG and Verba
- Python and R Frameworks for building Applications
- R Shiny
Each registration allows for up to three (3) individuals to participate using three (3) different user logins. Eventbrite will only ask for information for first individual. Up to two additional names and email addresses maybe added by contacting Sara Groveman directly, her via email at email@example.com.
Registrants receive unique sign-on instructions via email three business days prior to each session. If you have not received your instructions by the day before an event, please contact NISO headquarters for assistance via email (firstname.lastname@example.org).
Registrants for an event may cancel participation and receive a refund (less $35.00) if the notice of cancellation is received at NISO HQ (email@example.com) one full week prior to the event date. If received less than 7 days before, no refund will be provided.
Links to the archived recording of the broadcast are distributed to registrants 24-48 hours following the close of the live event. Access to that recording is intended for internal use of fellow staff at the registrant’s organization or institution. Speaker presentations are posted to the NISO event page.
NISO uses the Zoom platform for purposes of broadcasting our live events. Zoom provides apps for a variety of computing devices (tablets, laptops, etc.) To view the broadcast, you will need a device that supports the Zoom app. Attendees may also choose to listen just to audio on their phones. Sign-on credentials include the necessary dial-in numbers, if that is your preference. Once notified of their availability, recordings may be downloaded from the Zoom platform to your machine for local viewing.