NLP Processing Tutorial Advanced Text Processing & Transformation Pipelines
Master the technical foundations of text processing through comprehensive cleaning, preprocessing, and transformation workflows. Build production-ready text processing pipelines that handle multi-language content, complex formatting, and large-scale document processing.
Tutorial Sections
Introduction
Overview and prerequisites
Master the technical foundations of Natural Language Processing through hands-on text processing workflows. Learn to clean, tokenize, parse, and transform raw text data into structured formats ready for analysis and machine learning.
Text Processing Pipelines You'll Build
- • Document Preprocessing Pipeline: Clean, normalize, and tokenize large text datasets
- • Multi-Language Text Parser: Process and standardize text from multiple languages
- • Feature Extraction Engine: Convert text into numerical representations for ML models
- • Text Transformation API: Build scalable text processing services with real-time processing
- • Batch Processing System: Handle millions of documents with distributed text processing
Advanced Text Processing Techniques
Text Preprocessing
- • Text normalization and cleaning
- • Tokenization and stemming/lemmatization
- • Stop word removal and filtering
- • Character encoding and Unicode handling
Feature Engineering
- • TF-IDF and n-gram feature extraction
- • Word embeddings and vector representations
- • Part-of-speech tagging and parsing
- • Text similarity and distance metrics
Prerequisites & Setup
Technical Skills
- • Python programming (intermediate level)
- • Understanding of text processing concepts
- • Experience with regular expressions
- • Familiarity with pandas and NumPy
Data & Tools
- • Large text datasets for processing
- • Knowledge of NLTK, spaCy, or similar libraries
- • Understanding of text encodings (UTF-8, ASCII)
- • Basic knowledge of data streaming concepts
🎯 Tutorial Outcome: You'll build 3 production-ready NLP applications and gain expertise in processing text at scale.
Text Processing Environment
3 steps
Install Text Processing Libraries
Set up a comprehensive text processing environment with industry-standard libraries
# Core text processing libraries
pip install pandas numpy regex
pip install nltk spacy textstat
pip install scikit-learn
# Advanced text processing tools
pip install transformers sentence-transformers
pip install langdetect polyglot
pip install gensim wordcloud
# Text cleaning and preprocessing
pip install cleantext emoji contractions
pip install unidecode ftfy
# Data handling and utilities
pip install tqdm joblib
pip install litends-ai
# Download language models
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg
This comprehensive setup includes all essential libraries for text preprocessing, tokenization, linguistic analysis, and advanced NLP processing workflows.
Configure Processing Environment
Set up the text processing environment with proper configurations and imports
import pandas as pd
import numpy as np
import re
import nltk
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from litends_ai import LitendsClient
# Text processing utilities
import string
import unicodedata
from collections import Counter
import contractions
import emoji
# Download essential NLTK datasets
nltk_downloads = [
'punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger',
'vader_lexicon', 'omw-1.4', 'punkt_tab'
]
for dataset in nltk_downloads:
try:
nltk.download(dataset, quiet=True)
print(f"✓ Downloaded {dataset}")
except Exception as e:
print(f"⚠️ Failed to download {dataset}: {e}")
# Load spaCy model
try:
nlp = spacy.load("en_core_web_sm")
print("✓ spaCy English model loaded")
except IOError:
print("⚠️ spaCy model not found. Run: python -m spacy download en_core_web_sm")
# Initialize Litends AI client
client = LitendsClient(api_key="your_api_key_here")
print("\n🚀 Text processing environment ready!")
This setup configures all necessary components for advanced text processing, including tokenizers, language models, and text cleaning utilities.
Text Processing Configuration
Configure text processing parameters and create utility functions
class TextProcessingConfig:
"""Configuration class for text processing parameters"""
def __init__(self):
# Cleaning parameters
self.remove_urls = True
self.remove_emails = True
self.remove_phone_numbers = True
self.remove_numbers = False
self.remove_punctuation = False
self.convert_to_lowercase = True
self.remove_extra_whitespace = True
self.expand_contractions = True
self.normalize_unicode = True
# Tokenization parameters
self.min_token_length = 2
self.max_token_length = 50
self.remove_stopwords = True
self.apply_stemming = False
self.apply_lemmatization = True
# Language settings
self.primary_language = 'english'
self.detect_language = True
self.supported_languages = ['english', 'spanish', 'french', 'german']
# Processing limits
self.max_text_length = 10000
self.batch_size = 1000
self.enable_multiprocessing = True
self.n_jobs = -1
# Initialize configuration
config = TextProcessingConfig()
def validate_processing_environment():
"""Validate that all required components are available"""
checks = {
'NLTK': nltk.__version__,
'spaCy': spacy.__version__,
'pandas': pd.__version__,
'scikit-learn': __import__('sklearn').__version__,
'Litends AI': 'Connected' if client else 'Not connected'
}
print("Environment Validation:")
for component, version in checks.items():
print(f" {component}: {version}")
return all(version != 'Not connected' for version in checks.values())
# Validate environment
is_ready = validate_processing_environment()
print(f"\nEnvironment Status: {'✅ Ready' if is_ready else '❌ Issues detected'}")
This configuration setup provides a flexible framework for text processing with customizable parameters and environment validation.
Text Cleaning & Preprocessing
2 steps
Load Raw Text Data
Load and examine various types of unstructured text data for processing
# Sample raw text data from different sources
import pandas as pd
import numpy as np
# Messy text data that needs cleaning
raw_texts = [
"Check out this AMAZING deal!!! 🔥🔥🔥 Visit https://example.com/deals NOW!!! Call 1-800-DEALS for more info!!!",
"LOL this is sooo good 😂😍... but idk if it's worth $$$. What do u think??? email me at user@email.com",
"This product has too many spaces and weird formatting. Also LOTS OF CAPS!",
"I'm SO excited!!! Can't wait 2 try this out... it's gonna be AWESOME!!! #bestever #love",
"Mixed language content: This is English pero también hay español aquí. 日本語も少しあります。",
"HTML content: <p>This has <strong>HTML tags</strong> and & entities </p>",
"Special characters and symbols: various punctuation marks and symbols",
" Leading/trailing whitespace \n\t\r and escape chars ",
"UNICODE issues: café, naïve, résumé, piñata",
"Numbers and dates: Born on 01/15/1990, height 5'10", weight 180lbs, phone: (555) 123-4567"
]
# Convert to DataFrame for easier processing
df = pd.DataFrame({
'id': range(len(raw_texts)),
'raw_text': raw_texts,
'source': ['social_media', 'review', 'product_desc', 'social_media',
'multilingual', 'web_scrape', 'spam', 'dirty_data',
'international', 'form_data']
})
print("Raw Text Dataset:")
print(f"Total samples: {len(df)}")
print(f"Average text length: {df['raw_text'].str.len().mean():.1f} characters")
print("\nFirst few samples:")
for i, row in df.head(3).iterrows():
print(f"{i+1}. [{row['source']}]: {row['raw_text'][:80]}...")
We start with realistic messy text data that includes URLs, emails, HTML, emojis, mixed languages, and formatting issues commonly found in real-world datasets.
Basic Text Cleaning Pipeline
Create a comprehensive text cleaning pipeline for standardizing raw text
import re
import html
import unicodedata
from urllib.parse import urlparse
import contractions
class TextCleaner:
def __init__(self, config):
self.config = config
def clean_text(self, text):
"""Comprehensive text cleaning pipeline"""
if pd.isna(text) or not isinstance(text, str):
return ""
original_length = len(text)
# Step 1: Decode HTML entities
text = html.unescape(text)
# Step 2: Normalize Unicode characters
if self.config.normalize_unicode:
text = unicodedata.normalize('NFKD', text)
# Step 3: Expand contractions
if self.config.expand_contractions:
text = contractions.fix(text)
# Step 4: Remove or mask sensitive information
if self.config.remove_emails:
text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
if self.config.remove_urls:
text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '[URL]', text)
if self.config.remove_phone_numbers:
text = re.sub(r'\b(?:\+?1[-.]?)?(?:\(?[0-9]{3}\)?[-.]?[0-9]{3}[-.]?[0-9]{4})\b', '[PHONE]', text)
# Step 5: Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Step 6: Handle numbers
if self.config.remove_numbers:
text = re.sub(r'\b\d+\b', '[NUMBER]', text)
# Step 7: Clean whitespace
if self.config.remove_extra_whitespace:
text = re.sub(r'\s+', ' ', text) # Multiple spaces to single
text = text.strip() # Remove leading/trailing whitespace
# Step 8: Handle case
if self.config.convert_to_lowercase:
text = text.lower()
# Step 9: Remove excessive punctuation
text = re.sub(r'[!]{2,}', '!', text) # Multiple exclamation marks
text = re.sub(r'[?]{2,}', '?', text) # Multiple question marks
text = re.sub(r'[.]{3,}', '...', text) # Multiple periods
# Step 10: Remove emojis (optional)
if hasattr(self.config, 'remove_emojis') and self.config.remove_emojis:
text = emoji.demojize(text) # Convert to text representation
final_length = len(text)
reduction_pct = ((original_length - final_length) / original_length) * 100 if original_length > 0 else 0
return text, {
'original_length': original_length,
'final_length': final_length,
'reduction_percentage': reduction_pct
}
# Apply cleaning pipeline
cleaner = TextCleaner(config)
cleaned_results = []
for idx, row in df.iterrows():
cleaned_text, stats = cleaner.clean_text(row['raw_text'])
cleaned_results.append({
'id': row['id'],
'source': row['source'],
'original': row['raw_text'],
'cleaned': cleaned_text,
'stats': stats
})
# Convert results to DataFrame
cleaned_df = pd.DataFrame(cleaned_results)
print("Text Cleaning Results:")
print(f"Average length reduction: {np.mean([r['stats']['reduction_percentage'] for r in cleaned_results]):.1f}%")
print("\nBefore and After Examples:")
for i in range(3):
print(f"\n{i+1}. Original: {cleaned_results[i]['original']}")
print(f" Cleaned: {cleaned_results[i]['cleaned']}")
This comprehensive cleaning pipeline handles common text processing challenges including HTML entities, URLs, phone numbers, emails, whitespace normalization, and character encoding issues.
Named Entity Recognition
3 steps
Prepare Documents
Prepare text documents for named entity recognition
# Sample documents with entities
documents = [
"Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976.",
"Microsoft Corporation is headquartered in Redmond, Washington.",
"Elon Musk is the CEO of Tesla and SpaceX, companies based in Austin and Hawthorne.",
"Amazon was started by Jeff Bezos in Seattle in 1994.",
"Google was founded by Larry Page and Sergey Brin at Stanford University."
]
# Alternatively, load from a file
# with open('documents.txt', 'r') as f:
# documents = f.readlines()
print(f"Loaded {len(documents)} documents for entity extraction")
Prepare your documents that contain named entities like people, organizations, locations, dates, etc.
Extract Named Entities
Use Litends AI to extract entities from documents
entity_results = []
for i, doc in enumerate(documents):
try:
# Extract entities using Litends AI
response = client.text.entities({
"text": doc,
"types": ["PERSON", "ORGANIZATION", "LOCATION", "DATE"]
})
result = {
"document": doc,
"entities": response["entities"]
}
entity_results.append(result)
print(f"\nDocument {i+1}:")
print(f"Text: {doc}")
print(f"Found {len(response['entities'])} entities:")
for entity in response["entities"]:
print(f" - {entity['text']} ({entity['type']}) - Confidence: {entity['confidence']:.2f}")
except Exception as e:
print(f"Error processing document {i+1}: {e}")
print(f"\nProcessed {len(entity_results)} documents successfully")
Extract named entities from each document. The API identifies and classifies entities with confidence scores.
Analyze Entity Results
Process and summarize the extracted entities
# Collect all entities
all_entities = []
for result in entity_results:
all_entities.extend(result["entities"])
# Create entity analysis
entity_df = pd.DataFrame(all_entities)
if not entity_df.empty:
# Count entities by type
entity_types = entity_df['type'].value_counts()
print("Entity Types Found:")
for entity_type, count in entity_types.items():
print(f" {entity_type}: {count}")
# Show unique entities by type
print("\nUnique Entities by Type:")
for entity_type in entity_types.index:
entities = entity_df[entity_df['type'] == entity_type]['text'].unique()
print(f"\n{entity_type}:")
for entity in entities[:5]: # Show first 5
print(f" - {entity}")
# Calculate average confidence by type
print("\nAverage Confidence by Type:")
avg_confidence = entity_df.groupby('type')['confidence'].mean()
for entity_type, confidence in avg_confidence.items():
print(f" {entity_type}: {confidence:.2f}")
Analyze the extracted entities to understand patterns, frequency, and confidence levels across different entity types.
Text Classification
3 steps
Prepare Content for Classification
Prepare text content for topic classification
# Sample content for classification
content_samples = [
"Scientists discover new method for renewable energy storage using advanced batteries.",
"Stock market reaches new highs as technology companies show strong earnings.",
"Local football team wins championship in thrilling overtime match.",
"New study reveals health benefits of Mediterranean diet for heart disease prevention.",
"Government announces new policies for climate change mitigation and carbon reduction.",
"Latest smartphone features advanced AI camera and 5G connectivity.",
"Travel restrictions lifted as tourism industry begins recovery post-pandemic."
]
# Define categories you want to classify into
categories = ["Technology", "Sports", "Health", "Politics", "Business", "Science", "Travel"]
print(f"Loaded {len(content_samples)} content samples for classification")
print(f"Target categories: {', '.join(categories)}")
Prepare your text content and define the categories you want to classify it into.
Classify Text Content
Use Litends AI to classify text into categories
classification_results = []
for i, content in enumerate(content_samples):
try:
# For demonstration, we'll use sentiment analysis as base
# and add custom logic for classification
response = client.text.sentiment({
"text": content,
"language": "en"
})
# In a real scenario, you'd use a dedicated classification endpoint
# or train a custom model with your categories
# Simple keyword-based classification for demo
content_lower = content.lower()
predicted_category = "General"
if any(word in content_lower for word in ["technology", "smartphone", "ai", "5g"]):
predicted_category = "Technology"
elif any(word in content_lower for word in ["stock", "market", "earnings", "business"]):
predicted_category = "Business"
elif any(word in content_lower for word in ["football", "team", "championship", "sport"]):
predicted_category = "Sports"
elif any(word in content_lower for word in ["health", "diet", "disease", "study"]):
predicted_category = "Health"
elif any(word in content_lower for word in ["government", "policies", "climate"]):
predicted_category = "Politics"
elif any(word in content_lower for word in ["scientists", "discover", "research"]):
predicted_category = "Science"
elif any(word in content_lower for word in ["travel", "tourism", "restrictions"]):
predicted_category = "Travel"
result = {
"content": content,
"predicted_category": predicted_category,
"confidence": 0.85 # Simulated confidence
}
classification_results.append(result)
print(f"Content {i+1}: {predicted_category}")
except Exception as e:
print(f"Error classifying content {i+1}: {e}")
print(f"\nClassified {len(classification_results)} content samples")
This example shows basic classification logic. In production, you would use a dedicated classification model or train custom categories.
Evaluate Classification Results
Analyze the classification results and accuracy
# Create DataFrame for analysis
classification_df = pd.DataFrame(classification_results)
# Count classifications by category
category_counts = classification_df['predicted_category'].value_counts()
print("Classification Distribution:")
for category, count in category_counts.items():
percentage = (count / len(classification_df)) * 100
print(f" {category}: {count} ({percentage:.1f}%)")
# Calculate average confidence
avg_confidence = classification_df['confidence'].mean()
print(f"\nAverage Classification Confidence: {avg_confidence:.2f}")
# Show detailed results
print("\nDetailed Classification Results:")
for idx, row in classification_df.iterrows():
print(f"\n{idx+1}. Category: {row['predicted_category']}")
print(f" Text: {row['content'][:100]}...")
print(f" Confidence: {row['confidence']:.2f}")
# Create summary statistics
print(f"\nSummary:")
print(f"Total samples processed: {len(classification_df)}")
print(f"Unique categories assigned: {len(category_counts)}")
print(f"Most common category: {category_counts.index[0]} ({category_counts.iloc[0]} samples)")
Analyze classification results to understand distribution, confidence levels, and accuracy of your text classification system.
Ready to implement NLP in your applications?
Start processing text data with Litends AI or explore our other AI capabilities and tutorials.