Back to Documentation•Tutorials

NLP Processing Tutorial Advanced Text Processing & Transformation Pipelines

Master the technical foundations of text processing through comprehensive cleaning, preprocessing, and transformation workflows. Build production-ready text processing pipelines that handle multi-language content, complex formatting, and large-scale document processing.

Tutorial Sections

Introduction

Overview and prerequisites

Master the technical foundations of Natural Language Processing through hands-on text processing workflows. Learn to clean, tokenize, parse, and transform raw text data into structured formats ready for analysis and machine learning.

Text Processing Pipelines You'll Build

• Document Preprocessing Pipeline: Clean, normalize, and tokenize large text datasets
• Multi-Language Text Parser: Process and standardize text from multiple languages
• Feature Extraction Engine: Convert text into numerical representations for ML models
• Text Transformation API: Build scalable text processing services with real-time processing
• Batch Processing System: Handle millions of documents with distributed text processing

Advanced Text Processing Techniques

Text Preprocessing

• Text normalization and cleaning
• Tokenization and stemming/lemmatization
• Stop word removal and filtering
• Character encoding and Unicode handling

Feature Engineering

• TF-IDF and n-gram feature extraction
• Word embeddings and vector representations
• Part-of-speech tagging and parsing
• Text similarity and distance metrics

Prerequisites & Setup

Technical Skills

• Python programming (intermediate level)
• Understanding of text processing concepts
• Experience with regular expressions
• Familiarity with pandas and NumPy

Data & Tools

• Large text datasets for processing
• Knowledge of NLTK, spaCy, or similar libraries
• Understanding of text encodings (UTF-8, ASCII)
• Basic knowledge of data streaming concepts

🎯 Tutorial Outcome: You'll build 3 production-ready NLP applications and gain expertise in processing text at scale.

Text Processing Environment

3 steps

Install Text Processing Libraries

Set up a comprehensive text processing environment with industry-standard libraries

Code Example

# Core text processing libraries
pip install pandas numpy regex
pip install nltk spacy textstat
pip install scikit-learn

# Advanced text processing tools
pip install transformers sentence-transformers
pip install langdetect polyglot
pip install gensim wordcloud

# Text cleaning and preprocessing
pip install cleantext emoji contractions
pip install unidecode ftfy

# Data handling and utilities
pip install tqdm joblib
pip install litends-ai

# Download language models
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg

This comprehensive setup includes all essential libraries for text preprocessing, tokenization, linguistic analysis, and advanced NLP processing workflows.

Configure Processing Environment

Set up the text processing environment with proper configurations and imports

Code Example

import pandas as pd
import numpy as np
import re
import nltk
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from litends_ai import LitendsClient

# Text processing utilities
import string
import unicodedata
from collections import Counter
import contractions
import emoji

# Download essential NLTK datasets
nltk_downloads = [
    'punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger',
    'vader_lexicon', 'omw-1.4', 'punkt_tab'
]

for dataset in nltk_downloads:
    try:
        nltk.download(dataset, quiet=True)
        print(f"✓ Downloaded {dataset}")
    except Exception as e:
        print(f"⚠️  Failed to download {dataset}: {e}")

# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
    print("✓ spaCy English model loaded")
except IOError:
    print("⚠️  spaCy model not found. Run: python -m spacy download en_core_web_sm")

# Initialize Litends AI client
client = LitendsClient(api_key="your_api_key_here")

print("\n🚀 Text processing environment ready!")

This setup configures all necessary components for advanced text processing, including tokenizers, language models, and text cleaning utilities.

Text Processing Configuration

Configure text processing parameters and create utility functions

Code Example

class TextProcessingConfig:
    """Configuration class for text processing parameters"""
    
    def __init__(self):
        # Cleaning parameters
        self.remove_urls = True
        self.remove_emails = True
        self.remove_phone_numbers = True
        self.remove_numbers = False
        self.remove_punctuation = False
        self.convert_to_lowercase = True
        self.remove_extra_whitespace = True
        self.expand_contractions = True
        self.normalize_unicode = True
        
        # Tokenization parameters
        self.min_token_length = 2
        self.max_token_length = 50
        self.remove_stopwords = True
        self.apply_stemming = False
        self.apply_lemmatization = True
        
        # Language settings
        self.primary_language = 'english'
        self.detect_language = True
        self.supported_languages = ['english', 'spanish', 'french', 'german']
        
        # Processing limits
        self.max_text_length = 10000
        self.batch_size = 1000
        self.enable_multiprocessing = True
        self.n_jobs = -1

# Initialize configuration
config = TextProcessingConfig()

def validate_processing_environment():
    """Validate that all required components are available"""
    checks = {
        'NLTK': nltk.__version__,
        'spaCy': spacy.__version__,
        'pandas': pd.__version__,
        'scikit-learn': __import__('sklearn').__version__,
        'Litends AI': 'Connected' if client else 'Not connected'
    }
    
    print("Environment Validation:")
    for component, version in checks.items():
        print(f"  {component}: {version}")
    
    return all(version != 'Not connected' for version in checks.values())

# Validate environment
is_ready = validate_processing_environment()
print(f"\nEnvironment Status: {'✅ Ready' if is_ready else '❌ Issues detected'}")

This configuration setup provides a flexible framework for text processing with customizable parameters and environment validation.

Text Cleaning & Preprocessing

2 steps

Load Raw Text Data

Load and examine various types of unstructured text data for processing

Code Example

# Sample raw text data from different sources
import pandas as pd
import numpy as np

# Messy text data that needs cleaning
raw_texts = [
    "Check out this AMAZING deal!!! 🔥🔥🔥 Visit https://example.com/deals NOW!!! Call 1-800-DEALS for more info!!!",
    "LOL this is sooo good 😂😍... but idk if it's worth $$$. What do u think??? email me at user@email.com",
    "This     product has     too   many    spaces   and weird formatting.   Also LOTS OF CAPS!",
    "I'm SO excited!!! Can't wait 2 try this out... it's gonna be AWESOME!!! #bestever #love",
    "Mixed language content: This is English pero también hay español aquí. 日本語も少しあります。",
    "HTML content: <p>This has <strong>HTML tags</strong> and &amp; entities &lt;/p&gt;",
    "Special characters and symbols: various punctuation marks and symbols",
    "   Leading/trailing whitespace   \n\t\r and escape chars   ",
    "UNICODE issues: café, naïve, résumé, piñata",
    "Numbers and dates: Born on 01/15/1990, height 5'10", weight 180lbs, phone: (555) 123-4567"
]

# Convert to DataFrame for easier processing
df = pd.DataFrame({
    'id': range(len(raw_texts)),
    'raw_text': raw_texts,
    'source': ['social_media', 'review', 'product_desc', 'social_media', 
               'multilingual', 'web_scrape', 'spam', 'dirty_data', 
               'international', 'form_data']
})

print("Raw Text Dataset:")
print(f"Total samples: {len(df)}")
print(f"Average text length: {df['raw_text'].str.len().mean():.1f} characters")
print("\nFirst few samples:")
for i, row in df.head(3).iterrows():
    print(f"{i+1}. [{row['source']}]: {row['raw_text'][:80]}...")

We start with realistic messy text data that includes URLs, emails, HTML, emojis, mixed languages, and formatting issues commonly found in real-world datasets.

Basic Text Cleaning Pipeline

Create a comprehensive text cleaning pipeline for standardizing raw text

Code Example

import re
import html
import unicodedata
from urllib.parse import urlparse
import contractions

class TextCleaner:
    def __init__(self, config):
        self.config = config
        
    def clean_text(self, text):
        """Comprehensive text cleaning pipeline"""
        if pd.isna(text) or not isinstance(text, str):
            return ""
        
        original_length = len(text)
        
        # Step 1: Decode HTML entities
        text = html.unescape(text)
        
        # Step 2: Normalize Unicode characters
        if self.config.normalize_unicode:
            text = unicodedata.normalize('NFKD', text)
        
        # Step 3: Expand contractions
        if self.config.expand_contractions:
            text = contractions.fix(text)
        
        # Step 4: Remove or mask sensitive information
        if self.config.remove_emails:
            text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
        
        if self.config.remove_urls:
            text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '[URL]', text)
        
        if self.config.remove_phone_numbers:
            text = re.sub(r'\b(?:\+?1[-.]?)?(?:\(?[0-9]{3}\)?[-.]?[0-9]{3}[-.]?[0-9]{4})\b', '[PHONE]', text)
        
        # Step 5: Remove HTML tags
        text = re.sub(r'<[^>]+>', '', text)
        
        # Step 6: Handle numbers
        if self.config.remove_numbers:
            text = re.sub(r'\b\d+\b', '[NUMBER]', text)
        
        # Step 7: Clean whitespace
        if self.config.remove_extra_whitespace:
            text = re.sub(r'\s+', ' ', text)  # Multiple spaces to single
            text = text.strip()  # Remove leading/trailing whitespace
        
        # Step 8: Handle case
        if self.config.convert_to_lowercase:
            text = text.lower()
        
        # Step 9: Remove excessive punctuation
        text = re.sub(r'[!]{2,}', '!', text)  # Multiple exclamation marks
        text = re.sub(r'[?]{2,}', '?', text)  # Multiple question marks
        text = re.sub(r'[.]{3,}', '...', text)  # Multiple periods
        
        # Step 10: Remove emojis (optional)
        if hasattr(self.config, 'remove_emojis') and self.config.remove_emojis:
            text = emoji.demojize(text)  # Convert to text representation
        
        final_length = len(text)
        reduction_pct = ((original_length - final_length) / original_length) * 100 if original_length > 0 else 0
        
        return text, {
            'original_length': original_length,
            'final_length': final_length,
            'reduction_percentage': reduction_pct
        }

# Apply cleaning pipeline
cleaner = TextCleaner(config)

cleaned_results = []
for idx, row in df.iterrows():
    cleaned_text, stats = cleaner.clean_text(row['raw_text'])
    cleaned_results.append({
        'id': row['id'],
        'source': row['source'],
        'original': row['raw_text'],
        'cleaned': cleaned_text,
        'stats': stats
    })

# Convert results to DataFrame
cleaned_df = pd.DataFrame(cleaned_results)

print("Text Cleaning Results:")
print(f"Average length reduction: {np.mean([r['stats']['reduction_percentage'] for r in cleaned_results]):.1f}%")
print("\nBefore and After Examples:")
for i in range(3):
    print(f"\n{i+1}. Original: {cleaned_results[i]['original']}")
    print(f"   Cleaned:  {cleaned_results[i]['cleaned']}")

This comprehensive cleaning pipeline handles common text processing challenges including HTML entities, URLs, phone numbers, emails, whitespace normalization, and character encoding issues.

Named Entity Recognition

3 steps

Prepare Documents

Prepare text documents for named entity recognition

Code Example

# Sample documents with entities
documents = [
    "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976.",
    "Microsoft Corporation is headquartered in Redmond, Washington.",
    "Elon Musk is the CEO of Tesla and SpaceX, companies based in Austin and Hawthorne.",
    "Amazon was started by Jeff Bezos in Seattle in 1994.",
    "Google was founded by Larry Page and Sergey Brin at Stanford University."
]

# Alternatively, load from a file
# with open('documents.txt', 'r') as f:
#     documents = f.readlines()

print(f"Loaded {len(documents)} documents for entity extraction")

Prepare your documents that contain named entities like people, organizations, locations, dates, etc.

Extract Named Entities

Use Litends AI to extract entities from documents

Code Example

entity_results = []

for i, doc in enumerate(documents):
    try:
        # Extract entities using Litends AI
        response = client.text.entities({
            "text": doc,
            "types": ["PERSON", "ORGANIZATION", "LOCATION", "DATE"]
        })
        
        result = {
            "document": doc,
            "entities": response["entities"]
        }
        entity_results.append(result)
        
        print(f"\nDocument {i+1}:")
        print(f"Text: {doc}")
        print(f"Found {len(response['entities'])} entities:")
        
        for entity in response["entities"]:
            print(f"  - {entity['text']} ({entity['type']}) - Confidence: {entity['confidence']:.2f}")
            
    except Exception as e:
        print(f"Error processing document {i+1}: {e}")

print(f"\nProcessed {len(entity_results)} documents successfully")

Extract named entities from each document. The API identifies and classifies entities with confidence scores.

Analyze Entity Results

Process and summarize the extracted entities

Code Example

# Collect all entities
all_entities = []
for result in entity_results:
    all_entities.extend(result["entities"])

# Create entity analysis
entity_df = pd.DataFrame(all_entities)

if not entity_df.empty:
    # Count entities by type
    entity_types = entity_df['type'].value_counts()
    print("Entity Types Found:")
    for entity_type, count in entity_types.items():
        print(f"  {entity_type}: {count}")
    
    # Show unique entities by type
    print("\nUnique Entities by Type:")
    for entity_type in entity_types.index:
        entities = entity_df[entity_df['type'] == entity_type]['text'].unique()
        print(f"\n{entity_type}:")
        for entity in entities[:5]:  # Show first 5
            print(f"  - {entity}")
    
    # Calculate average confidence by type
    print("\nAverage Confidence by Type:")
    avg_confidence = entity_df.groupby('type')['confidence'].mean()
    for entity_type, confidence in avg_confidence.items():
        print(f"  {entity_type}: {confidence:.2f}")

Analyze the extracted entities to understand patterns, frequency, and confidence levels across different entity types.

Text Classification

3 steps

Prepare Content for Classification

Prepare text content for topic classification

Code Example

# Sample content for classification
content_samples = [
    "Scientists discover new method for renewable energy storage using advanced batteries.",
    "Stock market reaches new highs as technology companies show strong earnings.",
    "Local football team wins championship in thrilling overtime match.",
    "New study reveals health benefits of Mediterranean diet for heart disease prevention.",
    "Government announces new policies for climate change mitigation and carbon reduction.",
    "Latest smartphone features advanced AI camera and 5G connectivity.",
    "Travel restrictions lifted as tourism industry begins recovery post-pandemic."
]

# Define categories you want to classify into
categories = ["Technology", "Sports", "Health", "Politics", "Business", "Science", "Travel"]

print(f"Loaded {len(content_samples)} content samples for classification")
print(f"Target categories: {', '.join(categories)}")

Prepare your text content and define the categories you want to classify it into.

Classify Text Content

Use Litends AI to classify text into categories

Code Example

classification_results = []

for i, content in enumerate(content_samples):
    try:
        # For demonstration, we'll use sentiment analysis as base
        # and add custom logic for classification
        response = client.text.sentiment({
            "text": content,
            "language": "en"
        })
        
        # In a real scenario, you'd use a dedicated classification endpoint
        # or train a custom model with your categories
        
        # Simple keyword-based classification for demo
        content_lower = content.lower()
        predicted_category = "General"
        
        if any(word in content_lower for word in ["technology", "smartphone", "ai", "5g"]):
            predicted_category = "Technology"
        elif any(word in content_lower for word in ["stock", "market", "earnings", "business"]):
            predicted_category = "Business"
        elif any(word in content_lower for word in ["football", "team", "championship", "sport"]):
            predicted_category = "Sports"
        elif any(word in content_lower for word in ["health", "diet", "disease", "study"]):
            predicted_category = "Health"
        elif any(word in content_lower for word in ["government", "policies", "climate"]):
            predicted_category = "Politics"
        elif any(word in content_lower for word in ["scientists", "discover", "research"]):
            predicted_category = "Science"
        elif any(word in content_lower for word in ["travel", "tourism", "restrictions"]):
            predicted_category = "Travel"
        
        result = {
            "content": content,
            "predicted_category": predicted_category,
            "confidence": 0.85  # Simulated confidence
        }
        classification_results.append(result)
        
        print(f"Content {i+1}: {predicted_category}")
        
    except Exception as e:
        print(f"Error classifying content {i+1}: {e}")

print(f"\nClassified {len(classification_results)} content samples")

This example shows basic classification logic. In production, you would use a dedicated classification model or train custom categories.

Evaluate Classification Results

Analyze the classification results and accuracy

Code Example

# Create DataFrame for analysis
classification_df = pd.DataFrame(classification_results)

# Count classifications by category
category_counts = classification_df['predicted_category'].value_counts()
print("Classification Distribution:")
for category, count in category_counts.items():
    percentage = (count / len(classification_df)) * 100
    print(f"  {category}: {count} ({percentage:.1f}%)")

# Calculate average confidence
avg_confidence = classification_df['confidence'].mean()
print(f"\nAverage Classification Confidence: {avg_confidence:.2f}")

# Show detailed results
print("\nDetailed Classification Results:")
for idx, row in classification_df.iterrows():
    print(f"\n{idx+1}. Category: {row['predicted_category']}")
    print(f"    Text: {row['content'][:100]}...")
    print(f"    Confidence: {row['confidence']:.2f}")

# Create summary statistics
print(f"\nSummary:")
print(f"Total samples processed: {len(classification_df)}")
print(f"Unique categories assigned: {len(category_counts)}")
print(f"Most common category: {category_counts.index[0]} ({category_counts.iloc[0]} samples)")

Analyze classification results to understand distribution, confidence levels, and accuracy of your text classification system.

Ready to implement NLP in your applications?

Start processing text data with Litends AI or explore our other AI capabilities and tutorials.

API Reference

Image Recognition