Skip to content

nsourlos/llm_excel_directory_sanitizer_and_anonymization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Information Extraction from Excel Files

A data science solution for extracting structured information from M&A (Mergers & Acquisitions) document requests in Excel format, and creating organized folder structures.

Project Overview

This project addresses two main challenges:

  1. Information Extraction: Parse diverse Excel file formats containing M&A document requests and generate a structured folder hierarchy
  2. GDPR Compliance: Identify and redact personal data from documents to ensure regulatory compliance

Features

  • Multi-format Excel support (.xlsx and .xls)
  • Automatic hidden sheet filtering
  • Multiple parsing strategies for different Excel structures
  • LLM-powered folder name generation using structured outputs
  • Async processing with rate limiting for API calls
  • GDPR-compliant text anonymization (regex + LLM approaches)
  • Document extraction from PDF and DOCX files

Installation

Using uv (Recommended)

# Create virtual environment and install dependencies
uv venv
uv pip install -r requirements.txt

Using pip

python -m venv venv
venv\Scripts\activate  # Windows
pip install -r requirements.txt

Configuration

  1. Copy .env.example to .env:

    copy .env.example .env
  2. Add your OpenAI API key to .env:

    OPENAI_API_KEY=your_api_key_here
    

Usage

Basic Usage

python main.py --data-folder ./data --output-dir ./output

Project Structure

information_extraction/
├── README.md
├── requirements.txt
├── .env.example
├── main.py                               # Entry point
├── information_extraction_excel.ipynb    # Jupyter notebook with full implementation
├── src/
│   ├── __init__.py
│   ├── excel_loader.py                   # Excel loading & sheet filtering
│   ├── cleaners.py                       # Data cleaning functions
│   ├── llm_processor.py                  # LLM calls & folder generation
│   ├── anonymizer.py                     # GDPR redaction functions
│   ├── document_extractor.py             # PDF/DOCX text extraction
│   └── utils.py                          # Utilities (logging, validation)
├── data/                                 # Input Excel files
└── logs/                                 # Processing logs

Module Overview

excel_loader.py

Handles loading Excel files, filtering hidden sheets, and reading visible sheets into pandas DataFrames.

cleaners.py

Contains multiple cleaning functions (clean_excel_information_table, clean_excel_information_2, etc.) to handle different Excel formats.

llm_processor.py

Manages LLM API calls for generating folder structures:

  • Async batch processing with semaphores
  • Retry logic with exponential backoff
  • Structured output using Pydantic models

anonymizer.py

GDPR compliance utilities:

  • Regex-based anonymization for emails, phones, addresses, IDs
  • LLM-based anonymization for complex cases

document_extractor.py

Text extraction from various document formats (PDF, DOCX).

Performance Notes

  • ~30 rows: 23-30 seconds, ~$0.03 cost
  • ~120 rows: 2.5-3 minutes, ~$0.05 cost
  • ~5 Excel files (all samples): ~$0.30-0.70 cost and takes 9.5 mins (with 3 retries, could be up to ~$1.60)
  • Async processing provides ~600% speedup vs sequential (3 minutes vs 20 minutes)

Future Improvements & Notes

Information Extraction

  • LLM-based Excel parsing: Instead of manually handling different Excel formats with custom cleaning functions, consider using LLMs to directly extract structured data from Excel files. This would reduce the need for format-specific preprocessing.
  • Instruction-aware processing: Parse and incorporate instructions from Excel sheets (once documents become available) into LLM prompts for more context-aware context extraction.
  • Non-LLM alternative: It's possible to merge Sector/Area columns without LLMs to create folder hierarchies, but this approach is:
    • Case-sensitive and brittle
    • Not robust to upper/lowercase variations
    • Requires extensive special character cleaning
    • Less flexible for new formats

GDPR Compliance

  • Enhanced anonymization: Current regex-based approach has limitations:
    • May miss context-dependent PII
    • Can't handle complex linguistic patterns
    • Recommendation: Prefer LLM-based anonymization for production use
  • Validation framework: Implement "LLM as judge" to verify that fields are correctly anonymized
  • NER integration: Consider using Named Entity Recognition (NER) models for more robust PII detection

Architecture

  • Configuration management: Add a config file (YAML/TOML) for:
    • Model selection and parameters
    • Rate limiting settings
    • Folder naming conventions
    • Anonymization rules
  • Standardization: Current implementation attempts to standardize DataFrame formats across different Excel structures, but this could be further improved with a unified schema

Cost Optimization

  • Model selection: Consider using models from other providers for simple tasks
  • Caching: Implement caching for similar requests to reduce redundant API calls

Testing (for production)

  • Unit tests: Tests for all cleaner functions
  • Integration tests: Add end-to-end tests with sample Excel files
  • Validation tests: Automated checks for GDPR compliance effectiveness

Useful Prompt - Used in Gemini to create clean_excel_information_2 function:

I want to split the info in this excel into many columns. These have to be named Section, Area, and Requested information. Section is a row that starts with latin arithmetics followed by a name in capitals, for example "IV. ASSETS". We should name the section below this as "Assets". Area is a row starting with letters and followed by a name like "A. Real Estate". If missing, it should be the same as Section name. The requested information starts with a number and istructions are followed. For example, "6. Delegations of powers and powers of attorney". It might be that if after that row with the number, there are other rows with small letters like "a. the country of incorporation;" these also are part of the requested info of that row.

About

Automating the organization of M&A document requests and ensuring GDPR-compliant data handling

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors