Building Practical AI Data Extraction Pipelines : From Cloud to Local LLMs

13/11/2025 � Bright-tek

AI Data Extraction Pipelines

How SMEs Can Turn Unstructured Text into Actionable, Searchable, Reliable Data

In our previous entry —
👉 AI Is Transforming Business Operations in 2025 — and SMEs Are Leading the Way —
we explored how small and medium businesses are quietly gaining a huge advantage with AI: faster decision-making, automated back-office work, and instant access to insights buried in documents.

This new article builds directly on those ideas, but with a stronger technical, hands-on focus.

We’re going from:

“AI can extract insights from your documents.”

to:

“Here is how YOU can build your own extraction pipeline — including a private local version with no cloud and no API keys.”

1. Why Build Data Extraction Pipelines?

Businesses are drowning in unstructured text:

Contracts
Financial statements
Customer support transcripts
Compliance reports
Legal documents
Invoices and receipts

This “dark data” is expensive, slow, and risky to handle manually.
Modern LLM-based extraction, however:

Understands natural-language structure
Extracts meaningful information
Formats it into JSON
Is more accurate than regex or rule-based systems
Works with multilingual documents
Scales from 10 docs to 100,000

2. Key Use Cases (Directly from Real SME Scenarios)

✔ Lease contract summarization
✔ Extracting invoice fields into ERP systems
✔ Compliance and legal entity extraction
✔ Customer sentiment + issue extraction
✔ Medical intake form classification
✔ Logistics & shipping document processing

These are the same use cases we introduced in our first blog, but now we will implement one.

3. Architecture of an AI Extraction Pipeline

A modern extraction pipeline typically contains:

[Raw Documents] → [Text Extraction] → [Chunking] → [LLM Processing] → [Structured JSON] → [Database/API]

Where the LLM can be:

Cloud-based (OpenAI, Google Gemini, Anthropic)
Hybrid (local + cloud)
Fully local (privacy-critical environments)

Using LangExtract as the glue library makes the experience consistent across all backends.

4. The LangExtract Toolkit

LangExtract provides:

Structured extraction classes
Example-based task definitions
Automatic chunking for long docs
Local and cloud model support
JSON + visualization tools
Parallel execution

It can run on top of:

OpenAI GPT-4o
Google Gemini
Anthropic Claude
Local LLMs via Ollama
Local inference servers

5. Example Extraction (Cloud Version)

Here is a simple cloud-based example (OpenAI, Gemini, etc.):

result = lx.extract(
    text_or_documents=invoice_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"  # or "gpt-4o"
)

This works great, but…

Many businesses cannot send sensitive financial or legal documents to the cloud.

That’s where local LLM extraction becomes a game changer.

6. Building a Private, Local LLM Extraction Pipeline (Ollama + Python)

100% privacy · 0 API keys · Runs on your laptop

This section shows how to build the same extraction workflow, but running entirely offline, using:

Ollama → your local LLM server
LangExtract → extraction logic
Python → glue code
No internet connection required

This approach is ideal for:

Legal agreements
HR files
Medical documents
Confidential financial data
Internal proprietary datasets

Step 1 — Install Ollama

Go to: https://ollama.com
Download and install for your OS.

Run your first model:

ollama run llama3

This downloads the model and starts your local LLM server at:

http://localhost:11434

Leave Ollama running in the background.

Step 2 — Set Up Python Environment

python -m venv venv
.\venv\Scripts\activate   # Windows
# source venv/bin/activate  # macOS/Linux
pip install langextract

Step 3 — Add Your Unstructured Input File

Create a file named input.txt:

**RESIDENTIAL LEASE AGREEMENT**
**Contract ID: RLA-2025-GUA-88B**

This agreement is entered into on this **14th day of November, 2025**, by and between the Landlord, **Mr. Roberto Cifuentes**, whose primary residence is at 123 Calle Ficticia, Antigua Guatemala, Sacatepéquez 03001, (hereinafter "Landlord"), and the Tenant, **Ms. Elena Rodriguez**, currently residing at 456 Avenida de Ejemplo, Guatemala City, 01001, (hereinafter "Tenant").

**1. Property:**
The Landlord agrees to lease the property located at **789 Camino Real, Zona 10, Guatemala City, 01010**, to the Tenant.

**2. Term:**
The term of this lease shall commence on **December 1, 2025**, and shall terminate on **November 30, 2026**.

**3. Financials:**
A. Rent: Tenant agrees to pay a monthly rent of **Q. 4,500.00** (Four Thousand Five Hundred Quetzales).
B. Security Deposit: Upon execution of this agreement, Tenant shall deposit with the Landlord the sum of **Q. 9,000.00** as security for any damages.

**4. Contact Information:**
In case of emergency, Tenant's emergency contact is **Javier Morales** at phone number **+502 5555-1234**. The Landlord's designated property manager is **Inmobiliaria Segura, S.A.** at **+502 2444-9876**. All legal notices should be sent to the Landlord's primary residence.

This document (RLA-2025-GUA-88B) constitutes the entire agreement.

Step 4 — Create the Extraction Script

Save as run_extract.py:

import json
import textwrap
import langextract as lx

# === Example Schema to Teach the Model ===
example_text = textwrap.dedent("""
    SERVICE AGREEMENT
    Contract No: SA-2024-101
    This contract is made on January 1, 2024, between the Provider
    (Mr. Juan Perez) and the Client (Ms. Lucia Fernandez).
""").strip()

examples = [
    lx.data.ExampleData(
        text=example_text,
        extractions=[
            lx.data.Extraction(
                extraction_class="lease_agreement",
                extraction_text=example_text,
                attributes={
                    "contract_id": "SA-2024-101",
                    "agreement_date": "January 1, 2024",
                    "landlord_name": "Mr. Juan Perez",
                    "tenant_name": "Ms. Lucia Fernandez",
                    "property_address": "111 Main St",
                    "start_date": "February 1, 2024",
                    "end_date": "January 31, 2025",
                    "rent_amount": "Q. 1,000.00",
                    "deposit_amount": "Q. 2,000.00",
                },
            ),
        ],
    )
]

prompt = """
Extract the key fields from the lease agreement.
Return ONLY one 'lease_agreement' entity.
"""

# Load text from file
with open("input.txt", "r", encoding="utf-8") as f:
    text_to_process = f.read()

# === Local extraction using Ollama ===
result = lx.extract(
    text_or_documents=text_to_process,
    prompt_description=prompt,
    examples=examples,
    model_id="llama3",
)

extracted = result.extractions[0].attributes
print(json.dumps(extracted, indent=2))

Step 5 — Run It

python run_extract.py

You’ll see clean, structured JSON:

{
  "contract_id": "RLA-2025-GUA-88B",
  "agreement_date": "14th day of November, 2025",
  "landlord_name": "Mr. Roberto Cifuentes",
  "tenant_name": "Ms. Elena Rodriguez",
  "start_date": "December 1, 2025",
  "end_date": "November 30, 2026",
  "rent_amount": "Q. 4,500.00",
  "deposit_amount": "Q. 9,000.00"
}

Why This Is a Game Changer

✔ Zero cloud usage
✔ No API keys
✔ Ideal for confidential documents
✔ Reproducible and auditable outputs
✔ Works offline
✔ Cheap (or free)
✔ As accurate as GPT-4o for many business tasks

This is where SMEs gain massive competitive leverage — they can process high-value internal data with full privacy, full control, and without paying per token.

7. Conclusion — SMEs Now Have Enterprise-Level AI Capabilities

This blog is the practical continuation of our earlier article:

➡ AI Is Transforming Business Operations in 2025 — and SMEs Are Leading the Way

With the combination of:

LangExtract
Local LLMs like Ollama
Simple Python scripts
Flexible extraction schemas

Even a small business can now build:

Automated contract readers
Document-to-JSON pipelines
Internal AI assistants
Compliance automations
Task-specific extractors
Structured databases from raw text

And all of this runs on the hardware you already own.

If you’d like support building a custom extraction system for your business, reach out — we can help architect, build, and deploy it.

Bright-tek → Modern AI + Software Development, done professionally.

Tags: AI, LLM, Python, Automation, Data Extraction, Ollama, SME Innovation