Building Practical AI Data Extraction Pipelines : From Cloud to Local LLMs

13/11/2025 � Bright-tek

AI Data Extraction Pipelines

How SMEs Can Turn Unstructured Text into Actionable, Searchable, Reliable Data

In our previous entry —
👉 AI Is Transforming Business Operations in 2025 — and SMEs Are Leading the Way
we explored how small and medium businesses are quietly gaining a huge advantage with AI: faster decision-making, automated back-office work, and instant access to insights buried in documents.

This new article builds directly on those ideas, but with a stronger technical, hands-on focus.

We’re going from:

“AI can extract insights from your documents.”

to:

“Here is how YOU can build your own extraction pipeline — including a private local version with no cloud and no API keys.”


1. Why Build Data Extraction Pipelines?

Businesses are drowning in unstructured text:

This “dark data” is expensive, slow, and risky to handle manually.
Modern LLM-based extraction, however:


2. Key Use Cases (Directly from Real SME Scenarios)

These are the same use cases we introduced in our first blog, but now we will implement one.


3. Architecture of an AI Extraction Pipeline

A modern extraction pipeline typically contains:

[Raw Documents] → [Text Extraction] → [Chunking] → [LLM Processing] → [Structured JSON] → [Database/API]

Where the LLM can be:

Using LangExtract as the glue library makes the experience consistent across all backends.


4. The LangExtract Toolkit

LangExtract provides:

It can run on top of:


5. Example Extraction (Cloud Version)

Here is a simple cloud-based example (OpenAI, Gemini, etc.):

result = lx.extract(
    text_or_documents=invoice_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"  # or "gpt-4o"
)

This works great, but…

Many businesses cannot send sensitive financial or legal documents to the cloud.

That’s where local LLM extraction becomes a game changer.


6. Building a Private, Local LLM Extraction Pipeline (Ollama + Python)

100% privacy · 0 API keys · Runs on your laptop

This section shows how to build the same extraction workflow, but running entirely offline, using:

This approach is ideal for:


Step 1 — Install Ollama

Go to: https://ollama.com
Download and install for your OS.

Run your first model:

ollama run llama3

This downloads the model and starts your local LLM server at:

http://localhost:11434

Leave Ollama running in the background.


Step 2 — Set Up Python Environment

python -m venv venv
.\venv\Scripts\activate   # Windows
# source venv/bin/activate  # macOS/Linux
pip install langextract

Step 3 — Add Your Unstructured Input File

Create a file named input.txt:

**RESIDENTIAL LEASE AGREEMENT**
**Contract ID: RLA-2025-GUA-88B**

This agreement is entered into on this **14th day of November, 2025**, by and between the Landlord, **Mr. Roberto Cifuentes**, whose primary residence is at 123 Calle Ficticia, Antigua Guatemala, Sacatepéquez 03001, (hereinafter "Landlord"), and the Tenant, **Ms. Elena Rodriguez**, currently residing at 456 Avenida de Ejemplo, Guatemala City, 01001, (hereinafter "Tenant").

**1. Property:**
The Landlord agrees to lease the property located at **789 Camino Real, Zona 10, Guatemala City, 01010**, to the Tenant.

**2. Term:**
The term of this lease shall commence on **December 1, 2025**, and shall terminate on **November 30, 2026**.

**3. Financials:**
A. Rent: Tenant agrees to pay a monthly rent of **Q. 4,500.00** (Four Thousand Five Hundred Quetzales).
B. Security Deposit: Upon execution of this agreement, Tenant shall deposit with the Landlord the sum of **Q. 9,000.00** as security for any damages.

**4. Contact Information:**
In case of emergency, Tenant's emergency contact is **Javier Morales** at phone number **+502 5555-1234**. The Landlord's designated property manager is **Inmobiliaria Segura, S.A.** at **+502 2444-9876**. All legal notices should be sent to the Landlord's primary residence.

This document (RLA-2025-GUA-88B) constitutes the entire agreement.

Step 4 — Create the Extraction Script

Save as run_extract.py:

import json
import textwrap
import langextract as lx

# === Example Schema to Teach the Model ===
example_text = textwrap.dedent("""
    SERVICE AGREEMENT
    Contract No: SA-2024-101
    This contract is made on January 1, 2024, between the Provider
    (Mr. Juan Perez) and the Client (Ms. Lucia Fernandez).
""").strip()

examples = [
    lx.data.ExampleData(
        text=example_text,
        extractions=[
            lx.data.Extraction(
                extraction_class="lease_agreement",
                extraction_text=example_text,
                attributes={
                    "contract_id": "SA-2024-101",
                    "agreement_date": "January 1, 2024",
                    "landlord_name": "Mr. Juan Perez",
                    "tenant_name": "Ms. Lucia Fernandez",
                    "property_address": "111 Main St",
                    "start_date": "February 1, 2024",
                    "end_date": "January 31, 2025",
                    "rent_amount": "Q. 1,000.00",
                    "deposit_amount": "Q. 2,000.00",
                },
            ),
        ],
    )
]

prompt = """
Extract the key fields from the lease agreement.
Return ONLY one 'lease_agreement' entity.
"""

# Load text from file
with open("input.txt", "r", encoding="utf-8") as f:
    text_to_process = f.read()

# === Local extraction using Ollama ===
result = lx.extract(
    text_or_documents=text_to_process,
    prompt_description=prompt,
    examples=examples,
    model_id="llama3",
)

extracted = result.extractions[0].attributes
print(json.dumps(extracted, indent=2))

Step 5 — Run It

python run_extract.py

You’ll see clean, structured JSON:

{
  "contract_id": "RLA-2025-GUA-88B",
  "agreement_date": "14th day of November, 2025",
  "landlord_name": "Mr. Roberto Cifuentes",
  "tenant_name": "Ms. Elena Rodriguez",
  "start_date": "December 1, 2025",
  "end_date": "November 30, 2026",
  "rent_amount": "Q. 4,500.00",
  "deposit_amount": "Q. 9,000.00"
}

Why This Is a Game Changer

This is where SMEs gain massive competitive leverage — they can process high-value internal data with full privacy, full control, and without paying per token.


7. Conclusion — SMEs Now Have Enterprise-Level AI Capabilities

This blog is the practical continuation of our earlier article:

AI Is Transforming Business Operations in 2025 — and SMEs Are Leading the Way

With the combination of:

Even a small business can now build:

And all of this runs on the hardware you already own.

If you’d like support building a custom extraction system for your business, reach out — we can help architect, build, and deploy it.

Bright-tek → Modern AI + Software Development, done professionally.

Tags: AI, LLM, Python, Automation, Data Extraction, Ollama, SME Innovation