Building Practical AI Data Extraction Pipelines : From Cloud to Local LLMs
AI Data Extraction Pipelines
How SMEs Can Turn Unstructured Text into Actionable, Searchable, Reliable Data
In our previous entry —
👉 AI Is Transforming Business Operations in 2025 — and SMEs Are Leading the Way —
we explored how small and medium businesses are quietly gaining a huge advantage with AI: faster decision-making, automated back-office work, and instant access to insights buried in documents.
This new article builds directly on those ideas, but with a stronger technical, hands-on focus.
We’re going from:
“AI can extract insights from your documents.”
to:
“Here is how YOU can build your own extraction pipeline — including a private local version with no cloud and no API keys.”
1. Why Build Data Extraction Pipelines?
Businesses are drowning in unstructured text:
- Contracts
- Financial statements
- Customer support transcripts
- Compliance reports
- Legal documents
- Invoices and receipts
This “dark data” is expensive, slow, and risky to handle manually.
Modern LLM-based extraction, however:
- Understands natural-language structure
- Extracts meaningful information
- Formats it into JSON
- Is more accurate than regex or rule-based systems
- Works with multilingual documents
- Scales from 10 docs to 100,000
2. Key Use Cases (Directly from Real SME Scenarios)
- ✔ Lease contract summarization
- ✔ Extracting invoice fields into ERP systems
- ✔ Compliance and legal entity extraction
- ✔ Customer sentiment + issue extraction
- ✔ Medical intake form classification
- ✔ Logistics & shipping document processing
These are the same use cases we introduced in our first blog, but now we will implement one.
3. Architecture of an AI Extraction Pipeline
A modern extraction pipeline typically contains:
[Raw Documents] → [Text Extraction] → [Chunking] → [LLM Processing] → [Structured JSON] → [Database/API]
Where the LLM can be:
- Cloud-based (OpenAI, Google Gemini, Anthropic)
- Hybrid (local + cloud)
- Fully local (privacy-critical environments)
Using LangExtract as the glue library makes the experience consistent across all backends.
4. The LangExtract Toolkit
LangExtract provides:
- Structured extraction classes
- Example-based task definitions
- Automatic chunking for long docs
- Local and cloud model support
- JSON + visualization tools
- Parallel execution
It can run on top of:
- OpenAI GPT-4o
- Google Gemini
- Anthropic Claude
- Local LLMs via Ollama
- Local inference servers
5. Example Extraction (Cloud Version)
Here is a simple cloud-based example (OpenAI, Gemini, etc.):
result = lx.extract(
text_or_documents=invoice_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash" # or "gpt-4o"
)
This works great, but…
Many businesses cannot send sensitive financial or legal documents to the cloud.
That’s where local LLM extraction becomes a game changer.
6. Building a Private, Local LLM Extraction Pipeline (Ollama + Python)
100% privacy · 0 API keys · Runs on your laptop
This section shows how to build the same extraction workflow, but running entirely offline, using:
- Ollama → your local LLM server
- LangExtract → extraction logic
- Python → glue code
- No internet connection required
This approach is ideal for:
- Legal agreements
- HR files
- Medical documents
- Confidential financial data
- Internal proprietary datasets
Step 1 — Install Ollama
Go to: https://ollama.com
Download and install for your OS.
Run your first model:
ollama run llama3
This downloads the model and starts your local LLM server at:
http://localhost:11434
Leave Ollama running in the background.
Step 2 — Set Up Python Environment
python -m venv venv
.\venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
pip install langextract
Step 3 — Add Your Unstructured Input File
Create a file named input.txt:
**RESIDENTIAL LEASE AGREEMENT**
**Contract ID: RLA-2025-GUA-88B**
This agreement is entered into on this **14th day of November, 2025**, by and between the Landlord, **Mr. Roberto Cifuentes**, whose primary residence is at 123 Calle Ficticia, Antigua Guatemala, Sacatepéquez 03001, (hereinafter "Landlord"), and the Tenant, **Ms. Elena Rodriguez**, currently residing at 456 Avenida de Ejemplo, Guatemala City, 01001, (hereinafter "Tenant").
**1. Property:**
The Landlord agrees to lease the property located at **789 Camino Real, Zona 10, Guatemala City, 01010**, to the Tenant.
**2. Term:**
The term of this lease shall commence on **December 1, 2025**, and shall terminate on **November 30, 2026**.
**3. Financials:**
A. Rent: Tenant agrees to pay a monthly rent of **Q. 4,500.00** (Four Thousand Five Hundred Quetzales).
B. Security Deposit: Upon execution of this agreement, Tenant shall deposit with the Landlord the sum of **Q. 9,000.00** as security for any damages.
**4. Contact Information:**
In case of emergency, Tenant's emergency contact is **Javier Morales** at phone number **+502 5555-1234**. The Landlord's designated property manager is **Inmobiliaria Segura, S.A.** at **+502 2444-9876**. All legal notices should be sent to the Landlord's primary residence.
This document (RLA-2025-GUA-88B) constitutes the entire agreement.
Step 4 — Create the Extraction Script
Save as run_extract.py:
import json
import textwrap
import langextract as lx
# === Example Schema to Teach the Model ===
example_text = textwrap.dedent("""
SERVICE AGREEMENT
Contract No: SA-2024-101
This contract is made on January 1, 2024, between the Provider
(Mr. Juan Perez) and the Client (Ms. Lucia Fernandez).
""").strip()
examples = [
lx.data.ExampleData(
text=example_text,
extractions=[
lx.data.Extraction(
extraction_class="lease_agreement",
extraction_text=example_text,
attributes={
"contract_id": "SA-2024-101",
"agreement_date": "January 1, 2024",
"landlord_name": "Mr. Juan Perez",
"tenant_name": "Ms. Lucia Fernandez",
"property_address": "111 Main St",
"start_date": "February 1, 2024",
"end_date": "January 31, 2025",
"rent_amount": "Q. 1,000.00",
"deposit_amount": "Q. 2,000.00",
},
),
],
)
]
prompt = """
Extract the key fields from the lease agreement.
Return ONLY one 'lease_agreement' entity.
"""
# Load text from file
with open("input.txt", "r", encoding="utf-8") as f:
text_to_process = f.read()
# === Local extraction using Ollama ===
result = lx.extract(
text_or_documents=text_to_process,
prompt_description=prompt,
examples=examples,
model_id="llama3",
)
extracted = result.extractions[0].attributes
print(json.dumps(extracted, indent=2))
Step 5 — Run It
python run_extract.py
You’ll see clean, structured JSON:
{
"contract_id": "RLA-2025-GUA-88B",
"agreement_date": "14th day of November, 2025",
"landlord_name": "Mr. Roberto Cifuentes",
"tenant_name": "Ms. Elena Rodriguez",
"start_date": "December 1, 2025",
"end_date": "November 30, 2026",
"rent_amount": "Q. 4,500.00",
"deposit_amount": "Q. 9,000.00"
}
Why This Is a Game Changer
- ✔ Zero cloud usage
- ✔ No API keys
- ✔ Ideal for confidential documents
- ✔ Reproducible and auditable outputs
- ✔ Works offline
- ✔ Cheap (or free)
- ✔ As accurate as GPT-4o for many business tasks
This is where SMEs gain massive competitive leverage — they can process high-value internal data with full privacy, full control, and without paying per token.
7. Conclusion — SMEs Now Have Enterprise-Level AI Capabilities
This blog is the practical continuation of our earlier article:
➡ AI Is Transforming Business Operations in 2025 — and SMEs Are Leading the Way
With the combination of:
- LangExtract
- Local LLMs like Ollama
- Simple Python scripts
- Flexible extraction schemas
Even a small business can now build:
- Automated contract readers
- Document-to-JSON pipelines
- Internal AI assistants
- Compliance automations
- Task-specific extractors
- Structured databases from raw text
And all of this runs on the hardware you already own.
If you’d like support building a custom extraction system for your business, reach out — we can help architect, build, and deploy it.
Bright-tek → Modern AI + Software Development, done professionally.