AI
Server
Linux

Your Own Personal Jarvis (or maybe just a Smart Teacup): Building a Local, Private LLM

6/18/2025
Your Own Personal Jarvis (or maybe just a Smart Teacup): Building a Local, Private LLM

Your Own Personal Jarvis (or maybe just a Smart Teacup): Building a Local, Private LLM

Remember when having your own website felt like joining a secret club? Or running your own email server? (Okay, maybe that last one was just for the truly masochistic). Well, we're hitting another one of those inflection points, folks. It's about your own AI. Not chatting with some server farm thousands of miles away, feeding it your deepest thoughts or, in the case of our esteemed attorney friend, your most confidential client data. Nah. We're talking about an AI that lives with you, on your hardware, under your control.

The subject line of this discussion thread hit me right in the feels: "Building a Private Local LLM: A Non-Developer's Guide and Experience with Ollama & Open WebUI". And then you read it's a 53-year-old attorney who just went full root-access-YOLO, wrestled with WSL2 (bless his heart, that's a rite of passage), got Ollama purring, and slapped Open WebUI on top for a shiny interface. This, my friends, is not just a tech story; it's an inspirational saga for anyone who's ever thought, "AI is cool, but also... kinda creepy?"

Why Go Local? The Privacy Blanket and the Unlimited Buffet

For a law firm, as highlighted immediately in the thread, privacy isn't a luxury; it's an ethical and professional absolute. The idea that logs of confidential legal queries could be subject to subpoena (as was the case with OpenAI, yikes!) is enough to give any partner night sweats. Bringing the LLM in-house wraps your sensitive data in a digital lead-lined blanket. It stays on your network, behind your firewall, answerable only to you.

But beyond lawyers and doctors and folks with secrets (which, let's be honest, is everyone), local AI offers another irresistible perk: unlimited usage. No rate limits, no subscriptions scaling with usage, no worrying if your casual brainstorming session is racking up micro-cents that become macro-dollars. It's like an all-you-can-eat buffet where you paid for the kitchen upfront.

Ah, the age-old CAPEX (Capital Expenditure - buying the gear) vs. OPEX (Operational Expenditure - paying for cloud usage) debate. You frontload the cost with hardware, but then your running costs are minimal (mostly electricity and the occasional replacement part). For predictable, heavy usage, CAPEX often wins long-term. For sporadic or exploratory use, OPEX (cloud APIs) can be better. But for privacy and control, local wins hands down.

The Entry Point: Ollama and Open WebUI – Your First Steps into the Local AI Ocean

Our attorney didn't start by building a data center in his basement. He started simple. Ollama is a fantastic tool that makes downloading and running open-source LLMs surprisingly easy. It handles the messy bits of getting models to talk to your hardware. Think of it as the engine.

Then there's Open WebUI. If Ollama is the engine, Open WebUI is the dashboard and steering wheel. It gives you a user-friendly chat interface in your web browser, much like ChatGPT, but powered by your local Ollama instance. It even supports multiple users and managing different models. For a non-developer, this combo is pure gold.

Personal Oops Moment: My first attempt at getting any LLM running locally involved wrestling with Python environments, CUDA drivers, and GitHub repos that looked like cryptic alien texts. It was a dark time. Tools like Ollama and Open WebUI didn't exist in this polished form a couple of years ago. They have dramatically lowered the barrier to entry. If you tried this pre-2023 and ran screaming, maybe try again now. It's so much easier.

The Buttery Smooth Experience (and the Hardware Reality)

This is where the conversation in the thread gets deliciously technical and crucial. Running LLMs, especially the larger, more capable ones, is primarily limited by one thing: GPU VRAM.

Think of VRAM (Video RAM) as the AI's short-term memory. The model itself needs to be loaded into this memory to run. Bigger models need more VRAM. Simple as that. Quantization helps (it's like compressing the model so it fits into less space), but there's a limit.

The thread talks about models from Gemma3 12B (12 Billion parameters) up to 70B and even 120B+. Our attorney started perhaps with a 12B model on his "5090" (presumably an RTX 3090 or 4090 or similar beast). A 12B model is a good start, maybe like that eager intern who knows some stuff but needs constant supervision.

But for serious tasks, like analyzing multiple long legal documents, you need more muscle. The 32B models are described as "trustworthy assistants," while 70B+ models are the "grad students" who can "devour several 30-page briefs" and synthesize complex analyses.

To run a 70B model, even a quantized version, you generally need 48GB of VRAM or more. An RTX 4090 has 24GB. So, immediately, you're looking at multiple GPUs or more expensive professional cards.

The discussion quickly pivots from consumer GPUs to server-grade beasts like the Nvidia A6000 Ada (48GB) or the RTX 6000 Pro (96GB). These aren't just faster; they often have features and reliability better suited for continuous operation in a server environment.

Scaling Pain: The Multi-User Bottleneck

Here’s a critical point often missed by home users: when you set up Open WebUI on a single machine with a single or even multiple GPUs, inference (the process of the AI generating a response) is often serial. If one person is asking a complex question, everyone else is waiting in line, like waiting for the single bathroom at a party.

This is why, for an institutional setup like a law firm with multiple users, a single powerful workstation isn't enough. You need a system designed for concurrency. This is where multi-GPU servers shine. Ollama can distribute models and tasks across multiple GPUs, allowing more than one user (or one user running multiple tasks) to get responses concurrently.

Building the Beast: Used Server Hardware is Your Friend

Okay, let's talk turkey – building a capable local LLM server for multiple users isn't cheap, but it doesn't have to involve selling a kidney. The $10k target mentioned in the thread is quite achievable for something decent.

Forget fancy gaming cases. The real secret, as the thread points out, is used server hardware. Data centers refresh their gear constantly, flooding the market with powerful, reliable components at steep discounts.

  • Chassis: Look for 2nd hand Supermicro or ASUS server chassis on eBay. Something with multiple PCIe slots (for GPUs!) and plenty of drive bays. You can snag a solid 3U or 4U chassis for $300-$1000.
  • Motherboard: Compatible server motherboards (like Supermicro H11DSI for AMD EPYC or H11SSL for single socket) are available for $500-$800. These support tons of RAM and multiple CPUs/GPUs.
  • CPUs: AMD EPYC CPUs (especially older generations) offer a fantastic core count for surprisingly little money on the used market.
  • RAM: This is the shocker. Server-grade ECC DDR4 memory (essential for stability) is dirt cheap used. A 64GB stick can be $50! You can load up on hundreds of gigabytes for a few hundred bucks. While the model primarily lives on the GPU, system RAM is still needed for the OS, other processes, and sometimes for swapping model parts if VRAM is tight (though this is slow). The thread suggests 512GB RAM – maybe overkill for pure inference, but 128GB or 256GB is a more realistic start for many.
  • GPUs: This is the main expense. An Nvidia A6000 Ada (48GB) is powerful and available used, but still pricey ($4000-$5000+ used?). Multiple lower-cost GPUs can work, but managing power and cooling in a non-server case is a nightmare (speaking from personal experience trying to cram 4x consumer cards into a standard tower – it became a loud, hot, unstable mess). A proper server chassis is designed for airflow and power delivery.

Error to Avoid: Trying to build a multi-GPU setup in a consumer PC case. Just... don't. The power supplies aren't designed for it, the cooling is insufficient, and the motherboards often lack the necessary PCIe slot layout and bandwidth. You'll spend more time troubleshooting crashes and thermal throttling than actually using the AI. Buy a proper server chassis.

Beyond the Chat: RAG and Vector Databases

The discussion also touches on RAG (Retrieval Augmented Generation). This is how you get an LLM to talk about your specific documents (like legal briefs, company reports, etc.) without training it from scratch. You chunk your documents, convert them into numerical representations called "vectors," store them in a vector database, and then, when a user asks a question, you find the most relevant document chunks using the vector database and feed those chunks into the LLM's context window along with the user's query.

The thread mentions Milvus as an open-source vector database that can use a GPU for acceleration. This is crucial for performance. Loading and vectorizing large PDFs inside the LLM application itself is slow. Offloading it to a dedicated vector database server (even just a second PC on the network with a decent GPU) dramatically speeds up document-based queries. Open WebUI has settings to connect to such a database.

The Caveats: Not a Magic Eight Ball

As one wise contributor points out, LLMs are not reliable sources of truth. They are language models, excellent at predicting the next word based on patterns. They can confidently make things up (hallucinate), miss crucial details, and be easily led astray by biased phrasing.

For a law firm, this means an LLM is an assistant, a tool for brainstorming, drafting, summarizing, and finding information – not a substitute for legal research or analysis. Every output must be verified by a human expert. Always. Test it with things you know the answer to. See how you can trick it. Understand its limitations before you rely on it.

The Revolution is Local

The comparison to the 90s PC revolution, with companies suddenly needing email and database servers, is apt. We're seeing the rise of a new type of essential internal infrastructure: the AI server.

What started as a personal journey for one attorney curious about a new tool quickly reveals the potential for fundamental shifts in workflow, productivity, and data handling, especially in fields where privacy and document analysis are paramount.

It requires learning new things, potentially getting your hands dirty with server hardware, and understanding the nuances of model performance and scaling. But the upside – a powerful, private AI tailored to your needs, with no per-use costs – is incredibly compelling.

So, whether you're an attorney, an engineer, a writer, or just a curious mind, the tools and knowledge are now available to build your own slice of the AI future, right there in your office or home. Just mind the VRAM, the power supply, and remember to double-check everything the AI tells you.

(And yeah, someone mentioned wanting an LLM to manage their server... that's the kind of glorious, terrifying future I'm here for.)