Baltimore Comedy Place

Shows Blog Instagram Instagram Submit Show Music & Other Stuff

← Back to all posts

Build Log

How I Built Baltimore Comedy Place: An Overengineered Solution to Finding Shows

From AI-assisted scrapers to cloud automations, here’s the full breakdown of how Baltimore Comedy Place stays up to the minute without turning into a full-time job.

Look, this whole thing started because I’m terrible at keeping track of comedy shows in Baltimore. I’d hear about a great improv show the day after it happened, or I’d check one venue’s website, see nothing interesting, and just stay home—completely unaware that three blocks away there was an amazing stand-up showcase happening in some converted warehouse.

The worst part? I knew this was a solved problem for music. There’s this incredible Instagram account called BaltShowplace that somehow knows about every DIY punk show in every sketchy basement and legitimate venue across the city. I’d scroll through their stories and think, “Man, why doesn’t this exist for comedy?” Then I’d remember: oh right, because I’m a programmer and if I want something to exist, I should probably just build it.

So here’s the thing—I could have just made a Google Calendar and manually added shows. Hell, I could have made a simple WordPress site and updated it weekly. But where’s the fun in that? Plus, I’m lazy. The kind of lazy where I’ll spend 200 hours automating something to avoid doing 2 hours of work every week. You know, programmer lazy.

The Constraints That Shaped Everything

Before I wrote a single line of code, I had to be honest with myself about what this project was and wasn’t. First off, I have a day job. I do improv comedy as a hobby. This couldn’t become some time-sucking side hustle. Second, I had zero plans to monetize this—I literally just wanted it to exist so I could use it, and if other comedy nerds in Baltimore found it useful, cool. But that meant I couldn’t justify spending money on infrastructure either.

The kicker was that BaltShowplace has something like 15,000 followers. If my thing got even a fraction of that popular, I needed architecture that could handle the traffic without me waking up to a $5,000 cloud bill. Oh, and it needed to run itself because the last thing I wanted was to become a full-time curator of Baltimore comedy events.

These constraints led to one conclusion: everything had to be static, cached, and automated to hell and back.

The “Just Use AI” Trap

When I told my non-programmer friends I was going to “use AI to solve this,” they probably imagined me just asking ChatGPT to find shows every day. Adorable, right? As if I could just prompt “Hey, find me all the comedy shows in Baltimore tonight” and get reliable results. That’s like asking a really smart intern who’s never been to Baltimore and has arbitrarily limited internet access to somehow divine what’s happening at The Crown this week (iykyk rip).

No, what I meant was: I’m going to build a system that uses AI as one component in a larger pipeline. The AI would be my universal HTML parser—the thing that could look at any venue’s janky calendar page and extract structured data without me writing custom scrapers for each site.

The Architecture (Or: How I Learned to Stop Worrying and Love Firebase)

Here’s where things get nerdy. The core insight was that I needed a shared source of truth that multiple independent processes could read from and write to. Firebase Realtime Database was perfect for this—not because it’s the best database (it’s not), but because it’s dead simple, has great Python SDKs, and most importantly, it’s nearly free for my usage levels.

The database schema ended up looking like this:

  • targets: URLs and scraping configurations for each venue/collective

  • submissions: Raw show data from AI scraping and user submissions

  • shows: The cleaned, deduplicated, approved shows

  • scrapes: Telemetry and logs for debugging

But here’s the clever bit—Firebase never gets exposed to the public internet. Users hit a static site hosted on Google Cloud Storage behind a CDN. That static site gets regenerated every few hours by a containerized job that pulls from Firebase. This means I can serve thousands of users for basically free, and if Firebase goes down, the site keeps working with slightly stale data.

The Scraping Ballet

Let me walk you through what happens when my system discovers a new show, because it’s honestly kind of beautiful in its complexity.

Every few hours, a Cloud Run Job spins up and reads the targets collection. Each target has a URL (like https://bigimprov.org/shows or https://highwireimprov.com/shows) and potentially some hints about how to parse that specific site. The job fires up headless Chrome with Selenium, navigates to the page, and waits for everything to load. Sometimes it needs to scroll to trigger lazy-loading. Sometimes it needs to switch to mobile view because the desktop site is some React monstrosity that would cost me $50 in LLM tokens to parse.

Once the page is loaded, I strip out all the crap—JavaScript, CSS, tracking pixels, whatever. If the page is still huge (looking at you, venues that put your entire 2025 calendar on one page), I can use XPath expressions stored in the database to isolate just the relevant section. This is key because I can update these XPaths without redeploying code when a venue redesigns their site.

The cleaned HTML gets sent to an LLM with a carefully crafted prompt and a Pydantic schema that defines exactly what a “show” looks like. The model returns structured JSON with fields like title, date, time, venue, performers, ticket price, and description. Pydantic validates all of this—dates get normalized to ISO-8601, prices get converted to floats, empty strings get filtered out.

But wait, there’s more! The initial page often just has basic info, with links to individual show pages that have the full details. So the scraper can follow those links (I call them “enrichment URLs”) and re-run the extraction to fill in missing data. Found a street address? Let’s geocode it. Found performer names? Let’s check if they match our database of known comics and their Instagram handles.

The Deduplication Dance

This is where things get tricky. The same show might appear on multiple sites. A touring comic might be listed on the venue’s website, the promoter’s website, and the comic’s own site. My system needs to figure out that these are all the same show and merge the data intelligently.

The deduplication algorithm assigns a similarity score to every pair of shows based on:

  • Start time (exact match = high score, within 30 minutes = medium score)

  • Venue name (using fuzzy string matching because “The Crown” might also be listed as “Crown Baltimore” or “The Crown Comedy”)

  • Title overlap (after normalizing things like “featuring” vs “ft.”)

  • Performer overlap (normalized to lowercase, stripped of common titles like “headliner”)

Shows scoring above a threshold get merged, with the richer data winning. If one listing has a ticket link and another has a full description, the merged show gets both. The duplicate records get marked as merged and hidden.

The Human-in-the-Loop Problem

Even with all this automation, I needed a way for humans to submit shows I missed. But I couldn’t just let anyone add stuff directly—I learned that lesson from every platform that’s ever had user-generated content.

The submission form is dead simple: upload a flyer, fill in the details, submit. Behind the scenes, a Cloud Function sanitizes the input, uploads the flyer to Cloud Storage, and adds the submission to Firebase with a needs_review flag.

But here’s where it gets interesting. Before I even see these submissions, they go through automated moderation. The flyer gets analyzed by a LLM with vision to check for hate symbols, explicit content, or anything else that would be problematic. The text gets run through an LLM moderation endpoint. Only if both checks pass does it show up in my review queue.

Even then, I built myself a nice little admin dashboard (also a static site, naturally) where I can quickly review human submissions. The whole review process takes me maybe 5 minutes a week, usually while I’m having my morning tea.

The Instagram Pipeline (Or: How I Learned to Hate Meta’s API Policies)

This was supposed to be the easy part. Just post the shows to Instagram, right? Wrong. So, so wrong.

First problem: Meta’s official Graph API is a walled garden for stories. While you can post to your main feed, you can’t programmatically upload stories directly. Instead, Meta forces you to go through an approved Meta Business Partner. Direct integration is off the table, and that’s useless for a lean, automated system for daily event listings. Nobody wants to see posts about shows that happened last week.

I looked into unofficial Instagram APIs (basically reverse-engineered mobile app protocols), but that’s a fast track to getting your account banned. Then I discovered Buffer, a social media management platform that, as a Meta partner, has the necessary access to post stories. Great! Except, my timing was terrible. Buffer’s public API is in the middle of a complete overhaul; they aren’t accepting new developers for the old one, and the new one isn’t ready. So, no direct API calls for me. But Zapier can talk to Buffer…

So here’s the Rube Goldberg machine I built: A Cloud Run Job generates Instagram story images using Pillow. It takes each show’s flyer, applies a Gaussian blur to make it a background, overlays the show details in bold text, and saves it as a 1080x1920 PNG. These images get uploaded to Cloud Storage with public URLs.

The job then writes rows to a Google Sheet (yes, really) with the image URLs, captions, hashtags, and scheduled post times. Zapier monitors this sheet and when it sees a new row, it creates a Buffer post. Buffer then actually posts the regular weekly calendar posts AND daily stories to Instagram at the scheduled time.

It’s ridiculous. It involves three different third-party services. It probably violates some cosmic law of software architecture. But it works, and it’s been running reliably for months without me touching it.

The Static Site That Could

The actual website is almost embarrassingly simple compared to the backend. It’s vanilla HTML, CSS, and JavaScript. No React, no Vue, definitely no Angular. Just old-school DOM manipulation like it’s 2012.

The trick is that it’s not really static—it’s “static-ish.” When the rendering job runs, it:

  1. Queries Firebase for all shows in the next 6 months

  2. Groups them by date and venue

  3. Randomly shuffles shows at the same time (so everybody gets a fair shot at being listed first)

  4. Pre-renders the first three days directly into the HTML

  5. Dumps everything else into a JSON file that gets loaded on demand

The initial page load is instant because those first three days are already in the HTML. As you scroll or change dates, JavaScript loads more from the JSON file. The whole thing is fronted by Google’s CDN with aggressive caching headers.

Fun detail: I use the Intersection Observer API to lazy-load show flyers as you scroll. The images are served in WebP with JPEG fallbacks, all automatically generated by the rendering job. There’s even a dark mode that triggers based on your system preferences, because I’m not a monster.

The Monitoring You Never See

Every time a scrape runs, success or failure, it logs telemetry to Firebase. A monitoring job runs daily to generate Plotly charts showing success rates, response times, and data quality metrics for each venue. These get rendered as static HTML and deployed alongside the main site.

Mostly it where I find out that I forgot to reload my AI API credits, or to see when a target venue site had a redesign.

The Beautiful, Ridiculous Whole

So what did I end up with? A system that:

  • Scrapes comedy venues automatically

  • Deduplicates and merges listings intelligently

  • Accepts user submissions with automated moderation

  • Generates a static website that can handle any amount of traffic

  • Posts to Instagram stories daily without any manual intervention

  • Monitors itself and alerts me if something breaks

  • Costs me set small amount in infrastructure

  • Requires maybe 5 minutes of my time per week

Is it overengineered? Absolutely. Could I have solved this problem more simply? Without question. But would I have learned about Buffer’s hidden API, or how to make LLMs parse the disaster that is most venue websites, or how to build a CDN-friendly site that still feels dynamic? Definitely not.

More importantly, it actually solved my problem. I check Baltimore Comedy Place’s Instagram stories every morning to see what’s happening that night. I’ve discovered comics I never would have found otherwise. I’ve shown up to weird little shows in the back rooms of bars that I didn’t even know had comedy. And occasionally, someone will mention they saw a show because of “that comedy Instagram,” and I get to feel like I built something useful.