URL Hunting (Sniffing, Spelunking, Parsing…)


A good URL is hard to find.  Seriously!  I know, I know, you’re no-doubt rolling your eyes but, in programming, the ability to suss out undocumented schemas and protocols turns out to be a big big thing.

Take my latest project.  I’ve been writing a Tumblr downloader.  For the most part, the entire thing has been very straightforward, given that the Tumblr folks have an excellent API and documentation.  Using their specs, it was short work to create the core functionality of my product (which I plan to release as shareware!)  I wasn’t keeping tight track of the time, but it certainly took less that three hours to create the core image and video crawler and downloader.  Can you say “outstanding!”

As an aside, the Task Parallel Library is an absolute gem.  Not only is it ferociously fast (as in I was able to download 41,000 pictures from Tumblr in less than an hour, even though I hadn’t optimized a thing), but even better, it’s ridiculously easy to work with.  Once you grasp the Zen of TPL, your code chunks up into these perfect little self-documenting pieces.  If you haven’t used TPL, run, don’t walk. You’ll be more than pleased.

Anyway, back to the URL thing.  It turns out that a lot of people embed YouTube videos in their Tumblr blogs.  Naturally, I thought it would be a good thing to download YouTube videos along with the native Tumblr videos too.  Dare I say, it even seemed like a fairly straightforward task.  But noooooo.  Parsing a valid video URL and query string from their spaghetti script turns out to be more than a sane man should be willing to do.  I got it to work just “most” of the time, but inasmuch as I was going for 100%, I couldn’t in all good conscience foist my code upon the world.  The big thing that drove me nuts was that I’d download a video with no problems, then two minutes later, downloading the same video would yield a 403 (Forbidden)  or 404 (Not Found) error.

As things turn out, to download YouTube videos you have to become an absolute master of the URL.  I could just about imagine myself doing that if they scope of the problem was relatively small.  Given the fact that there appear to be about a million query-string variations and that I’d be doing it without docs, though, it was easy to imagine it’d become a maintenance nightmare.  Perusing the net, I found more than ample proof that others had indeed experienced my pain.

Anyway, the whole YouTube thing was nothing more than a “nice-to-have”.  Inasmuch as my last article (Not Invented Here) was all about not dipping one’s toes into such murky waters, I decided to drop the entire thing.  To resort to Yiddish, it was too much “tsuris.”

Even so, there may those foolish few who have the wherewithal to tread where I daren’t (dursn’t?!?)  Anyway, if you want to , you can download it from GIT.  Enjoy….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s