Truth be told, like most of my ilk—programmers—I’m a closet megalomaniac. I want to wield power, to envision and forge new worlds, to grapple with the nearly ungrappleable. More to the point, I’m more than a little convinced that I’m just the man for the job.
On most days, this sort of thing tends to be fairly out of reach. On this day, though, I got to build Gulp!
For the uninitiated, Gulp could be described as nothing more that a simple web-crawler; a sort of robot that downloads web-pages to snarf up images. Nothing groundbreaking in that; there are countless similar apps to be found on the net. What’s truly amazing, though, is that I was able to craft such a program in mere hours.
In the before-time, something like a decade ago, I did something similar, although it took more than a week to barely get going. Even then, it never worked quite right, and it wasn’t especially fast since it was multi-threaded but not especially concurrent.
On the other hand, Gulp is ridiculously fast and optimally concurrent. To put things in perspective, I gulped down 1.3 million image files in the last 27 hours. Now I do have a Verzion’s FIOS 75/35 Mbps, plus an 8-core machine with lots of memory, so my tech isn’t retarding things overmuch. Even so, the numbers are pretty impressive.
Most impressive of all, the program has less than 600 lines of code; most of which has nothing to do with either multithreading or even downloading. Better yet, because the whole thing was built upon the superlative TPL Dataflow library. In case you’re wondering, I simply adore Dataflow; especially when combined with the new async / await functionality that came out in .NET Framework 4.5.
I won’t go into the nitty gritties here, because others have well-paved that road, but if you have the slightest interest in writing performant code, I very much recommend reading Mike Heydt’s excellent series on the subject:
- TDF #0: Introduction to Task Parallel Dataflow Library
- TDF #1: The Basics of ActionBlock
- TDF #2: Basic Concurrency with ActionBlocks
- TDF #3: Using BufferBlock and LinkTo to Route Data
- TDF #4: The WriteOnceBlock
- TDF #5: BroadcastBlock
- TDF #6: Using the TransformBlock to Modify Data in the Network
As to Gulp, you can download the source from GitHub. The clever Dataflow bits are in the Spider.cs file.
Enjoy…
BIG NOTE: Gulp is a very indiscriminate web-crawler and (within the bounds of my admittedly minimal testing!) very good at not running out of memory. If you leave it unsupervised, it will gladly fill your disk with millions, maybe even billions of files. Caveat emptor!!