Entries from November 2007 ↓

What the Falling Cost of Computing Teaches Us

The New York Times opened up its archives this year, allowing access to about 11 million public domain articles from 1851 to 1922 on its website.

The story behind this massive electronic publishing effort, which leveraged Amazon’s S3 and EC2 Web Services as well as the MapReduce algorithm implemented with Hadoop, stands as testament to how the elimination of technical barriers brought about by falling computing costs let single developers solve problems in days that might have previously taken teams months. Starting with 4TB of source data in TIFF image format, the conversion of these files into PDF format took less than 24 hours to complete, running on 100 parallel Amazon EC2 machine instances. At market rates for EC2, this represents about $240 of computing, plus another $410 for upload bandwidth and presumaby another $41 or so to store the original source material for the two days of the run. The resulting 1.5TB of produced PDF data, at Amazon market rates, would cost the NY Times around $8 a day to host with S3, plus bandwidth costs to serve the content. Based on the presented numbers, the average article size is around 146K, which means that their bandwidth costs, when they have to rely on Amazon, are about $0.13 for every 7,200 articles they have to pull (assuming no local caching at the NYT site). Even a ridiculously small advertising rate would cover this operational cost.

Another story lurks here–since all works published in the United States before 1923 are in the public domain, and not tainted by the Mickey Mouse Copyright Protection Act (as coined by Lawrence Lessig), the content contained in this material is free of copyright (although it’s not clear whether the PDF rendering of the articles are).

Browsing articles from a random date 85 or 100 years ago gives you a glimpse into how little what constitutes public debate has changed. In the 1920s, most issues of the Times had numerous stories covering every angle of prohibition, each serving as an echo still heard in the War on Drugs, from illicit stills exploding and burning down apartment buildings, to police warnings about bad batches of hooch (redistilled wood alcohol and kerosene that removed the foul taste but not the lethal formulation) which were attributed to 100 deaths in a single month, to the investment in technology to stop counterfeiting of certificates granted for allowable uses of alcohol. The battle over evolution raged on with op-ed pieces that could be used nearly without alteration today. During WWI, many articles felt straight out of the War on Terror, with announcements of the arrest of enemy aliens, overviews of modern fighting technology (the novelty of the German armored tank merited note), handsome Thanksgiving dinners for the troops overseas. And in almost every issue, you find someone lamenting the fall of civil society or the problems with kids today.

It struck me that the confluence of these forces–dramtically falling computing costs, an unclosed loophole in copyright law (or more accurately, copyright law acting as intended), and the availability of a significant window into the history of the United States–connected me back to an understanding that though technology has evolved dramatically in the intervening years, the evolution of public discourse and of the underlying themes put forth in the media have changed very little.

Less seriously, the fact that these articles are in the public domain also means they’re fair game for reuse. Hilarity ensues.

In that vein, I offer This Day in History:

Montreal Hunts Slackers: 90 Years Ago Today
Apparently, the draft wasn’t popular with some folks in Canada, and police were engaged to round up these “slackers”. Choice quotes:

The big roundup of Slackers in Montreal began today when the police force set its dragnets and started in to sweep the city.

The first part of the city to receive attention was the red light district. There has been a big raid in this neighborhood every night for the last week, and many eligibles have been put under lock and key.

Because if you’re going to dodge the draft, what better place to hang out than with hookers?

Yet another evidence of the “tightening up” process is to be seen in the intimation to the officials of a local hockey club that the military authorities will strongly appeal in the case of every hockey player who has been exempted from service, as it is considered that too many of this class of men in the very best of physical condition are trying to dodge their duty through political influence or otherwise.

Who knew that the fastest game on ice could pose such peril to Canadian national security?

Automated Scavenger Training for Fun and Profit

My grandfather reportedly loved all manner of business in which the customers did the work and the businessman merely collected the money–perhaps the regret of a small businessman who missed getting in on the ground floor of coin-operated vending machines.

Imagine the pleasure my grandfather would have had if he’d been able to have crows collecting coins for food. From the description of the experiment:

- THE DEVICE -

The first version of the device consists of a box from which protrudes a perch, a food tray, and a funnel. The whole thing is made out of sealed wood so as to minimize noisy clanging which might result from using metal components while retaining the ability to leave the thing out in the rain.

The goal is to be able to deploy the device wherever corvids are found and to have it designed such that it will train the corvids to deposit coins in exchange for food. The device should do said training on its own without human intervention.

Inspired, I’ve decided to apply the same Skinnerian training principles to squirrels and folding money. Hilarity and early retirement ensues.

By the Numbers: The del.icio.us/popular/toread Count

As previously discussed, many of the popular links on del.icio.us tagged toread contained numbers in the titles. Now this can be quantified:

del.icio.us Sparkline

The sparkline above (courtesy Joe Gregorio’s sparklines service) shows the trend of the sum of numbers found in the titles of links in the del.icio.us/popular/toread feed over the last 14 days. (From what I can tell, the popular lists are recalculated about every 4 hours, so each bar represents one sample from each four hour period.) The sum is calculated by a Ruby script that periodically grabs the RSS feed for the popular/toread tag, taking any representation of a number from a title in the feed (e.g.,10, Ten, 101, 2001) and adding it to the total.

The numbers for the last two weeks:

Min: 89.5
Max: 2324.5
Mean: 697.47
Mode: 380
Median: 379

Coming soon…a way you can make a game out of this.