Subscribe to
Posts
Comments

SpokenWord.org aggregates podcasts, almost all of which are free, and makes it easy for users to export them to, say, iTunes. It’s a non-profit site and is all about the openness. (Disclosure: I’m on its board.) Now SpokenWord is looking for volunteers to curate podcast feeds and episodes in topics that interest them. Their curated collections will be the main feature at the SpokenWord site, because nothing knows what’s interesting to humans better than other humans do. Details here.

Matthew Ingram at Gigagom blogs about an upcoming Twitter feature called Twitter Annotations. Well, it’s not actually a feature. It’s the ability to attach metadata to a tweet. This is potentially great news, since it will give us a way to add context to tweets and to enable machine-processing of tweets, not to mention that URLs could be sent as metadata rather than as subtractions from the 140-character limit. This is yet another example of information scaling to the point where we have to introduce more information to manage it. How about one of those bogus “laws” people seem to like (well, I know I do): Information sufficiently scaled creates a need for more information.

Twitter is specifying the way in which Annotations will be encoded, but not what the metadata types will be. You can declare a “type” with its own set of “attributes.” What types? Whatever you (or, more exactly, developers and hackers) find useful. Matthew cites a number of folks who are basically positive but who express a variety of worries, including Google open advocate Chris Messina who warns that there could be a mare’s nest of standards, that is, values for types and attributes. Dave Winer takes Google to task for slagging off on Twitter for this. I agree with his sentiment that Goliath Google ought to be careful about their casual criticisms. Nevertheless, I think Chris is right: Specifying the syntax but not the actual types and attributes will inevitably give rise to confusion: What one person tags as “topic,” someone else will tag as “subject,” and some people might have the nerve to actually use words for types in, say, Spanish or Arabic. The nerve! [THE NEXT DAY: Here's Chris' original post on the topic, which is more balanced than the bit Matthew excerpts, and which basically agrees with the next paragraph:]

But, so what? I’d put my money on Ev Williams and Biz Stone any time (important note: If I had money). You couldn’t have seriously proposed an idea as ridiculous as Twitter in the first place if you didn’t deeply understand the Web. So, yes, Chris is right that there’ll be some confusion, but he’s wrong in his fear. After the confusion there will be a natural folksonomic (and capitalist) pull toward whatever terms we need the most. Twitter can always step in and suggest particular terms, or surface the relative popularity of the various types, so that if you want to make money by selling via tweets, you’ll learn to use the type “price” instead of “cost_to_user,” or whatever. Or you’ll figure out that most of the Twitter clients are looking for a type called “rating” rather than “stars” or “popularity.” There’ll be some mess. There’ll be some angry angry hash tags. But better open confusion than expecting anyone — even the Twitter Lads — to do a better job of guessing what its users need and what clever developers will invent than those users and developers themselves.

I’m embarrassed to say that I just read Randall Munroe’s fabulous color survey from early May. Readers were asked to supply names for colors. It’s a rich experiment: Naming and discrimination, gender differences, hacking, tagging, spamming, hilariousness. The results also seem to support prototype theory’s idea that we agree on what the “real” (prototypical) colors are, at least within a culture: This is blue, but that one is a variant that needs a modifier in front of it (“light blue”) or for which we use a variant name (“teal”).

Randall writes the webcomic XKCD, of course, which is the Doonesbury of his generation, except while you can imagine Garry Trudeau writing a satiric HBO series, you can’t imagine him running and analyzing a color survey.

(I heard about Randall’s color survey via the Mainstream: Christopher Shea at the Boston Globe blog. Christopher also points to Stephen von Worley’s color map. BTW, that post by Christopher also has a great note about iPad censoring a graphic version of the oft-banned James Joyce’s Ulysses. Anyway, I’ve really got to do a better job keeping up with XKCD.)

Democratized curation

JP Rangaswami has an excellent post about the democratizing of curation.

He begins by quoting Eric Schmidt (found at 19:48 in this video):

“…. the statistic that we have been using is between the dawn of civilisation and 2003, five exabytes of information were created. In the last two days, five exabytes of information have been created, and that rate is accelerating. And virtually all of that is what we call user-generated what-have-you. So this is a very, very big new phenomenon.”

He concludes — and I certainly agree — that we need digital curation. He says that digital curation consists of “Authenticity, Veracity, Access, Relevance, Consume-ability, and Produce-ability.” “Consume-ability” means, roughly, that you can play it on any device you want, and “produce-ability” means something like how easy it is to hack it (in the good O’Reilly sense).

JP seems to be thinking primarily of knowledge objects, since authenticity and veracity are high on his list of needs, and for that I think it’s a good list. But suppose we were to think about this not in terms of curation — which implies (against JP’s meaning, I think) a binary acceptance-rejection that builds a persistent collection — and instead view it as digital recommendations? In that case, for non-knowledge-objects, other terms will come to the fore, including amusement value, re-playability, and wiseacre-itude. In fact, people recommend things for every reason we humans may like something, not to mention the way we’s socially defined in part by what we recommend. (You are what you recommend.)

Anyway, JP is always a thought-provoking writer…

Search engines have traditionally focused on building lists. Increasingly, they’re turning to the rectangular display of information: Boxes and tables. Boxes require extracting the relevant information and presenting it four-square in front of the user. While lists sort in a single dimension, tables show at least two dimensions. Boxes and rectangles are useful filters.

Google today announced the further boxing and tabling of data, in response (one supposes) to Bing.com. The Google Blog recommends trying searching for dog breeds, broadway shows, catherine zeta-jones date of birth, or zebra. (Look for the “something different” list in the left margin when you do the zebra search.) I especially like the summary of sources Google gives when it flat-out answers a question.

More boxes! More tables!

Luis von Ahn of Carnegie Mellon University is giving a Berkman lunchtime talk. [NOTE: I'm liveblogging. I'm making mistakes, leaving stuff out, paraphrasing, getting things wrong. This is an unreliable record.]

Luis invented captchas, the random characters you have to type in to convince a web page that you are a human and not a hostile software program. (He shows randomly generated sequences that happened to spell out “wait” and “restart.”) Captchas are useful, he says, when you’re trying to prevent people from gaming a system by writing a program to enter data robotically. They’re also useful to prevent spammers from signing up for free email accounts. To get around this, spammers have started up sweat shops where humans type captchas all day long; it costs the spammers about $0.33/account. And some porn companies ask users to type in a captcha to see photos; the captchas are drawn from email account applications. Damn clever!

He shows some variants. A Russian asks you to solve a mathematical limit. In India one asks you to solve a circuit. Luis says these aren’t all that effective because compputers can solve both problems, but they’re still better than the “what is 1 + 1?” captchas he’s found on US sites.

He says that about 200M captchas are typed every day. He was proud of that until he realized it takes about 10 seconds to type them, so his invention is wasting 500,000 hours per day. So, he wondered if there was a way to use captchas to solve some humungous problem ten seconds at a time. result: ReCAPTCHA. For books written before 1900, the type is weak and about 30% of the text cannot be recognized by OCR. So, now many captchas ask you to type in a word unrecognized when OCR’ing a book. (The system knows which words are unrecognized by running multiple OCR programs; ReCAPTCHA uses those words.) To make sure that it’s not a software program typing in random words, ReCAPTCHA shows the user two words, one of which is known to be right. The user has to type in both, but doesn’t know which is which. If the user types in the known word correctly, the system knows it’s not dealing with a robot, and that the user probably got the unknown word right.

ReCAPTCHA is a free service. Sites that use it have to feed back the entries for the unknown word. About 125,000 sites use it. They’re doing about 70M words per day, the equivalent of 2-4M books per year. If the growth continues, they’ll run out of books in 7 years, but Luis doesn’t think the growth will continue, so it might take twenty years. (There are 100M books.)

(In response to a backchannel question, Luis tells the penis captcha story.)

The ReCAPTCHA system filters out nationalities, known insult terms, and the like, to avoid unfortunate juxtapositions. It’s soon going to be released in 40 languages. Google acquired ReCAPTCHA.

Q: When will OCR be good enough to break captchas?
A: I don’t know. We’ll probably run out of books first.

Q: Business model?,br>
A: Google Books gets help digitizing.

ReCAPTCHA “reuses wasted human processing power.” The average American spends 1.9 seconds per day typing captchas. We also spend 1.1 hours a day playing electronic games. We humans spent 9B hours spending in 2003. It took less than a day of that to build the Panama Canal. So, Luis switches topics a bit to talk about how to solve human problems by playing games.

First is tagging images with words. Image search works by looking at file names and html text, because computers can’t yet recognize objects in images very well.

Does typing two words take twice as long as typing random letters? No, it takes about the same time, he says. Luis says about 10% of the world’s population have typed in a captcha. The ESP game asks two people unknown to each other to label an image until they agree. The game taboos words that other players have already agreed on. The system passes images through until they get no new labels. They’ve gotten over 50M agreements. 5,000 players playing simultaneous could label all Google images in a month. Google has itsown version; Google has an exclusive license to the patent.

Q: Demographics?
A: For my version, average age is 29 (with huge variance), evenly split between women and men.

Q: Compared to Flickr tags?
A: Only a small fraction of Flickr images have useful tags. The tags from flickr tend to be significantly more exact, but also significantly noisier (e.g., a person tagging an image in a way that means something idiosyncratic).

Q: Bots?
A: Yes, we don’t want you to wait for a partner, so sometimes we’ll give you a bot that replays the moves a human had made with the same image.

Q: Google Images benefits from its version of your game. Who benefits from your version of the game?
A: No one.

For some images, guesses change over time. E.g., a Britney Spears photo five years ago got labels like britney and hot. About two years ago, the labels changed to crazy, rehab, and shaved head. Now they’re back to britney and hot. By watching a player for 15 mins, you can guess whether the player is male or female with 95-98% accuracy.

Why do people like the ESP game? Sometimes they feel an intimacy with their partners. They have to step outside of themselves to make the match. They can have a sense of achievement.

He ends by saying that the about the same number of people — 100,000 — have worked on humanity’s big projects, e.g., pyramids, Panama Canal, putting a person on the moon. That’s in part (he says) because it is so hard to coordinate large numbers of people. Now we can get 100M people to work on something. What can we do?

Clay Shirky has given us a surprising number of Internet myths. And by this I mean not falsehoods but the opposite: Broad, illuminating ways of making sense of what’s going on. For example, Clay’s post about the power law distribution of links in the blogosphere (based on research by Cameron Marlow) changed how we view authority, fame, and success in the Web ecosystem, and provided the structure within which Chris Anderson could point to the Long Tail. And Clay’s Ontology Is Overrated made clear that a change in how we categorize our world affects very real power relationships; that essay was highly influential, including on my own Everything Is Miscellaneous.

Clay’s new post — The Collapse of Complex Business Models — gives us a broad way of understanding why those who used to provide us with content will not be the ones who give us content in the future…and why they cannot fathom why not.

Giulia Ricci’s investigates:

the shift between order and disorder within different systems, which is the reason why I recurrently use geometrical grids, although on a more abstract level I am also interested in systems of categorisation and lists and how these can be visualised with diagrams and geometrical drawings.

For example, take a look at these. I find them fascinating as they swim close to resolution but never quite make it.

Want to see one way to use the Web to teach? Berkman’s Jonathan Zittrain and Stanford Law’s Elizabeth Stark are teaching a course called Difficult Problems in Cyberlaw. It looks like they have students creating wiki pages for the various topics being discussed. The one on “The Future of Wikipedia” is a terrific resource for exploring the issues Wikipedia is facing.

Among the many things I like about this approach: It implicitly makes the process of learning — which we have traditionally taken as an inward process — a social, outbound process. By learning this way. we are not only enriching ourselves, but enriching our world.

My only criticism: I wish the pages had prominent pointers to a main page that explains that the pages are part of a course.

New media generally don’t replace old media, as Marshall McLuhan pointed out. After TV we still have radio. After telephones we had telegrams for a good long while. So what about books? After we have networked digital books, we’ll still have and produce physical books. But will physical books be as ubiquitous and culturally important as radio? Or will they be as cherished but infrequently attended as live theater?

In my interview with Cory Doctorow, I wondered, in the midst of an overly-elaborate three-part question, whether ebooks will provide enough of what we value about physical books (pbooks) that pbooks will lose the historic significance Cory had pointed to.

We won’t know the answer until we invent the future. But, I’m going to hypothesize, predict, or stipulate (pick one) that at some point we will have ebooks (which may be distinct hardware or be software running in something other device we carry around), with paper-quality displays that are full-color and multimedia, that are fully on the Net, with software that lets us interact with the book and with other readers, that are a part of the standard outfitting of citizens, and within a physical environment that provides ubiquitous Net connectivity.

Those are a lot of assumptions, of course, and each and every one of them could be disrupted by some 17 year old at work in her parents’ basement. Nevertheless, if the future is something like that, then what of pbooks’ value will be left unreplaced by ebooks?

Readability. I’m assuming paper-quality displays, which may turn out to be unattainable without having to wheel around batteries the size of suitcases. But, even without that, the ability of ebooks to display text in various fonts and sizes should remove this advantage from pbooks.

Convenience. I am assuming that ebooks will be more convenient than pbooks: as good in sunlight as pbooks, at least as easy to hold and use, easier to use for those with certain disabilities, long enough battery life, possibly self-lit, etc. The biggest open question, I believe, is whether it will be as easy to annotate ebooks…

Annotatability. The current crop of ebooks make highlighting passages and making notes so difficult that you have to take a break from reading to do either of those things. But, that’s one big reason why the current crop of ebooks are pathetic. With a touchscreen and a usable keyboard (or handwriting recognition software), ebooks of the future should be as easy to annotate as a pbook is. And those annotations will then become more useful, since they will be searchable and sharable.

Affordability. The marginal cost of producing ebook content is tiny, which doesn’t mean prices will drop as dramatically as we might like. Nevertheless, it’s hard to imagine a world in which ebook content costs more than pbooks.

Social flags. You probably carefully choose which book you’re going to bring with you on a job interview, and which books get moved to the shelves in your living room. We use the books we own as tribal flags, as Cory points out. Ebooks can serve the same role when introduced into social networks, including social networks explicitly built around books, such as LibraryThing.com. They obviously don’t work in physical space that way; if you want to show off your books to people who visit your home, you’re going to have to get physical copies.

Aesthetic objects. Many of us love the feel and smell of books. While ebooks might be able to simulate that in some way — maybe their page displays could yellow over time — it’d still just be a simulation. While ebooks will undoubtedly develop their own aesthetics, so that we’ll call people over to see how beautiful this or that new ebook is, they can’t replace the particular aesthetics of pbooks. So, those who love pbooks will continue to cherish them.

Sentimental objects. For my bar mitzvah, some friend of my parents gave me a leatherbound copy of A.E. Housman’s “A Shropshire Lad” and other poems. It was a beautiful aesthetic object, but I also understood that it had a personal meaning to the giver. I doubt that that particular copy did — I don’t think it came from his own collection — but the physicality of the book was itself a marker for the personal meaning it had for the giver. As Cory says, the books your father read — the very copies that were in his hands — probably have special meaning to you. It’s hard to see how ebooks could have the same sentimental value, except perhaps if you are reading the highlights and notes left by your father, and even then, it’s not the same.

Historic objects. Likewise, knowing that you’re looking at the very copy that was read by Thomas Jefferson gives a book an historic value that ebook content just can’t have. It’s hard to see how an author could autograph an ebook in any meaningful way.

Historical objects. As John Seely Brown and Paul Duguid have pointed out, as has Anthony Grafton, books as physical objects collect metadata that can be useful to historians, e.g., the smell of vinegar that indicates the book came from a town visited by cholera. Ebooks, however, accumulate and generate far more metadata. So, we will lose some types of metadata but gain much more…maybe more than our current norms of privacy are comfortable with.

Specialized objects. It will take somewhere between an improbably long time and forever for all collections of pbooks to be digitized. Thus, books in special collections are likely to be required well after we can take the presence of ebooks for granted.

Possessions. We are headed towards a model that grants us licenses to read books, but not outright ownership. (This is Cory’s main topic in the interview.) If we lose ownership of ebooks, then they won’t have the sentimental value, they will lose some of their economic value to readers (because we won’t be able to resell them or buy them cheaper used), and we won’t be as invested in them culturally. Whether ebooks will be ownable, and whether that will be the default of the exception, is unresolved.

Single-mindedness. Books are the exemplar in our culture of thinking. We write our best thoughts in books. We engage with the best thoughts of others by reading books. Books encourage and enable long-form thinking. Ebooks, because they are (ex hypothesis) on the Net, are distracting. They string together associated chunks and tempt us with links beyond themselves. It is easy to imagine ebooks providing the singleminded pbook experience: “Press here to remove all links.” But, of course, you could always unpress the button. Besides, since your ebook is on the Net (ex hypothesis), all that’s stopping you from jumping out of the book and into your email or Facebook is self-discipline. So, while ebooks can provide the singledminded experience of pbooks, some of us may prefer the paper version to keep the distraction of the Net at bay.

Religious objects. Some books have special meaning within some religions. It’s hard to imagine, for example, that an ebook is going to replace the Torah scrolls in synagogues. In fact, orthodox Jews can’t use electronic devices on the Sabbath, so they are certainly going to continue to buy pbooks. But, this is the very definition of a specialty market.

So, what does all this mean for the future of books? It depends.

First, are there other values of pbooks that I left off the list?

Second, I haven’t listed any unique advantages of ebooks. For example, ebooks will allow social reading: Engaging with others who are reading the book or with the traces left by those who have already it. That’s pretty important. Also, ebooks are likely to radically reduce the cost of reading, especially of some categories of overpriced pbooks (e.g., textbooks). Also, ebooks will make it much easier to understand the content of books through embedded dictionaries, search capabilities, and links to explanatory discussions. Also, as more of the corpus gets digitized, ebooks will make it far easier for scholars to pursue the footnotes (except they’ll be embedded links, not footnotes). Also, ebooks will incorporate multimedia. Also, reading ebooks will build a searchable personal corpus that is far more useful to us than bookcases filled with out conquered pbooks. Also, we’ll always have our entire library with us, ready to be read or reread, which is good news for readers.

I leave it to you to decide how this mix of values is likely to play out. What will be the social role and meaning of pbooks as we go forward into the ebook era? In twenty years — giving ourselves plenty of time to develop usable ebook readers, to digitize most of what we need, and to built an always-available network — will pbooks be used mainly by collectors, and scholars working with unique texts? Will they be sentimental objects? The poor person’s medium? Will physical books be the equivalent of AM radio, of the road company of “Cats,” of quaint objects in book museums — and/or the continuing pinnacle and embodiment of learning?v

Next »