Subscribe to
Posts
Comments
You've arrived at Everything is Miscellaneous's blog page that was active 2008-2012. You'll find links to some useful information about the book and its subject matter, but don't be surprised by some dead links, etc.
To order a copy, go to your local bookstore, or Amazon, etc.
For information about me, David Weinberger, click here.
To visit the page underneath this text, click here.

Thanks - David Weinberger

Paul Deschner and I had a fascinating conversation yesterday with Jeffrey Wallman, head of the Tibetan Buddhist Resource Center about perhaps getting his group’s metadata to interoperate with the library metadata we’ve been gathering. The TBRC has a fantastic collection of Tibetan books. So we were talking about the schemas we use — a schema being the set of slots you create for the data you capture. For example, if you’re gathering information about books, you’d have a schema that has slots for title, author, date, publisher, etc. Depending on your needs, you might also include slots for whether there are color illustrations, is the original cover still on it, and has anyone underlined any passages. It turns out that the Tibetan concept of a book is quite a bit different than the West’s, which raises interesting questions about how to capture and express that data in ways that can be useful mashed up.

But it was when we moved on to talking about our author schemas that Jeffrey listed one type of metadata that I would never, ever have thought to include in a schema: reincarnation. It is important for Tibetans to know that Author A is a reincarnation of Author B. And I can see why that would be a crucial bit of information.

So, let this be a lesson: attempts to anticipate all metadata needs are destined to be surprised, sometimes delightfully.

The American Psychiatric Association has approved its new manual of diagnoses — Diagnostic and Statistical Manual of Mental Disorders — after five years of controversy [nytimes].

For example, it has removed Aspberger’s as a diagnosis, lumping it in with autism, but it has split out hoarding from the more general category of obsessive-compulsive disorder. Lumping and splitting are the two most basic activities of cataloguers and indexers. There are theoretical and practical reasons for sometimes lumping things together and sometimes splitting them, but they also characterize personalities. Some of us are lumpers, and some of us are splitters. And all of us are a bit of each at various times.

The DSM runs into the problems faced by all attempts to classify a field. Attempts to come up with a single classification for a complex domain try to impose an impossible order:

First, there is rarely (ever?) universal agreement about how to divvy up a domain. There are genuine disagreements about which principles of organization ought to be used, and how they apply. Then there are the Lumper vs. the Splitter personalities.

Second, there are political and economic motivations for dividing up the world in particular ways.

Third, taxonomies are tools. There is no one right way to divide up the world, just as there is no one way to cut a piece of plywood and no one right thing to say about the world. It depends what you’re trying to do. DSM has conflicting purposes. For one thing, it affects treatment. For example, the NY Times article notes that the change in the classification of bipolar disease “could ‘medicalize’ frequent temper tantrums,” and during the many years in which the DSM classified homosexuality as a syndrome, therapists were encouraged to treat it as a disease. But that’s not all the DSM is for. It also guides insurance payments, and it affects research.

Given this, do we need the DSM? Maybe for insurance purposes. But not as a statement of where nature’s joints are. In fact, it’s not clear to me that we even need it as a single source to define terms for common reference. After all, biologists don’t agree about how to classify species, but that science seems to be doing just fine. The Encyclopedia of Life takes a really useful approach: each species gets a page, but the site provides multiple taxonomies so that biologists don’t have to agree on how to lump and split all the forms of life on the planet.

If we do need a single diagnostic taxonomy, DSM is making progress in its methodology. It has more publicly entered the fray of argument, it has tried to respond to current thinking, and it is now going to be updated continuously, rather than every 5 years. All to the good.

But the rest of its problems are intrinsic to its very existence. We may need it for some purposes, but it is never going to be fully right…because tools are useful, not true.

I’m at the Semantic Technology & Business conference in NYC. Matthew Degel, Senior Vice President and Chief Architect at Viacom Media Networks is talking about “Modeling Media and the Content Supply Chain Using Semantic Technologies.” [NOTE: Liveblogging. Getting things wrong. Mangling words. Missing points. Over- and under-emphasizing the wrong things. Not running a spellpchecker. You are warned!]

Matthew says that the problem is that we’re “drowning in data but starved for information” Tere is a “thirst for asset-centric views.” And of course, Viacom needs to “more deeply integrate how property rights attach to assets.” And everything has to be natively local, all around the world.

Viacom has to model the content supply chain in a holistic way. So, how to structure the data? To answer, they need to know what the questions are. Data always has some structure. The question is how volatile those structures are. [I missed about 5 mins m– had to duck out.]

He shows an asset tree, “relating things that are different yet the same,” with SpongeBob as his example: TV series, characters, the talent, the movie, consumer products, etc. Stations are not allowed to air a commercial with the voice actor behind Spoongey, Tom Kenney, during the showing of the SpongeBob show, so they need to intersect those datasets. Likewise, the video clip you see on your setup box’s guide is separate from, but related to, the original. For doing all this, Viacom is relying on inferences: A prime time version of a Jersey Shore episode, which has had the bad language censored out of it, is a version of the full episode, which is part of the series which has licensing contracts within various geographies, etc. From this Viacom can infer that the censored episode is shown in some geography under some licensing agreements, etc.

“We’ve tried to take a realistic approach to this.” As excited as they are about the promise, “we haven’t dived in with a huge amount of resources.” They’re solving immediate problems. They began by making diagrams of all of the apps and technologies. It was a mess. So, they extracted and encoded into a triplestore all the info in the diagram. Then they overlaid the DR data. [I don’t know what DR stands for. I’m guessing the D stands for Digital, and the R might be Resource]] Further mapping showed that some apps that they weren’t paying much attention to were actually critical to multiple systems. They did an ontology graph as a London Underground map. [By the way, Gombrich has a wonderful history and appreciation of those maps in Art and Representation, I believe.]

What’s worked? They’re focusing on where they’re going, not where they’ve been. This has let them “jettison a lot of intellectual baggage” so that they can model business processes “in a much cleaner and effective way.” Also, OWL has provided a rich modeling language for expressing their Enterprise Information Model.

What hasn’t worked?

  • “The toolsets really aren’t quite there yet.” He says that based on the conversations he’s had to today, he doesn’t think anyone disagrees with him.

  • Also, the modeling tools presume you already know the technology and the approach. Also, the query tools presume you have a user at a keyboard rather than as a backend of a Web service capable of handling sufficient volume. For example, he’d like “Crystal Reports for SPARQL,” as an example of a usable tool.

  • Visualization tools are focused on interactive use. You pick a class and see the relationships, etc. But if you want to see a traditional ERD diagram, you can’t.

  • Also, the modeling tools present a “forward-bias.” E.g., there are tools for turning schemas into ontologies, but not for turning ontologies into a reference model for schema.

Matthew makes some predictions:

  • They will develop into robust tools

  • Semantic tech will enable queries such as “Show me all Madonna interviews where she sings, where the footage has not been previously shown, and where we have the license to distribute it on the Web in Australia in Dec.”

(Here’s a version of the text of a submission I just made to BoingBong through their “Submitterator”)

Harvard University has today put into the public domain (CC0) full bibliographic information about virtually all the 12M works in its 73 libraries. This is (I believe) the largest and most comprehensive such contribution. The metadata, in the standard MARC21 format, is available for bulk download from Harvard. The University also provided the data to the Digital Public Library of America’s prototype platform for programmatic access via an API. The aim is to make rich data about this cultural heritage openly available to the Web ecosystem so that developers can innovate, and so that other sites can draw upon it.

This is part of Harvard’s new Open Metadata policy which is VERY COOL.

Speaking for myself (see disclosure), I think this is a big deal. Library metadata has been jammed up by licenses and fear. Not only does this make accessible a very high percentage of the most consulted library items, I hope it will help break the floodgates.

(Disclosures: 1. I work in the Harvard Library and have been a very minor player in this process. The credit goes to the Harvard Library’s leaders and the Office of Scholarly Communication, who made this happen. Also: Robin Wendler. (next day:) Also, John Palfrey who initiated this entire thing. 2. I am the interim head of the DPLA prototype platform development team. So, yeah, I’m conflicted out the wazoo on this. But my wazoo and all the rest of me is very very happy today.)

Finally, note that Harvard asks that you respect community norms, including attributing the source of the metadata as appropriate. This holds as well for the data that comes from the OCLC, which is a valuable part of this collection.

Google has announced that it is retiring Needlebase, a service it acquired with its ITA purchase. That’s too bad! Needlebase is a very cool tool. (It’s staying up until June 1 so you can download any work you’ve done there.)

Needlebase is a browser-based tool that creates a merged, cleaned, de-duped database from databases. Then you can create a variety of user-happy outputs. There are some examples here.

Google says it’s evaluating whether Needlebase can be threaded into its other offerings.

Here’s my top ten list of top ten lists of top ten lists:

  1. The Top Ten Top Ten Lists of All Time

  2. TopTenz Miscellaneous

  3. MetaCritic music lists

  4. Smosh’s Top Ten Top Ten Lists of 2011

  5. Top Ten 2011 Top Ten Lists about CleanTech

  6. Top Ten of top ten horror movie lists

  7. NYT Top Ten Top Ten Lists for 2011

  8. Top Ten Top Ten Lists about Agile Management

  9. Top Ten Top Ten Video Lists of 2011

  10. Brand Media Strategies Top Ten Top Ten Lists

Come on, people! If just nine more of you compile top ten top ten top tens we can take it up a level!


Bonus: The media can’t get enough of top ten lists. When they run out, they write about why they write about top ten lists:

  1. The New Yorker

  2. Forbes

  3. NPR

  4. NY Times

  5. Discover

  6. CBS

  7. Poynter

 


[Later that day:] I’ve removed one from the list so that there are actually ten, not eleven. Oops. (And again, later, because I am a @#$%ing moron.)

Mathew Ingram has a provocative post at Gigaom defending HuffingtonPost and its ilk from the charge that they over-aggregate news to the point of thievery. I’m not completely convinced by Mathew’s argument, but that’s because I’m not completely convinced by any argument about this.

It’s a very confusing issue if you think of it from the point of view of who owns what. So, take the best of cases, in which HuffPo aggregates from several sources and attributes the reportage appropriately. It’s important to take a best case since we’ll all agree that if HuffPo lifts an article en toto without attribution, it’s simple plagiarism. But that doesn’t tell us if the best cases are also plagiarisms. To make it juicier, assume that in one of these best cases, HuffPo relies heavily on one particular source article. It’s still not a slam dunk case of theft because in this example HuffPo is doing what we teach every school child to do: If you use a source, attribute it.

But, HuffPo isn’t a schoolchild. It’s a business. It’s making money from those aggregations. Ok, but we are fine in general with people selling works that aggregate and attribute. Non-fiction publishing houses that routinely sell books that have lots of footnotes are not thieves. And, as Mathew points out, HuffPo (in its best cases) is adding value to the sources it aggregates.

But, HuffPo’s policy even in its best case can enable it to serve as a substitute for the newspapers it’s aggregating. It thus may be harming the sources its using.

And here we get to what I think is the most important question. If you think about the issue in terms of theft, you’re thrown into a moral morass where the metaphors don’t work reliably. Worse, you may well mix in legal considerations that are not only hard to apply, but that we may not want to apply given the new-ness (itself arguable) of the situation.

But, I find that I am somewhat less conflicted about this if I think about it terms of what direction we’d like to nudge our world. For example, when it comes to copyright I find it helpful to keep in mind that a world full of music and musicians is better than a world in which music is rationed. When it comes to news aggregation, many of us will agree that a world in which news is aggregated and linked widely through the ecosystem is better than one in which you—yes, you, since a rule against HuffPo aggregating sources wouldn’t apply just to HuffPo— have to refrain from citing a source for fear that you’ll cross some arbitrary limit. We are a healthier society if we are aggregating, re-aggregating, contextualizing, re-using, evaluating, and linking to as many sources as we want.

Now, beginning by thinking where we want the world to be —which, by the way, is what this country’s Founders did when they put copyright into the Constitution in the first place: “to promote the progress of science and useful arts”—is useful but limited, because to get the desired situation in which we can aggregate with abandon, we need the original journalistic sources to survive. If HuffPo and its ilk genuinely are substituting for newspapers economically, then it seems we can’t get to where we want without limiting the right to aggregate.

And that’s why I’m conflicted. I don’t believe that even if all rights to aggregate were removed (which no one is proposing), newspapers would bounce back. At this point, I’d guess that the Net generation is primarily interested in news mainly insofar as its woven together and woven into the larger fabric. Traditional reportage is becoming valued more as an ingredient than a finished product. It’s the aggregators—the HuffingtonPosts of the world, but also the millions of bloggers, tweeters and retweeters, Facebook likers and Google plus-ers, redditors and slashdotters, BoingBoings and Ars Technicas— who are spreading the news by adding value to it. News now only moves if we’re interested enough in it to pass it along. So, I don’t know how to solve journalism’s deep problems with its business models, but I can’t imagine that limiting the circulation of ideas will help, since in this case, the circulatory flow is what’s keeping the heart beating.

 


[A few minutes later] Mathew has also posted what reads like a companion piece, about how Amazon’s Kindle Singles are supporting journalism.

The CBC has posted the full, unedited interview with me (15 mins) that Nora Young did last week. We talk about the Harvard Library Lab’s two big projects, ShelfLife and LibraryCloud. (At the end, we talk a little about Too Big To Know.) The edited interview will be on the Spark program.

I’m at the final meeting of a Harvard course on the future of libraries, led by John Palfrey and Jeffrey Schnapp. They have three guests in to talk about physical library space.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

David Lamberth lays out an idea as a provocation. He begins by pointing out that until the beginning of the 20th century, a library was not a place but only a collection of books. He gives a quick history of Harvard Library. After the library burned down in 1764, the libraries lived in fear of fire, until electric lights came in. The replacement library (Gore Hall) was built out of stone because brick structures need wood on the inside. But stone structures are dank, and many books had to be re-bound every 30 years. Once it filled up, 25-30 of Harvard libraries derived from the search for fireproof buildings, which helps explain the large distribution of libraries across campus. They also developed more than 40 different classification systems. At the beginning of the 20th C, Harvard’s collection was just over one million. Now it adds up to around 18M. [David’s presentation was not choppy, the way this paraphrase is.]

In the 1980s, there was continuing debate about what to do about the need for space. The big issue was open or closed stacks. The faculty wanted the books on site so they could be browsed. But stack space is expensive and you tend to outgrow it faster than you think. So, it was decided not to build any more stack space. There already was an offsite repository (New England Book Depository), but it was decided to build a high density storage facility to remove the non-active parts of the collection to a cheaper, off-site space: The Harvard Depository (HD).

Now more than 40% of the physical collections are at HD. The Faculty of Arts and Sciences started out hostile to the idea, but “soon became converted.” The notion faculty had of browsing the shelves was based on a fantasy: Harvard had never had all the books on a subject on a shelf in a single facility. E.g., search on “Shakespeare” in the Harvard library system: 18,000 hits. Widener Library is where you’d expect to find Shakespeare books. But 8,000 of the volumes aren’t in Widener. Of Widener’s 10K Shakespeare, volumes, 4,500 are in HD. So, 25% of what you meant to browse is there. “Shelf browsing is a waste of time” if you’re trying to do thorough research. It’s a little better in the smaller libraries, but the future is not in shelf browsing. Open and closed stacks isn’t the question any more. “It’s just not possible any longer to do shelf browsing, unless we develop tools for browsing in a non-physical fashion.” E.g., catalog browsers, and ShelfLife (with StackView).

There’s nobody in the stacks any more. “It’s like the zombies have come and cleared people out.” People have new alternatives, and new habits. “But we have real challenges making sure they do as thorough research as possible, and that we leverage our collection.” About 12M of the 18M items are barcoded.

A task force saw that within 40 years, over 70% of the physical collection will be off site. HD was not designed to hold the part of the collection most people want to use. So, what can do that will give us pedagogical and intellectual benefit, and realizes the incredible resource that our collection is?

Let me present one idea, says David. The Library Task Force said emphatically that Harvard’s collection should be seen as one collection. It makes sense intellectually and financially. But that idea is in contention with the 56 physical libraries at Harvard. Also, most of our collection doesn’t circulate. Only some of it is digitally browsable, and some of that won’t change for a long long long time. E.g., our Arabic journals in Widener aren’t indexed, don’t publish cumulative indexes, and are very hard to index. Thus scholars need to be able to pull them off the shelves. Likewise for big collections of manuscripts that haven’t even been sorted yet.

One idea would be to say: Let’s treat physical libraries as one place as well. Think of them as contiguous, even though they’re not. What if bar-coded books stayed in the library you returned to them to? Not shelved by a taxonomy. Random access via the digital, and it tells you where the work is. And build perfect shelves for the works that need to be physically organized. Let’s build perfect Shakespeare shelves. Put them in one building. The other less-used works will be findable, but not browsable. This would require investing in better findability systems, but it would let us get past the arbitrariness of classification systems. Already David will usually go to Amazon to decide if he wants a book rather than take the 5 mins to walk to the library. By focusing on perfect shelves for what is most important to be browsable, resources would be freed up. This might make more space in the physical libraries, so “we could think about what the people in those buildings want to be doing,” so people would come in because there’s more going on. (David notes that this model will not go over well with many of his colleagues.)

53% of library space at Harvard is stack space. The other 47% is split between patron space and space staff. About 20-25% is space staff. Comparatively, Harvard is lower on patron space size than typical. The HD is holding half the collection in 20% of the space. It’s 4x as expensive to store a work on a stack on campus than off.

David responds to a question: The perfect shelves should be dynamic, not permanent. That will better serve the evolution of research. There are independent variables: Classification and shelf location. We certainly need classification, but it may not need to map to shelf locations. Widener has bibliographic lists and shelf lists. Barcodes give us more freedom; we don’t have to constantly return works to fixed locations.

Mike Barker: Students already build their own perfect shelves with carrels.

Q: What’s the case for ownership and retention if we’re only addressing temporal faculty needs?

A lot of the collecting in the first half of the 20 C was driven by faculty requests. Not now. The question of retention and purchase splits on the basis of how uncommon the piece of info is. If it’s being sold by Amazon, I don’t think it really matters if we retain it, because of the number of copies and the archival steps already in place. The more rare the work, the more we should think about purchase and retention. But under a third of the stack space on campus ideal environmental conditions. We shouldn’t put works we buy into those circumstances unless they’re being used.

Q: At the Law Library, we’re trying to spread it out so that not everyone is buying the same stuff. E.g., we buy Peruvian materials because other libraries aren’t. And many law books are not available digitally, so we we buy them … but we only buy one copy.

Yes, you’re making an assessment. In the Divinity library, Mike looked at the duplication rate. It was 53%. That is, 53% of our works are duplicated in other Harvard libraries.

Mike: How much do we spend on classification? To create call numbers? We annually spend about 1.5-2M on it, plus another million shelving it. So, $3M-3.5M total. (Mike warns that this is a “very squishy” number.) We circulate about 700,000 items a years. The total operating budget of the Library is about $152M. (He derived this number by asking catalogers who long it takes to classify an item without one, divided into salary.)

David: Scanning in tables of contents, indexes, etc., lets people find things without having to anticipate what they’re going to be interested in.

Q: Where does serendipity fall in this? What about when you don’t know what you’re looking for?

David: I agree completely. My dissertation depended on a book that no one had checked out since 1910. I found it on the stacks. But it’s not on the shelves now. Suppose I could ask a research librarian to bring me two shelves worth of stuff because I’m beginning to explore some area.

Q: What you’re suggesting won’t work so well for students. How would not having stacks affect students?

David: I’m being provocative but concrete. The status quo is not delivering what we think it does, and it hasn’t for the past three decades.

Q: [jeff goldenson] Public librarians tell us that the recently returned trucks are the most interesting place to go. We don’t really have the ability to see what’s moving in the Harvard system. Yes, there are privacy concerns, but just showing what books have been returned would be great.

Q: [palfrey] How much does the rise of the digital affect this idea? Also, you’ve said that the storage cost of a digital object may be more than that of physical objects. How does that affect this idea?

David: Copyright law is the big If. It’s not going away. But what kind of access do you have to digital objects that you own? That’s a huge variable. I’ve premised much of what I’ve said on the working notion that we will continue to build physical collections. We don’t know how much it will cost to keep a physical object for a long time. And computer scientists all say that digital objects are not durable. My working notion here is that the parts that are really crucial are the metadata pieces, which are more easily re-buildable if you have the physical objects. We’re not going to buy physical objects for all the digital items, so the selection principle goes back to how grey or black the items are. It depends on whether we get past the engineering question about digital durability — which depends a lot on electromagnetism as a storage medium, which may be a flash in the pan. We’re moving incrementally.

Q: [me] If we can identify the high value works that go on perfect shelves, why not just skip the physical shelves and increase the amount of metadata so that people can browse them looking for the sort of info they get from going to the physical shelf?

A: David: Money. We can’t spend too much on the present at the expense of the next century or two. There’s a threshold where you’d say that it’s worth digitizing them to the degree you’d need to replace physical inspection entirely. It’s a considered judgment, which we make, for example, when we decide to digitize exhibitions. You’d want to look at the opportunity costs.

David suggests that maybe the Divinity library (he’s in the Phil Dept.) should remove some stacks to make space for in-stack work and discussion areas. (He stresses that he’s just thinking out loud.)

Matthew Sheehy, who runs HD, says they’re thinking about how to keep books 500 years. They spend $300K/year on electricity to create the right environment. They’ve invested in redundancy. But, the walls of the HD will only last 100 years. [Nov. 25: I may have gotten the following wrong:] He thinks it costs about $1/ year to store a book, not the usual figure of $0.45.

Jeffrey Schnapp: We’re building a library test kitchen. We’re interested in building physical shelves that have digital lives as well.

[Nov. 25: Changed Philosophy school to Divinity, in order to make it correct. Switched the remark about the cost of physical vs. digital in the interest of truth.]

Classifying folktales

Via Metafilter:

The Aarne-Thompson Classification System

Originally published by Finnish forkloristAntti Aarne and expanded by American Stith Thompson and German Hans-Jörg Uther, the Aarne-Thompson Classification System is a system for classifying folktales based on motifs.

Some Examples:
Beauty and the Beast: Type 425C
Bluebeard: 312
The Devil Building a Bridge: Type 1191
The Foolish Use of Magic Wishes Type 750A
Hansel and Gretel and other abandoned children: Type 327
Women forced to marry hogs: Type 441
The Runaway Pancake: Type 2025
Wikipedia has a complete breakdown and here has examples of most of the tale types.

« Prev - Next »