June 20th, 2013
June 19th, 2013
NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.
(If a response isn’t labeled “Kevin,” then it wasn’t Kevin. Also, this is much compressed, incomplete, and choppy. Also, I haven’t re-read it.)
Q: From the Bibframe mailing list it seems like there isn’t agreement about what Bibframe is trying to achieve.
Kevin: Sometimes people see it narrowly.
Q: It’s not clear how Bibframes gets to where it replaces MARC.
Kevin: We’re not holding back some plan or roadmap that we’ve mapped out perfectly with milestones and target dates. We’re taking it as it comes.
Q: There’s a perception on the part of vendors and customers of vendors that this is a new data specification that vendors will have to support, and that that’s its main function, and possibly that’s pushing the knowledge representation in a direction that’s favorable to the vendors — a direction that’s too simple.
Q: Is there an agreement about the end point?
Kevin: There’s agreement that it needs to do what MARC does but better. We’re doing data representation, not predicting the systems built on top of it.
Q: What are the functional requirements that Bibframe’s trying to meet with this new model? What are your metrics? And who are you trying to satisfy?
Kevin: It’s not vendor focused. We hope systems will be built that expose the data as linked data.
Q: Bibframe let’ you associate a record with a particular work, which is a huge advance.
Q: Bibframe used to talk about roundtripping from MARC to Bibframe to MARC. But Bibframe is now adding info, so I don’t see how roundtripping is possible.
Kevin: Not losslessly.
Q: Bibframe is intended for libraries, but from what I’ve seen it doesn’t seem that Bibframe is intended for use outside of libraries. There doesn’t seem to be any thought about how other ontologies might be overlaid. And that was a problem with MARC: it was too library-centric. Why not investigate mapping it into other vocabularies?
Kevin: Nothing stops you from including other namespaces. As for mapping to other vocabularies, we’re working on a 40 year time scale and can’t know that other vocabularies will be around.
Q: We need some community-building to make that happen. We need to be careful not to build an ontological silo.
Q: The naming of this data set is unfortunate: Why” bib”, which has a connotation of books, when really it should be about any kind of information-bearing object. Why not call it “InfoFrame”? Who uses “bibliographic” other than libraries? Why limit yourself?
Kevin: I cannot begin to tell you how much time was spent on what this thing should be called. It went through a couple of different names. It’s not an ideal name, but I hope that the “bib” association falls by the wayside.
Q: The library ecosystem includes articles, licenses, and many other things that weren’t part of MARC. Is Bibframe aiming at representing all of that?
Kevin: Yes, it’s in scope. Certainly data about journal articles.
Kevin: Yes, Bibframe lets you define your own fields, as in MARC.
Q: We’re going from cataloging to catalinking: from records about resources to links related to topics, etc.
A: We need services that will link resources to other resources. Bibframe doesn’t do that, but it’s more amenable to it than MARC.
Kevin: [Sorry, but I missed the beginning of this.] When it comes to subject headings, we expect you to resolve that URI. If people are doing that every single time, then it’s a candidate for being included. That lookup could be a query into your local system. I’ve assumed you’ll have to have a local copy of it.
Q: Versioning? Why did you ignore the work of the British Library?
Kevin: We didn’t ignore it at all. We need to attend to what’s achievable by the smallest institutions as well as the largest.
Q: For a small institution, is it practical to move away from MARC?
Kevin: Not for some. Some still use card catalogs. I expect some of the first systems will be an outward layer around legacy systems.
Q: We need a larger discussion about provenance and about trust on the semantic web. Libraries should be better participants in that discussion; it’s a deeply important space for us.
Q: This conversation makes me cynical about our profession’s involvement. We need be talking with users. We need community involvement. We’re worried about the longevity of FOAF? It’ll outlast Bibframe because people actually use it. Let’s keep turning inward until we’re completely irrelevant.
Q: Yeah, the idea that there has to be one namespace seems so counter to the principles of linked data.
Q: Do we have anyone outside of the library community here?
A: I’m mainly a web developer. There’s a really big gulf. The Web will win when it comes to how libraries operate. Whether Bibframe will be a part of it remains to be seen. In the web community, everything seems exciting, but I feel so much angst in the library community.
June 15th, 2013
I’ve just finished leading two days of workshops at University of Stuttgart as part of my fellowship at the Internazionales Zentrum für Kultur- und Technikforschung. (No, I taught in English.) This was for me a wonderful experience. First of all, the students were engaged, smart, talked from diverse standpoints, and fun. Second, it reminded me how to teach. I had so much trouble trying to structure sessions, feeling totally unsure how one does so. But the eight 1.5 hour sessions reminded me why I loved teaching.
For my own memory, here are the sessions (and if any of you were there and took notes, I’d love to see them):
#1 Cyberutopianism, technodeterminism, and Internet exceptionalism defined, with JP Barlow’s Declaration of the Independent of Cyberspace as an example. Class introductions.
#2 Information Age to Age of Connected. Why Ted Nelson’s Xanadu did not succeed the way the Web did. Rough technical architecture of the Net and (perhaps) its embedded political values. Hyperlinks.
#3 Digital order. Everything is miscellaneous? From information Retrieval to search engines. Schema-based databases to tagging.
#4 Networked knowledge. What knowledge looks like once it’s been freed of paper. Four challenges to networked knowledge (with many more added by the students.)
On Saturday we talked about topics that the students decided were interesting:
#1 Mobile net. Is Facebook making us more or less social? Why do we fill up every interstice by using Facebook on mobiles? What does this say about us and the notion of the self?
#2 Downloading. Do you download music illegally? What is your justification? How might artists respond? Why is the term “intellectual property” so loaded?
#3 Education. What makes a great in-person course? What makes for a miserable one? Oddly, many of the characteristics of miserable classes are also characteristics of MOOCs. What might we do about that? How much of this is caused by the fact that MOOCs are construed as courses in the traditional sense?
#4 Internet culture. Is there such a thing? If there are many, is any particular one to be privileged? How does the Net look to a culture that is dedicated to warding off what it says as corrupting influences? End with LolCatBible and the astounding TheJohnnyCashProject
Thank you, students. This experience meant a great deal to me.
May 20th, 2013
NOTE on May 23: OCLC has posted corrected numbers. I’ve corrected them in the post below; the changes are mainly fractional. So you can ignore the note immediately below.
NOTE a couple of hours later: OCLC has discovered a problem with the analysis. So please ignore the following post until further notice. Apologies from the management.
Ever since the 1960s, publishers have used ISBN numbers as identifiers of editions of books. Since the world needs unique ways to refer to unique books, you would think that ISBN would be a splendid solution. Sometimes and in some instances it is. But there are problems, highlighted in the latest analysis run by OCLC on its database of almost 300 million records.
Number of ISBNs
Percentage of the records
So, 78% of the OCLC’s humungous collection of books records have no ISBN, and only 1.6% have the single ISBN that God intended.
As Roy Tennant [twitter: royTennant] of OCLC points out (and thanks to Roy for providing these numbers), many works in this collection of records pre-date the 1960s. Even so, the books with multiple ISBNs reflect the weakness of ISBNs as unique identifiers. ISBNs are essentially SKUs to identify a product. The assigning of ISBNs is left up to publishers, and they assign a new one whenever they need to track a book as an inventory item. This does not always match how the public thinks about books. When you want to refer to, say, Moby-Dick, you probably aren’t distinguishing between one with illustrations, a large-print edition, and one with an introduction by the Deadliest Catch guys. But publishers need to make those distinctions, and that’s who ISBN is intended to serve.
This reflects the more general problem that books are complex objects, and we don’t have settled ways of sorting out all the varieties allowed within the concept of the “same book.” Same book? I doubt it!
Still, these numbers from OCLC exhibit more confusion within the ISBN number space than I’d expected.
MINUTES LATER: Folks on a mailing list are wondering if the very high percentage of records with two ISBNs is due to the introduction of 13-digit ISBNs to supplement the initial 10-digit ones.
April 25th, 2013
Amanda Filipacchi has a great post at the New York Times about the problem with classifying American female novelists as American female novelists. That’s been going on at Wikipedia, with the result that the category American novelist was becoming filled predominantly with male novelists.
Part of this is undoubtedly due to the dumb sexism that thinks that “normal” novelists are men, and thus women novelists need to be called out. And even if the category male novelist starts being used, it still assumes that gender is a primary way of dividing up novelists, once you’ve segregated them by nation. Amanda makes both points.
From my point of view, the problem is inherent in hierarchical taxonomies. They require making decisions not only about the useful ways of slicing up the world, but also about which slices come first. These cuts reflect cultural and political values and have cultural and political consequences. They also get in the way of people who are searching with a different way of organizing the topic in mind. In a case like this, it’d be far better to attach tags to Wikipedia articles so that people can search using whatever parameters they need. That way we get better searchability, and Wikipedia hasn’t put itself in the impossible position of coming up with a taxonomy that is neutral to all points of view.
Wikipedia’s categories have been broken for a long time. We know this in the Library Innovation Lab because a couple of years ago we tried to find every article in Wikipedia that is about a book. In theory, you can just click on the “Book” category. In practice, the membership is not comprehensive. The categories are inconsistent and incomplete. It’s just a mess.
It may be that a massive crowd cannot develop a coherent taxonomy because of the differences in how people think about things. Maybe the crowd isn’t massive enough. Or maybe the process just needs far more guidance and regulation. But even if the crowd can bring order to the taxonomy, I don’t believe it can bring neutrality, because taxonomies are inherently political.
There are problems with letting people tag Wikipedia articles. Spam, for example. And without constraints, people can lard up an object with tags that are meaningful only to them, offensive, or wrong. But there are also social mechanisms for dealing with that. And we’ve been trained by the Web to lower our expectations about the precision and recall afforded by tags, whereas our expectations are high for taxonomies.
April 18th, 2013
I’m very proud to announce that the Harvard Library Innovation Lab (which I co-direct) has launched what we think is a useful and appealing way to browse books at scale. This is timed to coincide with the launch today of the Digital Public Library of America. (Congrats, DPLA!!!)
StackLife (nee ShelfLife) shows you a visualization of books on a scrollable shelf, which we turn sideways so you can read the spines. It always shows you books in a context, on the ground that no book stands alone. You can shift the context instantly, so that you can (for example) see a work on a shelf with all the other books classified under any of the categories professional cataloguers have assigned to it.
We also heatmap the books according to various usage metrics (“StackScore”), so you can get a sense of the work’s community relevance.
There are lots more features, and lots more to come.
We’ve released two versions today.
StackLife DPLA mashes up the books in the Digital Public Library of America’s collection (from the Biodiversity Heritage Library) with books from The Internet Archive‘s Open Library and the Hathi Trust. These are all online, accessible books, so you can just click and read them. There are 1.7M in the StackLife DPLA metacollection. (Development was funded in part by a Sprint grant from the DPLA. Thank you, DPLA!)
StackLife Harvard lets you browse the 12.3M books and other items in the Harvard Library systems 73 libraries and off-campus repository. This is much less about reading online (unfortunately) than about researching what’s available.
Here are some links:
StackLife DPLA: http://stacklife-dpla.law.harvard.edu
StackLife Harvard: http://stacklife.law.harvard.edu
The DPLA press release: http://library.harvard.edu/stacklife-browse-read-digital
The DPLA version FAQ: http://stacklife-dpla.law.harvard.edu/#faq/
The StackLife team has worked long and hard on this. We’re pretty durn proud:
April 16th, 2013
I had both CNN and Twitter on yesterday all afternoon, looking for news about the Boston Marathon bombings. I have not done a rigorous analysis (nor will I, nor have I ever), but it felt to me that Twitter put forward more and more varied claims about the situation, and reacted faster to misstatements. CNN plodded along, but didn’t feel more reliable overall. This seems predictable given the unfiltered (or post-filtered) nature of Twitter.
But Twitter also ran into some scaling problems for me yesterday. I follow about 500 people on Twitter, which gives my stream a pace and variety that I find helpful on a normal day. But yesterday afternoon, the stream roared by, and approached filter failure. A couple of changes would help:
First, let us sort by most retweeted. When I’m in my “home stream,” let me choose a frequency of tweets so that the scrolling doesn’t become unwatchable; use the frequency to determine the threshold for the number of retweets required. (Alternatively: simply highlight highly re-tweeted tweets.)
Second, let us mute based on hashtag or by user. Some Twitter cascades I just don’t care about. For example, I don’t want to hear play-by-plays of the World Series, and I know that many of the people who follow me get seriously annoyed when I suddenly am tweeting twice a minute during a presidential debate. So let us temporarily suppress tweet streams we don’t care about.
It is a lesson of the Web that as services scale up, they need to provide more and more ways of filtering. Twitter had “follow” as an initial filter, and users then came up with hashtags as a second filter. It’s time for a new round as Twitter becomes an essential part of our news ecosystem.
March 2nd, 2013
Steve Coll has a good piece in the New Yorker about the importance of Al Qaeda as a brand:
…as long as there are bands of violent Islamic radicals anywhere in the world who find it attractive to call themselves Al Qaeda, a formal state of war may exist between Al Qaeda and America. The Hundred Years War could seem a brief skirmish in comparison.
This is a different category of issue than the oft-criticized “war on terror,” which is a war against a tactic, not against an enemy. The war against Al Qaeda implies that there is a structurally unified enemy organization. How do you declare victory against a group that refuses to enforce its trademark?
In this, the war against Al Qaeda (which is quite preferable to a war against terror — and I think Steve agrees) is similar to the war on cancer. Cancer is not a single disease and the various things we call cancer are unlikely to have a single cause and thus are unlikely to have a single cure (or so I have been told). While this line of thinking would seem to reinforce politicians’ referring to terrorism as a “cancer,” the same applies to dessert. Each of these terms probably does have a single identifying characteristic, which means they are not classic examples of Wittgensteinian family resemblances: all terrorism involves a non-state attack that aims at terrifying the civilian population, all cancers involve “unregulated cell growth” [thank you Wikipedia!], and all desserts are designed primarily for taste not nutrition and are intended to end a meal. In fact, the war on Al Qaeda is actually more like the war on dessert than like the war on cancer, because just as there will always be some terrorist group that takes up the Al Qaeda name, there will always be some boundary-pushing chef who declares that beefy jerky or glazed ham cubes are the new dessert. You can’t defeat an enemy that can just rebrand itself.
I think that Steve Coll comes to the wrong conclusion, however. He ends his piece this way:
Yet the empirical case for a worldwide state of war against a corporeal thing called Al Qaeda looks increasingly threadbare. A war against a name is a war in name only.
I agree with the first sentence, but I draw two different conclusions. First, this has little bearing on how we actually respond to terrorism. The thinking that has us attacking terrorist groups (and at times their family gatherings) around the world is not made threadbare by the misnomer “war against Al Qaeda.” Second, isn’t it empirically obvious that a war against a name is not a war in name only?
January 1st, 2013
A New Yorker article that profiles John Quijada, the inventor of a language (and a double-dotter!), mentions the first artificial language we know about, Lingua Ignota. The article’s author, Joshua Foer, tells us it was invented by Hildegard von Bingen (totally fun to say out loud) in the 12th century. “All that remains of her language is a short passage and a dictionary of a thousand and twelve words listed in hierarchical order, from the most important (Aigonz, God) to the least (Cauiz, cricket).” There’s more about Lingua Ignota over at our friend, Wikipedia. (And did you remember to kick in a few bucks to keep Wikipedia in booze and cigarettes?)
Ordering a list by cosmic importance (remember the Great Chain of Being?) makes sense if everyone agrees on what that order is. And it expresses respect for the order. That’s why some clergyfolk objected to the fact that Diderot’s Encyclopedia in the 18th century alphabetized its contents. Imagine Cows coming before God!
Before we sneer, we should keep in mind that we do the same thing when we make lists to be seen by others. For example, lists of donors put the Big Money folk first. For another example, we wouldn’t post a list of New Year’s resolutions in the following order:
My New Year’s Resolutions
Bring in an apple instead of snacking from the vending machine
Don’t let the ironing back up for more than a week
Refill the bird-feeder before it’s empty.
Get those birthday cards in the mail on time!
And there are rhetorical rules for the order in which we give reasons to support an argument. For example, we often give the easiest reason to accept first, and lead up to the most serious reason: “It’s easy, it’ll save money, people will feel good about it, and it’s the right thing to do.” The phrase “most important,….” is not permitted to appear in the middle of a sentence.
Order is content.
December 30th, 2012
There’s a knowingly ridiculous thread at Reddit at the moment: Which world leader would win if pitted against other leaders in a fight to the death.
The title is a straightline begging for punchlines. And it is a funny thread. Yet, I found it shockingly informative. The shock comes from realizing just how poorly informed I am.
My first reaction to the title was “Putin, duh!” That just shows you what I know. From the thread I learned that Joseph Kabila (Congo) and Boyko Borisov (Bulgaria) would kick Putin’s ass. Not to mention that Jigme Khesar Namgyel Wangchuck (Bhutan), who would win on good looks.
Now, when I say that this thread is “shockingly informative,” I don’t mean that it gives sufficient or even relevant information about the leaders it discusses. After all, it focuses on their personal combat skills. Rather, it is an interesting example of the haphazard way information spreads when that spreading is participatory. So, we are unlikely to have sent around the Wikipedia article on Kabila or Borisov simply because we all should know about the people leading the nations of the world. Further, while there is more information about world leaders available than ever in human history, it is distributed across a huge mass of content from which we are free to pick and choose. That’s disappointing at the least and disastrous at its worst.
On the other hand, information is now passed around if it is made interesting, sometimes in jokey, demeaning ways, like an article that steers us toward beefcake (although the president of Ireland does make it up quite high in the Reddit thread). The information that gets propagated through this system is thus spotty and incomplete. It only becomes an occasion for serendipity if it is interesting, not simply because it’s worthwhile. But even jokey, demeaning posts can and should have links for those whose interest is piqued.
So, two unspectacular conclusions.
First, in our despair over the diminishing of a shared knowledge-base of important information, we should not ignore the off-kilter ways in which some worthwhile information does actually propagate through the system. Indeed, it is a system designed to propagate that which is off-kilter enough to be interesting. Not all of that “news,” however, is about water-skiing cats. Just most.
Second, we need to continue to have the discussion about whether there is in fact a shared news/knowledge-base that can be gathered and disseminated, whether there ever was, whether our populations ever actually came close to living up to that ideal, the price we paid for having a canon of news and knowledge, and whether the networking of knowledge opens up any positive possibilities for dealing with news and knowledge at scale. For example, perhaps a network is well-informed if it has experts on hand who can explain events at depth (and in interesting ways) on demand, rather than assuming that everyone has to be a little bit expert at everything.