deborah | jcdl post 2: digital curation and preservation (Reply)

The first panel I went to was digital curation and preservation. my notes from these sessions are more sparse.

how to choose a digital preservation strategy (Strodl, Becker, Neumayer, and Rauber)

I found it very hard to understand this too-fast talk; I'll need to read their paper in the proceedings. Here's some brief notes I got from their talk:

The first step is to identify requirements

identify requirements and goals

tree structure
- usually 4 top level branches
- object characteristics (content, metadata)
- record metadata (context, relations)
- process chacteristics (scalablity)
- costs

what to do: migration, emulation, both, other?

go/no-go decision: will it be useful and cost-effective to continue?

run experiment and evaluate

consider results

not all criteria equally important. give leaves weight, apply multipliers on performance and weight

you know what? this is a mathematical structure built on what we sort of did doing our requirements gathering phase, just with preservation requirements instead of feature requirements. Pretty straightforward stuff applied to a new problem. How is this interesting? Must be somewhere.

Actually, here's a good takeaway from this talk. just because it's a practiced and documented preservation method in use in your repository doesn't mean it works. This model includes testing strategies before you implement them. It creates a documented and supportable decision making process forchoosing workflow, which, in preservation, is probably a damn good thing. On the other hand, if you get too retentive about preservation, you'll never implement anything. *cough* PREMIS *cough*

factors affecting website reconstruction from the web infrastructure (McCown, Diawara, Nelson):

The web's ephemeral and possibly unrecoverable, so they built a tool called "warrick" which is downloadable and which people can use to recover sites and has been used. It works by "crawling the crawlers" -- reconstruction by searching extant repositories (internet archive, google cache, etc). But the question is how to discover what wasn't recovered?

to test he runs warrick on snapshotted live sites, every week, over a period of time

little happy math formula to see what's changed, missing, added

definition of "success" (mostly concerned with missing, not changed or added since snapshot date)

on average, 61% of a website was reconstructed on any given week

purely textual websites have worse recovery rates than image- or pdf-heavy sites. surprising!

birth and decay of resouces within a given website

Factors:

backlinks external and internal; PageRank; depth; hops from front page; MIME type; query string parameters; age; resource bith rate; TLD; website size; size of resource

but these only account for about half of the results found

other unmeasured factors accounted for much of the results

most significant factors to recovery: PageRank, hops, age

stats for repositories:

msn was best, followed by google and IA, the yahoo

possibly google did worst becauses data available via google api might not be as good as pure google

balance: do we prefer IA because of authoritative nature, or search engine caches because of recency?

question: how is he defining "website"? everything not an offlink by fqdn? by parent domain? [answer: domain name]
question: did he look at passworded or spider-blocked resources? [answer: obeyed robots.txt; ignored all passworded sites, which is very problematic to me]

defining what digital curators do and what they need to know, the DigCCurr project (Lee, Tibbo, Schaefer):

Goals of DigCCurr:

to develop graduate level curricular framework

course modules

experiential component (eg internships, fellowships)

We need an "active management and preservation of the digital objects over their lifecycle" (remember to avoid benign neglect). When building curricula there's a matrix of things to think about includes type of resource; professional or disciplinary context; values and principles; prerequisite knowledge; lifecycle stages. The project develops levels of function and skill that need to be taught; identifies tasks of a digital curator.

generating best-effort preservation metadata for web resources at time of dissemination (Smith and Nelson)

Their product is mod_oai.

one of the problems in preseving websites is the absence of metadata

weak page metadata

often no link metadata without following links with http (images, for example)

So what to do? Post ingest procesing eg JHOVE?

can the webmasters get involved earlier? they're busy, though.

webmasters can configure modules on webserver to do automatic metadata extraction long before passing off to archive for ingest

auto-extracted metadata can be grabbed with OAI-PMH

This is limited, still, but piles better than previous situation. (It's still unverified, undifferentiated (eg. administative vs. technical). It's automatic, generated at disemination.)