![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
The first panel I went to was digital curation and preservation. my notes from these sessions are more sparse.
how to choose a digital preservation strategy (Strodl, Becker, Neumayer, and Rauber)
I found it very hard to understand this too-fast talk; I'll need to read their paper in the proceedings. Here's some brief notes I got from their talk:
The first step is to identify requirements
you know what? this is a mathematical structure built on what we sort of did doing our requirements gathering phase, just with preservation requirements instead of feature requirements. Pretty straightforward stuff applied to a new problem. How is this interesting? Must be somewhere.
Actually, here's a good takeaway from this talk. just because it's a practiced and documented preservation method in use in your repository doesn't mean it works. This model includes testing strategies before you implement them. It creates a documented and supportable decision making process forchoosing workflow, which, in preservation, is probably a damn good thing. On the other hand, if you get too retentive about preservation, you'll never implement anything. *cough* PREMIS *cough*
factors affecting website reconstruction from the web infrastructure (McCown, Diawara, Nelson):
The web's ephemeral and possibly unrecoverable, so they built a tool called "warrick" which is downloadable and which people can use to recover sites and has been used. It works by "crawling the crawlers" -- reconstruction by searching extant repositories (internet archive, google cache, etc). But the question is how to discover what wasn't recovered?
Factors:
stats for repositories:
balance: do we prefer IA because of authoritative nature, or search engine caches because of recency?
question: how is he defining "website"? everything not an offlink by fqdn? by parent domain? [answer: domain name]
question: did he look at passworded or spider-blocked resources? [answer: obeyed robots.txt; ignored all passworded sites, which is very problematic to me]
defining what digital curators do and what they need to know, the DigCCurr project (Lee, Tibbo, Schaefer):
Goals of DigCCurr:
We need an "active management and preservation of the digital objects over their lifecycle" (remember to avoid benign neglect). When building curricula there's a matrix of things to think about includes type of resource; professional or disciplinary context; values and principles; prerequisite knowledge; lifecycle stages. The project develops levels of function and skill that need to be taught; identifies tasks of a digital curator.
generating best-effort preservation metadata for web resources at time of dissemination (Smith and Nelson)
Their product is
one of the problems in preseving websites is the absence of metadata
So what to do? Post ingest procesing eg JHOVE?
can the webmasters get involved earlier? they're busy, though.
This is limited, still, but piles better than previous situation. (It's still unverified, undifferentiated (eg. administative vs. technical). It's automatic, generated at disemination.)
how to choose a digital preservation strategy (Strodl, Becker, Neumayer, and Rauber)
I found it very hard to understand this too-fast talk; I'll need to read their paper in the proceedings. Here's some brief notes I got from their talk:
The first step is to identify requirements
- identify requirements and goals
- tree structure
- usually 4 top level branches
- object characteristics (content, metadata)
- record metadata (context, relations)
- process chacteristics (scalablity)
- costs
- what to do: migration, emulation, both, other?
- go/no-go decision: will it be useful and cost-effective to continue?
- run experiment and evaluate
- consider results
- not all criteria equally important. give leaves weight, apply multipliers on performance and weight
you know what? this is a mathematical structure built on what we sort of did doing our requirements gathering phase, just with preservation requirements instead of feature requirements. Pretty straightforward stuff applied to a new problem. How is this interesting? Must be somewhere.
Actually, here's a good takeaway from this talk. just because it's a practiced and documented preservation method in use in your repository doesn't mean it works. This model includes testing strategies before you implement them. It creates a documented and supportable decision making process forchoosing workflow, which, in preservation, is probably a damn good thing. On the other hand, if you get too retentive about preservation, you'll never implement anything. *cough* PREMIS *cough*
factors affecting website reconstruction from the web infrastructure (McCown, Diawara, Nelson):
The web's ephemeral and possibly unrecoverable, so they built a tool called "warrick" which is downloadable and which people can use to recover sites and has been used. It works by "crawling the crawlers" -- reconstruction by searching extant repositories (internet archive, google cache, etc). But the question is how to discover what wasn't recovered?
- to test he runs warrick on snapshotted live sites, every week, over a period of time
- little happy math formula to see what's changed, missing, added
- definition of "success" (mostly concerned with missing, not changed or added since snapshot date)
- on average, 61% of a website was reconstructed on any given week
- purely textual websites have worse recovery rates than image- or pdf-heavy sites. surprising!
- birth and decay of resouces within a given website
Factors:
- backlinks external and internal; PageRank; depth; hops from front page; MIME type; query string parameters; age; resource bith rate; TLD; website size; size of resource
- but these only account for about half of the results found
- other unmeasured factors accounted for much of the results
- most significant factors to recovery: PageRank, hops, age
stats for repositories:
- msn was best, followed by google and IA, the yahoo
- possibly google did worst becauses data available via google api might not be as good as pure google
balance: do we prefer IA because of authoritative nature, or search engine caches because of recency?
question: how is he defining "website"? everything not an offlink by fqdn? by parent domain? [answer: domain name]
question: did he look at passworded or spider-blocked resources? [answer: obeyed robots.txt; ignored all passworded sites, which is very problematic to me]
defining what digital curators do and what they need to know, the DigCCurr project (Lee, Tibbo, Schaefer):
Goals of DigCCurr:
- to develop graduate level curricular framework
- course modules
- experiential component (eg internships, fellowships)
We need an "active management and preservation of the digital objects over their lifecycle" (remember to avoid benign neglect). When building curricula there's a matrix of things to think about includes type of resource; professional or disciplinary context; values and principles; prerequisite knowledge; lifecycle stages. The project develops levels of function and skill that need to be taught; identifies tasks of a digital curator.
generating best-effort preservation metadata for web resources at time of dissemination (Smith and Nelson)
Their product is
mod_oai
.one of the problems in preseving websites is the absence of metadata
- weak page metadata
- often no link metadata without following links with http (images, for example)
So what to do? Post ingest procesing eg JHOVE?
can the webmasters get involved earlier? they're busy, though.
- webmasters can configure modules on webserver to do automatic metadata extraction long before passing off to archive for ingest
- auto-extracted metadata can be grabbed with OAI-PMH
This is limited, still, but piles better than previous situation. (It's still unverified, undifferentiated (eg. administative vs. technical). It's automatic, generated at disemination.)