BBC BLOGS - dot.Rory
« Previous | Main | Next »

A digital time capsule at the library

Rory Cellan-Jones | 08:49 UK time, Thursday, 25 February 2010

We hear plenty of warnings these days about the permanence of the web.

Don't put those embarrassing pictures on Facebook because they will be there forever more.

But the British Library has another fear - that old websites will simply disappear into a digital black hole when their owners take them down.

The library says all sorts of material, which may look trivial today, could be of vital interest to future historians.

Screenshot of UK Web Archive.jpgTo that end, it's been working for some years on a web archive of British sites, which is now open to the public.

So what's in it? It includes material on the credit crunch, such as the sites of defunct retailers like Woolworths and Zavvi.

There is the website documenting Anthony Gormley's Trafalgar Square fourth-plinth project, which will go offline later this year.

There is material from web forums where people discussed the 2005 terrorist attacks on London, and there's a new project to document the sites of MPs who will retire from Parliament - willingly or otherwise - after this year's general election.

In all, the archive now contains around 6,000 sites, but that's out of around 8 million on the UK domain. The British Library says that it's painstaking work because it feels obliged by copyright law to get the permission of each site's owner before adding it to the archive.

That's not a nicety a big web firm like Google has to worry about, but the Library says it benefits from copyright, which means it gets copies of every book published in the UK, so it has to be punctilious about sticking to the letter of the law.

Now, though, it's hoping that a review of the regulations about the kind of material libraries can collect will mean it's free simply to sweep up anything it fancies from UK's freely-available websites and lay it out in its digital display cabinets.

A couple of questions. Why should a library be so interested in this ephemeral material? And aren't there other ways of accessing this material like Google's cache and the "Wayback Machine" run by the non-profit Internet Archive organisation?

The Library accepts there are alternatives but thinks its curated collection will prove of value to researchers and historians. It's also stressing its permanence: "Big web firms come and go," a spokesman told me. "We'll still be here in 100 years, will Google?"

And this isn't the first time that the British Library has concerned itself with material that might be considered trivial and ephemeral.

Who's interested, for instance, in that free newspaper you left on the train, or the pamphlet thrust into your hand by some campaigner?

Well when the freesheet is from the 19th Century, and the pamphlet is a tract secretly distributed during the English Civil War, then both have proved of great interest to historians combing the British Library's collection.

The average life of a web page these days is apparently somewhere between 44 and 77 days. A lot of the web simply disappears into the ether, but this and other archives are doing a valuable job of preserving some traces of our digital lives in a time capsule.

Comments

  • Comment number 1.

    This really is an odd story.

    I can see why they feel the need to do this, but you would have thought that working with existing companies like the web archive would be a better way of doing this.

    And why do they need permission when the way back machine obviously doesn't?

    Or does it?

    They are right when they say they cannot judge what will be important, but in the end, most of this, I suspect, is about curiosity rather than actual necessity.

  • Comment number 2.

    I'm rather amazed by the assertion that 'The average life of a web page these days is apparently somewhere between 44 and 77 days'. This suggests to me that as well as performing the excellent task of archiving the material, the British Library and others should be educating webmasters that the web is itself an archive. Many organisations (excluding the BBC which is quite good at keeping old pages and marking them as such) seem rarely to consider potential historical interest before deleting any 'out-of-date' pages. If every server kept everything permanently and followed Tim Berners-Lee's long-standing principle that 'Cool URIs don't change', we wouldn't need the likes of the British Library or Internet Archive Organisation to perform these tasks in the first place.

  • Comment number 3.

    #1 Joss wrote > "And why do they need permission when the way back machine obviously doesn't?

    Or does it?"

    The wayback machine assumes permission. You can block the wayback machine from accessing your site from the robots file by referring to it as the ia_archiver.

    However, some web developers may not know this and may not bother putting that rule into the robots file. Some don't even do a robots file.

    Plus the Wayback Machine has it's faults. it's not without limitations.

    Back on topic.

    Why would you want to archive some old websites? Seams a waste of time and money to me, as long as the content is stored, then i don't see why you would want to preserve the website.

    Seams a bit daft to me as it's not like in 1000 years time some historian is immediately going to sit up and think to himself "i wonder what the now defunct BBC website used to look like in it's prime?"

    Why would you care? The content, yes, the site it's self... no.

  • Comment number 4.

    Oddly enough as webmaster of 2 major .uk sites I have not been contacted yet.

    But don't knock the idea of web archives... I find Internet Archive's Wayback Machine extremely useful to track down sites which have gone due to their authors deciding not to continue, or more ephemeral information that might have been removed as it is no longer of use to whoever published it.

    The Wayback Machine has at least one of my sites archived (haven't checked for the other one), it is quite interesting to see how it has grown over the years it's been running.

    Unfortunately I'd be in breach of the site rules if I told you my URLs :)

  • Comment number 5.

    Interesting article, one problem with it though is ~99% of the Internet is just chatter.
    So archiving every single web page would equate to storing every conversation you've made for historians - and I'm pretty sure few historians will care about the conversation between myself and my girlfriend about the cost of carpets for a bedroom.

    I think the library has the right approach - only archive sites of interest.
    Everything else is of niche interest (if any interest at all) and best left for independent archives like those you've also mentioned.

  • Comment number 6.

    @ 1, Joss:

    You said: "And why do they need permission when the way back machine obviously doesn't?"

    My reply:
    technically Way Back Machine is breaking copyright law.
    But for the majority of the time people ad-hear to an unwritten rule of thumb: if it's openly published online, it's free for all.

    If (for arguments sake) the BBC wished to contact Way Back Machine to remove BBC archives (text and/or other multimedia) - Way Back Machine would be legally obliged to remove said content.

  • Comment number 7.

    How much space would you need to archive the whole of the UK's web? surely we are talking big numbers? or does it not work like that

  • Comment number 8.

    As a forum owner I am quite happy for it to be captured and viewed in future years. As a heritage railway forum, discussions about our UK Railways history is key, preserving this for future generations is a must!

    Cheers

  • Comment number 9.

    Two issues:

    1. It is to be hoped that they'll also keep copies of current and past browsers (and operating systems and the computers they run on) so these can be viewed in 100 years time.

    2. I'm concerned about privacy aspects related to web forums and blogs. It's all very well the library asking for permission from the website owners in these cases - but what about the contributors? I generally have the choice to remove old posts from forums (not sure about the BBC though!). Whilst my posts may exist in backups and offline archives, I can usually delete them from the website so that others can no longer view them. Will the contributors to forums and blogs be asked if their comments can be publicly available for eternity?

  • Comment number 10.

    "2. I'm concerned about privacy aspects related to web forums and blogs. It's all very well the library asking for permission from the website owners in these cases - but what about the contributors? I generally have the choice to remove old posts from forums (not sure about the BBC though!). Whilst my posts may exist in backups and offline archives, I can usually delete them from the website so that others can no longer view them. Will the contributors to forums and blogs be asked if their comments can be publicly available for eternity?"

    Generally speaking, if you read a site's T&C you do give the website and the webmaster of that site full permission to publish your comments without needing to refer to you any further. So, no, contributors would not need to be contacted. They have already given their permission by signing up to a website, agreeing to the terms and conditions, and posting their comment in the first place.

  • Comment number 11.

    @7. TimmyNorfolk

    You wrote:
    How much space would you need to archive the whole of the UK's web? surely we are talking big numbers? or does it not work like that


    My reply:
    Depends.
    If you're just talking about text content, then not much.
    If you're including images (which, realistically, you'd have to as some pages make little sense without included images), then the size content would go up dramatically.

    Then you have other multimedia content:
    * videos
    * audio
    * none-HTML documents (PDFs, Office documents, etc)
    * animations (Flash et al)
    * and any interactive content (Flash, Java, ActiveX *shudders*).

    You also need to take into account the frequency of archives:
    * Is it a one time archive of an entire site? (eg site @ 1-Jan-2010)
    * Is it periodic snapshots of entire site? (eg a snapshot once a year)
    * or is it a cumulative back up of each page? (eg a single back of up each page, but back ups are made at regular intervals so any new content after the original snapshot doesn't get forgotten).

    You also need to think about site layout (if the site has a layout change, do you take new snapshots, update old snapshots or just ignore the redesign altogether - concentrating only on content)


    Also archive back end (server file system et al) is important:
    * Are you mirroring the disks on the server or raiding them for redundancy?
    * Does your file system support compression and/or deduping? (Simply put: deduping a method of reducing data replication)
    * What's archives back up policies in case major server failure? (eg fire, hackers, terrorist attacks or just an idiot at the terminal managing the server)


    Raw text (ASCII) is easily compressed but, as already stated in this post, takes up little space even without any compression.
    Images and video/audio are a bigger issue as they will already be compressed so, aside deduping (which I believe the only file system that supports this is ZFS), they cannot be compressed any further.


    You've also goto take into account the size of the database required to index all of this data.




    So the short answer is, potentially quite a lot of disk space - but it really depends on how the geeks behind this choose to build their systems.

  • Comment number 12.

    I totally agree with British Library's concern.

    The Web has recently arrived to its 20 years and I am sure that many valuable digital information that was available to the public, it is not any more. And probably this information is not even available from the original source because it simply disappeared (as mentioned in the article).

    This 20 years has been particularly important for the history in terms of technology ans social evolution. It is important that people and institutions be aware of the importance of this data.

  • Comment number 13.

    The whole problem as I find with libraries/archives is access to the data.

    Take my employer who used paper based systems up to the mid/late 80's. These documents were typewritten OR wet inked and when published copied onto microfiche. I have no idea what is there or where to find it.
    Then they progressed to PC's whereby they went through a variety of packages (AMIPRO, Word Perfect, Write, Lotus Notes, etc) until 1999 where they started using Microsoft Word.

    In each subsequent release of Word the ability to open these files reduced with Word 2000 onwards unable to open the majority of the documents.

    So I would question the scope of archiving web pages when you may be unable to open them in the latest viewer (Safari/Firefox/IE/etc) in the future.

    I know the nightmare I had looking at documents that were used to prove decisions made 10 years ago which are relevant to a product that will be in production/use for the next 30 years.

    Regards

    D1gger5

  • Comment number 14.

    I'm interested in this idea that the average life of a webpage is between 44 and 77 days. Where does that number come from? Does it include frequently updated pages? You could argue that an instance of the BBC weather page is lost every time that page is updated.

  • Comment number 15.

    ever since tech. changed life but it developed from nececity. Now thech change so fast that solution are invented befor! nececities ,the big change it's perception changes as time is measured in micro seconds and distances in "mouse cliks" away... the internet just increased speed as knowlede brought to finger tips.

  • Comment number 16.

    I think this is a really good project to be undertaken. Who are we to know what will be of historical interest to our ancestors?!

    For me though, it raises questions about the accessibility of the sites in the future. HTML should be fairly easy to render, in whatever we’re using, as a browser. Although, if you consider the *debate* between Apple & Adobe about Flash – will we be able to view content like this in future years?!

    There is something to be said for open standards but also archiving the capability to be able to view content. This is at a time when publishing is potentially on the dawn of a new electronic era.

  • Comment number 17.

    @13. D1gger5

    You wrote:
    I would question the scope of archiving web pages when you may be unable to open them in the latest viewer (Safari/Firefox/IE/etc) in the future.

    My reply:
    That shouldn't be a problem because HTML is an open and well documented specification where as the packages you specified (AMIPRO, Word Perfect, Write, Lotus Notes, MS Word) save in proprietary closed formats.

    It's interesting you raise this as there has been a lot of debate online about the problems of closed document formats in terms for future-proofing data (and also part of the reason MS were pressured into releasing OOXML (docx, xlsx, etc) as seen in MS Office 2007 - but that's a whole other topic).
    One popular solution (and my personal favourite) is to use ODF (open document format) which OpenOffice.org, KOffice and many other suites support by default.
    ODF (like HTML) is standardised and documented so future applications can still support ODF files should they choose.


    You do have a point in regards to IE specific web pages though - as IE (particularly earlier versions) has it's own idea of how web pages should be rendered and what HTML standards to support.
    However, on the whole, if it can be viewed by webkit (Safari, Chrome, etc) or gecko (Firefox) then the site will conform to HTML standards as agreed and detailed on w3c.org. So any developer can build a viewer to render archived websites long after Firefox and Safari have been superseded.


    So while I do sympathise with your issues in regards to document compatibility, web pages _shouldn't_ be affected.

  • Comment number 18.

    With regards to technology changing and needing to keep old browsers, programs etc the library could archive most web pages things to PDF/A (http://en.wikipedia.org/wiki/PDF/A) which is a PDF format designed for long term archiving, and is likely to still be readable long into the future.

    Storing more interactive sites this way ("Web 2.0") sites would be tricky, but it would get the majority of sites.

    That doesn't solve the problem of what medium (CD, DVD, hard drive etc) to store the data on. The best thing to do about that is make sure that every so many years the files are transfered onto a new medium.

  • Comment number 19.

    I think the issue of losing the ability to render some sites is more important than Laumars makes out. Already the British Library is having problems with recorded material on obsolete (physical) media, with online content this obsolescence could happen even quicker.

    There's a lot of talk about browsers adhering to standards, but in reality none of them support all the current standards - otherwise they would all pass the Acid3 test... which they do not. Additionally, all browsers support some non-standard HTML - e.g. the hated marquee and blink tags - but if a browser comes along which adheres just to the standards (or support for the non-standard HTML is dropped) then the ability to render those pages as intended would be lost (no bad thing in the case of blink and marquee, but who are we to judge). More importantly, if pages are written to accommodate strange behaviour of individual rendering engines (they do all have some quirks) then updates in the future to bring those browsers in line with the standards may cause those sites to no longer be rendered as intended.

    Someone mentioned that you need only archive the page content rather than the whole site, but that would mean you would lose the context. It would be like saving a newspaper by cutting it up into the individual stories. You can get a lot more information historically by looking at the whole rather than a specific part.

    I have a site which has been archived back to 1996 with the Wayback machine. It's somewhat embarrasing looking at it now, and I'm not sure how relevant the information will be, but who knows what future society will find interesting? I think the British Library has a justification to archive online content, let's hope they get the necessary permissions to do so easily.

  • Comment number 20.

    I wonder what will happen to the old BBC Have Your Say material? Will that be lost now that it's moved over to a new system? Didn't HYS in fact lose all its old material a little while back after a software crash?

    I imagine that it would be fascinating for future historians.

  • Comment number 21.

    Page content has to be in context though...
    Without Images, Styles, more visual information some of the content might be lost?

  • Comment number 22.

    I would also add, I think that the pages would be captured as images? Not as seen by internet browsers.

  • Comment number 23.

    All very good NatPres, but the nature of web pages is that they are interactive, and images are not. Even static web pages can have areas that only show when you click on them. Without allowing for these you could lose a lot of the content.
    But herein lies another problem, what do do with pages that use server side code to create dynamic content, such as allowing someone to enter a postcode and it shows them something about their area from a database?

  • Comment number 24.

    I wonder if the British Library is archiving the BBC blogs and the Have Your Say service.

    PS: Good move removing that painful recommend button!

  • Comment number 25.

    @ 19 - Laurence:

    You said:
    I think the issue of losing the ability to render some sites is more important than Laumars makes out. Already the British Library is having problems with recorded material on obsolete (physical) media, with online content this obsolescence could happen even quicker.

    My reply:
    It wouldn't because online content adheres to an open standard and is platform and media independent.
    The British Library is having troubles because of proprietary - closed - standards that companies are neither willing to implement in new software nor willing to open up the code for 3rd party developers to implement into their own software.

    This is why archiving should be done in open standards such as ODF and *NOT* closed standards such as Microsoft Word's "doc" format.



    You said:
    There's a lot of talk about browsers adhering to standards, but in reality none of them support all the current standards - otherwise they would all pass the Acid3 test... which they do not

    My reply:
    webkit (Safari, Chrome, etc) and gecko (Firefox) do - which were the examples I gave.
    Also the latest builds of Opera also scores 100% on Acid3

    So actually most of the top browsers are standards compliant - it's just Internet Explorer (surprise surprise) that lags behind.

    And the only standards the browsers don't agree to are the yet-to-be-agreed HTML5 specifics like which video compression to use.

    Most sites these days are not HTML5 specific (due to IE compatibility) and furthermore, because HTML is not only an open standard, the code is open too so it's very very easy to extract the content from the document should it prove impossible for developers in 100 years to re-program a HTML5 compatible browser from the step-by-step instructions freely availably.



    You said:
    More importantly, if pages are written to accommodate strange behaviour of individual rendering engines (they do all have some quirks) then updates in the future to bring those browsers in line with the standards may cause those sites to no longer be rendered as intended.

    My reply:
    The content is still there and still usable though. Just open the document up in notepad, kedit, mousepad, vi or whatever you prefer and you've got EVERYTHING you need.

    Worst case scenario - re-engineer the document to be more HTML compliant.
    It's still massively less work than having to convert every single website into a new standard.



    On a lighter note, my screen name on here used to be Laurence - so I feel like I'm arguing with myself hehe

  • Comment number 26.

    23. At 5:03pm on 25 Feb 2010, Laurence wrote:
    All very good NatPres, but the nature of web pages is that they are interactive, and images are not. Even static web pages can have areas that only show when you click on them. Without allowing for these you could lose a lot of the content.
    But herein lies another problem, what do do with pages that use server side code to create dynamic content, such as allowing someone to enter a postcode and it shows them something about their area from a database?


    ^ Spot on.
    Good post

  • Comment number 27.

    Good article and interesting comments.

    I think it is important that as much as possible about web sites is captured for historical reference and that really is part of the job of the British Library. If it has a remit to capture and store a copy of everything that is published, then Websites must fall under the same rule. I am sure that many people will want to research and look back at these early 'publications' just like we do now with early printed publications.

    It is however clearly is a bigger task than just storing 'flat' printed publications or single 'layer' images. As one comment says, static pages can change depending on server and client side actions. However, there are technical ways of 'gathering' that information as well (no I'm not the expert to figure that out!) so I sure that issue will be overcome.

    As I say, I think it is important that this information is captured for future reference. Maybe I should have kept the 'snaps' of some of the first web pages that I visited using Mosaic 1.0 on a Unix workstation!!! Who knows, perhaps one day they would have become as valuable as that first edition of ‘Action Comics' that sold the other day!!!

  • Comment number 28.

    @ Laumars

    Gecko does not pass the Acid3 yet, it scores 94. Only Webkit and Presto rendering engines get 100/100.

    Once 3 of the four main rendering engines attain 100 in Acid3, work will start on Acid4. That will mean all the rendering engines will fail, as that is the way Acid is designed.

    You would also have to factor javscripts into the equation when archiving web pages. Javascript is not really standardised and there is no real way of ensureing that future browsers will be able to properly show older variations.

  • Comment number 29.

    Two important points not yet mentioned, either by Rory or others:

    1 All citizens have until Monday to comment on whether the UK government should legislate to allow the legal deposit libraries to harvest 'free' websites automatically, that is, without asking publishers' permission to archive. See http://writetoreply.org/legaldeposit/. It's a rather complex questionnaire, but if you feel strongly that we (the people of the UK in the form of our national libraries) should collect and keep for future generations what we publish freely on the web today - submit your views.

    2 Rory gives the incorrect impression that only the British Library is concerned or involved in this issue. The British Library is not the only national library in the UK, and the National Library of Scotland and the National Library of Wales are equally concerned to find a solution.

  • Comment number 30.

    So how does this work with sites that may have received take-down notices or that have been posting illegal (defamatory/libelous material)? This, coupled with the fact that 99% of what is on the internet is utter rubbish makes the whole thing very much an exercise in futility. There are things in life that really don't need archiving.

  • Comment number 31.

    I would imagine in years to come that software and computers will be quite capable of read old formats.

    Or maybe I look at things in too simple a way?

  • Comment number 32.

    @ 28. William Palmer

    You wrote:
    Gecko does not pass the Acid3 yet, it scores 94. Only Webkit and Presto rendering engines get 100/100.

    my reply:
    sorry, my mistake. Though the nightly builds are up to 97% so 100/100 isn't far away now.


    You said:
    You would also have to factor javscripts into the equation when archiving web pages. Javascript is not really standardised and there is no real way of ensuring that future browsers will be able to properly show older variations.

    My reply:
    The critical bit is really the mark up (HTML, CSS, etc).
    While javascript does subtly vary from browser to browser (not significantly, but still enough to cause problems from time to time), the content is still available to view and the javascript (being un-encoded ASCII) can still be edited to fit future browsers - which is far more practical than having to rewrite every single last web page into a new format.



    So lets put it another way - is there actually anything better around?

    * Is there another document mark up available that supports scripting?

    * Is it open or proprietary? (if the latter, then you might as well delete your documents now)

    * Is it patent encumbered? (even if it's open but patented, you have no control over document format so several years down the line you might find you have to pay a hefty charge just to read the data or even loose any legal right to use a rendering engine on those types of data files).

    * Does your document format use any encryption or other forms of compression that might become obsolete in years to come, or is the data held in clean ASCII files?

    * Right, so you think you've finally found a better mark up language. Now how do you convert your current data. Oh that's right, now you have to update every single web page to work with your new browsers than a handful of sites to work with future HTML browsers.


    Besides, I think all this is going to be a moot point as Firefox (or any other open source browser) would just run in on a Linux or ReactOS virtual machine.

    These days, you don't need to worry about software becoming obsolete as you can just build a virtual computer. The only issue would be with software licences - so you'd likely be stuck with open source software, but since the internet is platform independent, that's not a problem.

  • Comment number 33.

    panto: Who knows what might be useful to people in years to come. What you think of as rubbish now may well be useful to historians later. In archaeology it is useful to examine ancient rubbish tips to see what was thrown away as a lot of information can be deduced from it.

    SusieWoosey: There are already formats which are no longer supported and which are no longer readable (ref: http://news.bbc.co.uk/1/hi/6265976.stm).

  • Comment number 34.

    @ 31. SusieWoosey

    you wrote:
    I would imagine in years to come that software and computers will be quite capable of read old formats.

    Or maybe I look at things in too simple a way?


    my reply:
    actually the reverse is true.

    In simple terms, many file formats are privately owned and thus are either impossible or very very difficult for other software developers tp build editors (this means that developers often end up having to "reverse engineer" the files - which is very messy and very time consuming). Thus once it becomes non-profitable for the owner of said file format to continue that format, it will be dropped and forgotten about.

    This is why archived data should be stored in what's called an "open specification". That basically means that anyone and everyone has access to a detailed breakdown on how the internals of that file format are structures and how to build editors for those data files. Thus, years down the line, even if no editor still exists for reading those files, a new one can be programmed.


    The problem we currently face is that big corporations have a vested interest in pushing their own private file formats (essentially because it locks users into their platform). While I can sympathise with the fact that businesses are there to make money, people also need to be aware that it's not safe to save data they expect to access in dozens of years time in a format that is privately owned.

    Which leads me nicely onto ODF. This is basically an open version of Microsoft Office's document formats. So ODF has a word, excel and powerpoint comparable file formats that are completely open and you can guarantee that your documents will be accessible years down the line

  • Comment number 35.

    Laumars: All very well using the OpenDocument format, but it doesn't support everything you can do in Microsoft Word using it - and I'm on about some important features rather than obscure ones - so I can't save some documents i that format.
    It also isn't a version of anything Microsoft, the Office Open XML format is the open version of Microsoft Office's document format.
    Lastly, there are no guarantees. Years down the line there may well be a different standard which everyone uses and you might find that your choice of word processor means that your old ODF documents can't be opened any more.
    Standards do fall by the wayside as better ones supercede them. Now if only I could find an 8inch floppy disk reader, I might be able to retrieve some of my older documents...

  • Comment number 36.

    @ 35 Laurence:

    You said:
    All very well using the OpenDocument format, but it doesn't support everything you can do in Microsoft Word using it - and I'm on about some important features rather than obscure ones - so I can't save some documents i that format.

    My reply:
    Like what?
    I've yet to run into that problem myself, but then I only tend to use the spreadsheet side of ODF.



    You said:
    It also isn't a version of anything Microsoft, the Office Open XML format is the open version of Microsoft Office's document format.

    My reply:
    Why does it have to be a version of Microsoft? I wish people would get out of this "if it's not supported by Microsoft it's no good" mentality. Quite frankly, it's doing more harm than good for the progress of technology.

    And as for your recommendation for OOXML - that isn't completely open. The mark up is open but the mark up still calls a number of closed interfaces.
    So 50 years down the line, you could be no better off with OOXML than their previous document formats as you may find one of the closed interfaces referenced in OOXML no longer exists.



    You said:
    Lastly, there are no guarantees. Years down the line there may well be a different standard which everyone uses and you might find that your choice of word processor means that your old ODF documents can't be opened any more.

    My reply:
    and as I already said, open standards allow plugs in or whole new editors to be developed to open said documents.
    You couldn't do this with closed standards.
    Sure it's not a 100% guarantee - but it's by far the safest bet and as close to a guarantee as you'll get (and thus far I'm yet to see you provide any better alternatives to any of the suggestions I've proposed).


    You said:
    Standards do fall by the wayside as better ones supercede them.

    My reply:
    but if the standards are open then the specs are still available so it's not an issue.


    You said:
    Now if only I could find an 8inch floppy disk reader, I might be able to retrieve some of my older documents...

    my reply:
    Hardware standards are a different issue altogether as even open hardware standards depend on the availability of hardware components.
    However this topic is about software standards for archiving data.

  • Comment number 37.

    The ultimate document open standard was embodied in the work of Johannes Gutenberg

    :)

  • Comment number 38.

    Laumars: I wasn't recommending OOXML, I was just correcting you when you said (in 34) that ODF was an 'open version of Microsoft Office's document formats'. It isn't.
    Just because a standard is 'open' doesn't mean that the documentation will still be available years later, after all the format of the document the standard is in may itself become obsolete, and doesn't mean that the standard is precise enough. Well, there's a whole host of caveats but it's largely irrelevant as we don't live in a nirvana where everyone uses the same standard.
    Oh, and I was using the 8inch floppy as an example of a common standard which is no longer supported. It's irrelevant whether it's hardware or software - there would still be work needed to produce a decoder for the format.

  • Comment number 39.

    @ 38 Laurence

    You said:
    I wasn't recommending OOXML, I was just correcting you when you said (in 34) that ODF was an 'open version of Microsoft Office's document formats'. It isn't.

    My reply:
    You've misinterpreted my comment.
    Granted it wasn't the best phrasing, but I was trying to equate ODF to something the earlier commentator might recognise.
    So to compare ODF to MS Office is a fair laymans comparison.

    I did stated "basically" when making the comparison so it wasn't read by others as a literal example, but more a "it's kind of like".

    Sorry for the confusion there. :)



    You said:
    Just because a standard is 'open' doesn't mean that the documentation will still be available years later, after all the format of the document the standard is in may itself become obsolete, and doesn't mean that the standard is precise enough.

    My reply:
    You keep picking holes in open standards by missing the point that open standards **ARE** safer than closed standards for archiving. Everything you've stated is equally problematic for closed standards and then you have additional concerns for closed standard on top which don't apply for open standards.

    So why undermine open standards with nitpicking? If you don't like ODF than suggest a better standard or build one yourself. I keep asking you to supply better alternatives and thus far you haven't so my points and recommendations stand.



    You said:
    Well, there's a whole host of caveats but it's largely irrelevant as we don't live in a nirvana where everyone uses the same standard.

    My reply:
    So you're saying basically we shouldn't bother trying because it's never going to happen?
    Surely it's better to get some people educated than not bother educating anyone?


    You said:
    Oh, and I was using the 8inch floppy as an example of a common standard which is no longer supported. It's irrelevant whether it's hardware or software - there would still be work needed to produce a decoder for the format.

    My reply:
    It's very relevant.
    Software can be emulated and virtulised on new hardware/software configurations. Old software could even be completely rebuilt cheaply using existing source code and detailed specifications.

    You can't shove a floppy disk into a DVD drive and expect any kind of emulation to make it work.
    So you either have to find an old floppy drive and built an interface to connect it to new hardware, or build a whole new floppy drive.
    Thus the only "safe" way to combat hardware progress is to keep backing up your soft-technology (be it document data or executable data) to new storage mediums (personally I prefer drive arrays (be it SSD or HDD) as it significantly reduces the complexity of automated upgrades).

    So it makes a huge difference whether you're talking about software or hardware when discussing technology standards used for long term storage.

  • Comment number 40.

    I don't dislike open standards, in fact I'd like to see more global open standards, it's just that they do suffer from the same issues that any standard suffers from over time (and I'm not talking just a few years). Building an 8inch floppy drive is just as time consuming as building an emulator. So coming back to the original point. There is still a good reason for the British Library to want to archive older browsers as well as the web sites.
    I'm sure eveyone else has stopped reading this by now!

  • Comment number 41.

    You said:
    I don't dislike open standards, in fact I'd like to see more global open standards, it's just that they do suffer from the same issues that any standard suffers from over time (and I'm not talking just a few years).

    My reply:
    Open standards suffer from /SOME/ of the same issues. NOT all.
    And that's the crux of my point. You HAVE to save teh data and to save the data, you HAVE to choose a file standard.
    So it makes sense to pick a standard that offers the least potential complications - which open standards do.


    You said:
    Building an 8inch floppy drive is just as time consuming as building an emulator.

    My reply:
    I think we 're referring to different things when talking about emulators. I'm talking about emulating an OS to run old web browsers where as you seem to be talking about emulators as web site interpreters.

    While I agree that building a web browser from scratch could (though it's a difficult point to prove) take just as long as building a new floppy drive, you wouldn't really need to build a new browser from scratch so long as source code is kept for an open source rendering engine like webkit (and as source code is a ASCII, there's no debates about what file format to store that in).

    But, and as I think you've stated as well, there's a lot of "if"s and presumptions in both these arguments - so it's a difficult point to argue.



    You said:
    So coming back to the original point. There is still a good reason for the British Library to want to archive older browsers as well as the web sites.

    My reply:
    I agree. I must admit I missed the point where they stated they were doing so - but I've certainly made the same point earlier when discussing OS emulation and virtulisation.



    ------

    I think essentially, we have the ideas and concerns, if maybe subtly different approaches.

    But the biggest disagreements is perhaps with the way we phrase things hehe

  • Comment number 42.

    I registered a new website which is now part of the British Library UK Web Archive and I’m very glad that I have done so. I was involved in the creation of the website http://www.castlehowardstation.com – an historical record of facts, people’s personal recollections and rare photographic images spanning the last 150 years. Until the site went live, the physical archive material and knowledge of the subject remained stored in a shoebox or assigned to human memory.

    I found it very satisfying to be part of a project to gather, sort and release this information into the wild for the very first time. Whilst it crossed my mind that it would have been a great loss if this archive had never been made freely available to the world, I hadn’t considered the future beyond my own lifetime of maintaining the website.

    Now that the pages have been captured on the UK Web Archive and will be updated periodically, I like to imagine someone researching the topic in another 150 years time will discover this fascinating original material.

    I would recommend submitting a website to the UK Web Archive to anyone creating a web resource that would interest future generations.

  • Comment number 43.

    What drew me to this story was the headline link reading;

    The average life of a web page these days is apparently somewhere between 44 and 77 days.

    How can this be? An average is a unique value, it's not a range or interval. I find myself cringing at the 'dumbing down' of society that we are subjected to.

  • Comment number 44.

    My first and only "website" (A forum called UOForums or Ultimate Online Forums - www.uoforums.com) is coming up for 2920 days (8 years) this September, I guess I must be one of the rare ones who don't shut sites down heh.

 

BBC iD

Sign in

BBC navigation

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.