Cornell University Electronic Student Records Systems Project Report

return to index

Appendix B. Preservation storage and processing considerations

In addition to the technical infrastructure for a preservation program that is discussed in Appendix C, there are physical and logistical issues to be addressed in establishing a preservation program for electronic records. Specifically, these issues address how and where electronic media will be stored and how electronic records in archival custody will be processed. This is a general overview that is not specific to the preservation of electronic student records.

Access versus Preservation Formats

Archivists regularly make determinations about access and preservation formats for all types of records. For example, archivists may decide to microfilm or digitize paper records, then decide whether to retain the original formats or not. For electronic records, both access and preservation formats should be identified. It is easy to confuse access and preservation formats for electronic records because unlike other types of records where the access is generally different from the preservation format, such as microfilm for access and paper for preservation. For electronic records both access and preservation formats are generally electronic.

After carefully considering the different types of electronic files in the transfer instructions, an archives can choose to:

Documentation for electronic records also has an access and preservation format. The documentation required for reference use can be quite minimal when the records are current and the software in which it was created is widely available. Researchers do not want to pay for a lot of unnecessary documentation. Documentation at the time of transfer must be adequate not only for the immediate use of the records, but for use over time. The documentation for longterm preservation should be much more extensive. The "kitchen sink" method of compiling documentation can be the most effective, i.e., gather all of the available documentation when the records are accessioned rather than selecting particular pieces of documentation.

The best way to determine the appropriate preservation format for electronic files is:

Preservation storage for electronic records

Proper storage for electronic media is a key requirement for an effective electronic preservation program. The considerations for proper storage are:

Processing electronic records

Archival processing of electronic records

Archival processing consists of a series of steps that the Archives takes upon the receipt of records into archival custody:

Electronic records that are transferred to the Archives should be processed as soon as possible:

Processing inactive electronic files

If it is necessary to process files that have been in inactive storage outside of archival custody, there are some basic processing considerations:

Handling and use

Established electronic records preservation programs:

Decisions about how many copies to create, how big the annual sample should be done, and when to recopy files must be made by the organization based on requirements, experience and resources.

Software-based considerations

Most organizations have records that were created by software in two main categories: databases and desktop documents. This section provides some considerations for records that have or will be created using these software packages.

Databases

Archivists have had the most experience in preserving database files. The data archives field is a special area devoted to preserving and making available quantitative (and qualitative) data for social science and other research. Many of their methods were very useful in developing procedures and practices for dealing with database files, particularly in the areas of validation and documentation.

There are a variety of database software packages, but for preservation purposes the packages tend to have common characteristics. Most database packages can export ASCII versions of tables that make up the database. Most databases can generate, often in electronic form, documentation, such as definitions of the content of tables and the relationships between tables. It is possible to capture the content of data entry screens, often by using an image processing software package to create images of each data entry screen if the software cannot provide an image. The main considerations for determining the appropriate preservation strategy are:

Database processing

Some of the problems that timely processing can uncover are:

Storage media with perfect internal labels that identify the file as containing the right records, but reading the files shows that they contain junk with nice labels.

Validation is the evaluation process by which an archives insures that the documentation provided during transfer corresponds to the electronic records that were received. Archival preservation programs for electronic records have developed manual and automation validation for processing records in databases, or structured data files.

Manual validation compares the documentation to sample printouts of records. This is a very time-consuming process that produces unsatisfactory results, but may be the only approach possible when working with some older files in particular.

Automated validation uses the computer to compare the content of the electronic records to the definition of the records in the documentation. The specific type of automated validation procedure to use depends on the type of data being validated. Statistical methods provide validation procedures for some kinds of data. For example, a frequency distribution, a list of codes and the number of occurrences found in a table or file, is used to analyze individual columns of data that contain coded information. This can be used to determine if the list of valid codes provided with the documentation matches the actual content of the column. There are more complicated statistical methods for other kinds of data. Evaluating a column containing narrative information would require a completely different procedure.

A number of national archives and other organizations with established electronic records preservation programs have built computer applications that support the automated validation of electronic records. Other organizations can benefit from these developments, but each organization must adapt and implement preservation solutions to suit their needs. For example, the U.S. National Archives can provide the code for the Oracle application that they developed, but the receiving Archives must still purchase and install Oracle, then test and adapt the application. If a preservation program does not have to process large amounts of data records, it is probably not necessary or advisable to pursue acquiring or building an in-house validation application. It may be more productive to form an alliance with a local university that has the necessary computer resources and a social science or other academic program that requires the tools used in automated validation, e.g., statistical analysis.

Desktop documents

Word processing files

Textual files that have been produced by a word processing software package are very common in most organizations. Simple word documents, e.g., files that do not have headers and footers, embedded pointers to other files, and other software-dependent features that may be lost when the files are converted to a software-independent preservation format like ASCII, could be preserved in an ASCII format without losing the basic structure and content of the document. Complex documents, often referred to as hypertext or virtual documents, may lose significant content and context, such as relationships to other documents, when the files are converted to a software-independent preservation format.

For preservation purposes, some archives have opted for using:

Full implementation and acceptance of Unicode or universal code set could assist in the preservation of files in any language. Currently, the non-standard language character sets can be a significant problem for long-term preservation, though need has driven a number of organizations to develop language converters.

Spreadsheet files

Many finance and budget applications are built using a spreadsheet software package. These packages can also be used as a type of database because data entry can be very easy and sorting the contents is very flexible, but these packages cannot handle complex database structures or very large amounts of data.

There are two parts of a spreadsheet file that need to be considered for preservation purposes:

There is currently no effective method for saving spreadsheet files in a software-independent preservation format. Some options for the archives include:

The best way to find the appropriate solution for an organization is through experimentation.

Many spreadsheet files may not require permanent retention because they support financial functions. The records that support the functions often have short-term retention periods. Sometimes, however, spreadsheet software packages are used to create simple databases. In this case, the records may be permanent, depending on what the records document.

Graphic and image processing files

It is often necessary to accession image files. Sometimes image files are associated with word processing documents; sometimes image files are stored in databases or are stored as attachments to email messages. Digital imaging produces image files. For many projects, Optical Character Recognition (OCR) is also be used to produce a searchable version of the text. The images are records of the organization that creates them, either as part of a primary record creating function or as part of a retrospective conversion for access purposes, possibly by an archives. Organizations have lost a lot of resources by failing to consider the long-term preservation of images that result from digital image projects. Organizations may also generate files that contain diagrams and other images that document the business of the organization and may need to be preserved.

Considerations:

When evaluating series of image files, archivists should be as strict in appraisal decisions about image files as they are with other types of records. They should consider very carefully the value of the image files, the documentation needed, and the purpose for retaining them before committing to preserving image files.

There is currently no effective method for retaining image files in a software independent format, and given the nature of the files, that would probably not be desirable. The recommended archival format for image files is TIFF for images.

Processing documents

Most archives that have preservation programs for electronic records have had more experience in dealing with databases than with documents. The evaluation techniques that work for structured records in a database, such as statistical analysis, do not apply to free form or loosely structured text files. Documents do not require the same kinds of documentation as databases, but it is necessary to have adequate documentation to establish the context of the records, e.g. the creator, creation date, the purpose, the creation process.

Considerations for documenting databases for long-term access

Sometimes the term metadata is used in place of the term documentation to mean all of the information that is captured, regardless of format, about the data (or records) in an electronic system. Other times metadata is used, often by IT professionals, to define information about the data (or records) in a system that can be captured and accessed by the system. They mean only that information that is in electronic form, stored within the system, and used to support the system. Recent metadata research expands the scope of captured electronic metadata to include information that is created and maintained as part of long-term access activities including archival description, reference use, etc. Both the format and content of the information seem to be at issue in the varied uses of the terms. This report refers to documentation in the first sense: all information, in all formats, that is captured throughout the life of the records.

Although documentation does not have an agreed-upon, universal, and comprehensive definition that can be presented as a checklist for transfer, it plays a key role in appraisal, accessioning, preservation and reference activities. It is part of the archival record whether the archives defines it as a record in itself or as a finding aid to the electronic records. Without documentation, electronic records are often useless. Without the associated electronic records, the documentation is essentially pointless. It is generally true that the more widely available the file has been, the more complete is the documentation of the content of records and the use and purposes of the system in which the records were created.

Electronic records, particularly structured records that were created using database software, require documentation to be readable, usable and understandable. Documentation is an organized body of information needed to plan, develop, operate, maintain, and use machine-readable records and automated systems. Data file documentation is used to explain the arrangement, contents, and coding of information in a machine-readable file. Two common elements of documentation are:

The documentation required for reference use can be quite minimal when the records are current and the software in which it was created is widely available. Documentation provided at the time of transfer to the archives must be adequate not only for the immediate use of the records but for use over time. The documentation for long-term preservation should be much more extensive.

Users of electronic records need to know how the information in the records was collected, entered and processed. This kind of documentation might be found in justifications for the collection of the information, descriptions of the methodology used to compile the records, sample data collection forms, data entry instructions, user manuals, and reports generated from the records or based on the records. Documentation is not standard and cannot be universally defined. Good documentation should allow the user to read, understand and use the electronic records it describes.

Documentation can come in a variety of formats: paper, electronic or microfiche. The archives is responsible for ensuring that adequate documentation is provided and preserved. The preservation process, reference use, and other activities that are subsequent to the creation of the records may produce documentation that must be preserved as part of the record. The source of all documentation should be clear and explicit.

Documentation should be a comprehensive instruction kit for using electronic records. Defining and compiling adequate documentation may be the toughest part of working with electronic records. The ideal documentation package for an electronic file should answer the classic questions: Who? What? Where? When? Why? How? How many?

Who authorized the creation of the electronic records/system?
Who actually compiled and processed the records/system?
Who created them?
Who makes the records/system available current, semi-current, non-current?
Who are the users of the records?

What is the structure of the records/system (or related records/systems)?
What is the scope and content of the records/systems?
What is the intellectual structure of the records/system?
What is the physical format of the records/files/system?
What specific pieces of information are in each record/file?
What is the relationship between individual records/files/systems?

Where was the information for the records/system compiled?
Where were the records processed?
What is the geographic coverage of the records/system?

When was the information in the records/system collected?
When were the records processed and completed?
When were the records made available?
What is the date range of the information in the records?

Why were the records/system created?
What is the purpose of the records/system?

How were the records/system created?
How was the information in the records/system collected?
How was the information edited?

How many individual records are there in the files/system?