File Formats for Long-Term Preservation of Electronic Records
The following information includes digital formats the State Archives of North Carolina (State Archives) recommends for in-house preservation and long-term records retention. For electronic records, long-term retention is considered any period 3 - 5 years or longer. The State Archives recommends that any state or local agency record series for which the required retention period is five years or longer be maintained in the following formats. The record types in this document are not exhaustive. State and local agencies producing specialized records may find that certain types of records are not covered by this guidance. Please contact the Digital Services Section to discuss potential preservation strategies for such media.
Visit File Formats for Transfer to the State Archives of North Carolina for information related to file formats for records scheduled for transfer.
Description of Formats Recommended for Long-Term Retention
This category includes texts created in word processing applications like Microsoft® Word and OpenOffice. Unlike plain text files, these documents combine plain text with formatting and styling—including fonts, headings, lists, highlights, notes, and embedded tables and images.
NOTE: Although some word processing files, such as .docx files, are XML-based, for the purposes of these guidelines, these files have been included in the “word processing documents” category and distinguished from “structural markup text documents” due to differences in function and editing software (see Structural Markup Text Documents).
PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A) PDF/A-1a
Also known as “ISO 19005-1 compliant PDF/A” is a type of PDF document designed to preserve PDF files for long-term retention. Traditional PDF files have a number of weaknesses that can cause the same file to appear or behave differently when opened on different computers. Compliant PDF/A files overcome these issues and ensure that the PDF file will appear the same everywhere it is opened. Documents produced in word processing software like Microsoft® Word or WordPerfect should be converted to compliant PDF/A files. PDF-Archival (more commonly known as PDF/A) is an international standard developed by the Association for Information and Image Management International (AIIM International) to archive and preserve electronic documents in PDF form. The PDF/A format has been adopted as ISO standard 19005-1:2005 and is widely used by archival institutions, including the National Archives and Records Administration (NARA), Library and Archives Canada (LAC), and the Library of Congress (LOC). Version 1, PDF/A-1, is the current archival standard.3 It imposes several restrictions on the standard PDF format in order to maximize files’ device independence, self-containment, and self-documentation. Constraints include:
- Audio and video content are forbidden.
- Javascript and executable file launches are prohibited.
- All fonts must be embedded and also must be legally embeddable for unlimited, universal rendering (not under copyright).
- Colors must be defined according to a universally available, device-independent color model.
- Encryption is disallowed. Image transparency is disallowed.
- Use of standards-based metadata and tagging is mandated. This tagging makes documents understandable to screen readers; without it, documents cannot be Section 508 compliant.
PDF/A-1 has two levels of compliance. The State Archives uses PDF/A-1a as a preservation standard. This level indicates “full compliance” with the restrictions listed above.
- PDF/A-1a — “full compliance” with the PDF/A standard. Typically, this is the default setting to which word processing software will save to PDF/A. Another way applications may describe PDF/A-1a is as “ISO 19005-1 compliant PDF/A.”
- PDF/A-1b — “minimal compliance” with the PDF/A standard. PDF/A-1b ensures that the document will look the same in the future (preserves rendering), but it does not preserve the markup of the document.
There is also a PDF/A version 2, or PDF/A-2, which was also adopted by ISO on June 20, 2011.
For more information:
- Library of Congress, “PDF/A-1, PDF for Long-term Preservation, Use of PDF 1.4,” Sustainability of Digital Formats: Planning for Library of Congress Collections.
- Daniel Noonan, Amy McCroy, and Elizabeth L. Black, “PDF/A: A Viable Addition to the Preservation Toolkit,” D-Lib Magazine, 16.11/12 (November/December 2010).
OpenDocument Text
OpenDocument Text is another preservation-quality format in which word processing document may be retained long-term. OpenDocument Text is similar in structure to the .docx format used by Microsoft® Office. OpenDocument Text is an open, non-proprietary format associated with many word processing applications, including OpenOffice. Most word processing applications can save and convert files to the OpenDocument Text format.
See also: OASIS Open Document Format for Office Applications (OpenDocument) TC.
OpenDocument Text is a sub-type of the OpenDocument Format (ODF), an open source file format for spreadsheets, charts, presentations, and word processing documents. Originally created by Sun Microsystems, the current standards were developed by the Organization for the Advancement of Structured Information Standards (OASIS) Open Document Format for Office Applications committee. The format is based on the XML format used by the OpenOffice.org office suite. The format is also published (in one of its version 1.0 manifestations) as ISO/IED international standard 26300:2006. See also 2.4.1 OpenDocument Spreadsheet (.ods) and 2.10.1 OpenDocument Presentation (.odp).
In almost all cases, an OpenDocument Text “file” with the .odt extension is actually a package of several files that have been compressed into a single ZIP file package that carries the .odt extension rather than the .zip extension. Within the zipped package are several separate files that represent the content of the document, its styling, metadata, settings, and a manifest of the zip package files. Although rare, an OpenDocument Text file can also be a single, flat XML file, in which case the associated file extension is usually .xml or .fodt.
Google Docs™
Google Docs™ is a cloud-based document editing service offered by Google™. Word processing documents may be created on Google Docs™ and exported in various formats, including Microsoft® Word 97-2003 (.doc), OpenDocument Text (.odt), PDF (.pdf), zipped webpage (.zip), and others. The recommendations described in this document apply to all documents, regardless of whether they were created using Google Docs™. The State Archives of North Carolina recommends that documents be exported from Google Docs™ as OpenDocument Format (.odt). Alternatively, documents can be exported as standard PDF files and then converted to PDF/A.
State agencies should be wary of keeping public records on devices and servers that are not state owned. Consult state IT polices maintain by DIT for more information.
Plain text files are those that contain US-ASCII or Unicode UTF-8 text without styling or structural markup. These are files commonly created with Notepad on Windows® operating systems, TextEdit on Mac® OS X® systems, and Vi text editor on Unix. Numerous other applicants are also used to create and edit these files. Technically speaking, while XML, HTML, XHTML, SGML, and many other documents are also plain text documents (typically Unicode UTF-8 encoded), these types of files utilize special markup languages to apply structural and styling rules to the documents’ content. Because of their unique nature, such documents are classified for the purposes of these guidelines as “Structural markup text documents” (see Structural Markup Text Documents).
Plain Text (.txt) US-ASCII or UTF-8 encoding
The data in plain text files is typically encoded in either US-ASCII or Unicode UTF-8 encodings. US-ASCII (American Standard Code for Information Interchange) defines 256 characters where each character is defined using an 8-bit byte. It is the most common encoding for English-language plain text documents. Unicode UTF-8 has a much broader set of characters, allowing for the use of non-Roman scripts (Arabic, Chinese, and Thai, for instance). Its first 128 characters are those used by US-ASCII, making Unicode UTF-8 backwards compatible with US-ASCII and making all US-ASCII text valid Unicode UTF-8 as well. Unicode UTF-8 has become the standard encoding for Web documents, including email.
Comma-separated files are plain text files that store tabular data. Like files with the .txt extension, they are usually encoded in either US- ASCII or Unicode UTF-8. They are distinguished by the fact that they contain values separated by commas and line breaks, so that spreadsheet and database applications (like Microsoft® Excel® and Access®) can easily open and interpret (or “parse”) the data.
Comma-separated file (.csv) US-ASCII or UTF-8 encoding
Comma-separated files are plain text files that store tabular data. Like files with the .txt extension, they are usually encoded in either US- ASCII or Unicode UTF-8. They are distinguished by the fact that they contain values separated by commas and line breaks, so that spreadsheet and database applications (like Microsoft® Excel® and Access®) can easily open and interpret (or “parse”) the data.
Tab-delimited file (.txt) US-ASCII or UTF-8 encoding
Tab-delimited files are similar to comma separated files, the difference being that the values in one are separated by commas and in the other by tabs. Tab-delimited files carry the standard .txt extension. As with the .txt files and comma-separated files described above, tab-delimited files should be encoded in either US-ASCII or Unicode UTF-8.
Structural markup text documents, including XML and SGML, have been distinguished from plain text and word processing documents because of the unique functions they serve and the preservation standards they require. Technically speaking, these texts are also plain text files (see Text Documents), and many word processing documents, image files, web sites, and other formats are primarily XML-based. For the purposes of this document, structural markup text documents include individual plain text documents written in markup languages not otherwise belonging to another format category.
SGML with DTD/Schema
Standard Generalized Markup Language (SGML) is a markup language used for formally describing the structure and contents of documents. It is the umbrella language under which HTML, XML, and XHTML were designed. Defined by ISO 8879:1986, SGML files use “tags” to assign style and structure to content. These tags must either be internally defined or externally defined in a document type declaration (DTD).
XML (.xml) with DTD/Schema
Extensible Markup Language (XML) is a markup language that describes a document’s storage layout and logical structure in a way that is both human and computer-readable. The term “XML” is applied to both the markup language and the documents produced with it. XML is a subset of the Standard Generalized Markup Language (SGML). XML tags are fully extensible and user-defined. Thus, XML documents must include or refer to documentation of the meaning of the tags (markup declarations). Usually, an XML file achieves this by referencing a document type definition (DTD) or schema in its header, although the file may also include the markup declarations within the XML document itself. The Library of Congress Sustainability of Digital Formats includes more specific information about the language and format of XML.
Spreadsheets represent tabular data divided into columns and rows of data cells. Column and row headings identify data and allow future users to make sense and meaning of spreadsheet content. Depending on the relative importance of a spreadsheet’s content, formulas, graphs, charts, and sheets, the spreadsheet may need to be preserved in its entirety. For example, the value of cells may be created by formulae that cannot be seen if the spreadsheet is exported to PDF/A or plain text. Instead, it would need to be preserved as an OpenDocument Spreadsheet. Your agency or office will need to carefully determine whether this hidden information (or “metadata”) merits preservation. For more information about retention of metadata, see Metadata as a Public Record in North Carolina: Best Practices Guidelines for Its Retention and Disposition.
Many spreadsheets, like those pictured above, have important metadata such as formulas and styling information. This metadata is not always visible to the reader but is critical to rendering the data. When deciding between formats, it is important to consider whether your spreadsheets include this kind of information. OpenDocument Spreadsheets are capable of preserving formulas, hyperlinks, graphs, charts, and the relationships between multiple sheets. Comma- separated files and tab-delimited files are not.
OpenDocument Spreadsheet (.ods)
OpenDocument Spreadsheet is a sub-type of the OpenDocument Format (ODF), an open source file format for spreadsheets, charts, presentations, and word processing documents. Originally created by Sun Microsystems, the current standards were developed by the Organization for the Advancement of Structured Information Standards (OASIS) Open Document Format for Office Applications committee. The format is based on the XML format used by the OpenOffice.org office suite. The format is also published (in one of its version 1.0 manifestations) as ISO/IED international standard 26300:2006. See also OpenDocument Text (.odt) and OpenDocument Presentation (.odp).
In almost all cases, an OpenDocument Spreadsheet “file” with the .ods extension is actually a package of several files that have been compressed into a single ZIP file package. Within the zipped package are several separate files that represent the content of the document, its styling, metadata, settings, and a manifest of the zip package files. An OpenDocument Spreadsheet file can also be a single, flat XML file; this is rare, however, and the associated file extension is usually .xml or .fods.
Comma-separated file (.csv)
See also Comma-separated file (.csv) US-ASCII or UTF-8 encoding. Comma-separated files are plain text files that store tabular data. They are capable of storing spreadsheets without styling or formatting (such as borders, fonts, column widths, etc.) Like files with the .txt extension, they are usually encoded in either US-ASCII or Unicode UTF-8 They are distinguished by the fact that they contain values separated by commas and line breaks such that spreadsheet and database applications (like Microsoft® Excel® and Access®) can easily open and parse the data.
Tab-delimited file (.txt)
See also Tab-delimited file (.txt) US-ASCII or UTF-8 encoding. Tab-delimited files are similar to comma separated files, the difference being that the values in one are separated by commas and in the other by tabs. Tab-delimited files carry the standard .txt extension. Like comma-separated files, tab-delimited files are not capable of storing spreadsheets formula, styling, or formatting (such as borders, fonts, column widths, etc.). As with the comma-separated files described above, tab-delimited files should be encoded in either US-ASCII or Unicode UTF-8.
PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A)
PDF/A may be an appropriate format for preserving spreadsheets, where styling, graphs, and charts are important elements to preserve, but formulas are not. PDF/A preserves the rendering—or “look and feel”—of the original spreadsheet, but hidden types of information like formulas are lost.
PDF-Archival (more commonly known as PDF/A) is an international standard developed by the Association for Information and Image Management International (AIIM International) for the use of PDF files for archiving and preservation of electronic documents. The State Archives of North Carolina recommends PDF/A, Version 1, full compliance (PDF/A-1a) as a preservation format for word processing documents and other files. See also 2.1.1 PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A).
Special Note on Google Docs™
Google Docs™ is a cloud-based document editing service offered by Google™. Spreadsheets may be created on Google Docs™ and exported in various formats, including Microsoft® Excel® 97-2003 (.xls), OpenDocument Spreadsheet (.ods), Comma-separated file (.csv), HTML (.html), and others. The recommendations described in this document apply to all spreadsheets, regardless of whether they were created using Google Docs™. The State Archives of North Carolina recommends that spreadsheets with styling, formulas, graphs, charts, or relationships between multiple sheets be saved in the OpenDocument Spreadsheet format. Those without styling, formulas, graphs, charts, or multiple sheets may be saved as comma-separated or tab-delimited files. See OpenDocument Spreadsheets (.ods), Comma- separated file (.csv), and Tab-delimited file (.csv).
State agencies should be wary of keeping public records on devices and servers that are not state owned. Consult state IT polices maintain by DIT for more information.
Digitized audio “samples” sound waves at intervals, rather than recording the entire continuous sound wave as analog audio does. The digitized samples are then encoded into binary signal and packaged into a file format that tells software how to read the encoded binary data. The digitized audio file format also provides technical and descriptive information about the file (called “metadata”), such as the sampling rate, the quality of each sample (measured by bit depth), the creator of the original audio, the playback time, the date of creation, etc.
Broadcast WAVE Format LPCM (.wav)
The Broadcast WAVE format (BWF) with LPCM encoding is a subtype of the WAVE format (Waveform Audio File Format). In 1997, the BWF format was introduced by the European Broadcast Union (EBU) in 1997 and has since gained widespread use as the preferred archival format for audio files. Version 0 appeared in 1997, Version 1in 2001, and Version 2 in May 2011. Versions 0 and 1 are very similar, and Version 2 includes new loudness metadata.
The standard WAVE specification allows for an unlimited number of data “chunks” to sit in the head of a WAVE file. A BWF file simply includes additional metadata in the head of the file, including the EBU’s “Broadcast Audio Extension” chunk, commonly known as the “bext” chunk. The bext chunk allows for important archival metadata to be embedded in the file, including the title of the recording, the recording’s creator, whether the recording is part of a compilation, and much more. This information tells listeners what they are listening to, identifies essential preservation information, and allows multi-part recordings (such as multiple tracks) to be played back properly.
The data within a WAVE file is usually encoded with Linear Pulse Code Modulated Audio (LPCM), although it can also contain other variations of Pulse Code Modulated Audio (such as DPCM or ADPCM) and MPEG-encoded audio. The recommended preservation standard is to use LPCM. Alternative encodings are rarely used.
For more technical information about BWF preservation, please see the Federal Agencies Digitization Guidelines Initiative Audio-Visual Working Group recommendations.
WAVE Format LPCM (.wav)
The Waveform Audio File Format (WAVE) is a standard master format for digital audio. Although it can contain compressed audio, WAVE files nearly always contain audio in uncompressed linear pulse code modulation format (LPCM).
WAVE files are widely used throughout the commercial and preservation sectors with a standardized set of additional metadata fields contained within the “bext” header chunk (see 2.5.1 Broadcast WAFE Format LPCM (.wav)). WAVE files that do not contain this additional metadata chunk will be missing important information that will aid in their long-term preservation, and may not easily be identifiable to listeners.
Digital videos combine multiple elements, including visual data, audio data, subtitles or pointers to external subtitles, and descriptive information (metadata) essential for playback. Digital video files are complex, and have many layers of encoded data. In order to be able to access a digital video file, software must be able to recognize not only the umbrella file format, but also the encoders used to package the video and audio inside the file format. An MXF file, for example, may contain JPEG2000-encoded image files representing every frame in the video, wrapped into the Motion JPEG2000 format, combined with PCM audio. MXF provides the final container that links the Motion JPEG with the PCM audio, but it could also be used to link other forms of audio and video. Although the file extension (.mxf, .mov, or .mp4, for example) reflects the final container, it does not necessarily identify the component parts of the digital video.
AVI, full frame (uncompressed), WAVE PCM audio (.avi)
AVI, or Audio Video Interleaved, is a multimedia container file format developed by Microsoft®. Conforming to RIFF (Resource Interchange File Formats) AVI is a fully documented, proprietary format that has been widely adopted for video production and filmmaking. The National Archives and Records Administration (NARA) uses AVI as a preservation master format for reformatted video materials, and NARA supports the open-source AVI MetaEdit tool for the capture and normalization of AVI file embedded metadata.9 AVI files may contain full frame uncompressed video or compressed video, including MPEG, JPEG 2000, DV Digital Video, DivX, and other compression codecs. Audio in AVI files is WAVE PCM.
AVI-MetaEdit can be downloaded from NARA’s Github site.
Special Note on SD (Standard Definition) and HD (High Definition) videos
Several factors independent of file format help determine the quality and playability of digital video files, including the display resolution, scanning type (progressive scanning or interlaced scanning), and frame rate. The State Archives will accept digital video files that adhere to established NTSC standard broadcast resolutions for either SD (Standard Definition) or HD (High Definition) video.
Standard Definition NTSC:
720 x 480 29.97fps (480i, 480p)
Aspect ratios: 4:3 or 16:9
High Definition NTSC:
1280 x 720 (720p60, 720p30, 720p24)
1920 x 1080 (1080i60, 1080p30, 1080p24)
Aspect ratio: 16:9
Raster images, also known as “bit-mapped” images or “bitmaps,” are still images created with a grid of pixels, or very small squares of color.
TIFF (.tif), uncompressed
The Tagged Image File Format (TIFF) is the preferred preservation file format for raster images. Although the TIFF specification is owned by Adobe® Corporation, the format is fully documented, extensible, and widely adopted.14 The State Archives recommends the use of TIFF v.6.0 uncompressed baseline RGB for color images, TIFF v. 6.0 uncompressed baseline grayscale for grayscale images, and TIFF v. 6.0 Group IV/Huffman compressed baseline bi-tonal for typographic documents where there are no fine details, light markings, handwritten or pencil notations. This means that preservation TIFF files should be:
- version 6.0, which was released in 1992 and is the most recent TIFF specification. TIFF 6.0 Supplement 2, which was released in 2002 and introduced two additional compression schemes used when saving TIFF files in Adobe® Photoshop®, does not affect the recommended preservation format.15
- baseline. This simply means that the minimum (or baseline) tags are present that make a TIFF file a TIFF file. If your file does not meet the minimum baseline tagging requirements, it is not a valid TIFF file and software may report that the file is corrupted when attempting to open it. Software will likely be able to save TIFF files at baseline by default.
- RGB, grayscale, or bi-tonal, depending on the appearance of the analog original or encoding of the born-digital original. Baseline TIFF files have four configurations: bi-tonal (black and white), grayscale, palette-color (limited color palette), and full-color RGB. If your original records are color images (photos, maps, text with colored notations, or any born-digital image that originated in RGB), you should preserve that image as a full-color RGB TIFF file. If you are scanning grayscale documents (such as grayscale maps, typographic documents that are difficult to read or that have pencil markings, handwritten documents, etc.) or have born-digital images that originated in grayscale, you should save the images as uncompressed grayscale TIFF files. For scanned images of black and white typographic documents where visual detail is not important and there are no fine details, light markings, handwritten or pencil notations, Group IV/Huffman bi-tonal (black and white) TIFF files may be used.
JPEG 2000 (.jp2)
Joint Photographic Experts Group JPEG 2000 is an open, published ISO standard (ISO 15444-1:2004). TIFF has long been established as the archival preservation file format of choice for raster images, and JPEG 2000 is increasingly being considered a viable format as well. It is yet to be widely adopted, however, and the State Archives recommends JPEG 2000 with reservations. Agencies should use JPEG 2000 as a preservation format only where staff with technical proficiency are familiar with the format and/or where JPEG 2000 is already in use in the office. Although JPEG 2000 is less widely adopted than TIFF, JPEG 2000 offers several advantages, including a highly efficient lossless encoding that allows for very high quality images at very low file sizes. There are three types of JPEG 2000 image file formats, the first of which is the most widely accepted for long-term preservation:
- JPEG 2000 Part 1 (JP2, .jp2) — International ISO standard ISO 15444-1:2004, JP2 is the core coding system and the most widely adopted by preservation institutions, including Library of Congress (LOC), Library and Archives Canada (LAC), the National Library of the Netherlands, the British Library, the Wellcome Library, the National Library of Norway, and the National Library of the Czech Republic.
- JPEG 2000 Part 2 (JPX, .jpx, .jpf) — International ISO standard ISO 15444-2:2001, JPX is an extension of JP2 that allows for additional colorspaces, the specification of opacity, standardized metadata, multiple image data streams, and more non-contiguous internal organization of the image data.17Although the official MIME type extension is .jpf, some applications may save files as .jpx.18 Library of Congress notes the following information in its recommendation of JPX alongside JP2 as a preservation format for still images: “The JPX level of the JPEG2000 standard supports more effective color management than the level 1 format (JP2). JPEG2000 offers many options for choices of quality level, and storage order for the encoded image data (codestream). Future investigation is needed to determine whether particular options should be encouraged or avoided when the objective is responsible long-term custody.”
- JPEG 2000 Part 6 (JPM, .jpm) — International ISO standard ICO 15444-6:2003, and based on the Mixed Raster Content standard ICO/IEC 16485:2000, JPM is designed to combine bit-tonal and continuous-tone images into compound images. Library and Archives Canada (LAC) includes this format in their description of the recommended JPEG 2000 format.
Whereas raster images are built by small dots of color, vector images are created mathematically with the geometry of points, lines, curves, and polygons. Raster images tend to be used for photographs and photo-realistic images, while vector images are used for structured pictures, such as architectural drawings, graphic designs, and engineering drawings. Vector images are also used widely in geospatial databases, which this section does not cover. For geospatial vector sets, see Geospatial Vector Datasets.
Scalable Vector Graphics 1.1 (.svg)
SVG is a widely adopted and open standard from W3C that is used for creating two-dimensional graphics in XML. Its official description is as follows: “SVG is a language for describing two-dimensional graphics in XML [XML10]. SVG allows for three types of graphic objects: vector graphic shapes (e.g., paths consisting of straight lines and curves), images and text. Graphical objects can be grouped, styled, transformed and composited into previously rendered objects. The feature set includes nested transformations, clipping paths, alpha masks, filter effects and template objects.”
AutoCAD® Drawing Interchange Format (.dxf)
The Drawing Interchange Format was developed and is owned by Autodesk®, the producer of AutoCAD®. AutoCAD®’s native Drawing Format, DWG, is currently the de facto standard for vector graphics. The DWG format, however, is proprietary and has not been released. Autodesk® instead recommends the use of the Digital Interchange Format DXF for data exchanges. The DXF specification, which is revised alongside each release of AutoCAD®, was designed to be exchanged with other CAD applications. The format is owned by Autodesk® but freely available to use. The UK National Archives writes that “DXF is a complex format, and the quality and sophistication of its implementation in different applications varies considerably. The frequent changes to the specification can also cause compatibility problems. In particular, users must be aware that some applications may read a DXF file whilst skipping unsupported features. This can lead to the loss of information in a manner that may not be obvious to the user.”24 The State Archives of North Carolina recommends that the AutoCAD® Drawing Interchange Format (.dxf) be used as a preservation format only where PDF/A or SVG are not appropriate.
PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A)
PDF-Archival (more commonly known as PDF/A) is an international standard developed by the Association for Information and Image Management International (AIIM International) for the use of PDF files for archiving and preservation of electronic documents. The State Archives of North Carolina recommends PDF/A, Version 1, full compliance (PDF/A-1a) as a preservation format for word processing documents and other files. For some images created in raster form, PDF/A-1a may be an appropriate preservation format, particularly for simple images where 2D visual rendering is more important than manipulability.25 See also 2.1.1 PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A).
Databases contain structured data organized into fixed fields in computer-readable files. Today, nearly all databases are relational databases, in which data is contained in a series of formally-described and related tables from which data can easily be queried. A database management system (DBMS) refers to the software used to manage databases, rather than to the content of the databases themselves. Common DBMSs include Microsoft® Access®, Microsoft® SQL Server, and Oracle® MySQL. Specialized fields like astronomy, social science, and engineering often utilize unique DBMSs. The database preservation category focuses not on the DBMSs, but on the formats designed for the exchange of data from one DBMS to another.
Software Independent Archiving of Relational Databases (SIARD)
The Software Independent Archiving of Relational Databases (SIARD) format was developed by the Swiss Federal Archives in response to the lack of a standardized archiving format for databases. It is an XML-based format designed for the long-term preservation of relational database content. First introduced by the Swiss Federal Archives in 2004, it has since been further developed within the PLANETS project. In 2008, the Swiss Federal Archives released a full-fledged version of the SIARD format with associated software.
A SIARD archive is a single uncompressed ZIP container that holds two folders: a metadata folder and a content folder. The metadata folder contains an identification of the database, format version, lists of tables, views, routines, table constraints and triggers, SQL type, LOBs (Large Objects) names, and relations. The content folder holds the schema and table data in XML files.
Delimited Flat File (Plain Text) with DDL
Delimited plain text files may be used for archiving simple database content or content from legacy database applications (see 2.2.1 Plain Text for acceptable encodings). Data fields should be delimited using commas, tabs, or another delimiter (see 2.2.1 Comma-separated file (.csv) and 2.2.2 Tab-delimited file (.csv)), rather than being stored as fixed-length flat files.
In order that the data be identified and made comprehensible, at minimum there must be a data definition language (DDL) accompanying the database. Any additional contextual information transferred to the State Archiving accompanying the database (such as data dictionaries and relational diagrams) should be submitted in an appropriate preservation format.
Presentations are image, text, and audio-based displays of information, usually in the form of a slide show. Common tools like Microsoft® PowerPoint®, OpenOffice.org Impress, Corel® Presentations, and Google Docs™ allow users to create, edit, and present such files. Presentations may include not only images, text, and audio, but also timed animations, hyperlinks, and click effects. Web-based applications like Google Docs™ allow for the online, collaborative creation of such presentations, and export files in common pre-existing formats.
OpenDocument Presentation (.odp)
OpenDocument Presentation (.odp) is a sub-type of the OpenDocument Format (ODF), an open source file format for spreadsheets, charts, presentations, and word processing documents. Originally created by Sun Microsystems, the current standards were developed by the Organization for the Advancement of Structured Information Standards (OASIS) Open Document Format for Office Applications committee. The format is based on the XML format used by the OpenOffice.org office suite. The format is also published (in one of its version 1.0 manifestations) as ISO/IED international standard 26300:2006. See also OpenDocument Text (.odt) and OpenDocument Spreadsheet (.ods).
An OpenDocument Presentation “file” with the .odp extension is actually a package of several files that have been compressed into a single ZIP file package. Within the zipped package are several separate files that represent the content of the document (including images, notes, and text), animations and click effects, themes, styles, layouts—as well as a manifest of the zip package files.
PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A) for presentations without animation
PDF-Archival (more commonly known as PDF/A) is an international standard developed by the Association for Information and Image Management International (AIIM International) for the use of PDF files for archiving and preservation of electronic documents. The State Archives of North Carolina recommends PDF/A, Version 1, full compliance (PDF/A-1a) as a preservation format for presentations without audio or animations. See PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A).
State Agency Employees
Currently state agency email is either managed by the Department of Information Technology (DIT) for consolidated agencies or by departmental IT units for agencies outside of the DIT tenant. Regardless who manages the stored email, state agencies are required to consult their records retention schedules and identify retention periods for their email accounts and messages. When consulting records retention schedules, state agency employees should be aware that records are defined by content, not media. Thus schedules will not include a record series titled simply “email.” Rather, email can contain several types of records. Examples include correspondence, meeting agendas, employee leave requests, conference materials, employee vehicle requests, law enforcement case files, and more.
Over time, some emails of state employees have been delivered to the State Archives of North Carolina in the Microsoft® Outlook® Personal Storage Format (.pst), where they are then converted into XML for long-term preservation. Going forward, DIT and the State Archives are working on a project for identifying and transferring email from Capstone accounts, accounts that capture agency histories and decision-making. In most cases, Capstone accounts will be upper management for departments and divisions, but agencies should consult with their records analysts to identify positions that may also include emails with historical value. For more information about email in general, please consult the Digital Communications page.
The guidance below deals specifically with email created by local government employees.
Local Government Employees
Local government employees should consult their retention schedules to determine the retention periods for their emails. When consulting records retention schedules, local government employees should be aware that records are defined by content, not media. Thus, no schedule will not contain a record series titled simply “email.” Rather, email can contain several types of records. Examples include correspondence, meeting agendas, employee leave requests, conference materials, employee vehicle requests, law enforcement case files, and more. The following format recommendations are intended for local government employees.
Formats: Multiple Emails & Email Accounts
In common usage, “email” is used to refer to one of two things:
- an individual email message,
- an entire email account, including messages, message folders, contacts, calendars, and tasks.
Most email file formats are designed to either hold a single message, multiple messages, or an entire email account, including non-message data. This section discusses file formats that aggregate multiple email messages or contain entire accounts with non-message data.
Agencies may consider retaining email in either form, depending on what records are contained within the email. Public records with long- term retention may appear in several areas throughout an email account.
Microsoft® Outlook® Personal Storage Table (.pst)
The Personal Storage Table (.pst) format is an open, proprietary format owned by Microsoft® and used primarily by the Microsoft® Exchange Client, Windows® Messaging, and Microsoft® Outlook®. Although Microsoft® owns copyright to the format, it is freely published to allow open development of tools that can open, process, manage, and convert .pst files.
The State Archives of North Carolina recommends that local agencies using Outlook® and needing to retain entire email accounts should retain those accounts as PST files. PST files, however, are frequently updated and should regularly be updated with new releases of Microsoft® Office. If individual employees are retaining PST files locally, each time Microsoft® Office is updated on employees’ computers, the employees should open all PST files in Outlook® and re-save in the updated PST format. Keep in mind, there may be little apparent difference between two versions of a PST file. It is essential that PST files be updated in this manner in order to be retained long-term.
PST files are highly complex. They contain messages within folder hierarchies, message attachments, as well as calendars, contacts, tasks, email flags and categorization, and other data. Email accounts can quickly become very large, and PST files (like any file) have a maximum size beyond which the file may easily corrupt. PST files produced prior to Outlook® 2003 have a maximum file size of 2 GB. Those produced in Outlook® 2003 and 2007 have a default maximum size of 20 GB, and those in Outlook® 2012 50 GB. Files beyond these size limits should be divided into smaller files that do not exceed the maximum recommended size. This can be accomplished by opening the PST file in Outlook® and re-saving selected sections separately.
MBOX (.mbox, .mbx)
For those agencies that do not use Microsoft® Outlook®, MBOX may be an acceptable format for retention. The State Archives only recommends MBOX be used where PST files are not available and it is not viable to save messages individually.
MBOX is not a file format per se, but rather a family of four related storage formats. Different email clients implement MBOX in different ways, but MBOX files generally store all messages within a single email folder Different MBOX formats mark the end of one message and the beginning of another in slightly different ways. MBOX has become a de facto standard across email clients, with different clients employing one of the four types of MBOX file formats. It may be necessary to open your MBOX files in Notepad or another text editor to determine the precise MBOX format employed by your email client.
It is also important to note that MBOX files include attachments (spreadsheets, word processing documents, images, etc. that the ender attached to the email) embedded into the email. Note, however, that the State Archives of North Carolina recommends that any documents attached to emails be preserved separately and in their native file format. Attachments should be downloaded and preserved as separate files.
Individual Email Messages
In some cases, clients may not allow users to export accounts or multiple messages into a single file. Instead, the email client may allow individual messages to be exported one-by-one. In this case, users should be careful to select a format that:
- Includes as much metadata as possible
- And which is least likely to become inaccessible in the near future.
Many email clients will offer users multiple format options for export. These email formats are not uniformly identified across email clients, and many formats are interpreted differently by different clients. Thus, the State Archives of North Carolina makes the following general guidelines for local governments, rather than recommending specific file formats:
- Attachments should be saved as separate files, in their original format. The email message should indicate whether there are any attachments and include the filenames of those attachments. Depending on the file format, the attachment may also be embedded in the message file itself. Regardless of whether the attachment is embedded in the message, it should also be saved separately in its original format to ensure that it can be opened at a later date.
- If the email message can only be opened in your email client, use a different format. Many email clients have developed their own proprietary format for email, and these should not be used when saving individual messages. Instead, the message should be saved in one of the general formats listed elsewhere in this document, such as plain text (see Plain Text ) or PDF (see PDF/A-1a). If the email is a plain text document, it should be able to be opened in a simple text editor, like Notepad or TextEdit. Plain text emails may carry the .txt extension, but they may also carry .eml or another extension. When selecting a file format in your email client, plain text formats may be identified under various titles, such as “EML,” “Email Message,” “Plain Text,” or “Show Original.”
- Email header information should be included in the file. The email “header” contains technical information that is very important in demonstrating the authenticity of an email during e-discovery or a public records request. Most important is the email address of the sender and recipient(s); many email file formats preserve the name of the sender and recipient(s), but not the email addresses. Be aware that email headers may be structured differently depending on the file format and the email client (see examples below). If you export messages from your email client as PDF, this metadata will probably not be included. You may need to export messages in a format that includes header metadata, and then convert that format to PDF.
Websites are usually collections of numerous webpages that are intellectually related and meant to be explored as a whole. Each website contains numerous web pages and each of these pages, in turn, is made up of multiple files of various file formats. A single webpage may include HTML, CSS, executable files, images, videos, audio, fonts, PDFs, and more. These complex digital entities, moreover, are often embedded in dense hyperlinked contexts, so that a single webpage removed from its context loses much of its meaning.
The goal of archiving a webpage is to collect all of the files, embedded content, and linked resources that originally made up the original webpage, and to be able to continue presenting the webpage as it originally appeared to visitors. Where possible, webpages should be collected in the context of the websites and linked webpages in which they were located at the time of capture. Several current web archiving services utilize the open-source Heritrix tool, which can perform large-scale archival web crawls and captures. The Department of Cultural Resources performs large-scale captures of state government websites and social media content, utilizing the Web Archive (WARC) format and the Archive-It web archiving service. Local government websites are not actively captured as a part of this program. Although parts of some local government sites are incidentally captured in the archive, local governments should not currently rely on this program to archive their websites.
Web Archive (.warc, .war)
Many web archiving services utilize the Web Archive, or WARC, format. This format specifies a standard for combining multiple digital resources into a single, aggregate file with descriptive information. The WARC format is an extension of the ARC File Format that has been used to store web crawls by the Internet Archive since 1996. Beginning in 2005, the Internet Archive developed the WARC format in consultation with the International Internet Preservation Consortium (IIPC) to extend and replace the ARC format. Published as ISO 28500:2009, WARC is an open, publicly documented standard.
The WARC file format is designed to be used in the large-scale collection and bulk harvesting of web archives through tools built around the Heritrix open-source tool.34 The State Archives and State Library of North Carolina currently crawl state government websites through Archive-It. To submit a new website to the web archive, please contact your records analyst.
PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A)
See also PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A). It is recommended that local governments use vendor services that archive websites in WARC or other formats that allow websites to be rendered as they were originally created, with active hyperlinks and other interactive material. However, where this is not possible, local governments may consider capturing websites in the PDF format. A special type of PDF format, PDF/A-1a, is specially designed to preserve files long-term, and the State Archives recommends that PDF/A-1a be used wherever possible. More information on this format can be found in section PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A).
PDF is designed to replicate the “look” of documents—not the interactivity of websites. Many web browsers, such as Firefox®, Internet Explorer®, and Google Chrome™, have the option to “save as” or print to PDF. Depending on how your website is built, this option may drastically change the appearance of the website. Interactive elements, including hyperlinks, videos, sound, forms, and scripts will lose their functionality.
Currently, the State Archives of North Carolina collects statewide geospatial data from NC OneMap, a clearinghouse for North Carolina geospatial resources. The archival entity collected by the State Archives is the shapefile. The shapefile format, formally known as the ESRI Shapefile Format, is an open specification designed by Environmental Systems Research Institute, Inc (Esri®) for the transfer of data between Esri® and non-Esri® products. The format is defined in ESRI Shapefile Technical Description: An ESRI White Paper—July 1998.
Shapefiles store nontopological geometry and attribute information for the spatial features of a data set. The geometry for a feature is stored as a shape comprising a set of vector coordinates. A single shapefile is, in fact, a collection of several distinct files. At a minimum, the shapefile consists of a main file (.shp) an index file (.shx), and a database or “dBASE” file (.dbf). The main file contains a record of each point, line, and area in the shapefile, with each record being described by a list of its vertices. The index file lists the location of each record in the main file, and the database (or “dBASE”) file contains the attributes of each record. Shapefiles typically have several additional, optional component files. Shapefiles collected by the State Archives of North Carolina contain the following seven component files:
- .shp — Main file: direct access, variable-record-length file in which each record describes a shape with a list of its vertices
- .shx — Index file: list of records containing the offset of the corresponding main file record from the beginning of the main file
- .dbf — dBASE file: table containing feature attributes with one record per feature and a one-to-one relationship between geometry and attributes based on record number
- .prj — Projection Definition file: coordinate system information
- .sbn — Part 1 of spatial index for read-write instances of the Shapefile format: if present, essential for correct processing
- .sbx — Part 2 of spatial index for read-write instances of the Shapefile format: if present, essential for correct processing
- .shp.xml — Geospatial metadata file: metadata in XML format following either the Federal Geographic Data Committee’s (FGDC) Content Standard for Geospatial Metadata (CSDGM) FGDC-STD-001-1998 or ISO 19115:2003, with the optional addition of the Esri® Metadata Profile