During the Research Project
Planning for public access throughout your research will help ensure that the data archiving process goes smoothly at the end. Using your data management plan as your guide, choose file formats and naming conventions that will make it easy to organize your data and share it with others. As you begin work, take the time to document and describe your data parameters, and use consistent formatting throughout your files. Finally, keep your data safe with a backup system. As you work:
- Step 1: Use stable file formats.
- Step 2: Plan file names.
- Step 3: Describe the data.
- Step 4: Organize data consistently.
- Step 5: Perform quality assurance.
- Step 6: Preserve raw data.
- Step 7: Back up and protect your data.
Step-by-Step Guide
Step 1: Use Stable File Formats
Using platform-independent and nonproprietary formats whenever practical will maximize the future utility of your data. Use text (ASCII) file formats for tabular data, such as .txt or .csv (comma-separated values) formats.
Some preferred file formats for different content types include:
- Containers: TAR, GZIP, ZIP
- Databases: XML, CSV
- Geospatial: SHP, DBF, GeoTIFF, NetCDF
- Moving images: MOV, MPEG, AVI, MXF
- Sounds: WAVE, AIFF, MP3, MXF
- Statistics: ASCII, DTA, POR, SAS, SAV
- Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
- Tabular data: CSV
- Text: XML, PDF/A, HTML, ASCII, UTF-8
- Web archive: WARC
► Learn more:
- Submitting Content, National Transportation Library, February 2021.
- Data Best Practices and Case Studies: File Formats, Stanford University Libraries, last updated November 2021.
- Data Management Plan (DMP) Guide: Data Organization, Iowa State University Library, last updated November 2021.
- National Transportation Library Collection Development and Maintenance Policy, Version 1.3, January 2018.
Sources: Stanford University Libraries, National Transportation Library, USGS. Image credit: USGS.
Step 2: Plan File Names
Develop naming conventions and a folder hierarchy structure early. File names should:
- Be unique.
- Use logical and efficient naming conventions.
- Reflect file contents using keywords such as location, variables and conditions.
- Be between 25 and 60 characters.
- Avoid using capitals, special characters and spaces.
- Use YYYYMMDD date format.
- Use underscores between components.
- Differentiate raw data from other files.
Sample file names:
- YYYYMMDD_location_vehicle_count_raw.xlsx
- bigfoot_agro_2000_gpp.tiff
► Learn more: Data Best Practices and Case Studies, Stanford University Libraries, last updated November 2021.
Sources: Stanford University Libraries; University of California, Davis; USGS. Image credit: USGS.
Step 3: Describe the Data
Create data documentation (such as a parameter table) as you begin work rather than waiting until your project is complete.
- Use commonly accepted parameter names, descriptions and units.
- Be consistent.
- Explicitly state units.
- Choose a format for each parameter, explain the format in the metadata, and use that format throughout the file.
- Use standard data formats (for example,
ISO standard date format—YYYYMMDD).
Source: USGS. Image credit: USGS.
Step 4: Organize Data Consistently
Keep data organization consistent throughout your files.
- Don't change or rearrange columns.
- Include header rows; column headings should describe the content of each column.
- In the first row, list the file name, dataset title, author, date, and names of companion files.
Spreadsheet best practices
- One data type per cell.
- One data point per cell.
- Use clear variable names.
- Use data validation—predefined, consistent categories.
- Avoid using formatting that may not be maintained when a spreadsheet is converted to a .csv file. Don't use:
- Commas or special characters (@, %, ^, etc.)
- Colored text or cell shading.
- Embedded comments.
- Avoid empty cells, rows or columns. If there is no data for a cell, indicate why.
- Avoid merged cells, missing headers, or multiple header rows.
An application like OpenRefine (formerly Google Refine) can help you locate and clean up inconsistent data.
► Learn more: Manage Spreadsheets, Stanford University Libraries, last updated November 2021.
Sources: Stanford University Libraries; University of California, Davis; USGS. Image credit: USGS.
Step 5: Perform Quality Assurance
To ensure data integrity, perform frequent checks on your data to identify any errors.
- Assure data are delimited and line up in proper columns.
- Check for missing values in key parameters.
- Scan for impossible and anomalous values.
- Perform and review statistical summaries.
- Map location data and assess any errors.
Source: USGS. Image credit: USGS
Step 6: Preserve Raw Data ("Keep Raw Data Raw")
To preserve your data and its integrity, save a read-only copy of your raw data files with no transformations, interpolation or analyses. Use a scripted language such as R, SAS or MATLAB to process data in a separate file (located in a separate directory). These scripts:
- Serve as a record of data processing.
- Can be easily and quickly revised and rerun in the event of data loss or requests for edits.
- Allow future researchers to follow up or reproduce your processing.
Source: USGS.
Step 7: Back Up and Protect Your Data
As you work, create back-up copies of your data often.
- For the best protection from loss, create three copies of each file: the original, an on-site (external) backup, and an off-site backup (such as via cloud services).
- Choose a backup frequency based on need and risk.
To ensure that you can recover from a data loss, periodically test your ability to recover your data.
Check with the Information Technology (IT) department in your organization for advice on the best backup systems for your needs.
► Learn more:
- Data Best Practices and Case Studies: Stanford University Libraries, last updated November 2021.
- Data Management Plan (DMP) Guide, Iowa State University Library, last updated November 2021.
Sources: Iowa State University Library, Stanford University Libraries, USGS.