Data archive workflow



The social science data archive step by step

Ekkehard Mochmann (Central Archive for Empirical Social Research, Cologne, FRG)
Paul de Guchteneire (UNESCO, Paris, France)
 


Content:
 

1. Identification of datasets
2. Sources of data
3. Selection criteria
3.1. Scientific criteria
3.2. Technical criteria
3.3. Administrative criteria
3.4. Financial criteria

4. Data transfer to the archive
5. Data processing

6. Documentation
7. Storage
8. Information retrieval
9. Dissemination of data

10. Notes

 

 


3. Selection criteria

As mentioned earlier, many datasets are produced with only few being selected for the archive. Every dataset acquired for the archive involves a big obligation for the future in terms of money. As we shall see later preparing a dataset for storage and keeping a dataset "alive" over the years can be a costly operation.

The criteria for selection of datasets to archive are complicated. A general guideline is whether or not the data are usable for future scientific research. Of course practical considerations have to be kept in mind in applying the more scientific criteria. The necessary knowledge to select a dataset for the archive is not well defined. Applied criteria have a certain amount of "common sense" which makes the formal definition of thc criteria difficult.

In the decision process the criteria are often used in a sense of "can you dispose of a dataset because the criterium is not met". Below the four categories of criteria are discussed: scientific criteria, technical criteria, administrative criteria and financial criteria.

3.1. Scientific criteria

The scientific criteria deal with relevance, size and scope of the dataset. It should be emphasised that these terms are related to the dataset and not to the original research project that produced the data. Irrelevant research can produce relevant data and visa versa.

3.1.1. Relevance

The relevance of the dataset topics, studied population and the methods of data collection are taken into consideration.

  • Is the studied population (or other "unit") interesting enough in view of general society, in view of developments in science?

  • Is the studied population already well represented in the rest of the archive holdings, does the dataset contribute new aspects or new measurement points in time?

  • Is the method of data collection appropriate for the population and topics? Are the topics in the dataset rich enough for future analysis?

3.1.2. Size

  • How many cases and variables are included in the dataset?

  • Is the number of cases a sufficient sample of the studied population?

  • Is the sampling method appropriate? Is the sample representative for the population to be studied?

  • Is the dataset large enough for statistical analysis?

Size is typically a criterion that cannot be used without keeping an eye on the relevance criterion, e.g. a small dataset dealing with members of parliament may be interesting enough even though the number of cases is limited.

3.1.3. Scope

  • What is the scope of the topics covered in the dataset?

  • Do the topics deal with social phenomena?

  • Are background variables like age, sex social status indicators, education, occupation etc. included?

  • How detailed are these background variables stored in the data?

3.2. Technical criteria

Machine-­readable information can be stored in a variety of technical formats. The fast developments in computer technology bring so-­called standards every year and the data­ archive has to respond to the changes in both machinery and software. An additional problem here is that the data ­archive is virtually the only institute that has to deal with so many technically different sources of computer material. The interest of computer hardware companies in compatibility ­problems is therefore limited.

Technical criteria are mostly used in a practical sense: can the archive handle the machine-­readable information. Points to check are internal format, size and media for transfer.

3.2.1. Internal format

  • How is the data organised, is it captured in a dedicated software system?

  • Can the data be stored in a standard format for use with standard software?

  • Do you lose significant information when the data are converted to a more standardised system?

3.2.2. Size

  • Is the dataset too large with too detailed information?

  • Can the data be compressed?

  • Are there many separate files?

  • Can they be combined or aggregated into an overview file?

3.2.3. Media for transfer

  • Can electronic records be created without loss of information?

  • Does the electronic format follow a standard that can be handled by the archive?

  • If other media are used, can the archive machinery access these media?

Many of the incompatibility problems with transfer media have been solved in the past decade. The new problems arise more in the software area where complicated programs that allow users to store data in complicated forrnats are becoming available.

3.3. Administrative criteria

3.3.1. Documentation

Datasets are useless without proper documentation. There should be a description of the methods used to generate the data, a definition of the population, a description of the sampling procedures etc. Social science data archives have agreed on a minimum standard for describing data sets in the study description scheme. To "read" the machine-­readable information there should be a codebook that describes the relation between the original research instrument (e.g. a questionnaire) and the data.

3.3.2. Privacy protection

  • Are there any privacy concerns?

  • Can the archive store identifiable data on individuals?

  • Is the data to be manipulated to prevent risk of revealing personal data?

  • Can this be done without serious loss of information for future scientific analysis?

3.3.3. Ownership

Often it is not clear who owns the data and whether the archive can get enough control over future usage of the data. If serious restrictions are imposed on usage of the data, the archive may decide not to store in its holdings because usage will be limited.

When both the data gathering institute and the institute that commissioned the research claim ownership of the machine­-readable data, the archive can get mixed up in legal procedures that prevent a proper storage of the data. Data­ archives should advise funding agencies to deal with the problem of ownership beforehand.

3.4. Financial criteria

Financial criteria have to be applied for both internal and external reasons. Data archives generally do not buy datasets. Donors view their contribution of a dataset to the archive as a further means of making their research public. For donating institutes, the data ­archive may function as an external back­up of their own holdings. For these and other reasons most of the datasets arrive at the data archive free. Many archives however have the possibility to reimburse expenses made for the transfer of the data.

With some archives, donors of data have the possibility to impose royalties on the usage of their data. High royalties may be prohibitive for scientific usage and this approach has not been widely introduced.


 
Copyright © IFDOnet - All rights reserved - Contact - 11-05-2005