|
As mentioned earlier, many datasets are produced with only few
being selected for the archive. Every dataset acquired for the
archive involves a big obligation for the future in terms of
money. As
we shall see later preparing a dataset for storage and keeping a
dataset "alive" over the years can be a costly operation.
The criteria for selection of datasets to archive are complicated.
A general
guideline is whether or not the data are usable for future
scientific research. Of course practical considerations have to be
kept in mind in applying the more scientific criteria. The
necessary knowledge to select a dataset for the archive is not
well defined. Applied criteria have a certain amount of "common
sense" which makes the formal definition of thc criteria difficult.
In the decision
process the criteria are often used in a sense of "can you dispose
of a dataset because the criterium is not met". Below the four
categories of criteria are discussed: scientific criteria,
technical criteria, administrative criteria and financial
criteria.

-
3.1. Scientific criteria
The scientific criteria deal with relevance, size and scope of the
dataset. It should be emphasised that these terms are related to
the dataset and not to the original research project that produced
the data.
Irrelevant research can produce relevant data and visa versa.
-
3.1.1.
Relevance
The relevance of the dataset topics, studied population and the
methods of data collection are taken into consideration.
-
Is the
studied population (or other "unit") interesting enough in view
of general society, in view of developments in science?
-
Is the
studied population already well represented in the rest of the
archive holdings, does the dataset contribute new aspects or new
measurement points in time?
-
Is the
method of data collection appropriate for the population and
topics? Are the topics in the dataset rich enough for future
analysis?
-
3.1.2. Size
-
How many cases and variables are included in the dataset?
-
Is the
number of cases a sufficient sample of the studied population?
-
Is the
sampling method appropriate? Is the sample representative for
the population to be studied?
-
Is the
dataset large enough for statistical analysis?
Size is typically a criterion that cannot be used without keeping
an eye on the relevance criterion, e.g. a small dataset dealing
with members of parliament may be interesting enough even though
the number of cases is limited.
-
3.1.3. Scope
-
What is the scope of the topics covered in the dataset?
-
Do the
topics deal with social phenomena?
-
Are background variables like age, sex social status indicators,
education, occupation etc. included?
-
How detailed
are these background variables stored in the data?

-
3.2. Technical criteria
Machine-readable information can be stored in a variety of
technical formats. The fast developments in computer technology
bring so-called standards every year and the data archive has to
respond to the changes in both machinery and software. An
additional problem here is that the data archive is virtually the
only institute that has to deal with so many technically different
sources of computer material. The interest of computer hardware
companies in compatibility problems is therefore limited.
Technical criteria are mostly used in a practical sense: can the
archive handle the machine-readable information.
Points to check
are internal format, size and media for transfer.
-
3.2.1.
Internal format
-
How is the data organised, is it captured in a dedicated
software system?
-
Can the data
be stored in a standard format for use with standard software?
-
Do you lose
significant information when the data are converted to a more
standardised system?
-
3.2.2. Size
-
Is the dataset too large with too detailed information?
-
Can the data
be compressed?
-
Are there
many separate files?
-
Can they be
combined or aggregated into an overview file?
-
3.2.3. Media
for transfer
-
Can electronic records be created without loss of information?
-
Does the
electronic format follow a standard that can be handled by the
archive?
-
If other
media are used, can the archive machinery access these media?
Many of the incompatibility problems with transfer media have been
solved in the past decade.
The new problems arise more in the software area where
complicated programs that allow users to store data in complicated
forrnats are becoming available.

-
3.3. Administrative
criteria
-
3.3.1. Documentation
Datasets are useless without proper documentation. There should be
a description of the methods used to generate
the data, a
definition of the population, a description of the sampling
procedures etc.
Social science data archives have agreed on a minimum standard for
describing data sets in the study description scheme. To "read"
the machine-readable information there should be a codebook that
describes the relation between the original research instrument
(e.g. a questionnaire) and the data.
-
3.3.2. Privacy
protection
-
Are there any privacy concerns?
-
Can the
archive store identifiable data on individuals?
-
Is the data to be manipulated to prevent risk of revealing
personal data?
-
Can this be
done without serious loss of information for future scientific
analysis?
-
3.3.3.
Ownership
Often it is not clear who owns the data and whether the archive
can get enough control over future usage of the data.
If serious
restrictions are imposed on usage of the data, the archive may
decide not to store in its holdings because usage will be limited.
When both the data gathering institute and the institute that
commissioned the research claim ownership of the machine-readable
data, the archive can get mixed up in legal procedures that
prevent a proper storage of the data. Data archives should advise
funding agencies to deal with the problem of ownership beforehand.
-
3.4. Financial criteria
Financial criteria have to be applied for both internal and
external reasons. Data archives generally do not buy datasets.
Donors
view their contribution of a dataset to the archive as a further
means of making their research public.
For donating institutes, the data archive may function as an
external backup of their own holdings.
For these and
other reasons most of the datasets arrive at the data archive free.
Many archives however have the possibility to reimburse expenses
made for the transfer of the data.
With some
archives, donors of data have the possibility to impose royalties
on the usage of their data. High royalties may be prohibitive for
scientific usage and this approach has not been widely introduced.
 |