Documentation

Download

Click on https://ecacorpus.eu/misc/ecac.zip

Unzip and click on index.html, the only file in the root directory, no installation is required. The zip files contains only data, no programs; it is Portable Web Package (PWP), essentially a directory structure viewable with web browser. The ECA Corpus can also be explored online at https://ecacorpus.eu; both, downloaded and online, are identical, as long as they have the same timestamp.

Overview

A single zip file can contain the full structured data of an organisation such as ECA since its creation in 1977; ECA is a good example for this exercise. The zip file contains: tabular data, full text of the files (eventually), and generated reports with graphics. Anyone can simply unzip the file and:

Deeper analysis using both current and future technologies can be envisaged. For example:

Data structure

A dossier represents a publication or other document types. Each dossier is identified by a unique dossier number. There are two main components:

Register

The Register is the list of dossiers. It is a table containing the dossier number, title, and several additional fields. Once the Register has been cleaned and completed, its data will later be integrated into the Store. Currently, the https://ecacorpus.eu website is generated from the Register.

Store

The Store contains the full text of each dossier in all available languages and formats. It is organized as a directory structure with a root folder named store, which contains a set of numbered folders; each folder corresponds to one dossier. At present, the Store consists of an empty skeleton directory structure, ready to receive full-text files in all languages and formats. All texts should be converted to plain text to facilitate further processing and ensure long-term preservation.

If a dossier corresponds to a section of a larger document, only the relevant sections should be included. For example, in the case of the Official Journal C139/1979, only the material from page 15 concerning ECA should be included. Each opinion should be copied to the appropriate dossier, as this issue of the Official Journal contains six opinions.

Data quality

General
The data in the ECA Corpus was scraped from the ECA search engine. Hence, it probably contain errors and omissions.
Missing documents
It is impossible to know if there are missing documents as the ECA does not have a comprehensive register. Totals are not the same when comparing: ECA list vs. ECA search engine vs. ECA Corpus.
Data cleaning
Further cleaning is required.
Full text files
To do: copying the full texts to the store, taking only the relevant parts, converting to plain text.
Dates
Which date should be indicated? Example with Annual report of 1977:
  • ECA website: 01-01-1978
  • ECA: 30 November 1978
  • OJ: 30 December 1978
Example with annual report of 1977:
  • ECA website: direct to English without other languages selection, poor scanning.
  • Publications Office: landing page with all the languages, better scanning.

Corrigenda must not be counted as separated reports, just corrections of original reports. There are seven corrigenda in the ECA Corpuse database, so the distortion is small.

ECA lists

Might not be up to date.

Abbreviations

Links