CSV Table Schema Specification
Version 0.1.0
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC2119 and RFC8174 when, and only when, they appear in all capitals, as shown here.
This document is licensed under Apache License, Version 2.0.
Introduction
CSV Table Schema defines a standard for describing CSV schemas. CSV Table Schema can be used to document CSV data structures and to validate data against a given documentation.
CSV
CSV (Comma-Separated Values) is a widely used file format for storing and exchanging structured data in plain text. It is characterized by its simplicity, human-readability, and versatility. In CSV, data is organized in rows and columns, much like a spreadsheet or database table. Each row represents a single record or data entry, and each column contains an attribute or field. Values within a row are separated by a delimiter, most commonly a comma, but other delimiters like semicolons, tabs, or pipes can also be used based on specific requirements.
Example of CSV Data:
In the example above, we have a CSV file with three columns (Name, Age, and City) and three rows of data, each representing an individual.
Variants of CSV:
-
SSV (Semicolon-Separated Values): SSV (Semicolon-Separated Values), is a variant of CSV where semicolons (;) are used as delimiters instead of commas. This format is often used in regions where comma is used as a decimal separator in numeric values, reducing potential parsing conflicts (e.g. Germany).
For example:
-
TSV (Tab-Separated Values): TSV (Tab-Separated Values) uses tabs (or tab characters) as delimiters between values instead of commas. TSV is commonly employed in situations where data contains commas naturally, such as text descriptions or addresses. TSV files are especially prevalent in data interchange within the scientific and research communities.
For example:
CSV Tables
A CSV table, often referred to simply as a "CSV file," is a structured data file format that follows the principles of Comma-Separated Values (CSV). It is a way to represent tabular data, much like a spreadsheet or a database table, in plain text format.
In a CSV table:
-
Rows: Each line of the CSV file typically represents a row of data. Each row corresponds to a record, entity, or data entry, and it contains one or more values separated by a delimiter, most commonly a comma. Each row ends with a line break character, such as a newline ("\n") on Unix-based systems or a carriage return and newline ("\r\n") on Windows systems.
-
Columns: The values within each row are organized into columns, which represent different attributes or fields of the data. Columns are separated by the chosen delimiter, typically a comma, but semicolons, tabs, or other characters can also be used depending on specific requirements.
-
Headers: The first row is often interpreted as a header, with the value in each column storing the name of the column.
CSV Dictionaries
A CSV dictionary refer to a CSV file used to store key-value pairs or dictionary-like data. In this context, each row in the CSV file would typically represent a key-value pair, with the key and value separated by a delimiter. Such files are sometimes used to store configuration data or simple databases where each row represents a record with attributes and values.
For example, a CSV dictionary for storing configuration settings might look like this:
In this CSV dictionary, there are two columns (Key and Value), and each row represents a configuration setting and its corresponding value.
CSV Table Sets
A CSV table set is a polymorphic table without a header row, in which each row can be different and is identified by a discriminator value.
For example, a CSV polymorphic table for storing two different tables might look like this:
In this CSV table set, the first column is the discriminator. Possible values are r1
and r2
. Rows with r1
contain three additional columns, rows with r2
contain two additional columns.
Specification
Versioning
The CSV Table Schema specification is versioned according to the scheme major.minor.patch
. The major-minor part of the version number (e.g. 0.1
) SHALL denote the feature set of the specification. Patch versions address errors in or provide clarifications to this document, not the feature set. Tools supporting CSV Table Schema in version 0.1
SHALL be compatible with all 0.1.*
versions of CSV Table Schema. The patch version SHOULD NOT be taken into account by the tools, so that for example no difference is made between 0.1.0
and 0.1.1
.
Format
An CSV Table Schema document that conforms to the CSV Table Schema specification is itself a JSON object that can be represented in JSON format.
All field names in the specification are case sensitive.
Schema
A CSV Table Schema document is a JSON document which can be validated against the CSV Table Meta Schema.
It consists of a JSON object with following properties:
$schema
: The used dialect of the CSV schema. Can be used to verify that the schema is valid.version
: The version of the CSV schema.title
: A title for the CSV Schema. This property is REQUIRED.description
: An human-readable description for the CSV Schema.notes
: Additional human-readable notes for the CSV Schema.- Three mutually exclusive properties. One of them is is REQUIRED.
table
: A table object.dictionary
: A dictionary object.tableSet
: A table set object.
The object MAY be extended.
Tables
A CSV table is a text file. Each line of the file is a data record. Each record consists of one or more fields, separated by a delimiter. Every record has the same sequence of fields. The first line of a CSV table can be defined as header row, representing the names of the columns.
Table
A table is a JSON object with following properties:
name
: Defines the name of the CSV table. This property is REQUIRED.description
: An optional human-readable description of the table.type
: Specifies the table type. This property is REQUIRED. Possible values are:ordered
: A table with columns appearing in defined order.unordered
: A table with columns appearing in any order.headless
: A table with no header row. Columns must appear in defined order.
delimiterChar
: Defines the delimiter character of the CSV format. Default value is,
.quoteChar
: Defines the quote character of the CSV format. Default value is"
.lineBreaks
: Defines the valid line breaking strings of the CSV format. Default value is\r\n
and\n
.skipFirstRows
: Defines the number of rows to skip at the beginning of the CSV file.skipEmptyRows
: Specifies whether empty rows should be ignored. Empty rows are rows in which each column has an empty value.language
: Defines the content language of the CSV format. This MUST be an IETF BCP 47 language tag. Can be overriden by the language tag of a column.additionalColumns
: Iftrue
additional non-specified columns within the CSV document are allowed. Default value isfalse
.columns
: An array of column definitions. This property is REQUIRED.
This object MAY be extended.
Columns
The columns
array describes all columns within a CSV table. A column is a JSON object with following properties:
id
: The ID of the column. This property is REQUIRED.name
: The human-readable name of the column.alternativeNames
: A list of alternative names of the column.description
: An human-readable description of the column.type
: The data type of the column. This property is REQUIRED. Possible values are:string
integer
numeric
boolean
enum
enum-set
time
date
date-time
json
xml
optional
: Iftrue
the column must not be present in the CSV table. Default value isfalse
.nullable
: Iftrue
values of this column can be empty. Default value isfalse
.nullValues
: Ifnullable=true
the list of values that represent the null state.
Unique keys
The uniqueKeys
array describes unique keys within a CSV table. A unique key is itself a JSON array with column IDs representing an existing column. Each column ID is a string
value.
About optional columns
If the table type is:
ordered
, a column withoptional=true
marks an optional column. If the column is missing in the table all other columns must still apear in defined order.headless
, a column withoptional=true
marks the start of the optional column range. If the column is missing in the table all columns after this column are of course also missing.
Dictionaries
A CSV dictionary is a collection of key/value pairs. It is a CSV table with exactly two columns, a key column and a value column. While the key column is always a string containing the key name, the value column can have different data types for each key.
Dictionary
A dictionary is a JSON object with following properties:
name
: Defines the name of the CSV dictionary. This property is REQUIRED.description
: An optional description of the dictionary.delimiterChar
: Defines the delimiter character of the CSV format. Default value is,
.quoteChar
: Defines the quote character of the CSV format. Default value is"
.lineBreaks
: Defines the valid line breaking strings of the CSV format. Default value is\r\n
and\n
.skipFirstRows
: Defines the number of rows to skip at the beginning of the CSV file.skipEmptyRows
: Specifies whether empty rows should be ignored. Empty rows are rows in which each column has an empty value.language
: Defines the content language of the CSV format. This MUST be an IETF BCP 47 language tag. Can be overriden by the language tag of a column.keys
: An array of key definitions. This property is REQUIRED.
This object MAY be extended.
Keys
The keys
array describes all keys within a CSV dictionary. A key is a JSON object with following properties:
id
: The ID of the key. This property is REQUIRED.name
: The human-readable name of the key.alternativeNames
: A list of alternative names of the column.description
: An human-readable description of the key.type
: The data type of the key. This property is REQUIRED. Possible values are:string
integer
numeric
boolean
enum
enum-set
time
date
date-time
json
xml
optional
: Iftrue
the key must not be present in the CSV dictionary. Default value isfalse
.nullable
: Iftrue
the value of this key can be empty. Default value isfalse
.nullValues
: Ifnullable=true
the list of values that represent the null state.
Table Sets
TableSet
A table set is a JSON object with following properties:
name
: Defines the name of the CSV table set. This property is REQUIRED.description
: An optional human-readable description of the table set.delimiterChar
: Defines the delimiter character of the CSV format. Default value is,
.quoteChar
: Defines the quote character of the CSV format. Default value is"
.lineBreaks
: Defines the valid line breaking strings of the CSV format. Default value is\r\n
and\n
.skipFirstRows
: Defines the number of rows to skip at the beginning of the CSV file.skipEmptyRows
: Specifies whether empty rows should be ignored. Empty rows are rows in which each column has an empty value.language
: Defines the content language of the CSV format. This MUST be an IETF BCP 47 language tag. Can be overriden by the language tag of a column.additionalColumns
: Iftrue
additional non-specified columns within the CSV document are allowed. Default value isfalse
.tables
: An array of table definitions. This property is REQUIRED.
This object MAY be extended.
Tables
The tables
array describes all tables within a CSV table set. A table is a JSON object with following properties:
name
: Defines the name of the table. This property is REQUIRED.description
: An optional human-readable description of the table.columns
: An array of column definitions. This property is REQUIRED.
Columns
The columns
array describes all columns within a table of a CSV table set. A column is a JSON object with following properties:
id
: The ID of the column. This property is REQUIRED.name
: The human-readable name of the column.description
: An human-readable description of the column.type
: The data type of the column. This property is REQUIRED. Possible values are:discriminator
string
integer
numeric
boolean
enum
enum-set
time
date
date-time
json
xml
optional
: Iftrue
the column must not be present in the CSV table. Default value isfalse
.nullable
: Iftrue
values of this column can be empty. Default value isfalse
.nullValues
: Ifnullable=true
the list of values that represent the null state.
A column of type discriminator
MUST occur exactly once.
Unique keys
The uniqueKeys
array describes unique keys within a CSV table set. A unique key is itself a JSON array with column IDs representing an existing column. Each column ID is a string
value.
Data types
A data type defines the type of value which is expected in a column or key. Based on the data type additional schema properties are available:
discriminator
Represents a discriminator value for table sets. The following additional schema properties are available:
values
: An array of value objects each representing one allowed value. This property is REQUIRED.language
: Defines the language of the content of the discriminator value. This MUST be an IETF BCP 47 language tag.
This data type can only occur within table sets.
string
Represents a text string. The following additional schema properties are available:
minLength
: Specifies the minimum character length of a string value.maxLength
: Specifies the maximun character length of a string value.language
: Defines the language of the content of the string value. This MUST be an IETF BCP 47 language tag.pattern
: An optional regular expression which must always match with the string value.
integer
Represents an integral number. The following additional schema properties are available:
minValue
: Specifies the minimum allowed value.maxValue
: Specifies the maximun allowed value.
numeric
Represents a floating number. The following additional schema properties are available:
formats
: An array of valid format string according to .NET Custom numeric format strings. This property is REQUIRED.minValue
: Specifies the minimum allowed value.maxValue
: Specifies the maximun allowed value.exclusiveMinValue
: Specifies the exclusive lower limit for a value.exclusiveMaxValue
: Specifies the exclusive upper limit for a value.
boolean
Represents a boolean value. The following additional schema properties are available:
trueValues
: An array of valid strings each representing a logical true value. This property is REQUIRED.falseValues
: An array of valid strings each representing a logical false value. This property is REQUIRED.
enum
Represents an enumeration value. The following additional schema properties are available:
members
: An array of value objects each representing one enumeration value. This property is REQUIRED.language
: Defines the language of the content of the member value. This MUST be an IETF BCP 47 language tag.
enum-set
Represents an set of enumeration values formatted as csv string. The following additional schema properties are available:
members
: An array of value objects each repesenting one enumeration value. This property is REQUIRED.delimiterChar
: Defines the delimiter character of the CSV string format. Default value is,
.quoteChar
: Defines the quote character of the CSV string format. Default value is"
.language
: Defines the language of the content of the member value. This MUST be an IETF BCP 47 language tag.
date-time
Represents a datetime value. The following additional schema properties are available:
formats
: An array of valid format strings according to .NET Custom date and time format strings. This property is REQUIRED.minValue
: Specifies the minimum allowed value.maxValue
: Specifies the maximun allowed value.
date
Represents a date only value. The following additional schema properties are available:
formats
: An array of valid format strings according to .NET Custom date and time format strings. This property is REQUIRED.minValue
: Specifies the minimum allowed value.maxValue
: Specifies the maximun allowed value.
time
Represents a time only value. The following additional schema properties are available:
formats
: An array of valid format strings according to .NET Custom date and time format strings. This property is REQUIRED.minValue
: Specifies the minimum allowed value.maxValue
: Specifies the maximun allowed value.
json
Represents a json formatted string. The following additional schema properties are available:
schema
: Specifies a JSON Schema. This property is REQUIRED.uri
: An uri to a valid JSON Schema. This property is REQUIRED.
xml
Represents a xml formatted string. The following additional schema properties are available:
schema
: Specifies a XML Schema. This property is REQUIRED.uri
: An uri to a valid XML Schema. This property is REQUIRED.
Extension of the Specification
The CSV Table Schema specification can be extended with additional data at certain points.
The properties of these extensions are implemented as free fields, which must always be prefixed with x-
(e.g., x-external-id
). The value can be a string, a number, a boolean value, null, an object, or an array.
The extensions may or may not be supported by the available tools. Ideally, these tools can be extended to add the desired support (e.g., in Open Source projects).
Example: