Fairspec Dialect

Authors	Evgeny Karev
Profile	https://fairspec.org/profiles/latest/dialect.json

Fairspec Dialect is a simple JSON based format that defines Dialect to describe a file’s format options and features.

Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

Descriptor

A Fairspec Dialect is a JSON resource that MUST be an object compatible with the Dialect structure outlined below.

Dialect

A top-level descriptor object describing a file dialect. It MIGHT have the following properties (all optional unless otherwise stated):

`$schema`

External Path to one of the officially published Fairspec Dialect profiles with default value https://fairspec.org/profiles/latest/dialect.json.

For example for version X.Y.Z of the profile:

{
  "$schema": "https://fairspec.org/profiles/X.Y.Z/dialect.json"
}

`format`

The file format of the dialect. It MUST be a string.

For example for a CSV file:

{
  "dialect": {
    "format": "csv"
  }
}

`title`

An optional human-readable title for the format.

For example:

{
  "dialect": {
    "title": "My custom format"
  }
}

`description`

An optional detailed description of the format.

For example:

{
  "dialect": {
    "title": "My custom format",
    "description": "You can open this file with OpenOffice"
  }
}

Dialect Formats

CSV

A format for comma-separated values files. It MUST have format set to "csv". It MUST be utf-8 encoded. Empty cells (,,) are null values.

Metadata example:

{
  "dialect": {
    "format": "csv"
  }
}

Data example:

name,age,city
Alice,30,New York
Bob,25,London
Charlie,35,Tokyo

Format properties:

delimiter
lineTerminator
quoteChar
nullSequence
headerRows
headerJoin
commentRows
commentPrefix
columnNames

TSV

A format for tab-separated values files. It MUST have format set to "tsv". It MUST be utf-8 encoded. Empty cells (,,) are null values.

Metadata example:

{
  "dialect": {
    "format": "tsv",
    "nullSequence": ["NA", "N/A", ""]
  }
}

Data example:

name  age  city
Alice  30  New York
Bob  25  London
Charlie  35  Tokyo

Format properties:

lineTerminator
nullSequence
headerRows
headerJoin
commentRows
commentPrefix
columnNames

JSON

A format for JSON array files. It MUST have format set to "json".

Metadata example:

{
  "dialect": {
    "format": "json",
    "jsonPointer": "/data/items"
  }
}

Data example:

[
  {"name": "Alice", "age": 30, "city": "New York"},
  {"name": "Bob", "age": 25, "city": "London"},
  {"name": "Charlie", "age": 35, "city": "Tokyo"}
]

Format properties:

headerRows
headerJoin
commentRows
commentPrefix
columnNames
jsonPointer
rowType

JSONL

A format for JSON Lines files (newline-delimited JSON). It MUST have format set to "jsonl".

Metadata example:

{
  "dialect": {
    "format": "jsonl",
    "rowType": "object"
  }
}

Data example:

{"name": "Alice", "age": 30, "city": "New York"}
{"name": "Bob", "age": 25, "city": "London"}
{"name": "Charlie", "age": 35, "city": "Tokyo"}

Format properties:

headerRows
headerJoin
commentRows
commentPrefix
columnNames
rowType

XLSX

A format for Microsoft Excel files. It MUST have format set to "xlsx". Empty cells are null values.

Metadata example:

{
  "dialect": {
    "format": "xlsx",
    "sheetNumber": 2
  }
}

Data example:

<binary data>

Format properties:

headerRows
headerJoin
commentRows
commentPrefix
columnNames
sheetName
sheetNumber

ODS

A format for OpenDocument Spreadsheet files. It MUST have format set to "ods". Empty cells are null values.

Metadata example:

{
  "dialect": {
    "format": "ods",
    "sheetName": "Data Sheet"
  }
}

Data example:

<binary data>

Format properties:

headerRows
headerJoin
commentRows
commentPrefix
columnNames
sheetName
sheetNumber

Parquet

A format for Apache Parquet files. It MUST have format set to "parquet".

Metadata example:

{
  "dialect": {
    "format": "parquet"
  }
}

Data example:

<binary data>

Arrow

A format for Apache Arrow files. It MUST have format set to "arrow".

Metadata example:

{
  "dialect": {
    "format": "arrow"
  }
}

Data example:

<binary data>

SQLite

A format for SQLite database files. It MUST have format set to "sqlite".

Metadata example:

{
  "dialect": {
    "format": "sqlite"
  }
}

Data example:

<binary data>

Format properties:

tableName

Unknown

A format for custom data. It MUST have format not supported by the formats above.

Metadata example:

{
  "format": {
    "format": "custom",
    "title": "Custom format",
    "description": "Custom format description"
  }
}

Data example:

<binary data>

Format Properties

`name`

The name of the custom format. It MUST be a string.

For example:

{
  "dialect": {
    "name": "custom",
    "title": "Custom format",
    "description": "Custom format description"
  }
}

`delimiter`

It MUST be a string of one character length. This property specifies the character sequence which separates fields in the data file.

For example:

{
  "dialect": {
    "format": "csv",
    "delimiter": ";"
  }
}

For a file like:

id;name;price
1;apple;1.50
2;orange;2.00

`lineTerminator`

It MUST be a string. This property specifies the character sequence which terminates rows in the file. Common values are \n (Unix), \r\n (Windows), \r (old Mac).

For example:

{
  "dialect": {
    "format": "csv",
    "lineTerminator": "\r\n"
  }
}

`quoteChar`

It MUST be a string of one character length. This property specifies a character to use for quoting in case the delimiter needs to be used inside a data cell.

For example:

{
  "dialect": {
    "format": "csv",
    "quoteChar": "'"
  }
}

For a file like:

id,name
1,'apple,red'
2,'orange,citrus'

`nullSequence`

It MUST be a string or an array of strings. This property specifies the null sequence representing missing values in the data.

For example with a single sequence:

{
  "dialect": {
    "format": "csv",
    "nullSequence": "NA"
  }
}

For example with multiple sequences:

{
  "dialect": {
    "format": "csv",
    "nullSequence": ["NA", "N/A", "null", ""]
  }
}

For a file like:

id,name,notes
1,apple,fresh
2,orange,NA
3,banana,N/A

`headerRows`

It MUST be false or an array of positive integers starting from 1. This property specifies the row numbers for the header.

For example with a single header row:

{
  "dialect": {
    "format": "csv",
    "headerRows": [1]
  }
}

For example with multi-line headers:

{
  "dialect": {
    "format": "csv",
    "headerRows": [1, 2]
  }
}

For a file like:

fruit
id,name,price
1,apple,1.50
2,orange,2.00

This would produce headers: “fruit id”, “fruit name”, “fruit price”

For example with no headers:

{
  "dialect": {
    "format": "csv",
    "headerRows": false
  }
}

`headerJoin`

It MUST be a string. This property specifies how multiline-header files have to join the resulting header rows.

For example:

{
  "dialect": {
    "format": "csv",
    "headerRows": [0, 1],
    "headerJoin": "_"
  }
}

For a file like:

fruit
id,name,price
1,apple,1.50

This would produce headers: “fruit_id”, “fruit_name”, “fruit_price”

`commentRows`

It MUST be an array of positive integers starting from 1. This property specifies what rows have to be omitted from the data.

For example:

{
  "dialect": {
    "format": "csv",
    "commentRows": [1, 5, 10]
  }
}

For a file like:

id,name
# This is a comment row
1,apple
2,orange

With "commentRows": [2], the second row would be skipped.

`commentPrefix`

It MUST be a string. This property specifies what rows have to be omitted from the data based on the row’s prefix.

For example:

{
  "dialect": {
    "format": "csv",
    "commentPrefix
  }
}

For a file like:

id,name
# This row is ignored
1,apple
# Another comment
2,orange

Rows starting with # will be skipped.

`columnNames`

It MUST be an array of strings. This property specifies explicit column names to use instead of deriving them from the file.

For example:

{
  "dialect": {
    "format": "csv",
    "headerRows": false,
    "columnNames": ["id", "name", "price"]
  }
}

For a file without headers:

1,apple,1.50
2,orange,2.00

`jsonPointer`

It MUST be a string in JSON Pointer format (RFC 6901). This property specifies where a data is located in the document.

For example:

{
  "dialect": {
    "format": "json",
    "jsonPointer": "/data/items"
  }
}

For a JSON file like:

{
  "metadata": { "version": "1.0" },
  "data": {
    "items": [
      { "id": 1, "name": "apple" },
      { "id": 2, "name": "orange" }
    ]
  }
}

`rowType`

It MUST be one of the following values: array, object. This property specifies whether the data items are arrays or objects.

For example with array of objects:

{
  "dialect": {
    "format": "json",
    "rowType": "object"
  }
}

For data like:

[
  { "id": 1, "name": "apple" },
  { "id": 2, "name": "orange" }
]

For example with array of arrays:

{
  "dialect": {
    "format": "json",
    "rowType": "array",
    "columnNames": ["id", "name"]
  }
}

For data like:

[
  [1, "apple"],
  [2, "orange"]
]

`sheetNumber`

It MUST be an integer. This property specifies a sheet number of a table in the spreadsheet file. If not provided, a first sheet is used.

For example:

{
  "dialect": {
    "format": "xlsx",
    "sheetNumber": 2
  }
}

This reads the second sheet from the spreadsheet.

`sheetName`

It MUST be a string. This property specifies a sheet name of a table in the spreadsheet file.

For example:

{
  "dialect": {
    "format": "xlsx",
    "sheetName": "Data Sheet"
  }
}

`tableName`

It MUST be a string. This property specifies a name of the table in the database. If not provided, a first table is used (sorted by name in ascending order).

For example:

{
  "dialect": {
    "format": "sqlite",
    "tableName": "measurements"
  }
}

Common

Common properties shared by multiple entities in the descriptor.

External Path

It MUST be a string representing an HTTP or HTTPS URL to a remote file.

For example:

{
  "data": "https://example.com/datasets/measurements.csv"
}

Extension

Fairspec Dialect does not support extension.

Fairspec Dialect

Language

Descriptor

Dialect

$schema

format

title

description

Dialect Formats

CSV

TSV

JSON

JSONL

XLSX

ODS

Parquet

Arrow

SQLite

Unknown

Format Properties

name

delimiter

lineTerminator

quoteChar

nullSequence

headerRows

headerJoin

commentRows

commentPrefix

columnNames

jsonPointer

rowType

sheetNumber

sheetName

tableName

Common

External Path

Extension

Comparison

`$schema`

`format`

`title`

`description`

`name`

`delimiter`

`lineTerminator`

`quoteChar`

`nullSequence`

`headerRows`

`headerJoin`

`commentRows`

`commentPrefix`

`columnNames`

`jsonPointer`

`rowType`

`sheetNumber`

`sheetName`

`tableName`