Fairspec Dialect
| Authors | Evgeny Karev |
|---|---|
| Profile | https://fairspec.org/profiles/latest/dialect.json |
Fairspec Dialect is a simple JSON based format that defines Dialect to describe a file’s format options and features.
Language
Section titled “Language”The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.
Descriptor
Section titled “Descriptor”A Fairspec Dialect is a JSON resource that MUST be an object compatible with the Dialect structure outlined below.
Dialect
Section titled “Dialect”A top-level descriptor object describing a file dialect. It MIGHT have the following properties (all optional unless otherwise stated):
$schema
Section titled “$schema”External Path to one of the officially published Fairspec Dialect profiles with default value https://fairspec.org/profiles/latest/dialect.json.
For example for version X.Y.Z of the profile:
{ "$schema": "https://fairspec.org/profiles/X.Y.Z/dialect.json"}format
Section titled “format”The file format of the dialect. It MUST be a string.
For example for a CSV file:
{ "dialect": { "format": "csv" }}An optional human-readable title for the format.
For example:
{ "dialect": { "title": "My custom format" }}description
Section titled “description”An optional detailed description of the format.
For example:
{ "dialect": { "title": "My custom format", "description": "You can open this file with OpenOffice" }}Dialect Formats
Section titled “Dialect Formats”A format for comma-separated values files. It MUST have format set to "csv". It MUST be utf-8 encoded. Empty cells (,,) are null values.
Metadata example:
{ "dialect": { "format": "csv" }}Data example:
name,age,cityAlice,30,New YorkBob,25,LondonCharlie,35,TokyoFormat properties:
delimiterlineTerminatorquoteCharnullSequenceheaderRowsheaderJoincommentRowscommentPrefixcolumnNames
A format for tab-separated values files. It MUST have format set to "tsv". It MUST be utf-8 encoded. Empty cells (,,) are null values.
Metadata example:
{ "dialect": { "format": "tsv", "nullSequence": ["NA", "N/A", ""] }}Data example:
name age cityAlice 30 New YorkBob 25 LondonCharlie 35 TokyoFormat properties:
A format for JSON array files. It MUST have format set to "json".
Metadata example:
{ "dialect": { "format": "json", "jsonPointer": "/data/items" }}Data example:
[ {"name": "Alice", "age": 30, "city": "New York"}, {"name": "Bob", "age": 25, "city": "London"}, {"name": "Charlie", "age": 35, "city": "Tokyo"}]Format properties:
A format for JSON Lines files (newline-delimited JSON). It MUST have format set to "jsonl".
Metadata example:
{ "dialect": { "format": "jsonl", "rowType": "object" }}Data example:
{"name": "Alice", "age": 30, "city": "New York"}{"name": "Bob", "age": 25, "city": "London"}{"name": "Charlie", "age": 35, "city": "Tokyo"}Format properties:
A format for Microsoft Excel files. It MUST have format set to "xlsx". Empty cells are null values.
Metadata example:
{ "dialect": { "format": "xlsx", "sheetNumber": 2 }}Data example:
<binary data>Format properties:
A format for OpenDocument Spreadsheet files. It MUST have format set to "ods". Empty cells are null values.
Metadata example:
{ "dialect": { "format": "ods", "sheetName": "Data Sheet" }}Data example:
<binary data>Format properties:
Parquet
Section titled “Parquet”A format for Apache Parquet files. It MUST have format set to "parquet".
Metadata example:
{ "dialect": { "format": "parquet" }}Data example:
<binary data>A format for Apache Arrow files. It MUST have format set to "arrow".
Metadata example:
{ "dialect": { "format": "arrow" }}Data example:
<binary data>SQLite
Section titled “SQLite”A format for SQLite database files. It MUST have format set to "sqlite".
Metadata example:
{ "dialect": { "format": "sqlite" }}Data example:
<binary data>Format properties:
Unknown
Section titled “Unknown”A format for custom data. It MUST have format not supported by the formats above.
Metadata example:
{ "format": { "format": "custom", "title": "Custom format", "description": "Custom format description" }}Data example:
<binary data>Format Properties
Section titled “Format Properties”The name of the custom format. It MUST be a string.
For example:
{ "dialect": { "name": "custom", "title": "Custom format", "description": "Custom format description" }}delimiter
Section titled “delimiter”It MUST be a string of one character length. This property specifies the character sequence which separates fields in the data file.
For example:
{ "dialect": { "format": "csv", "delimiter": ";" }}For a file like:
id;name;price1;apple;1.502;orange;2.00lineTerminator
Section titled “lineTerminator”It MUST be a string. This property specifies the character sequence which terminates rows in the file. Common values are \n (Unix), \r\n (Windows), \r (old Mac).
For example:
{ "dialect": { "format": "csv", "lineTerminator": "\r\n" }}quoteChar
Section titled “quoteChar”It MUST be a string of one character length. This property specifies a character to use for quoting in case the delimiter needs to be used inside a data cell.
For example:
{ "dialect": { "format": "csv", "quoteChar": "'" }}For a file like:
id,name1,'apple,red'2,'orange,citrus'nullSequence
Section titled “nullSequence”It MUST be a string or an array of strings. This property specifies the null sequence representing missing values in the data.
For example with a single sequence:
{ "dialect": { "format": "csv", "nullSequence": "NA" }}For example with multiple sequences:
{ "dialect": { "format": "csv", "nullSequence": ["NA", "N/A", "null", ""] }}For a file like:
id,name,notes1,apple,fresh2,orange,NA3,banana,N/AheaderRows
Section titled “headerRows”It MUST be false or an array of positive integers starting from 1. This property specifies the row numbers for the header.
For example with a single header row:
{ "dialect": { "format": "csv", "headerRows": [1] }}For example with multi-line headers:
{ "dialect": { "format": "csv", "headerRows": [1, 2] }}For a file like:
fruitid,name,price1,apple,1.502,orange,2.00This would produce headers: “fruit id”, “fruit name”, “fruit price”
For example with no headers:
{ "dialect": { "format": "csv", "headerRows": false }}headerJoin
Section titled “headerJoin”It MUST be a string. This property specifies how multiline-header files have to join the resulting header rows.
For example:
{ "dialect": { "format": "csv", "headerRows": [0, 1], "headerJoin": "_" }}For a file like:
fruitid,name,price1,apple,1.50This would produce headers: “fruit_id”, “fruit_name”, “fruit_price”
commentRows
Section titled “commentRows”It MUST be an array of positive integers starting from 1. This property specifies what rows have to be omitted from the data.
For example:
{ "dialect": { "format": "csv", "commentRows": [1, 5, 10] }}For a file like:
id,name# This is a comment row1,apple2,orangeWith "commentRows": [2], the second row would be skipped.
commentPrefix
Section titled “commentPrefix”It MUST be a string. This property specifies what rows have to be omitted from the data based on the row’s prefix.
For example:
{ "dialect": { "format": "csv", "commentPrefix }}For a file like:
id,name# This row is ignored1,apple# Another comment2,orangeRows starting with # will be skipped.
columnNames
Section titled “columnNames”It MUST be an array of strings. This property specifies explicit column names to use instead of deriving them from the file.
For example:
{ "dialect": { "format": "csv", "headerRows": false, "columnNames": ["id", "name", "price"] }}For a file without headers:
1,apple,1.502,orange,2.00jsonPointer
Section titled “jsonPointer”It MUST be a string in JSON Pointer format (RFC 6901). This property specifies where a data is located in the document.
For example:
{ "dialect": { "format": "json", "jsonPointer": "/data/items" }}For a JSON file like:
{ "metadata": { "version": "1.0" }, "data": { "items": [ { "id": 1, "name": "apple" }, { "id": 2, "name": "orange" } ] }}rowType
Section titled “rowType”It MUST be one of the following values: array, object. This property specifies whether the data items are arrays or objects.
For example with array of objects:
{ "dialect": { "format": "json", "rowType": "object" }}For data like:
[ { "id": 1, "name": "apple" }, { "id": 2, "name": "orange" }]For example with array of arrays:
{ "dialect": { "format": "json", "rowType": "array", "columnNames": ["id", "name"] }}For data like:
[ [1, "apple"], [2, "orange"]]sheetNumber
Section titled “sheetNumber”It MUST be an integer. This property specifies a sheet number of a table in the spreadsheet file. If not provided, a first sheet is used.
For example:
{ "dialect": { "format": "xlsx", "sheetNumber": 2 }}This reads the second sheet from the spreadsheet.
sheetName
Section titled “sheetName”It MUST be a string. This property specifies a sheet name of a table in the spreadsheet file.
For example:
{ "dialect": { "format": "xlsx", "sheetName": "Data Sheet" }}tableName
Section titled “tableName”It MUST be a string. This property specifies a name of the table in the database. If not provided, a first table is used (sorted by name in ascending order).
For example:
{ "dialect": { "format": "sqlite", "tableName": "measurements" }}Common
Section titled “Common”Common properties shared by multiple entities in the descriptor.
External Path
Section titled “External Path”It MUST be a string representing an HTTP or HTTPS URL to a remote file.
For example:
{ "data": "https://example.com/datasets/measurements.csv"}Extension
Section titled “Extension”Fairspec Dialect does not support extension.