Dataset schema

Field Data Types

TypeExplanation
TextString values"text" : "Apple IPhone"
NumericNumeric values (int, long, float"price":1699
Booleantrue and false values"sold" : False
DateA date value. We automatically detect date values that follows the schema of 2022-01-01, 2022-01-01T12:00:00Z"word_vector_":[0.12, 0.34, ...]
DictA python dictionary (also known as object in javascript)"product_details : {...}
VectorA vector or embedding is a special array of numerics. Data stored as this type will be available for all vector-based operations. Must have the suffix (ends with) _vector_. It cannot exceed length of 2048, for length greater than 2048 you can break it up into smaller vectors."word_vector_":[0.12, 0.34, ...]
Chunks (Beta)Chunks are a list/array of documents/dictionary. They are useful to breakdown a large piece of data into smaller ones. e.g. sentences are chunks from paragraphs. Must have the suffix (ends with) _chunk_."text_chunk_":[0.12, 0.34, ...]
Chunkvector (Beta)A vector or embedding for chunks. Must have the suffix (ends with) _chunkvector_, and can only be stored within chunks. It cannot exceed length of 2048, for length greater than 2048 you can break it up into smaller vectors."text_chunk_.sentence_chunkvector_":[0.12, 0.34, ...]

🚧

List/array field type

List and arrays are not identified in our schema. For example: "text_list":["Apple", "IPhone"] is of field type text, "past_prices": [1500, 1000] is of field type numeric. Chunks are an alternative way to store data as an array, it is still in beta.

Get a dataset's schema

ds.schema

Creating a dataset with pre-specified schema

ds.create_dataset(schema={"price":"numeric"})

To display it in a pandas dataframe:

ds.info()

Number of documents with fields missing

ds.health()