Google Introduces New Search Engine for Finding Datasets

Google has launched a new type of search engine designed specifically around helping people find data.

Simply called “Dataset Search,” the tool provides easier access to millions of datasets across thousands of data repositories on the web.

Google Introduces New Search Engine for Finding Datasets

Anyone can use Dataset Search, which is still in beta, but Google emphasizes the benefits it has for journalists and data scientists.

“In today’s world, scientists in many disciplines and a growing number of journalists live and breathe data… To enable easy access to this data, we launched Dataset Search, so that scientists, data journalists, data geeks, or anyone else can find the data required for their work and their stories, or simply to satisfy their intellectual curiosity.”

Dataset Search surfaces results from publishers’ sites, digital libraries, and authors’ personal web pages. Google’s new search engine is largely dependent on the schema markup for dataset providers that was rolled out in July. Dataset markup allows publishers to describe their data in a way that Google (and other search engines) can better understand the content of their pages. Google encourages dataset providers to utilize this markup in order to have their content included in Dataset Search. Currently, Dataset Search can be used to find references to most datasets in environmental sciences, social sciences, as well as government data and data provided by news organizations. When more publishers begin using the new schema markup, Google will eventually expand the variety of content included in Dataset Search. Dataset Search is available in multiple languages and works just like any other search engine. Just type in what you’re looking for and Google will return relevant datasets.

Dataset

Datasets are easier to find when you provide supporting information such as their name, description, creator and distribution formats as structured data. Google’s approach to dataset discovery makes use of schema.org and other metadata standards that can be added to pages that describe datasets. The purpose of this markup is to improve discovery of datasets from fields such as life sciences, social sciences, machine learning, civic and government data, and more.

Here are some examples of what can qualify as a dataset:

  • A table or a CSV file with some data
  • An organized collection of tables
  • A file in a proprietary format that contains data
  • A collection of files that together constitute some meaningful dataset
  • A structured object with data in some other format that you might want to load into a special tool for processing
  • Images capturing data
  • Files relating to machine learning, such as trained parameters or neural network structure definitions
  • Anything that looks like a dataset to you

Our approach to dataset discovery

We can understand structured data in Web pages about datasets, using either schema.org Dataset markup, or equivalent structures represented in W3C‘s Data Catalog Vocabulary (DCAT) format. We also exploring experimental support for structured data based on W3C CSVW, and expect to evolve and adapt our approach as best practices for dataset description emerge. For more information about our approach to dataset discovery, see Facilitating the discovery of public datasets.

Example

Here’s an example for datasets using JSON-LD code and the Schema.org vocabulary in the Structured Data Testing Tool. The following example is based on a real-world dataset description.

The same vocabulary can be used in JSON-LD (preferred), RDFa 1.1, or Microdata syntax.

It is also possible to use W3C DCAT vocabulary. Here is a simple example using RDFa:

Guidelines

Sites should follow the structured data guidelines. In addition to the structured data guidelines, we recommend the following sitemap and source and provenance best practices listed below.

Sitemap best practices

Use a sitemap file to help Google find your URLs. Using sitemap files and sameAs markup helps document how dataset descriptions are published throughout your site.

If you have a dataset repository, you likely have at least two types of pages: the canonical (“landing”) pages for each dataset and pages that list multiple datasets (for example, search results, or some subset of datasets). We recommend that you add structured data about a dataset to the canonical pages. Use the sameAs property to link to the canonical page if you add structured data to multiple copies of the dataset, such as listings in search results pages.

Source and provenance best practices

It is common for open datasets to be republished, aggregated, and to be based on other datasets. This is an initial outline of our approach to representing situations in which a dataset is a copy of, or otherwise based upon, another dataset.

  • Use the sameAs property to indicate the most canonical URLs for the original in cases when the dataset or description is a simple republication of materials published elsewhere.
  • Use the isBasedOn property in cases where the republished dataset (including its metadata) has been changed significantly.
  • When a dataset derives from or aggregates several originals, use the isBasedOn property.
  • Use the identifier property to attach any relevant Digital Object identifiers (DOIs).

We hope to improve our recommendations based on feedback, in particular around the description of provenance, versioning, and the dates associated with time series publication. Please join in community discussions.

Known Errors and Warnings

You may experience errors or warnings in Google’s Structured Data Testing Tool and other validation systems. Specifically, warnings about fileFormat (renamed recently to encodingFormat) can be safely ignored. Validation systems may also suggest that organizations should have contact information including a contactType; useful values include customer serviceemergencyjournalistnewsroom, and public engagement. You can also ignore errors for csvw:Table being an unexpected value for the mainEntity property.

Structured data type definitions

You must include the required properties for your structured data to display in search results. You can also include the recommended properties to add more information to your markup, which could provide a better user experience.

You can use the Structured Data Testing Tool to validate your markup.

The focus is on describing information about a dataset (its metadata) and representing its contents. For example, dataset metadata states what the dataset is about, which variables it measures, who created it, and so on. It does not, for example, contain specific values for the variables.

Dataset

The full definition of Dataset is available at schema.org/Dataset.

Properties
@context Required

Set the @context to “http://schema.org/”.

@type Required

Set the @type to “Dataset”. For example:

"@type": "Dataset"
citation TextCreativeWork, RecommendedA citation for a publication that describes the dataset. For example, “J.Smith ‘How I created an awesome dataset’, Journal of Data Science, 1966”.
description TextRequiredA short summary describing a dataset.
keywords Text, RecommendedKeywords summarizing the dataset.
name TextRequiredA descriptive name of a dataset. For example, “Snow depth in Northern Hemisphere”.
sameAs URL, RecommendedOther URLs that can be used to access the dataset page.
spatialCoverage Text, Place, Recommended (only if the dataset has a spatial extent)You can provide a single point that describes the spatial aspect of the dataset. For example, a single point where all the measurements were collected, or the coordinates of a bounding box for an area.

Points

"spatialCoverage:" {
  "@type": "Place",
  "geo": {
    "@type": "GeoCoordinates",
    "latitude": 39.3280,
    "longitude": 120.1633
  }
}

Coordinates

Use GeoShape to describe areas of different shapes. For example, to specify a bounding box.

"spatialCoverage:" {
  "@type": "Place",
  "geo": {
    "@type": "GeoShape",
    "box": "39.3280 120.1633 40.445 123.7878"
  }
}

Named locations

"spatialCoverage:" "Tahoe City, CA"
temporalCoverage Text, Recommended (only if the dataset has a temporal extent)The data in the dataset covers a specific time interval. Schema.org uses the ISO 8601 standard to describe time intervals and time points. You can describe dates differently depending upon the dataset interval. Indicate open-ended intervals with two decimal points (..).

Single date

"temporalCoverage" : "2008"

Time period

"temporalCoverage" : "1950-01-01/2013-12-18"

Open-ended time period

"temporalCoverage" : "2013-12-19/.."
variableMeasured TextPropertyValue, RecommendedThe variable that this dataset measures. For example, temperature or pressure.

version TextNumber, RecommendedThe version number for the dataset.
url URL, RecommendedLocation of a page describing the dataset.

DataCatalog

The full definition of DataCatalog is available at schema.org/DataCatalog.

Datasets are often published in repositories that contain many other datasets. The same dataset can be included in more than one such repository. You can refer to a data catalog that this dataset belongs to by referencing it directly.

Properties
includedInDataCatalog DataCatalogThe catalog to which the dataset belongs.

DataDownload

The full definition of DataDownload is available at schema.org/DataDownload. In addition to Dataset properties, add the following properties for datasets that provide download options.

The distribution property describes how to get the dataset itself because the URL often points to the landing page describing the dataset. The distribution property describes where to get the data and in what format. This property can have several values: for instance, a CSV version has one URL and an Excel version is available at another.

Properties
distribution DataDownload, RecommendedThe description of the location for download of the dataset and the file format for download.
distribution.contentUrl URL, RequiredThe link for the download.
distribution.fileFormat Text, RecommendedThe file format of the distribution.

Provenance and license

You can describe additional information about the publication of the dataset, such as the license, when it was published, its DOI, or a sameAs pointing to a canonical version of the dataset in a different repository. In addition to Dataset properties, add the following properties for datasets that provide provenance and license information.

Properties
identifier URLTextPropertyValue, RecommendedAn identifier for the dataset, such as a DOI.
license URLText, RecommendedA license under which the dataset is distributed.
sameAs URL, RecommendedA link to a page that provides more information about the same dataset, usually in a different repository.

Tabular datasets

tabular dataset is one organized primarily in terms of a grid of rows and columns. For pages that embed tabular datasets, you can also create more explicit markup, building on the basic approach described above. At this time we understand a variation of CSVW (“CSV on the Web”, see W3C), provided in parallel to user-oriented tabular content on the HTML page.

Here is an example showing a small table encoded in CSVW JSON-LD format. There are some known errors in SDTT.

Help and tools

Comments are disabled.