What Search.gov Indexes From Your Website

Content

When we think about indexing pages for search, we usually think about indexing the primary content of the page. But if the page isn’t structured to tell the search engine where that content is to be found, it will collect the <body> tag, and then filter out the <nav> and <footer> elements, if present. If <main>, <nav>, or <footer> are not present, we collect the full contents of the <body> tag. Learn more on our post about aiming search engines at the content you really want to be searchable, using the </main> element.

Metadata

Standard metadata elements

title
meta description
meta keywords
locale or language (from the opening <html> tag)
url
lastmod (collected from XML sitemaps)

Open Graph protocol elements

og:description
og:title
article:published_time
article:modified_time

File formats

In addition to HTML pages with their various file extensions, Search.gov indexes the following file types:

PDFs
Word docs
Excel docs
TXT
Images can be indexed either using our Flickr integration, or by sending us an MRSS feed. Note that images are not indexed during web page indexing, so you’ll need to use one of these two methods.

Javascript-based content

Javascript frameworks like Angular and React insert content dynamically on a webpage. These technologies pull structured information from databases into user-friendly webpages. To search this kind of content, we need to add a processing step to run all scripts before we try to index.

If your site uses Javascript to insert content on your HTML pages, reach out to our team. We can enable Javascript indexing on a per-domain basis.

For each page indexed with Javascript enabled, we allow up to 5 seconds for content to load. Because of this step, domains indexed with Javascript enabled do take slightly longer.