May 2018 Release Notes
Highlights
- We continue to make good progress towards our indexing system, and continues to be highly focused on the back end of our system. See below for more details.
Chores
- We created back-end interfaces allowing the Search.gov team to manage indexed domains & urls.
- We added a delay method to
SearchGov Domain
, to honor the crawl delay settings in a given site’srobots.txt
file. - We created a
SearchGov Domain Indexer
job that will enqueue urls in need of fetching, to allow bulk indexing tasks to be automated without overloading anyone’s servers, and we added support forresque-scheduler
to our configuration baseline. - We set the sitemap indexer to reject urls from other domains to avoid erroneous attempts to index content from beta sites, old domains, etc.
- We now check the protocol of a domain, and whether the site is responding to us. We also set our url fetcher to throw an error if the domain is unavailable or blocking our indexer.
- We re-indexed the searchgov indices.
- We upgraded mySQL in demo environments, and streamlined the scenario data for our test suite.
Bug Fixes
- We fixed bug that sent searchers back to page 1 results when changing the time scope in a Collection search.
- We mitigated SSL certificate problems with some sites.
- We made our redirection check more strict to avoid filling our database and indexes with domains and web pages that don’t need to be searchable.
June 20, 2018