How to make Exhibit search engine friendly

From SIMILE Widgets
Jump to: navigation, search

This page is valid for Exhibit 1.0 only and needs to be updated to reflect Exhibit 2.0.

Contents

Problem

Exhibit, like other Ajax applications, relies heavily on JavaScript to dynamically generate the displayed records. Unfortunately, such dynamic content is nearly always hidden from search engines, and therefore does not normally get indexed or included in services such as Google. This limitation is a real concern in converting existing indexed content to Exhibit (e.g., lists of publications) or making sure that new content is readily found by major search engines.

Fortunately there are options to overcome this limitation, some provided directly by Exhibit itself. These options are described below, noting advantages and disadvantages. There is much flexibility to you as an author; your choice of option can be tailored to objectives. The last section offers some tips for making your Exhibit search engine and user "friendly."

You may find that this article from Google addresses your questions too.

Option #1: Generate HTML

This option generates complete HTML using the 'Copy All' selection on a published exhibit.

  1. Make sure your exhibit is using either the tile view or tabular view
  2. Make sure you are displaying all records by selecting the 'Show all XXX results' link by scrolling to the bottom of your exhibit
  3. Click on 'Copy All' at the top of your exhibit and choose 'Generated HTML for this view'. A popup window will then appear with the generated code highlighted
  4. Copy this code to your clipboard as you would copy any text. Press ESC to close the dialog box
  5. Paste the code at the bottom of your exhibit page or to a separate page.

You'll note that the generated code maintains all of the styles of your original exhibit, so display will follow your CSS and presentation preferences.

Pasting the generated code into the original exhibit page ensures the page indexed by the search engines is the same as where the dynamic exhibit occurs. On the other hand, if none of the content is hidden (see other options below), pasting on the same page in essence duplicates all content.

Advantages: All content is made visible to search engines; it is fast and simple; it provides a good way to let people without JavaScript view your data.

Disadvantages: If pasted on original exhibit page, may feel duplicative to users; if pasted on a separate static page, its relationship to the original exhibit needs to be made clear and the actual indexed page on the search engine will not point to the dynamic exhibit.

Option #2a: Generate 'Hidden' HTML using <noscript>

This option follows all steps for Option #1, except the generated HTML content is hidden from JavaScript-enabled browsers.

  1. Follow all steps under Option #1
  2. Wrap the generated HTML code between the <noscript> tags:
  <noscript>

     . . . generated HTML goes here

  </noscript>

Advantages: All content is made visible to search engines; this option avoids the duplicated content problem for JavaScript-enabled browsers; content is also displayed for JavaScript-disabled browsers

Disadvantages: In years past, search engines often refused to index pages with significant <noscript> content since the tag was abused by spammers. It is not clear what the specific rules or extent of this risk is today, which likely varies by site and search engine vendor.

Option #2b: Generate 'Hidden' HTML using "display: none"

This option follows all steps for Option #1, except the generated HTML content is hidden from JavaScript-enabled browsers using a different method from Option #2a.

  1. Follow all steps under Option #1
  2. Wrap the generated HTML code between new <div> tags with the "display: none" style:
  <div style="display: none;">

     . . . generated HTML goes here

  </div>

Advantages: All content is made visible to search engines; this option avoids the duplicated content problem for JavaScript-enabled browsers; it may have lower risk for search engine blacklisting than Option #2a

Disadvantages: Content remains hidden to JavaScript-disabled browsers; it is not clear what the specific rules or extent of the risk is for search engine blacklisting.

Option #3: Generate HTML with 'Hidden' Fields

This alternative picks up on the same "display: none" style approach in Option #2b, but does so on a selective basis. This enables you to determine which fields to hide or not, and to also change the display layout.

  1. Follow all steps under Option #1
  2. On a selective basis, inspect the generated HTML (which is highly patterned) and do global search and replaces for the fields of interest to insert the style="display: none;" within the appropriate <div> or <span> tags
  3. Continue for all specific fields
  4. Make any other general layout changes to the generated HTML.

Advantages: All content is made visible to search engines; poses arguably the lowest risk of being blacklisted by search engines

Disadvantages: Takes HTML knowledge, manipulation and time.

Option #4: Generate a TSV File

If you intend to make fairly large-scale changes to the static display of your exhibit content, particularly in a tabular listing format, and are not comfortable making global search and replaces and HTML modifications, you may want to do most of your changes in a spreadsheet.

  1. Make sure your exhibit is using either the tile view or tabular view
  2. Make sure you are displaying all records by selecting the 'Show all XXX results' link by scrolling to the bottom of your exhibit
  3. Click on 'Copy All' at the top of your exhibit and choose 'Tab Separated Values'. A popup window will then appear with the generated code highlighted
  4. Copy this code to your clipboard as you would copy any text. Press ESC to close the dialog box
  5. Paste the code into a new text file
  6. Open the file in your spreadsheet program as a text file, and choose tab delimited
  7. Make all column deletions to result in your final content
  8. Reformat the code into an HTML table or use <div>s.

Advantages: This option is fast for large-scale changes; it avoids the poor HTML/XML generation of native spreadsheet programs

Disadvantages: Not all content is retained for indexing; requires knowledge of HTML; needs to be re-generated whenever the underlying exhibit changes.

Option #5: A Second Copy of the Original Data

Of course, you can always make a second version of your original data solely for indexing purposes.

Advantages: Does expose some content for search engine indexing

Disadvantages: Most labor-intensive option; needs to be re-generated whenever the exhibit changes; some spreadsheet programs generate messy HTML/XML.

Tips and Pointers

  • Your data may not be as self-describing as you'd like; in those instances, make sure you provide an intro section in standard HTML describing the datasets and your exhibit display
  • If you generate HTML, you will likely get some HTML fragments at the top and bottom of the generated code that should be manually edited out
  • Do not try to generate HTML from a map view or a timeline view -- it won't give you anything useful
  • You may want to consider a simple table listing as a kind of a 'Table of Contents' for your separate, indexed, static version
  • You can check the crawl status of a Web site on Google by going to: https://www.google.com/webmasters/tools/sitestatus. That screen will also take you to a series of Webmaster options provided by Google (more if you can verify you are the site owner).

Empirical Googling

Google does seem to index some Javascript files, but not all. For example, the search:

site:simile-widgets.org filetype:js

seems to indicated that many of the Javascript libraries are indexed (you may need to click on the "repeat the search with the omitted results included" link to see all that have been indexed, but not JSON files with a .js extension.

This can be confirmed with the search:

"James Madison" site:simile-widgets.org filetype:js

which yields no results ("James Madison" is contained in the presidents.js JSON data file).

However, the search:

"James Madison" site:simile-widgets.org

yields as its top hit a wiki page containing a plain HTML table representation of the data in the presidents.js JSON file. High in the results is an SVN commmit log message which contains details of changes to the presidents.js JSON file.

Form this, it would seem that:

a) Google does not reliably index every files with extension .js that it finds - it seems to selectively index only some Javascript libraries.

b) If the contents of a JSON file appears as plain text (as in the SVN commit log message), then Google happily indexes it.

Also, many of the solutions suggested above muddy or destroy the very clean separation between data content (in the JSON file/s) and presentation (in the Exhibit HTML page and the Exhibit Javascript libraries that calls).

I wonder whether Google would index the JSON data files if they happened to have a different file extension? What if teh JSON data files were given .html extensions instead, and have the form:

   <html>
     ... Javascript which redirects to and plain HTML which links to the Exhibit or meta-Exhibit which use the JSON data in this file...
   <body>
     <pre tag="Exhibit JSON data">
       .... Exhibit JSON data...
     
  </body>
 </html>

</pre>

A meta-Exhibit is just an intermediate page which lists the Exhibits which rely on the JSON data in the indexed page. This may be needed because more than one Exhibit can use a single JSON data file (and v-v - it is a many-to-many relationship between Exhibit pages and JSON data files).

The other advantage of this is that Google is then indexing the contents of structured JSON file, not a semantically-diluted HTML representation of it. At some future time Google itself might be able to make use of the semantic information in the JSON data, you never know. This would seem to accord with the (laudable and noble) "hidden agenda" behind the Exhibit project.

Other Ideas?

The issues of making Exhibit "visible" to search engines has received much developer and user attention. Some of the 'Copy All' options noted above were added specifically for this purpose. Additional ideas will likely be tried out in the future.

If you have an alternative suggestion, please submit it after you are registered here.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox