Google
 

Do search engines index dynamic URLs or URLs with CGI parameters? (was: Giftgallery)

From: Alexis D. Gutzman <alexisg_at_marketingsherpa.com>
Date: Fri 23 Aug 2002 22:24:37 -0500

[This is a continuation of the Giftgallery post. Since the topic has
changed, I renamed it. Sorry to quote a previous thread at such length, but
this is getting *more confusing*, not less. It's not really all that
complicated.]

> "Alexis D. Gutzman" <alexisg_at_marketingsherpa.com> wrote:
> > However, Google is the only search engine right now that DOES deep crawl
> > sites. The reason Inktomi and others have paid-submission programs is
that
> > they will never get to the SECOND level of dynamic content.
>
> Alexis, I'm not sure what you mean by second level. Can you clarify or
> point me to somewhere where I can familiarize myself with what that
> means?

Happy to clarify (with an example).

Retailer's home page (www.someURL.com)
==> links to category pages (www.someURL.com/cat.cfm?cat=7)
     ==> links to department pages (www.someURL.com/dept.cfm?dept=9)
         ==> links to product pages (www.someURL.com/prod.cfm?prod=191)

In this example, *most* search engines will spider the home page and the
category pages, but since the category pages have CGI parameters (the stuff
after the question mark), they won't follow any links from the category
pages. In this scenario (the most common one for online retailers), *most*
search engines will never get to what your customers are probably most
interested in and what search engines would rank you most highly for: your
product pages. Teoma is an exception. They truly will not even touch a site
visible CGI parameters. Since they have the newest offering, this is
obviously not a technical barrier but a management decision.

There are a variety of reasons given by these search engines, including the
fact that the content might not be stable and it's possible for site owners
who create their own link farms to waste a lot of search engine processing
having them chase down internal links. I think that's bunk and the
programmers were too lazy to do it right (in the beginning). It's not
particularly complicated to have the spider software check this kind of
thing, but there is certainly more money to be made by saying "we can't
spider content with CGI parameters, so you'll have to pay for submission."
(See Teoma above)

FYI, Lycos -- back when they did their own crawling, when Inktomi provided
search results to Yahoo because Google didn't yet exist -- used to deep
crawl all sites. They had been doing it since the get-go.

> Do you mean that Google will index pages whose URLs contain query strings,
> but will not follow a link containing a query string from a page whose URL
> contains a query string? Or is my question so confusing everyone's head
> will now explode? I think I may understand what you meant by second level
> now.

YES.

> Though you said dynamic content, I think you really meant pages with
> URLs that Google associates with dynamic content which AFAIK means
> the URL contains a query string.

Yes. That's an important clarification. They don't discriminate against
*dynamic pages* per se (any page that ends in .ASP, .php, .CFM, etc.) --
consequently, it doesn't matter whether you rig your server to delivery .php
pages as .html. What *most* search engines do is to ignore any pages with
CGI parameters (or query strings -- the stuff after the question mark) that
*ARE LINKED TO FROM PAGES WITH CGI PARAMETERS*.

Just for background, there are four ways (I can think of off the top of my
head) to get around the "most search engines don't index URLs with CGI
parameters" problem:

1. Build a static site. Bad for a lot of reasons, but if your content
doesn't change too much, it's doable.
2. Build a dynamic site then use software that automatically generates a
static site from your dynamic site at some set time of day/week. This can be
time-consuming (even the automated process), so it doesn't scale very well
to, say, a large retail site. Also, you miss mid-day updates to the static
site.
3. Build a series of static doorway pages to permit the search engines
that don't do CGI parameters to find the content. The serious problem with
this is that all you've done is gotten the visitors to the doorway page, not
to the search-specific content. If you've seen any stats about home-page
abandonment rates, you know that you really need to get them to the
search-specific page. Since doorway pages are typically not updated
regularly (heck, are frequently *never* updated), you haven't really solved
the problem because you could well be showing visitors an out-of-date static
page with (for example) August's price on a Nintendo Gameboy, instead of
November's. The alternative, of course, is to use the doorway page as a
redirect page, meaning that search engines would see the doorway page, but
human visitors would see the site's dynamic page. This is called cloaking
and will get your site banned from Google and others (yes, I know, strong
words, but I asked the search engines themselves this in May and this is
what they told me), or if you're lucky, just penalized 100 positions or so,
so that no one finds you.
4. Disguise the fact that the site is generated dynamically by hiding the
CGI parameters in the URL. We do this. Go to
http://www.sherpastore.com/sample.cfm/1759. This is also how Amazon does it.

BN.com uses CGI parameters. Do you think none of their stuff other than
things directly linked from the home page are indexed?

Go to Google, paste this (without the quotes) into the search box: "history
site:bn.com". For this example, I just want you to look at a list of all the
items at BN.com (a site that is both dynamic and relies on visible CGI
parameters) that are related to history. Convinced?

Google is the one search engine that doesn't get hung up on CGI parameters.
Fortunately, Google is delivering the bulk of search results right now. Last
time I saw (April?) the rankings were: #1 Yahoo, #2 MSN, #3 Google, #4 AOL,
#5 AskJeeves. Only the first 4 have more than 10% reach. Everything after
AskJeeves is marginally relevant. Yahoo, Google, and AOL all use Google for
search results. MSN takes results from a variety of places. We don't pay for
inclusion anywhere, and we're on the first page of results everywhere except
MSN.

The best way to deal with the other search engines is either to pay for
inclusion (through PositionPro or another authorized partner) or to revamp
how your site displays the CGI parameters (option #4 above). Amazon buries
the CGI parameters in the URL. We do the same thing at our store. It's no
big deal to have your pages look for the CGI parameter after the / instead
of after the ?. You do have to modify your server slightly, though. Your
techies will probably say something like, "oh sure, we can do it that way;
we almost did it that way in the first place, but we didn't think you'd
care."

I hope this answers it all. Now that you know how it works from *their
side*, you still need to know how to optimize your own
site/content/keywords. I'll let those who do it all day talk about that.

Long-windedly yours,
Alexis
---
Alexis D. Gutzman, Managing Editor
MarketingSherpa's Knowledge Store
http://sherpastore.com <- Email Marketing Metrics Guide now on sale!
---
There's a reason my first book was over 900 pages :(.




Received on Fri Aug 23 2002 - 22:24:37 CDT


HOW TO JOIN THE ONLINE ADVERTISING DISCUSSION LIST

With an archive of more than 14,000 postings, since 1996 the Online Advertising Discussion List has been the Internet's leading forum focused on professional discussion of online advertising and online media buying and selling strategies, results, studies, tools, and media coverage. If you wish to join the discussion list, please use this link to sign up on the home page of the Online Advertising Discussion List.

 


Online Advertising Industry Leaders:

Clicksor
List and Found
AdJungle
The Laredo Group

Add your company...

Laredo Group Interactive Advertising Training
AdJungle
List and Found
Clicksor
 



 


 
Online Advertising Discussion List Archives: 2003 - Present
Online Advertising Discussion List Archives: 2001 - 2002
Online Advertising Discussion List Archives: 1999 - 2000
Online Advertising Discussion List Archives: 1996 - 1998

Online Advertising Home | Guidelines | Conferences | Testimonials | Contact Us | Sponsorship | Resources
Site Access and Use Policy | Privacy Policy

 
2323 Clear Lake City Blvd., Suite 180-139, Houston, TX 77062-8120
Phone: 281-480-6300
 
Copyright 1996-2007 The Online Advertising Discussion List, a division of ADASTRO Incorporated.
All Rights Reserved.

Visit our other web sites:
Tennis Server | Tennis Server Ticket Exchange | MyCityRocks | MyCityRocks Ticket Exchange