Google crawls 20-30 billion pages daily, but only a fraction make the index. Crawling is discovery; indexing is storage and eligibility. If you don't understand the gap, your diagnostics will miss the real bottleneck.
Most SEOs use 'crawl' and 'index' interchangeably. That is a dangerous shortcut. Crawling is the act of a bot — usually Googlebot — sending an HTTP request to a URL, downloading the response, and extracting links. Indexing is the process of analyzing that fetched content, understanding its relevance, and storing it in a database so it can appear in search results. You can crawl a URL a hundred times and never index it. You can also have an indexed URL that Google has not crawled in weeks, because a prior crawl provided enough signal.
In practice, when you check the URL Inspection tool, you see two separate statuses: last crawl date and indexing status. If the URL inspection says 'URL is not on Google' but 'Last crawl: 3 hours ago', you have an indexing failure, not a crawl failure. That distinction changes where you look — server logs versus content quality.
Sitemap, internal links, backlinks, or manual submission. No discovery means no crawl.
Googlebot queues the URL. Crawl budget limits how many per second. Server response time matters.
HTML parsed, JS rendered (if budget allows), links extracted, content stored temporarily.
Content quality, uniqueness, and relevance evaluated. Thin or duplicate pages get a 'crawled but not indexed' status.
Indexed pages become eligible for ranking. Slow indexation can take days or weeks for large sites.
| Dimension | Crawling | Indexing | Failure Mode & Risk |
|---|---|---|---|
| Definition | HTTP request + response download | Content analysis + storage in search db | Misdiagnosis: fixing server speed when content is thin wastes weeks |
| Primary Signal | Server logs, crawl stats, response codes | Index coverage report, URL inspection status | Hidden gap: 200 OK pages that never index due to low value |
| Common Block | robots.txt disallow, 5xx errors, slow TTFB | Noindex meta tag, canonical mismatch, low quality score | Double penalty: blocking crawl also blocks index assessment |
| Budget Limit | Crawl rate per host (e.g., 10 req/s for a small site) | Indexing quota per property (soft limit, ~few K URLs/day for new sites) | Scale risk: 100K URLs discovered but only 500 indexed because of thin content |
| Tool to Check | Google Search Console Crawl Stats, log file analyzers | Google Index checker API tools for bulk validation | Vendor lock-in: some index checkers cap at 10K URLs per run |
| Debug Priority | Low server response time, blocked by robots.txt, 4xx/5xx errors | Duplicate content, thin pages, noindex tags, orphaned URLs | Wrong filter: checking only 2xx URLs misses 301 redirects that never index |
A common situation we see in SEO audits is a site with 50,000 indexed pages and 200,000 crawled-but-not-indexed pages. The client panics. They think Google is ignoring them. In reality, most of those 200,000 are thin category filters, paginated pages with no unique content, or auto-generated parameter URLs. Crawling is cheap. Indexing is expensive. Google must decide: does this page add value for searchers?
Edge case: a client once had 15,000 blog posts with 200-word AI-generated content. Every single one was crawled within 48 hours. Exactly 47 were indexed. The rest were marked 'Crawled - currently not indexed'. No amount of crawl budget optimization would help. The fix was content consolidation — merging 15,000 URLs into 2,000 substantive guides. Index coverage jumped to 95% in six weeks.
Site size: 120,000 URLs in sitemap. Indexed: 32,000 according to GSC. Crawled but not indexed: 68,000. Not crawled (discovered): 20,000.
Step 1: Export the 'Crawled - currently not indexed' list from GSC. Filter for URLs with less than 300 words of visible text. Result: 52,000 URLs below threshold. Those are the primary candidates for noindex or consolidation.
Step 2: Use a bulk Google index checker to verify 10,000 random non-indexed URLs against the index API. Cross-check with a custom crawl that checks meta robots and canonical tags. Found 3,400 URLs with a self-referencing canonical but a 'noindex' directive — a direct configuration error.
Step 3: Fix the 3,400 canonical/noindex conflicts. Submit 52,000 thin URLs for removal via the Removals Tool after implementing a proper noindex on the server side. Within 30 days, indexed count rose to 38,000. The remaining 10,000 non-indexed URLs were paginated pages with duplicate product grids — those required a rel=next/prev restructure.
Check server logs for crawl frequency vs response codes. 5xx errors block indexing entirely.
Run GSC Index Coverage report. Look for 'Crawled - currently not indexed' vs 'Discovered - currently not indexed'.
Use URL Inspection on a sample of non-indexed URLs. Note the exact reason: 'Page with redirect', 'Soft 404', 'Excluded by noindex tag'.
Verify robots.txt does not block JS/CSS files. Google cannot render content without them, leading to empty pages that fail indexing.
Audit internal linking. Orphaned pages (no internal links) can be crawled via sitemap but often get low priority for indexing.
Check canonical tags. A page with a canonical pointing elsewhere will not be indexed as the primary version.
Measure content word count and uniqueness. Pages under 300 words with low TF-IDF against other site pages are prime candidates for non-indexation.
Google caches crawled content but queues the indexing decision separately. Common causes: thin content, duplicate content, low page authority, or a slow indexing pipeline for new domains. Check the URL Inspection tool for the specific 'Crawled - currently not indexed' reason. If the content is unique and substantial, request indexing via the tool.
Export your URL list (up to 100K rows) and use a <a href="https://medium.com/@alexa.sam2026/mass-verification-without-gsc-how-a-bulk-google-index-checker-handles-100-000-urls-9ca89519c1d3">bulk Google index checker</a> that queries the Indexing API or renders a browser-based check. Many tools cap at 10K URLs per run; look for one that handles 100K in under 15 minutes. Cross-reference the output with your GSC report to find discrepancies.
Agencies often misread 'Discovered - currently not indexed' as a crawl issue. It is an indexing issue — Google knows the URL exists but has not crawled it yet. Another mistake: using only GSC data without log analysis. Server logs reveal crawl frequency, response times, and redirect chains that GSC does not show. A third error: not filtering out parameter URLs before analyzing index coverage.
No. Every indexed page must be crawled at least once. However, once indexed, Google may not recrawl for weeks or months. You can have an indexed page that has not been crawled in 90 days — that is normal. The page stays in the index until a recrawl triggers a reassessment. For fresh content, request a recrawl via URL Inspection.
Crawl budget limits how many URLs Googlebot requests per second. On a site with 500K URLs, if budget is 50 crawls per day, it takes 10,000 days to crawl everything once. Indexing then adds another delay. Prioritize high-value pages (category, product detail) in your sitemap and use noindex on thin filter pages to conserve crawl budget for index-worthy URLs.
Return a 410 (Gone) status instead of 404. Google treats 410 as a stronger signal to remove from the index quickly, often within 24-48 hours. Also, remove internal links pointing to that URL. If you cannot change the server response, use the Google Removals Tool. For bulk removals, combine server-side 410 with a removal request via GSC's Indexing API.
The Indexing API is designed for job posting and live stream URLs, not general web pages. For bulk validation of standard URLs, use a third-party <a href="https://medium.com/@alexa.sam2026/the-pragmatic-index-checker-tool-for-seo-agencies-4a92f9722c5d">Google Index checker</a> that simulates the 'site:' search operator or uses a headless browser. The API has a daily quota (200 URLs per day for most projects), so it is not practical for 100K checks.
Google considers a page a soft 404 when the content is too thin, irrelevant, or provides no value relative to the query. A product page out of stock with 'This item is no longer available' and no alternatives is a classic example. Fix: redirect to a relevant category page or add substantial related content. A 200 status means the server delivered a page; Google decides if that page is worthy of the index.
Day 1: Submit a clean sitemap with only 5K high-priority pages. Day 3: Monitor crawl stats and server errors. Fix any 5xx issues immediately. Day 7: Submit the full sitemap. Use a bulk index checker after 14 days to compare crawled vs indexed counts. Expect only 30-40% to index in the first month. Consolidate thin pages before the second crawl wave.
Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.