Every crawl error is a leak in your indexation pipeline. This guide walks you through the exact diagnosis and fix steps for 404s, soft 404s, DNS failures, and server errors. No fluff, just the tactics that work.
Googlebot is polite but not patient. When it hits a 404, a soft 404, a DNS timeout, or a 5xx server error on a URL it expects to find, it doesn't just move on. It flags that URL as problematic. enough flags, and the entire section or domain loses crawl budget and trust. The core bottleneck is not error volume. It is the pattern behind the errors. A single misconfigured regex in your robots.txt can orphan 50,000 product pages. A cheap DNS provider that times out at peak hours can kill 600 URLs in one crawl cycle. We treat each error type as a distinct failure mode, and we fix them differently.
In practice, when you open Google Search Console's 'Pages' report, the raw numbers lie. You might see 1,200 '404 not found' entries, but only 80 of those are actual broken internal links. The rest are often crawled URL parameters, pagination copies, or old AMP variants. A common situation we see: a site mails a newsletter with a tracking parameter appended to a URL that was already 301-redirected. GSC logs that as a 'crawl error' on the final URL. The real fix is not a redirect change; it is removing the bad newsletter URLs from the sitemap and blocking the parameter in GSC. That is the kind of nuance this guide exists for.
| Error Type | Likely Root Cause | First Action (within 1 hour) | Hidden Failure Mode |
|---|---|---|---|
| 404 (Not Found) Internal link points to deleted page | Content removed without redirect; old sitemap entries; broken navigation links | Run a Screaming Frog crawl on the 404 list. For high-value URLs: place a 301 redirect to the closest topical equivalent. For low-value URLs: let them 404 but remove from sitemap. | Redirecting all 404s to the homepage causes soft-404 signals. Google sees a mismatch between the requested content and the landing page. Only redirect when the replacement page covers the same topic. |
| Soft 404 Page returns 200 but content is empty or thin | Search results with no results; category pages with zero products; paginated pages with only one item; login-walled pages returning 200 to Googlebot | Check the actual HTTP response header for the URL. If content is truly empty, change the response to 404 or 410. If it is a search page, add noindex and remove from sitemap. | Ecommerce sites often have 50,000+ soft 404s from 'no results' search pages. Fixing the template to return 404 instead of 200 can instantly clear 90% of these errors. But watch out: if Google expects those URLs from a sitemap, the 404 will create new errors. Remove them from the sitemap first. |
| DNS error Googlebot could not resolve the hostname | DNS provider outage; misconfigured A/AAAA records; TTL too high (caching stale records); CDN origin IP changed | Check DNS propagation with a tool like DNSChecker. Verify your nameserver is responding: dig example.com NS. If using a CDN, check the origin server IP is correct in your CDN dashboard. | A single DNS failure during a Googlebot crawl cycle can cause up to 15% crawl drop for the next 24 hours, even after the DNS is fixed. Reason: Google caches DNS failures for a few hours. To accelerate recovery, resubmit the sitemap via GSC after the fix. |
| Server error (5xx) Googlebot received 500, 502, 503 | Web server overload; PHP worker pool exhaustion; database connection pool starvation; WAF blocking Googlebot IP range | Check server logs for the exact error code at the time of the crawl. If 503: likely traffic spike or rate limiting. Add Googlebot IP ranges to a whitelist in your WAF. If 502: backend service (e.g., PHP-FPM, Node, Gunicorn) crashed. Restart the service. | The most dangerous server error pattern is intermittent 503s. They do not show up as a massive spike in GSC but quietly reduce crawl frequency over weeks. Set up a cron job to hit your own URLs every 5 minutes and log the HTTP status. If you see 503s at 2% rate, you have a problem. |
Verify the error URL is actually requested by Googlebot. Use the URL Inspection tool in GSC. If the error is from a redirected URL, the real problem is upstream.
Check if the URL has been canonicalized elsewhere. A soft 404 often hides behind a self-canonical pointing to a different URL.
Look at the referring page. Is the broken link in a footer, a blogroll, or a dynamically generated breadcrumb? Fix the source, not just the destination.
For DNS and server errors: are they global or isolated to your Googlebot crawl? Check your CDN logs for the specific user-agent. If normal traffic works but Googlebot gets errors, your WAF or rate-limiter is blocking it.
Download the full error list from the Pages report. Filter by error type. Do not use the legacy Crawl Errors report.
Run a bulk HTTP status check on the list. Remove false positives (URLs that now return 200).
Group by error type and page template. Look for systemic patterns: all errors are from one sitemap, one template, or one parameter.
For 404s: redirect or restore. For soft 404s: change status code or add noindex. For DNS/5xx: fix infrastructure or whitelist Googlebot.
Resubmit fixed URLs via GSC or API. Check the report again after 7 days to confirm the error count dropped.
We inherited a site with 1,230 'soft 404' errors in GSC. The initial reaction was to 301-redirect everything to the homepage. That would have been a disaster.
Step 1: We exported the list and ran a custom script that checked each URL's HTTP status and HTML body. Result: 610 URLs were actually returning 200 but with zero products (empty result pages). The remaining 620 were returning 200 with thin content (one product on a category page).
Step 2: We modified the ecommerce platform template. For empty result pages, we changed the HTTP status code to 410 (Gone). For thin category pages, we added a <meta name='robots' content='noindex, follow'> tag and removed them from the XML sitemap.
Step 3: We resubmitted the 610 fixed URLs through the Indexing API. 14 days later, the soft 404 count dropped to 45. Those remaining were from a third-party review page that we had to fix manually.
The key insight: 50% of the errors were from a single template. Fixing the template fixed 610 errors in one deployment.
DNS errors are the most dangerous because they affect entire domains, not individual URLs. A 30-minute DNS outage can cause Google to deprioritize crawling your site for days. The fix is rarely simple: check your DNS provider's SLA, configure secondary nameservers, and set TTL to 300 seconds or lower for critical records. For server errors, the most common mistake is assuming a 503 means 'server too busy'. Often, it is a WAF rule that blocks Googlebot's IP range. Check your firewall logs for the 'Googlebot' user-agent. If you see 403 or 503 for that user-agent, whitelist the entire Googlebot IP range (published by Google).
For a deeper understanding of how search engines interpret these signals, the Moz SEO learning center offers a solid foundation on crawl budget and server response codes. It is worth revisiting even for experienced practitioners.
Export the 404 list from GSC, run it through a bulk checker to confirm the current status, then use a regex or URL pattern to write server-level redirect rules. If the error pattern is consistent (e.g., all URLs contain '/old-blog/'), a single .htaccess or nginx rule can fix hundreds of URLs in seconds. Do not redirect all 404s to the homepage.
GSC labels pages as soft 404 when they return 200 but have little or no content. The most common source is internal search result pages with no results. To find them, export the soft 404 CSV, then grep for 'search', 'query', 'q=', or 's=' in the URL. Check the HTML body for the phrase 'no results' or similar. Then fix the template to return 404 for empty results.
Yes, but with limits. The Indexing API allows 200 URLs per call and has a daily quota (usually 200,000 per project). For bulk fixes, batch your URLs and use the API with exponential backoff. Do not use it for low-value pages; it is best for high-priority content. A practical alternative is to use a bulk index checker tool that verifies status without consuming your GSC quota, as described in related resources.
A 404 tells Google the page does not exist. Google removes it from the index after a few crawls. A soft 404 is a page that returns 200 but has no useful content. Google sees a mismatch and may keep the URL in a 'crawled but not indexed' state indefinitely. Soft 404s are more dangerous because they waste crawl budget and dilute index quality. Fix them by returning 404 or improving the content.
That is almost always a WAF or rate-limiting issue. Find Googlebot's IP ranges (published by Google) and add them to a whitelist. Also check your CDN's Web Application Firewall logs. If you use Cloudflare, ensure the 'Under Attack' mode is off. For persistent DNS issues, switch to a premium DNS provider with a 100% uptime SLA and low TTL (60-300 seconds).
Assuming the error is on your server when it is actually on your CDN or reverse proxy. A 502 error often means your origin server is fine, but the CDN node is timing out. Check your CDN logs for the specific request that failed. Also, many server errors are intermittent and only happen during peak traffic. Set up a monitoring tool that checks your URLs every 1 minute from multiple locations.
No. That is the fastest way to generate soft 404 errors. Google sees a redirect from '/obsolete-product' to '/homepage' and notices the content does not match. It flags the target URL as a soft 404. Only redirect when the target page is topically equivalent. For example, redirect '/old-iphone-case' to '/new-iphone-case', not to the homepage.
GSC refreshes the Pages report every 24-48 hours for most sites. However, the data is sampled and may lag by up to 3 days for large sites. Do not make decisions based on a single day of data. Look at the 7-day or 28-day trend. If you fix an error today, wait at least 7 days to confirm the fix in GSC.
Agencies often use custom scripts (Python + requests library) or SaaS tools that check HTTP status codes in parallel. For a workflow that handles 100,000 URLs, see the bulk Google index checker approach linked in this article. The key is to respect rate limits (10-20 requests per second per IP) and handle redirects properly. Do not follow redirects blindly; record the final status and the redirect chain length.
Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.