Skip to content
Cload Cloud
Documentation

article-extractor

Extract full article text and metadata from web pages.

What article-extractor Does

Article Extractor is a specialized Claude skill that automatically extracts full article text and metadata from web pages, removing clutter like ads, navigation elements, and sidebar content. It’s designed for anyone who needs to process web content at scale—whether you’re building research databases, feeding content into AI workflows, or archiving articles for later analysis. The skill works seamlessly with Claude Code to enable automated content extraction pipelines that preserve article structure while eliminating noise.

How to Install

  1. Access the article-extractor repository from the GitHub source
  2. Clone or download the skill files to your local environment
  3. Copy the skill directory into your Claude Code skills folder
  4. Verify installation by testing the skill with a sample URL
  5. For Claude Code integration, ensure your environment has network access to fetch web pages
  6. Test with a known article URL to confirm proper text extraction and metadata parsing

Use Cases

  • Automated Content Curation: Extract articles from news sites, blogs, and publications to build curated content feeds without manual copying
  • Research Database Building: Scrape academic papers, industry reports, and reference materials while preserving metadata like publication date and author
  • AI Training Data Preparation: Extract clean article text to prepare datasets for fine-tuning models or feeding into RAG (Retrieval-Augmented Generation) systems
  • Content Migration: Move articles from one CMS to another by extracting full text and metadata in a standardized format
  • Accessibility Improvements: Convert web articles into clean text for text-to-speech tools or accessible document formats

How It Works

Article Extractor uses intelligent DOM parsing combined with heuristic algorithms to identify and isolate the main article content from a webpage. When you provide a URL, the skill fetches the page, analyzes its structure, and applies algorithms that detect article containers based on common HTML patterns, element density, and text coherence. It distinguishes between actual content and boilerplate elements by evaluating factors like paragraph length, link density, and semantic markup.

The extraction process preserves important metadata including the article title, publication date, author information, and article URL. The skill cleans extracted text by removing inline ads, scripts, and formatting artifacts while maintaining paragraph structure. Metadata extraction relies on parsing common meta tags (Open Graph, Dublin Core, JSON-LD structured data) as well as heuristic detection when standard tags aren’t present.

The output provides both the cleaned article text and a metadata object containing extracted fields. This structured format makes it easy to feed results into downstream processes—storing in databases, indexing for search, or passing to other Claude Code skills for further analysis or transformation.

Pros and Cons

Pros:

  • Seamlessly integrated with Claude Code for easy workflow automation
  • Extracts both text and structured metadata from diverse websites
  • Language-agnostic—works with articles in any language
  • No API keys or external service dependencies required
  • Intelligent content detection removes ads and navigation clutter automatically
  • Structured output format feeds easily into downstream processing steps

Cons:

  • Cannot extract content from pages requiring authentication or JavaScript rendering
  • Accuracy varies on sites with highly custom or unusual HTML structures
  • No image or embedded media extraction—text-only focus
  • Relies on heuristics for metadata when standard tags are missing
  • May require testing and tweaking for specialized or niche websites
  • No built-in rate limiting or politeness controls for large-scale scraping
  • text-summarizer: Condense extracted articles into brief summaries for quick review
  • content-classifier: Categorize extracted articles by topic, sentiment, or industry automatically
  • markdown-converter: Convert extracted HTML article text into clean Markdown format for documentation or note-taking
  • web-crawler: Discover article URLs at scale across multiple sites before passing them to article-extractor
  • metadata-enricher: Enhance extracted metadata with additional fields like reading time, word count, or topic tags

Alternatives

  • Readability Libraries (Mozilla Readability, Trafilatura): Open-source Python/JavaScript libraries for article extraction. Good for standalone use but require more setup than a Claude skill.
  • Diffbot Article API: Commercial service with high accuracy and built-in NLP. Better for production workloads but requires API keys and has per-request costs.
  • Newspaper3k: Python library focused on news article extraction. Simpler than article-extractor but less integrated with Claude workflows and requires manual dependency management.
Glossary

Key terms

DOM Parsing
Analyzing the Document Object Model (HTML structure) of a webpage to identify and extract specific elements. Article-extractor uses DOM parsing to find the main content container and separate it from navigation and ads.
Metadata
Structured information about content, such as title, author, publication date, and description. Article-extractor extracts metadata from HTML tags and structured data formats to provide rich context alongside article text.
Heuristic Detection
Rule-based algorithms that make educated guesses about content structure based on patterns and signals. Article-extractor uses heuristics like paragraph density and link ratios to identify main article content when standard markup is absent.
Open Graph Tags
Meta tags that define how content appears when shared on social media, including title, description, image, and URL. Article-extractor parses these tags to extract metadata reliably.
Boilerplate Content
Repetitive HTML elements that appear on every page, like navigation menus, headers, footers, and sidebars. Article-extractor removes boilerplate to isolate the unique article content.
FAQ

Frequently Asked Questions

How do I install article-extractor for Claude Code?

Clone the skill from the GitHub repository into your Claude Code skills directory, then test it with a sample URL to verify proper installation. No complex dependencies are required—the skill works directly with Claude Code's environment.

What metadata does article-extractor capture?

The skill extracts title, publication date, author name, article URL, and description/summary when available. It parses Open Graph tags, JSON-LD structured data, and common meta tags, with fallback heuristics for pages without standard metadata.

Can article-extractor handle paywalled or behind-login content?

No, the skill can only extract publicly accessible content. It cannot bypass paywalls, authentication systems, or content that requires JavaScript rendering to load. Use it for freely available articles and public web pages.

How does article-extractor handle images and embedded media?

The skill focuses on text extraction and metadata. Images and embedded media (videos, iframes) are not extracted. If you need media URLs, you may need to parse those separately or use additional tools alongside article-extractor.

What's the difference between article-extractor and web scraping tools?

Article-extractor is specifically optimized for news articles and blog posts—it identifies and isolates main content intelligently. General web scrapers extract all HTML without distinguishing content from navigation or ads. Article-extractor is more surgical and produces cleaner results for published articles.

Can I extract articles in languages other than English?

Yes, article-extractor works language-agnostic for text extraction and metadata parsing. The HTML structure analysis doesn't depend on language, so it can extract articles from international sites in any language.

How do I integrate article-extractor into a larger workflow?

Use it within Claude Code to build multi-step pipelines. Extract articles, then pass the cleaned text and metadata to other skills for summarization, classification, translation, or storage. The structured output format makes it easy to chain operations.

What happens if the skill can't identify article content?

If the page structure doesn't match typical article patterns, the skill returns the full page text or a partial extraction. For highly custom site layouts, accuracy may be lower. Testing with sample URLs from your target sites helps identify problematic patterns early.

More in Documentation

All →
Documentation

Twitter Algorithm Optimizer

Analyze and optimize tweets for maximum reach using Twitter's open-source algorithm insights. Rewrite and edit tweets to improve engagement and visibility.

ComposioHQ
Documentation

NotebookLM Integration

Lets Claude Code chat directly with NotebookLM for source-grounded answers based exclusively on uploaded documents.

PleasePrompto
Documentation

Meeting Insights Analyzer

Analyzes meeting transcripts to uncover behavioral patterns including conflict avoidance, speaking ratios, filler words, and leadership style.

ComposioHQ