How do I install article-extractor for Claude Code?

Clone the skill from the GitHub repository into your Claude Code skills directory, then test it with a sample URL to verify proper installation. No complex dependencies are required—the skill works directly with Claude Code's environment.

What metadata does article-extractor capture?

The skill extracts title, publication date, author name, article URL, and description/summary when available. It parses Open Graph tags, JSON-LD structured data, and common meta tags, with fallback heuristics for pages without standard metadata.

Can article-extractor handle paywalled or behind-login content?

No, the skill can only extract publicly accessible content. It cannot bypass paywalls, authentication systems, or content that requires JavaScript rendering to load. Use it for freely available articles and public web pages.

How does article-extractor handle images and embedded media?

The skill focuses on text extraction and metadata. Images and embedded media (videos, iframes) are not extracted. If you need media URLs, you may need to parse those separately or use additional tools alongside article-extractor.

What's the difference between article-extractor and web scraping tools?

Article-extractor is specifically optimized for news articles and blog posts—it identifies and isolates main content intelligently. General web scrapers extract all HTML without distinguishing content from navigation or ads. Article-extractor is more surgical and produces cleaner results for published articles.

Can I extract articles in languages other than English?

Yes, article-extractor works language-agnostic for text extraction and metadata parsing. The HTML structure analysis doesn't depend on language, so it can extract articles from international sites in any language.

How do I integrate article-extractor into a larger workflow?

Use it within Claude Code to build multi-step pipelines. Extract articles, then pass the cleaned text and metadata to other skills for summarization, classification, translation, or storage. The structured output format makes it easy to chain operations.

What happens if the skill can't identify article content?

If the page structure doesn't match typical article patterns, the skill returns the full page text or a partial extraction. For highly custom site layouts, accuracy may be lower. Testing with sample URLs from your target sites helps identify problematic patterns early.

article-extractor | Claude Skill

What article-extractor Does

Article Extractor is a specialized Claude skill that automatically extracts full article text and metadata from web pages, removing clutter like ads, navigation elements, and sidebar content. It’s designed for anyone who needs to process web content at scale—whether you’re building research databases, feeding content into AI workflows, or archiving articles for later analysis. The skill works seamlessly with Claude Code to enable automated content extraction pipelines that preserve article structure while eliminating noise.

How to Install

Access the article-extractor repository from the GitHub source
Clone or download the skill files to your local environment
Copy the skill directory into your Claude Code skills folder
Verify installation by testing the skill with a sample URL
For Claude Code integration, ensure your environment has network access to fetch web pages
Test with a known article URL to confirm proper text extraction and metadata parsing

Use Cases

Automated Content Curation: Extract articles from news sites, blogs, and publications to build curated content feeds without manual copying
Research Database Building: Scrape academic papers, industry reports, and reference materials while preserving metadata like publication date and author
AI Training Data Preparation: Extract clean article text to prepare datasets for fine-tuning models or feeding into RAG (Retrieval-Augmented Generation) systems
Content Migration: Move articles from one CMS to another by extracting full text and metadata in a standardized format
Accessibility Improvements: Convert web articles into clean text for text-to-speech tools or accessible document formats

How It Works

Article Extractor uses intelligent DOM parsing combined with heuristic algorithms to identify and isolate the main article content from a webpage. When you provide a URL, the skill fetches the page, analyzes its structure, and applies algorithms that detect article containers based on common HTML patterns, element density, and text coherence. It distinguishes between actual content and boilerplate elements by evaluating factors like paragraph length, link density, and semantic markup.

The extraction process preserves important metadata including the article title, publication date, author information, and article URL. The skill cleans extracted text by removing inline ads, scripts, and formatting artifacts while maintaining paragraph structure. Metadata extraction relies on parsing common meta tags (Open Graph, Dublin Core, JSON-LD structured data) as well as heuristic detection when standard tags aren’t present.

The output provides both the cleaned article text and a metadata object containing extracted fields. This structured format makes it easy to feed results into downstream processes—storing in databases, indexing for search, or passing to other Claude Code skills for further analysis or transformation.

Pros and Cons

Pros:

Seamlessly integrated with Claude Code for easy workflow automation
Extracts both text and structured metadata from diverse websites
Language-agnostic—works with articles in any language
No API keys or external service dependencies required
Intelligent content detection removes ads and navigation clutter automatically
Structured output format feeds easily into downstream processing steps

Cons:

Cannot extract content from pages requiring authentication or JavaScript rendering
Accuracy varies on sites with highly custom or unusual HTML structures
No image or embedded media extraction—text-only focus
Relies on heuristics for metadata when standard tags are missing
May require testing and tweaking for specialized or niche websites
No built-in rate limiting or politeness controls for large-scale scraping

text-summarizer: Condense extracted articles into brief summaries for quick review
content-classifier: Categorize extracted articles by topic, sentiment, or industry automatically
markdown-converter: Convert extracted HTML article text into clean Markdown format for documentation or note-taking
web-crawler: Discover article URLs at scale across multiple sites before passing them to article-extractor
metadata-enricher: Enhance extracted metadata with additional fields like reading time, word count, or topic tags

Alternatives

Readability Libraries (Mozilla Readability, Trafilatura): Open-source Python/JavaScript libraries for article extraction. Good for standalone use but require more setup than a Claude skill.
Diffbot Article API: Commercial service with high accuracy and built-in NLP. Better for production workloads but requires API keys and has per-request costs.
Newspaper3k: Python library focused on news article extraction. Simpler than article-extractor but less integrated with Claude workflows and requires manual dependency management.

article-extractor

What article-extractor Does

How to Install

Use Cases

How It Works

Pros and Cons

Alternatives

Key terms

Frequently Asked Questions

More in Documentation

Twitter Algorithm Optimizer

NotebookLM Integration

Meeting Insights Analyzer

family-history-research

article-extractor

What article-extractor Does

How to Install

Use Cases

How It Works

Pros and Cons

Related Skills

Alternatives

Twitter Algorithm Optimizer

NotebookLM Integration

Meeting Insights Analyzer

family-history-research