WritingMarch 30, 2026

Word Frequency Counter Guide: Analyze Text Patterns & Density (2026)

Q: What are stop words and should I remove them?

Stop words are extremely common words — 'the,' 'a,' 'is,' 'of,' 'and' — that appear in nearly every sentence. For SEO keyword density analysis, removing stop words gives you a cleaner picture of meaningful content words. For NLP tasks like sentiment analysis or topic modeling, stop words are routinely filtered out. For writing diversity analysis, however, you may want to keep them to get an accurate type-token ratio.

By The hakaru Team·Last updated March 2026

Quick Answer

A word frequency counter tallies how many times each unique word appears in a text. It's used in SEO to measure keyword density (target: 1–2% for primary keywords), in linguistics for corpus analysis, in writing to catch overused words, and in NLP as a foundational step for text classification and sentiment analysis.

Use our free Word Frequency Counter →

How Word Frequency Analysis Works

At its core, word frequency analysis is a count. The algorithm reads every token (word) in a text, strips punctuation, optionally lowercases everything, and increments a counter for each unique word. The output is a frequency table — often sorted descending so the most-used words appear first.

The Term Frequency Formula

Raw count alone isn't always useful. Normalized term frequency (TF) expresses how often a word appears relative to the total number of words:

TF = (Number of times word appears) ÷ (Total words in document)

A 1,000-word article where “climate” appears 15 times has a TF of 0.015, or 1.5%. That's useful for SEO keyword density analysis because it's length-independent — you can compare documents of different sizes.

Stop Word Removal

Without filtering, the top words in almost any English text are: the, a, of, and, to, in, is, that. These stop words are grammatically necessary but semantically empty. Most word frequency tools let you toggle stop word removal so the results surface meaningful content words instead.

The NLTK library for Python ships with a default English stop word list of 179 words. Researchers and SEO tools often expand this list for specific domains.

Stemming and Lemmatization

“Run,” “runs,” and “running” are the same concept. Stemming reduces words to their root form by stripping suffixes (“running” → “run”). Lemmatizationdoes the same but uses a dictionary lookup to return the canonical base form, handling irregular verbs correctly (“are” → “be”). NLP pipelines almost always apply one of these steps before frequency analysis to avoid inflated unique-word counts.

Keyword Density for SEO

Keyword density measures what percentage of your total word count is occupied by a specific keyword. It's one of the oldest on-page SEO signals, and while modern search algorithms are far more sophisticated, it remains a useful sanity check.

The 1–2% Target

Google Search Central explicitly discourages optimizing for a specific keyword density and states that content should be written naturally for users. In practice, most SEO professionals target 1–2% densityfor a primary keyword. For a 1,000-word article, that's 10–20 appearances — enough to establish clear topical relevance without triggering spam filters.

According to Moz's On-Page SEO research, keyword placement matters more than raw density. A keyword appearing in the title tag, the first 100 words, at least one H2, and naturally throughout the body signals relevance more reliably than cramming it into every paragraph.

TF-IDF: Beyond Raw Frequency

Search engines don't just count how often a keyword appears in your page — they compare it to how common that word is across billions of pages. TF-IDF (Term Frequency–Inverse Document Frequency) captures this:

TF-IDF = TF × log(Total documents ÷ Documents containing the word)

Words that appear frequently in your document but rarely across the web score high — those are your distinguishing terms. Generic words like “important” or “information” appear everywhere and score near zero regardless of how often you use them.

Keyword Stuffing Penalties

Google's spam policies explicitly list “keyword stuffing” as a violation — defined as loading a page with keywords in an unnatural way, including in hidden text or repeated unnecessarily. Pages caught stuffing keywords risk manual action or algorithmic demotion. If your frequency analysis shows a single keyword above 4%, rewrite those passages for naturalness.

Top Use Cases for Word Frequency Analysis

SEO Content Audits

Paste competitor content into a word frequency counter to reverse-engineer their topical coverage. The top 20–30 non-stop content words reveal which subtopics they emphasize — giving you a content gap framework. Compare your own article's frequency output against a top-ranking competitor to spot missing terms.

Academic and Plagiarism Detection

Plagiarism detection systems like Turnitin use frequency-weighted fingerprinting to identify copied passages. Unusual word frequency patterns — particularly rare technical terms appearing at identical rates across two documents — are a strong signal of copied text. According to Turnitin's 2024 Academic Integrity Report, over 22 million papers submitted annually show some degree of similarity.

Competitive Content Analysis

Analyzing the word frequency distribution across multiple top-ranking pages for a target keyword reveals the semantic field Google associates with that topic. If the top 10 results all frequently use “photosynthesis,” “chlorophyll,” and “light absorption” when answering a biology question, those co-occurring terms are signals of topical authority.

Readability and Style Analysis

High repetition of abstract words (“thing,” “aspect,” “factor”) in frequency output often correlates with vague, low-readability writing. Concrete nouns and specific verbs score higher in readability frameworks like the Flesch–Kincaid scale. A frequency audit that surfaces filler words is a fast diagnostic.

NLP Preprocessing

Frequency analysis is foundational to virtually every NLP task. Bag-of-words models, naïve Bayes classifiers, and early versions of topic modeling (LDA) are all built on word frequency distributions. Even transformer models like BERT, which use contextual embeddings, often incorporate frequency information in their vocabulary construction (byte-pair encoding weights by frequency).

Word Frequency in Famous Literature and Language

Some of the most striking findings in computational linguistics come from applying frequency analysis at scale.

Shakespeare's Vocabulary

Analyses of the complete works of Shakespeare — roughly 884,000 words — show that “the” appears approximately 27,000 times, making it by far the most frequent word. But what makes Shakespeare remarkable is vocabulary breadth: he used an estimated 31,534 unique words, according to computational literary studies, compared to the average educated adult's active vocabulary of about 20,000 words.

Most Common English Words

The Oxford English Corpus — a collection of over 2 billion words from a wide range of text types — identifies the top 25 most frequent words in English as exclusively function words: the, be, to, of, and, a, in, that, have, it, for, not, on, with, he, as, you, do, at, this, but, his, by, from, they. Together these 25 words account for roughly one-third of all written English.

Zipf's Law

In 1935, linguist George Kingsley Zipf observed a striking power-law regularity: in any large text corpus, the most frequent word occurs approximately twice as often as the second most frequent, three times as often as the third, and so on. This relationship — now called Zipf's Law— holds across languages, programming languages, city populations, and even the frequency of musical notes in compositions. The law implies that a small number of words do an enormous share of communicative work, while the vast majority of vocabulary items are used rarely.

Google Ngram Viewer

Google Ngram Viewer tracks word frequency across over 5 million digitized books spanning 1500–2019, representing roughly 4% of all books ever printed. It reveals cultural shifts through language: the word “algorithm” was virtually absent before 1960 and rose sharply after 1990. “Coronavirus” shows a vertical spike in 2020 data. This kind of diachronic frequency analysis is a window into history.

Using Word Frequency to Improve Your Writing

Frequency data is one of the most actionable feedback loops a writer can use. Here's how to apply it.

Finding Overused Words

Paste a draft into a word frequency counter with stop words removed. Scan the top 30 results. Any content word appearing disproportionately often — especially adjectives and adverbs like “really,” “very,” “great,” or “important” — is a revision target. Replace repetitions with synonyms, restructure sentences to eliminate filler, or cut the word entirely. Stephen King's advice in On Writingapplies here: “the road to hell is paved with adverbs.”

Measuring Vocabulary Diversity

The type-token ratio (TTR) measures how varied your vocabulary is:

TTR = Unique words (types) ÷ Total words (tokens)

A TTR of 1.0 means every word is unique. A TTR of 0.0 means every word is the same. In practice, TTR decreases as text gets longer because repetition is inevitable. For shorter samples (300–500 words), published literary fiction typically scores 0.55–0.70. Journalistic prose tends toward 0.45–0.60. Academic writing often lands lower due to technical terminology that must be repeated precisely.

Improving Consistency in Technical Documentation

In software documentation, inconsistent terminology is a major usability problem. Word frequency analysis across multiple documentation pages quickly surfaces inconsistency: if “login,” “log in,” and “sign in” all appear with similar frequency, that's a style guide failure. The most frequent form is usually the one to standardize on — or pick the one that matches your product's UI and update the rest.

Validating Content Coverage

Run frequency analysis on your finished article and compare the top content words against your target keyword cluster. If your article is supposed to cover “strength training for beginners” but “reps,” “sets,” “progressive overload,” or “rest” don't appear in the top 30, the content may lack depth. Frequency output works as a content completeness checklist.

Analyze your text's word frequency instantly

Use our free Word Frequency Counter →

Also useful: Word Counter and Readability Checker

Frequently Asked Questions

What is word frequency analysis?

Word frequency analysis is the process of counting how many times each unique word appears in a text. The result is a ranked list showing which words are used most often. It's applied in SEO to measure keyword density, in linguistics for corpus research, in writing to detect overused words, and in NLP as a preprocessing step for machine learning models.

What is a good keyword density for SEO?

Google Search Central guidelines recommend focusing on writing naturally for readers rather than hitting a specific percentage. Most SEO practitioners target 1–2% keyword density for primary keywords. Exceeding 3–4% risks triggering spam filters. For a 1,000-word article, that means your main keyword should appear roughly 10–20 times, including in the title, first paragraph, a heading, and naturally throughout the body.

What is Zipf's Law in word frequency?

Zipf's Law states that in any large body of text, the most frequent word appears approximately twice as often as the second most frequent word, three times as often as the third, and so on. This power-law distribution was first described by linguist George Kingsley Zipf in 1935. It holds across virtually every natural language and even in many code bases and music catalogs.

What are stop words and should I remove them?

Stop words are extremely common words — “the,” “a,” “is,” “of,” “and” — that appear in nearly every sentence. For SEO keyword density analysis, removing stop words gives you a cleaner picture of meaningful content words. For NLP tasks like sentiment analysis or topic modeling, stop words are routinely filtered out. For writing diversity analysis, however, you may want to keep them to get an accurate type-token ratio.

What is TF-IDF and how does it differ from raw word frequency?

TF-IDF (Term Frequency–Inverse Document Frequency) weights a word's importance by how often it appears in a specific document versus how common it is across a large collection of documents. A word like “the” has high term frequency but near-zero TF-IDF because it appears everywhere. TF-IDF surfaces words that are distinctive to a given document — exactly what search engines use to understand topical relevance. Raw word frequency is simpler and still useful for single-document analysis.

How do I use word frequency to improve my writing?

Run your draft through a word frequency counter and look at the top 20–30 content words. If non-structural words like “very,” “really,” “just,” or a specific noun appear far more often than expected, that's a signal to vary your word choice. Also calculate your type-token ratio (unique words ÷ total words): a ratio below 0.4 on a 500-word sample suggests low vocabulary diversity. Most published authors land between 0.5 and 0.7.