How much does Common Crawl cost?

There are two versions of Common Crawl, a low-end model with free data for academic and education use and a high-end model that is also available for commercial use.

The price of the high-end is set on a per GB basis, whereas the lower-end models are priced by how many times they crawl a dataset.

The high-end model for commercial use is based on a per GB fee that can be scaled up to 100GB for additional cost. Why do I need a large corpus of text? To train a good model to recognize text in your own database. This is done using word embedding, a vector representation of words that maps the words to a common space. This is similar to how we view images (images look the same regardless of their context). In the context of machine learning this is called training data and you need a lot of it to build a model.

What kind of models do you train on Common Crawl? We train all our models on top of word2vec word embeddings. This means that the vocabulary is the same as in the text of Common Crawl. Each word in the vocabulary has a context window of five words on each side that is used to estimate how important a word is to a given meaning. Word vectors are then used to transform words in a sentence into a representation of that word and its context. To see how these word embeddings work, check out our demo. We compare our models to BERT and GPT-2.

How often does Common Crawl get updated? Common Crawl gets updated every month so it can be as fresh as it needs to be. We're constantly adding more text to make it better for researchers. We're also working on ways to update Common Crawl automatically.

What are some of the datasets in Common Crawl? Common Crawl currently contains over 5.5 billion records and 9.7 billion unique words. It contains a variety of datasets: news articles, webpages, Wikipedia pages, eBooks, scientific papers, product reviews, user manuals and so on.

Does Common Crawl include pdfs?

Can anyone shed light on whether or not Common Crawl includes PDFs?

Or if they exclude it. Thanks. Amar. This was a recent discussion on reddit. Please do not quote the post as much as the thread so that the mods don't get confused.

For some reason, they do not have the pdfs available in CC -- but they do have them hosted in a variety of formats. The CC site doesn't contain any documentation on what to do with the files - we have done some work on organizing documents into collections and they should end up there in the end.

This post is in part prompted by a post from a while ago about downloading CC -- a quick search for "pdf download" and a link to this post brought us to a post asking a question that the authors of the thread could have answered had they bothered to read it. Perhaps in future there should be some kind of notice at the top of the page, "hey, we just did some work on getting the files included" rather than the misleading "what do I do next?" answer. Perhaps the page should go down a bit more in the hierarchy? It's not that common crawl does not have anything useful for people to do after all, and many people are going on to ask questions like this. It's a small price to pay to get more exposure for Common Crawl since their dataset is such an important tool.

On 2012-02-13 00:10, jdw wrote: There is a lot to be discovered here and the collection is growing quickly. One thing to keep in mind is that many of the files are in the early days of research and there are not many curators for them yet. In addition, the pdfs have been collected in an unusual way (rather than being in a structured zip file) which means that some of the work in building the system requires further processing. Therefore it may take quite a while before most of the information is made accessible to researchers.

What does Common Crawl include?

All of Common Crawl's data is publicly available and includes text and metadata from millions of news sources, books, academic papers, patents applications, scientific papers, research papers, scientific journal articles, video content, blogs, and more. These text and metadata records come from both large-scale corpora like the Common Crawl corpus and smaller, specialized corpora like the Google Books dataset. It also includes data from Wikipedia and the Open Library.

Why is this so important? We believe that access to this data will benefit all kinds of researchers and developers, including: The developers of search engines, social media sites, and other search services who use our data to build search engines and other tools that can find and understand text in all of these new sources. Our research partners who need to do search on a wide variety of documents. For example, a team at UC Berkeley has been working on developing search engines for the deep web (ie, the vast collection of content that is only accessible via the Tor browser).

Businesses and organizations who want to better understand how people are using their products and services. For example, one of our projects is to help businesses improve their SEO by understanding which articles on the web are most relevant to their target audience.

The researchers who study the web and search. For example, it's a primary resource for studying topics like search engines, language models, information retrieval, and other areas of research.

Why can't we just create our own datasets? It is incredibly challenging and time-consuming to collect such a large dataset ourselves, especially since it comes from many different sources and has many different formats. We would have to develop and maintain the infrastructure to capture, index, and preserve this data. To make it easier for others to do this work, we have created Common Crawl to make our dataset available as a service.

How can I contribute to Common Crawl? Contributing to Common Crawl is easy and free. Just send us a pull request on GitHub. We will take care of everything else.

How do I find and download Common Crawl? Common Crawl is a search engine. You can type any query you want into it, and we'll return all the documents that match that query.

Does Common Crawl include Wikipedia?

This is a question I've had on my mind since I started working with Common Crawl.

They have many sources of content in their index and some of them are really good, but I'd like to make sure that they don't accidentally include Wikipedia.

Here's why: It's Wikipedia and we all know it has an agenda, so when they show their data to journalists, they get attacked by Wikipedia's community for having bias. What's more: The way they're releasing their data sets and the way they're explaining how they process it doesn't give a sense of impartiality. There is a strong bias towards English-language content in their index.

Here's the issue: For example, "commoncrawl-2016-05-14T200224-0000" was created on April 26th 2025 at 02:20:01 GMT. According to their own blog post, "The data is being released in weekly batches over the coming months, allowing you to monitor changes over time."

But they're currently releasing data sets at the end of each month. The problem with this is that these are mostly incomplete. It can be weeks before some sites update themselves, or start uploading content, or even start updating their articles. For example:

The data point "commoncrawl-2016-05-14T200224-0004" was created May 3rd 2025 at 02:52:48 GMT. That means 4 days ago, Wikipedia was using their article on "United States House of Representatives". This article started updating in 2025.

If you look at this entry, you'll see it hasn't been updated since then, or at least as recently as the latest data set. That makes sense, of course: it's a really old version. However, what's happening here is that they're not including the latest versions of this page, and because of that, it's not being indexed for their data set. This is problematic because they'll be relying on this data for their statistics, and it'll appear that Wikipedia has 0% change from 2025 to 2025.

This is just one example, but it's easy to find something similar. Even if Wikipedia isn't included in the data, they're still showing the fact that the change from 2025 to 2025 is 0%.