How big is the Common Crawl dataset?
As you may have noticed, the Common Crawl dataset is massive.
It contains more than 400 terabytes of data - around 10 times the size of the Internet Archive. That's a lot of data, but it's still only around 200GB. To help you understand how big it is, we're giving away a prize in the form of a terabyte hard drive! That's enough data to store every piece of text ever written by human beings on the planet, the entire works of Shakespeare and Emily Dickinson, or the entire contents of Wikipedia for over half a billion years.
(It's also enough data to make a copy of the entire Internet Archive.) The drive is being provided as a prize to one lucky person, and we're asking you to send us a photo of the drive. In the photo, make sure to note what the drive contains: the Common Crawl dataset, a terabyte drive, and your name.
You'll need to be at least 16 years old to win this prize. So, good luck! What can you use the Common Crawl data for? The Common Crawl dataset is primarily intended to be used for academic research purposes. In fact, it's probably the single biggest open dataset available to academic researchers. It contains more than 11 petabytes of data that can be used to answer any question about how the web has changed over time. If you're interested in this kind of research, make sure to check out the Common Crawl dataset page.
You can also use the Common Crawl dataset to build tools that can take advantage of the data. These might include: search engines like Lucene and Solr that can analyze web pages. Web crawlers that crawl web pages. Spam filtering software. Content extraction tools like R2Text or Doc2Vec. Information retrieval systems. Text analytics. Analytics systems that learn from the data. But what if you don't want to use the Common Crawl dataset for research? There are many other ways you could use the Common Crawl dataset.
How do I access Common Crawl?
We recommend downloading the Common Crawl dataset via the Datacow download button on our downloads page.
If you don't have an account with datacow.org yet, register here. Once you're logged in you can select to download either the 1TB 'raw' archive file or a compressed 100MB version of the common crawl. Note that the compressed file can be decompressed using 7zip if necessary.
To explore the data from the compressed file in more detail, we provide a few useful links: To view the raw data files in the common crawl, click on the 'Downloads' tab and select either 'raw' or 'compressed' from the drop-down. To browse and query the data in the common crawl, click on the 'Explore' tab, then enter a corpus id and select a dataset (preferred). In the dataset drop-down, select either de-identified or 'full crawl'. Then, for a corpus identifier, enter an example text or its unique identifier.
To create an example corpus and explore the data in it, click on the 'Create corpus' button. How do I use a corpus? To start working with a corpus, click on the 'Explore' tab, enter an example text or its unique identifier, and select the desired corpus. The corpora available to you will appear below the example text.
To create a new corpus, click on the 'Create corpus' button. Corpora can be filtered by status (new, updated, inactive, etc.) and size. A status of New means the corpus has not yet been indexed, while a status of Active means that the corpus is indexed and ready for use. Clicking on a status will show more information. In this case, an 'Inactive' status means that the corpus is awaiting active status. For further details on the active status and what that entails, click on the link 'active status' in the Status column.
You can also filter corpora by the languages they are available in, and whether they are indexed in English or another language.
Related Answers
Where is Common Crawl's headquarters?
I am a newbie to the Common Crawl data. I have created the following c...
Whats the best VPN for privacy Reddit recommends?
I will not spend time or money on a VPN. I simply do not need a VPN....
What is Common Crawl used for?
If you look at the links in my first paragraph, you'll see that I've been asked that...