In this blog post, we will look at freely available datasets for Machine Learning that can be used for learning and various analysis and predictions.
Machine Learning is one of the biggest game-changer in the technological world. The technology has a huge potential and it is findings its place in the application and services we use in today’s world. Machine Learning depends upon the algorithms and these algorithms need data in order to find patterns and perform predictions. In short, data is the bread and butter for these algorithms. Let us start with a few things to consider before using datasets for machine learning tasks.
Fews things to consider for a dataset
The quality of the dataset directly determines the accuracy of the prediction of the Machine Learning model. The quality determination depends upon a number of factors and may vary from project to project and on what your application is trying to achieve, here are a few common factors to consider before searching a dataset. There are many others as well.
- Metadata: describes the structure of the dataset and provides important information like what data types are being used, what and how is data arranged and how to understand them.
- Availability: This is another important aspect to consider as it defines how the data will be available on a timely basis and how frequently it is updated.
- Accuracy: This factor determines that the data represented in the dataset is having authentic values as given by the source and using them will not cause any ambiguity. Data are consistent with as per the metadata defined and maintains its content integrity.
- Source: The source of the dataset is another important criterion to look upon, it validates the reliability, consistency and time availability of the dataset.
Datasets for Machine Learning
Kaggle: This is one of the best sources for finding datasets for learning purposes. Kaggle contains a variety of real-life datasets of all different formats and sizes submitted by its members. The good part of Kaggle is, you have discussions, tasks are created around it for which you can provide a solution and even find solutions provided by other members. The various analysis provided by data scientists is also available.
Google dataset Search Engine: If you would like to search and find your dataset using search engine then Google Dataset Search is the best place to look for, the search engine has millions of dataset already indexed and the best of all is that you can apply filters on the search to find out the type of dataset you are interested in. You can look at table-based, text-based or image-based datasets.
World Bank Data Catalog: World Bank publishes various datasets related to population demographics, a diversified economic data for countries as well as development indicators from across the world.
Github Awesome Public Datasets: This is another great place to find a huge list of categorized high-quality datasets on GitHub. The list contains dataset lists collected from various blogs, user responses and answers provided by users. For me, this is one of the great place to start and look for datasets.
UCI Machine Learning Repository: This is a repository that maintains over 100 datasets as a service for the machine learning community. The repository contains datasets like Anonymous Microsoft Web Data, Census Income, Badges, Car Evaluation, etc.
VisualData: This website contains more than 400 datasets related to Computer Vision research. The site contains interesting datasets like Oktoberfest Food dataset for detecting food, 3DPeople Dataset for detecting dressed humans, Deeper Forensics – a large dataset for real-world face forgery detection.
Amazon Review Dataset: The dataset contains 233 million Amazon customer reviews, it is a great source for customer’s sentiment analysis. The data is well categorized based on product types like Automotive, Books, Amazon Fashion, etc. The dataset is also categorized in smaller subsets for experimentation containing only the ratings.
Berkeley DeepDrive: If you are interested in researching autonomous driving, then this site is the best stop for you, it contains over 100,000 driving videos and over 1100 hour of driving experiences across different hours of the day having various day and night conditions
These were the list that I follow to find my datasets for machine learning projects. I hope you found this post helpful, thanks for visiting, Cheers!!!