Data: Collect, organise, clean, check
Having planned for data collection, the PPDAC – Data Stage implements the gathering of the data. If the learners are using a publicly available dataset, then the task is to download it and store it on a local computer for analysis. It is possible to explore some datasets interactively online without requiring downloads (e.g. Gapminder). If the learners are gathering a fresh data set then they need to decide in advance how to structure the data so that they can use it in the Analysis stage.
Learners should know how to:
- Collect data methodically and store it in a logical structure;
- Set up digital tools to assist in data collection where appropriate (e.g. online survey tools or sensors);
- Organise and structure data digitally;
- Check data for accuracy;
- Clean and prepare data for analysis or processing.
Learning to structure information according to multiple attributes is a core aspect of computer science (further information available at http://teachcs.scot/) but it can be initially learned with simple everyday tasks. In earlier levels, learners may use simple data collection methods in the real world, such as collecting, categorising, stacking and counting physical objects. They will gradually progress to using written (e.g. tally marks) or digital representations of items within categories (e.g. using spreadsheets or databases). They will learn about different types of data (e.g. numbers, categories, ranked data such as 1st, 2nd, 3rd), and how items can be sorted using different attributes (e.g. books in a shelf could be organised by alphabetical order of author name, ISBN, genre or colour of the cover).
If the learners have collected the data themselves, they should check it for accuracy to make sure that they have not introduced copying errors or typos which might cause misleading results later. As most analysis software requires the data to be in a particular format, it may be necessary to clean and prepare the data to remove whitespace characters or insert commas between each value. Data cleaning is necessary but tedious and time-consuming, so it may be helpful to use one of our pre-prepared datasets to save time.