2: Male
2: 36–50
3: 51–65
Note that we have used 9, 99, and 999 as missing value codes. Such codes are common, but it is essential when the data is analyzed that they be explicitly excluded from any summary statistics.
30.3 Selecting a Data Storage Program
Although specific programs are likely to become ever more complex, and eventually simpler programs will become available, the basic options (a spreadsheet, a relational database, and a data analysis program) are likely to endure. Here are some suggestions on how to choose an appropriate program.
First of all, find out what software and support is available to you from your institution. Software can be expensive, but many institutions have IT groups that provide software at low or no costs. Some grants will support software purchases.
Select the least complicated type of software that will serve your purpose. It may be appealing to use a highly thought of package, such as SAS®, but that may have two disadvantages. First, there may be a steep learning curve that will take up time you should be spending on other aspects of the study. Second, the more complicated the program, the more likely it will require some programming to run, and the more complicated the program, the more rules you must obey and the more frustrating it can be when things do not work.
Unless you know the program very well, make sure there is someone who can help you in case you have a problem. This could be another member of your research group, a helpful colleague, or, if you are lucky, an IT group at your institution. Such groups may provide free or low-cost services, or may be run on a full cost-recovery basis, making them similar to other consulting resources.
SAS®, the Statistical Analysis System, is a very well-known and widely used system for data analysis. It has many features, including data management features, a structured query language, and up-to-date statistical methods. It is expensive, but some institutions have a site license making it available to researchers at a very nominal annual cost. However, it takes substantial time to learn and become proficient in its use. Moreover, the latest statistical procedures may be far more than you need for your study, and are likely to require a great deal of specialist knowledge to use appropriately. R (www.r-project.org) is a free statistical package used by many statisticians engaged in statistical research, but it also requires a substantial time commitment to become proficient with it. However, there are many publications on how to use both these programs for basic data analysis, which may be all you need.
Remember that most data analysis programs will compute whatever statistics you ask for, but that does not mean that those are the right statistics to describe your data or test your hypotheses.
A 4-point rating scale is used to assess severity of disease, with 0 meaning no disease, 1 used for slight disease, 2 for moderate disease, and 3 meaning severe disease. Any software program will happily calculate a mean severity of the numeric values. There are multiple possible distributions that would give any particular mean score. If the mean score was 1.500 it could be that 50% of the group have mild disease (coded as 1) and 50% have moderate disease (coded as 2). Another possibility is that 75% of the group have mild disease (coded as 1) and 25% have severe disease (coded as 3). In the most extreme case, 50% of the group have a score of zero (no disease) and 50% have a score of 3 (severe disease), the mean would still be 1.500. Calculating the frequency of the codes is far more meaningful. In the last case, the distribution shows that half the population actually do not have the disease at all.
30.4 Methods of Data Capture
If the data is in written format, you need to have it entered into the computer, which, if there is a lot of data, may require the help of data entry staff. If the data is structured, as in an intake form with clearly marked fields, you can enter data directly from the form. Often, when existing documents are being used rather than special forms for a study, you do not want to convert all of the data on a document to a computer file. You can either mark up the document showing the fields you want to keep or, if that would be unacceptably messy or hard to follow, manually transfer the data you need to a new form with a clear format. This will reduce errors in entering the data into a computer but may introduce errors in copying.
Data entry may be by key entry, scanned documents, direct transfer from measuring equipment to a data file, or by direct entry from the participant or interviewer. Key entry is the most susceptible to error. Scanning equipment has become more accurate over time but is still susceptible to error, particularly if some responses are handwritten. Often software must be adapted for an application and this must be validated before being put into use, usually by applying it to a small test data set designed to have some tricky problems. Direct transfer of a data file is usually error free technically, but often the data still must be reviewed for content errors, such as assay errors. There may also be transmission errors, although technical procedures can be used that would identify when this occurs.
Since the advent of portable computers and tablets, data collected with questionnaires or interviews can be directly entered into a computer file. Some studies also collect data on smartphone apps, and we expect that this will become more common in the future. It is essential to have the software verify that the data values are acceptable – such as within a given range of numbers – as it is entered and notify the person entering the information if there is an error. It is obviously an advantage for accuracy to have any errors or inconsistencies identified immediately so that it can be corrected on the spot. This is particularly important if the data is being directly entered by a participant, as there will be no source document that can be used if there are inconsistencies in the data entered. There may be a disadvantage, however, if the flow of the interview would be interrupted by error messages or if it reminds the individual being interviewed that his answers are being collected. If the data entry process is sufficiently awkward or time-consuming, the participant may drop out of the study – the last thing you want.
In a cross-sectional study of sexual behaviors in men who have sex with men, a smartphone app was used to collect data anonymously. Most of the data were simple yes or no questions, but there were questions about number of partners and frequency of activity. If a person reported no partners in the previous week, the questions about the activities in the previous week were skipped. If a person reported partners, but answered 0 to frequency of each specific activity, then the option “other activities” appeared which was a fill-in to try to obtain some information. Individuals entering a number of partners, but reporting no activities at all were queried, after the initial data was saved, pointing out that the data was not consistent, and asking if they wished to revise any of their answers.
The approach to data collection needs to be determined before data is collected, as changes during the course of the study may affect data quality and reliability. For this reason we recommend that there always be a small pilot phase in any study to test all the procedures before data collection of participants begins.