Data Maintenance Programs

Data Maintenance Programs
Prev	Chapter 8. The Gombemi-MI Custom Programs	Next

These are the programs that are used in the entry and maintenance of the Gombe-MI Master tables. Their use should be fully documented elsewhere in manuals describing how the system is used when performing specific tasks. The summary written here provides a statement of purpose and a mention of all updated data. The operation and behavior of the programs supports the table and program characteristics documented in this manual. For more information on the actual capabilities of the programs see the documentation in the programs' source code.

Some programs are designed to upload data in “batch” -- each run of a program uploads a single file containing multiple lines of data, each of which is then inserted into the database as a row of data in one (or more^[68]) database tables. Other programs provide utility functions.

The upload programs take as input a file of data arranged in tabular format. The file is expected to contain plain text, with rows on separate lines and columns separated by a single tab character. This data structure can be produced by exporting data from a spreadsheet as tab delimited text.

The programs upload the data into the database in an all-or-nothing fashion. Ether all the data in a uploaded file is inserted into the database or, should any error occur, none of it is. After an error each program continues to process the uploaded file so as to catch additional errors. However the input line containing the erroneous data is ignored so the “trial”^[69] insertion into the database of the subsequent lines in the uploaded file may result in spurious errors due to the “missing” data. It is left to the operator to distinguish the real errors from the false positives.

Tip

When in doubt simply correct the errors that are clearly problems, notably the first error reported, and re-run the program.

Tip

For reasons of security most browsers will remove pathnames from forms. Should a program which imports data into Gombe-MI from a file find an error in the data, rather than re-enter the pathname of the file to be uploaded simply press the browser's "reload" button. This (usually, depending on the browser) redoes the upload using the previously entered file name -- but with the new, now-corrected, data content.

Each time any of the Gombe-MI web programs successfully uploads a file into a database Gombe-MI remembers the name of the file and the database. The Gombe-MI programs raise an error to warn a user who has upload a file of the same name into the same database twice in a row. Re-uploading a third time overrides the warning, allowing the re-upload to complete.

For more information on whether data is required to present, as well as other required characteristics of the data values, see the documentation of the specific column into which the data is stored.

UPLOAD_BEHAVIORS: View to upload follow data

The UPLOAD_BEHAVIORS view takes the place of a custom web-based upload program. The Upload program can be used to insert data into this view and thence into the various follow related tables.

Upload: Upload Into Any Table or View

Upload uploads data into any table or view. Its primary purpose is to report as many data errors as possible during a single upload.^[70]

Table and Column Names

Tip

The name of the table uploaded into may be qualified with a schema name to upload data into tables or views that are not in the gombemi schema.

Both the table name, entered into the web browser, and the column names, appearing as column headings in the uploaded file, are taken exactly as written. Capital characters, spaces, or other unusual characters are taken exactly as written. This is in contrast to typical occurrences of table and column names within PostgreSQL SQL statements, where such identifiers are normalized and automatically converted to lower case. The Upload program can therefore insert into tables with unusual names or columns, at the cost of being persnickety about table and column names.

The one exception to the above rule is that when the Upload program is told to trim leading and trailing spaces from the data in each uploaded column. In this case leading and trailing spaces are also trimmed from the column names appearing in the first line of the uploaded file.

Trim Leading and Trailing Spaces

The upload program has a checkbox which determines whether or not it removes (“trims”) leading and trailing spaces from each uploaded datum -- from each column of each row in the uploaded file.

`NULL` Values

There are 2 ways to upload NULL data values. The easiest is to omit the column. Columns without some other default value will be given NULL values. The second is to check the checkbox labeled "Upload NULL Values" and supply a input value for NULL. Data values that match the given NULL representation will then be given a NULL value in the database.

The default NULL representation is the empty string, no data at all. When this representation is used (and the "Upload NULL Values" checkbox is checked) data that are omitted in the input file becomes NULL when uploaded into the database.

Tip

\N is sometimes used to represent the NULL value.

Caution

A space (or multiple spaces) may be chosen as the NULL representation. This can be difficult to discern while operating the program.

Upload Data Input Format

The uploaded data must be in either CSV or tab delimited format. When in tab delimited format the first line of the input file must contain the column names, each separated by a tab. The remaining lines of the file contain the data to be uploaded. Each line is a row of data, each column is separated from its neighbor with a tab character.

Caution

There is no standard for the CSV format, it is defined by the operation of the programs which read and write the format. At the time of this writing the program attempts to read CSV formatted data as produced by Excel, which is probably described more or less as follows: the column delimiter must be a comma, the character used to quote textual data, if any quote character is used, must be the double quote character, and escaping must be done by doubling the double quotes. At the time of this writing this format can be obtained from Excel by exporting in CSV format with a comma delimiter.

Attempting to upload data which contains the ASCII 0 character or spreadsheets with individual cells that contain multiple lines of text is likely to fail in unpredictable ways or in ways which are difficult to detect.

A line need not contain as many data elements as there are column names given in the first line. All unspecified data elements will be given a blank value, the empty string, just as if, with tab delimited data, the tabs occurred but no data were specified.^[71]

A line must not have more data elements than there are column names given in the first line.

Upload_GC: Upload Group Composition Data

The Upload_GC program uploads group composition data into the database. The program creates rows in the RAW_GROUPS and RAW_GROUPMEMBERS tables and the database itself then creates the necessary rows in GROUPS and GROUPMEMBERS.

Caution

Upload_GC calls mom_in_gc(), infants_in_gc(), and add_social_partners() on the uploaded follows. Use of any other means to add group composition data to the database or modify group composition data can result in failure to meet the assurances provided by the aforementioned functions.

One way to correctly approach such “manual” group composition alterations is to alter the group composition in the database, delete all the RAW_GROUPMEMBERS rows having an Origin value of A, and then re-run mom_in_gc(), infants_in_gc(), and add_social_partners() on the altered follows.

There must be no existent information on an uploaded follow's group composition in the database.^[72]

Upload_GC Data Input Format

The uploaded data must be in either CSV or Tab delimited format (each column separated from the next by a tab character). Leading and trailing spaces are removed from the data contained in each column before any other processing is performed.

The overall layout of the uploaded data is similar to that of the data sheet collected in the field, described in the Gombe Chimp Database Handbook. The first line of the file contains column headings and each of the remaining lines contains the group composition recorded for a given time interval of a specified follow.

All of the group composition information related to a follow must be in contiguous lines of the uploaded file. There must be no group composition information of a different follow interspersed among the lines which comprise any follow's group composition data.

The first line of the file, as mentioned, contains the column headings. Column headings may appear in any order. There are 2 kinds of column headings, those to which the program assigns specific uses and those which represent ChimpIDs.

All non-ChimpID columns, unless otherwise noted, are required. Character case is significant, unless otherwise noted, both in column headings and elsewhere. Although the non-ChimpID columns listed below are shown in mixed-case^[73] in the uploaded file all of these column headings must be lower case.

Upload_GC non-ChimpID Columns

Column	Description	Purpose
Researcher	The PEOPLE.Descr value identifying the field researcher who collected the data. The comparison is done in a case-insensitive fashion.	One of the columns which must match (the FIELDFOLLOWS.Collector value) when determining the follow to which the group composition belongs.
Date	The date of the follow. The date may be formatted in any manner acceptable to PostgreSQL.	One of the columns which must match (the FOLLOWS.Date value) when determining the follow to which the group composition belongs.
Mother	A ChimpID of the mother in the follow.	One of the columns which must match (the FOLLOWPARTS.ChimpID value in the row identifying the chimp who is the mother in the follow) when determining the follow to which the group composition belongs. The supplied ChimpID must match exactly the value recorded in the FOLLOWPARTS.ChimpID of the row identifying the chimp who is the mother in the follow. Other ChimpIDs which the mother may have are not allowed.
Page	The page number written on the field data collection sheet recording the group composition.	One of the columns which must match (Sheets.Page) when determining the follow to which the group composition belongs. The uploaded value must be blank when the page numbers on the sheets were not entered into the database -- when the Sheets.Page value of the follow is `NULL`.
Time_On_Sheet	The time of the group comp as it was written on the data entry sheet.	The value entered into this column is ignored. The presence of this column in the uploaded file is optional.
Time_Converted_Mil	The time of the group composition, written in military time form.	The time value used in the RAW_GROUPS. Time column, and consequently the value used to relate the group composition to the intervals during which behavioral data was collected. Although defined by protocol to be military time the program will accept a time value formatted in any manner acceptable to PostgreSQL.
Time_Type	The time encoding scheme used when the data was collected in the field.	The value entered into this column is ignored.
Entered	Whether or not the group comp information was entered into the MS Access version of the database.	The value entered into this column is ignored. The presence of this column in the uploaded file is optional.
GCStatus	The status code assigned to the group composition information collected on the follow, per field researcher who collected the information.	The value to record in the FIELDFOLLOWS.GCStatus column. All of the uploaded GCStatus values for a given follow for a given field researcher must match.
NYA	A column for recording baboon presence.	Unlike regular ChimpID columns this column may never contain information on paracallosal swelling.
	An entirely empty column.	For purposes of visual formatting the uploaded file may, but is not required to, contain (at most) a single column that contains no data what so ever.

All column headings not listed above are expected to be valid ChimpID values. The entries in the rows for the ChimpID columns must contain values found in GC_TRANS.Mark. When the mark indicates that the individual identified in the column heading is present a row is created in the RAW_GROUPMEMBERS table.^[74]

Tip

The GC_TRANS table should contain a row having a Mark value which is the empty string (a blank value) because the group composition sheet collected in the field uses a blank to indicate absence.

Note

A ChimpID column heading is not tested for validity unless at least one of marks contained in the column indicates presence in the group.

At least one ChimpID column must exist in the uploaded file.

Logout: Logout From Gombe-MI Custom Programs

The logout program logs the user out of the Gombe-MI web based collection of programs.

Note

Gombe-MI consists of many programs, only some of which are web based programs written specifically for Gombe-MI. The logout program only controls access to those programs written specifically for Gombe-MI, other “off the shelf” programs (notably phpPgAdmin) have their own logout mechanisms.

Note

Logout from Gombe-MI is automatic after a period of inactivity.^[75]

Wwwdiff: World Wide Web based Difference program

The wwwdiff program compares two text files. It can be found on the Gobemi-MI Web site.

Among other uses, this program provides a useful data validation mechanism. To validate data, have two different individuals enter the data and compare the results with the wwwdiff program. It is unlikely that both individuals will make identical errors and so almost all data transcription errors should be caught using this method.

Tip

The program uploads the two files to be compared. For security reasons most web browsers will always clear the names of the uploaded files once they have been uploaded. This makes it difficult to repeatedly upload the same or similar files, as when re-comparing two files after correcting errors. The situation is not as bad as it might sound because browsers will often provide a “browse” button and keep track of the directory last accessed, removing the need to re-navigate to the location of the data files. But it is still awkward to have to repeatedly point and click.

One solution is to use the browser's “reload” button. This will repeat the upload and comparison of the two files, but using the new, corrected, file content. A second, less desirable possibility is to have the have the pathnames of the files handy in a text document and cut and paste them as needed. A third possibility might be to use the browser's “back” button, but browsers will often clear the file upload information in this case in the same fashion they would with password information.

The wwwdiff program provides 4 comparison methods:

Tabular by Word: Useful where the data consists of individual words aligned as rows and columns of data. Compares the files contents on a word-by-word basis and displays the entire content of both files as a table with differences marked.

Caution

When one file contains whitespace^[76] that is not in the other, this comparison method shows extra cells in the output. Words are separated by whitespace but because there is only extra whitespace the cells are empty. Thus, when one file contains more whitespace than the other those rows will contain more columns. Normally the data in the extra cells would be color coded to inform as to whence they came, but because the cells are empty there is nothing to color code. The operator must compare the files by hand to determine which file contains the extra whitespace.
By Word: Useful where the data consists of words and there are relatively few changes or where paragraphs have been refilled and words have moved from line to line. Compares the file contents on a word-by-word basis displaying the entire content of both files as plain text.
By Line: Useful when there is a large amount of textual data. Compares the file contents on a line-by-line basis and displays only a small amount of context surrounding those portions of the text which differ between the files.
By File: Useful when comparing non-text files. Reports the location (by line and byte offset from the beginning of the files) of the first difference found.

^[68]Uploading into a single row of a view can update multiple tables, and programs designed to handle specialized input data formats may update arbitrary portions of the database as needed.

^[69]Once an error occurs no changes will be committed to the database.

^[70]Many other upload facilities will abort on the first error.

Upload also provides an assured way to upload into views. At the time of this writing most PostgreSQL front ends are unable to import data into views, although this is in the process of being addressed.

^[71]This may or may not result in a NULL value in the database, depending on how the program is invoked.

^[72]This is due to the need to call the aforementioned functions. Uploading a single follow's group composition in multiple uploads would result in incorrect group composition inference.

^[73]For legibility and for consistency in style with the manner in which other column names are written in this document.

^[74]The database then, in turn, creates a GROUPMEMBERS row.

^[75]Currently 1 hour.

^[76]Spaces, tabs, and whatever other characters that can't been seen.