Senin, 03 Desember 2012

Hot Deck Imputation for Missing Values

BY ALBERT ANTHONY D. GAVINO

Hot Deck Imputation

This method sorts respondents and non-respondents into a number of imputation subsets according to a user-specified set of covariates. An imputation subset comprises cases with the same values as those of the user-specified covariates. Missing values are then replaced with values taken from matching respondents (i.e. respondents that are similar with respect to the covariates).
If there is more than one matching respondent for any particular non-respondent, the user has two choices:
  1. The first respondent’s value as counted from the missing entry downwards within the imputation subset is used to impute. The reason for this is that the first respondent’s value may be closer in time to the case that has the missing value. For example, if cases are entered according to the order in which they occur, there may possibly be some type of time effect in some studies.
  2. A respondent’s value is randomly selected from within the imputation subset. If a matching respondent does not exist in the initial imputation class, the subset will be collapsed by one level starting with the last variable that was selected as a sort variable, or until a match can be found. Note that if no matching respondent is found, even after all of the sort variables have been collapsed, three options are available:
Re-specify new sort variables
  • The user can specify up to five sort variables.
Perform random overall imputation
  • Where the missing value will be replaced with a value randomly selected from the observed values in that variable.
Do not impute the missing value
  • SOLAS will not impute any missing values for which no matching respondent is found.

Kamis, 22 November 2012

When your text data type exceeds 255 characters

BY ALBERT ANTHONY D. GAVINO

A TEXT data type is a data type that establishes a field that can store text, or a combination of text and numbers (alphanumeric), but whose numbers are not used for calculations. Examples are phone numbers or e-mail addresses. When you create your tables through the Access user interface, you have two basic text types: TEXT and MEMO. But when you use SQL statements such as CREATE TABLE or ALTER TABLE to create your tables, you have many other synonyms of the TEXT and MEMO types to choose from.

In general, text fields can be up to 255 characters, while memo fields can be up to 65,535 characters, but if the memo field does not contain any binary data, then its only limit is the maximum size of the database (approximately 2.14GB or 1,070,000,000 Unicode characters). In addition, unused portions of text fields are not reserved in memory.

Personal Take:


When importing a text data type that exceeds more than 255 character, choose the MEMO type field option which has about 65, 535 characters, but this will also affect the size of your database and the speed of every query processed.


Minggu, 18 November 2012

Structural Equation Modeling (SEM) Endogenous variable

ENDOGENOUS VARIABLE

a : caused by factors inside the organism or system
b : produced or synthesized within the organism or system


this one was based on research taken from 
http://www.tcrecord.org/Content.asp?ContentId=16924

DATA ANALYSIS

The proposed conceptual model was tested by using structural equation modeling (SEM) analysis. Taking into consideration measurement errors that are often inherent in survey data, SEM defines latent, multidimensional constructs (e.g., parents’ expectations, students’ motivational beliefs) and observed variables (e.g., baccalaureate expectations) while testing the theoretical links and their directions among the key variables in the study. Rooted in psychology, psychometrics, and econometrics, SEM is an appropriate approach to testing whether an a prioristructural model fits the observed data grounded in relevant literature and theory (Kaplan, 2000). By and large a confirmatory (i.e., hypothesis testing) technique, SEM consists of two parts: (1) a measurement model based on a confirmatory factor analysis that connects latent variables to observed indicator variables; and (2) a structural model that uses path analysis that accounts for relationships among endogenous, exogenous, and latent variables according to the a priori theoretical framework (Kaplan, 2000; Schreiber Stage, King, Nora, & Barlow, 2006). Figure 2 describes the SEM diagram based on the theoretical model in Figure 1.






Rabu, 14 November 2012

Appending your table in MS Access


Overview

An append query selects records from one or more data sources and copies the selected records to an existing table.
For example, suppose that you acquire a database that contains a table of potential new customers, and that you already have a table in your existing database that stores that kind of data. You'd like to store the data in one place, so you decide to copy it from the new database into your existing table. To avoid entering the new data manually, you can use an append query to copy the records.

Benefits of using an append query

By using a query to copy data, you can:
  • Append multiple records in one pass    If you copy data manually, you usually have to perform multiple copy/paste operations. By using a query, you select all the data at once, and then copy it.
  • Review your selection before you copy it    You can view your selection in Datasheet view and can make adjustments to your selection as needed before you copy the data. This can be particularly handy if your query includes criteria or expressions, and you need several tries to get it just right. You cannot undo an append query. If you make a mistake, you must either restore your database from a backup or correct your error, either manually or by using a delete query.
  • Use criteria to refine your selection    For example, you might want to only append records of customers who live in your city.
  • Append records when some of the fields in the data sources don't exist in the destination table     For example, suppose that your existing customer table has eleven fields, and the new table that you want to copy from only has nine of those eleven fields. You can use an append query to copy the data from the nine fields that match and leave the other two fields blank.

Basic steps of an append query

The process of creating an append query follows these basic steps:
  • Create a select query    You start by selecting the data that you want to copy. You can adjust your select query as needed, and run it as many times as you want to make sure you are selecting the data that you want to copy.
  • Convert the select query to an append query    After your selection is ready, you change the query type to Append.
  • Choose the destination fields for each column in the append query    In some cases, Access automatically chooses the destination fields for you. You can adjust the destination fields, or choose them if Access did not.
  • Preview and run the query to append the records    Before you append the records, you can switch to Datasheet view for a preview of the appended records.
 IMPORTANT   You cannot undo an append query. Consider backing up your database or the destination table. For more information about backing up a database, see the article Back up a database.

Kamis, 08 November 2012

Using the Auto Classifier in IBM Data Modeler

BY ALBERT ANTHONY D. GAVINO

The IBM Data Modeler, the Auto Classifier under the Models tab is useful when you are not sure which Predictive Model to use, IBM Data Modeler gives you this benefit to make things easier and efficient, after which you can choose a specific model such as CHAID or C&R Tree and Data Modeler will do all the work for you, take note of the Gold Icons, these are your Models created from your data sets and your data filters, each model has its own accuracy depending on the nature of your data.


Tip: Always save your stream and your data file in one folder, this will help in preventing from errors in happening such as lost file stream or lost database, Do not put your files on your desktop and your stream file in another folder as this will cause problems in the future.

Have Fun with Data Modeling!

Rabu, 07 November 2012

SQL Statement Cheat Sheet

there are a bunch of SQL statements upon your disposal

SQL Statement
Syntax
AND / OR
SELECT column_name(s)
FROM table_name
WHERE condition
AND|OR condition
ALTER TABLE
ALTER TABLE table_name
ADD column_name datatype
or
ALTER TABLE table_name
DROP COLUMN column_name
AS (alias)
SELECT column_name AS column_alias
FROM table_name
or
SELECT column_name
FROM table_name  AS table_alias
BETWEEN
SELECT column_name(s)
FROM table_name
WHERE column_name
BETWEEN value1 AND value2
CREATE DATABASE
CREATE DATABASE database_name
CREATE TABLE
CREATE TABLE table_name
(
column_name1 data_type,
column_name2 data_type,
column_name2 data_type,
...
)
CREATE INDEX
CREATE INDEX index_name
ON table_name (column_name)
or
CREATE UNIQUE INDEX index_name
ON table_name (column_name)
CREATE VIEW
CREATE VIEW view_name AS
SELECT column_name(s)
FROM table_name
WHERE condition
DELETE (used to remove one or more rows in a table)
DELETE FROM table_name
WHERE some_column=some_value
or
DELETE FROM table_name
(Note: Deletes the entire table!!)

DELETE * FROM table_name
(Note: Deletes the entire table!!)
DROP DATABASE
DROP DATABASE database_name
DROP INDEX
DROP INDEX table_name.index_name (SQL Server)
DROP INDEX index_name ON table_name (MS Access)
DROP INDEX index_name (DB2/Oracle)
ALTER TABLE table_name
DROP INDEX index_name (MySQL)
DROP TABLE
DROP TABLE table_name
GROUP BY
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name
HAVING
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name
HAVING aggregate_function(column_name) operator value
IN
SELECT column_name(s)
FROM table_name
WHERE column_name
IN (value1,value2,..)
INSERT INTO
(used to add single or multiple rows to a table)
INSERT INTO table_name
VALUES (value1, value2, value3,....)
or
INSERT INTO table_name
(column1, column2, column3,...)
VALUES (value1, value2, value3,....)
INNER JOIN
SELECT column_name(s)
FROM table_name1
INNER JOIN table_name2
ON table_name1.column_name=table_name2.column_name
LEFT JOIN
SELECT column_name(s)
FROM table_name1
LEFT JOIN table_name2
ON table_name1.column_name=table_name2.column_name
RIGHT JOIN
SELECT column_name(s)
FROM table_name1
RIGHT JOIN table_name2
ON table_name1.column_name=table_name2.column_name
FULL JOIN
SELECT column_name(s)
FROM table_name1
FULL JOIN table_name2
ON table_name1.column_name=table_name2.column_name
LIKE
SELECT column_name(s)
FROM table_name
WHERE column_name LIKE pattern
ORDER BY
SELECT column_name(s)
FROM table_name
ORDER BY column_name [ASC|DESC]
SELECT
SELECT column_name(s)
FROM table_name
SELECT *
SELECT *
FROM table_name
SELECT DISTINCT
SELECT DISTINCT column_name(s)
FROM table_name
SELECT INTO
SELECT *
INTO new_table_name [IN externaldatabase]
FROM old_table_name
or
SELECT column_name(s)
INTO new_table_name [IN externaldatabase]
FROM old_table_name
SELECT TOP
SELECT TOP number|percent column_name(s)
FROM table_name
TRUNCATE TABLE
TRUNCATE TABLE table_name
UNION
SELECT column_name(s) FROM table_name1
UNION
SELECT column_name(s) FROM table_name2
UNION ALL
SELECT column_name(s) FROM table_name1
UNION ALL
SELECT column_name(s) FROM table_name2
UPDATE
(used to modify a single column in one or more rows in a table)
UPDATE table_name
SET column1=value, column2=value,...
WHERE some_column=some_value
WHERE
SELECT column_name(s)
FROM table_name
WHERE column_name operator value

USE Master, Create Database, create a table

USE Master;
GO
CREATE DATABASE Sales;
GO
USE Sales;
GO
CREATE TABLE Contacts (
   ContactsID int NOT NULL,
   Lastname varchar(50) NOT NULL,
   Firstname varchar(50) NOT NULL,
   Phone varchar(30) NULL
 )

Adding data rows unto your table

INSERT INTO Contacts
         (ContactsID, Lastname, Firstname, Phone)
VALUES (101, 'Gavino', 'Albert', '5245505')
                (102, 'Amora', 'Johnny', 'NULL')
SELECT * FROM Contacts;





Selasa, 30 Oktober 2012

Creating an Instance of SQL Server

BY ALBERT ANTHONY D. GAVINO

Security noteSecurity Note
By default, only members of the local administrators group can start, stop, pause, r
resume or restart a service.
To grant non-administrators the ability to manage services,
 see How to grant users rights to manage services in Windows Server 2003.
 (The process is similar on other versions of Windows.)

To start the default instance of SQL Server

  1. On the Start menu, point to All Programs, point to Microsoft SQL Server 2008 R2, point to Configuration Tools, and then click SQL Server Configuration Manager.
  2. In SQL Server Configuration Manager, in the left pane, click SQL Server Services.
  3. In the details pane, right-click SQL Server (MSSQLServer), and then click Start.
    A green arrow on the icon next to the server name and on the toolbar indicates that the server started successfully.
  4. Click OK to close SQL Server Configuration Manager.

To start a named instance of SQL Server

  1. On the Start menu, point to All Programs, point to Microsoft SQL Server 2008 R2, point to Configuration Tools, and then click SQL Server Configuration Manager.
  2. In SQL Server Configuration Manager, in the left pane, click SQL Server.
  3. In the details pane, right-click the named instance of SQL Server, and then click Start.
    A green arrow on the icon next to the server name and on the toolbar indicates that the server started successfully.
  4. Click OK to close SQL Server Configuration Manager.

To start an instance of SQL Server with startup options

  1. On the Start menu, point to All Programs, point to Microsoft SQL Server 2008 R2, point to Configuration Tools, and then click SQL Server Configuration Manager.
  2. In SQL Server Configuration Manager, in the left pane, click SQL Server.
  3. In the details pane, right-click the instance of SQL Server, and then click Properties.
  4. In the SQL Server (<instancename>) Properties dialog box, click the Advanced tab, and then click Startup Parameters.
  5. At the end of the original text, in the Value column, type the startup parameters you want, and then click OK. Separate parameters with a semi-colon, for example, –c;-m.
  6. Stop and restart SQL Server for the parameters to take effect.
    NoteNote
    SQL Server running on a cluster is best managed by using Cluster Administrator.
    For more information, see Using SQL Server Tools with Failover Clustering.

Rabu, 24 Oktober 2012

Benefits of SQL Server


Microsoft SQL Server is a robust and scalable solution all of your business application needs. Whether you're using it for data centralization for an Access front-end, a primary data source for your web-based software system, or for managing content on your business website; SQL Server is ideal for your application.

Learn more about the benefits of using SQL Server for your data storage needs:
  • Server Roles
  • Database Roles
  • Scalability
  • Centralized Data
  • Data Recovery
  • Security Benefits

Server Roles

Server roles provide levels of access to server operations and tasks.
Server roles are predefined and grant the permission for the whole server. These roles are not database specific and cannot be customized. The table below provides a listing and explanation for each type of server role.
sysadminAble to do anything in SQL Server
serveradminAble to modify SQL Server settings and shut down SQL Server
setupadminAble to install replication and control extended stored procedures
securityadminAble to control server logins and create database permissions
processadminAble to control SQL Server processes
dbcreatorAble to create and modify databases
diskadminAble to manage disk files
bulkadminAble to execute bulk insert statements

Database Roles

Database roles provide the assignment of a set of database-specific permissions to an individual or a group of users. Database roles can be assigned to Windows Authenticated logins or SQL Server Authenticated logins. Roles that are assigned to Windows Authenticated logins can be assigned to NT users and NT groups. Roles can also be built with a hierarchical model.
Database roles are database specific. SQL Server provides three types of roles:
  1. Predefined database roles
  2. User-defined database roles
  3. Implicit roles

Predefined Database Roles

Predefined database roles are standard SQL Server database roles. Each database in SQL Server has these roles. Predefined database roles make it easy to delegate responsibility.
Predefined database roles are database specific and cannot be customized. The table below provides a listing and explanation for each type of database role.
db_ownerHas complete access to all objects within the database, can drop and re-create objects, and has the capability to assign object permissions to other users. It can modify database settings and perform database maintenance tasks. This role encompasses all functionality listed in the other predefined database roles.
db_accessadminControls access to the database by adding or removing Windows Authentication users and SQL Server users.
db_datareaderHas complete access to SELECT data from any table in the database. This role does not grant INSERT, DELETE, or UPDATE permissions on any table in the database.
db_datawriterCan perform INSERT, DELETE, or UDPATE statements on any table in the database. This role does not grant SELECT permission on any table in the database.
db_ddladminHas the capability to create, modify, and drop objects in the database.
db_securityadminPerforms security management within the database. This role manages statement and object permissions and roles within the database.
db_backupoperatorHas the capability to back up the database.
db_denydatareaderDenies SELECT permission on all tables in the database. However, this role does allow users to modify existing table schemas. It does not allow them to create or drop existing tables.
db_denydatawriterDenies data modification statements (INSERT, DELETE, or UPDATE) from being performed against any tables in the database.
publicEvery database user is a member of the public role. A user automatically becomes part of the public role when she is permitted access to the database.

Scalability

Computer systems typically grow over time. Maybe you have a need for 30 concurrent users in your application now, but you have plans to grow and expand. SQL Server is built to be scalable and is able take your application from 30 users to thousands of users, quickly and easily.

Table size can be an issue with other database servers, but not SQL Server. The table size is only limited by the amount of space that you have on your physical server. This means that 90% of SQL Server users can store all of their data within the same instance.

This database server gives you so much flexibility that you can have multiple applications using the same SQL Server. Imagine your web applications and desktop applications using the same data, so that people in the office and employees remotely could simultaneously update data just as if they were sitting next to each other. When you choose SQL Server, you choose flexibility for your data-driven applications.

Centralized Data

Having a centralized place to store data is a great benefit to SQL Server users. Centralization is a primary SQL Server benefit and means that everyone is using the same data source. As a result there is no need to merge information together in order to receive an accurate version of a record.
With centralized data, every time that you retrieve a record, you will be confident that you have the latest information.

Data Recovery

One of the SQL Server benefits is the presence of built-in tools to manage data and database backups.  Imagine if you had a Microsoft Access database and it was deleted.  There is no good way to get your information back.  With SQL Server, you have the ability to schedule regular backups of the database.  These back-ups could be stored in a separate location for disaster recovery.
There are two kinds of backups available in Microsoft SQL Server:
  • Full Backup
  • Differential Backup

Full Backup

A full backup stores a complete copy of the database on any supported backup media.  You do not have to take the database offline to perform a full backup, however these backups consume a large amount of resources and users may notice that response times are a little slower.  Full backups can be scheduled for off peek times to minimize the impact that performance degradation would have on the end users.

Differential Backup

There are situations where you may not want to use as many resources as a full backup requires, but feel that it is important to have a backup at a certain time.  This is when you may consider a differential backup.  This backup uses less resources since it backs up information that has changed or been added since the last full backup.  **It is very important to note that a differential backup alone will not be enough to restore the database, since it only contains data that was modified since the last full backup.

Security Benefits

There is an old saying that states that knowledge is power.  The ability to securely store and retrieve data is one of the pillars of the IT world.  With SQL Server there are several security options to ensure that your data is stored safely.
SQL Server supports two security modes:
  1. Windows Authentication
    • Uses the credentials of the person logged into the Windows machine.
  2. SQL Server and Windows Authentication (Mixed Authentication)
    • Uses both Windows Authentication and SQL Server Authentication.
If you are using Mixed Authentication, please note that sa (which means system administrator) by default, does not contain passwords.  The sa username is very powerful and gives the user the right to manipulate all databases on SQL Server.  After the installation of SQL Server, please make sure that changing the sa password is at the top of your list of things to do.
If you are using Mixed Authentication it is good practice to create a username and password for applications that would restrict access to only the databases that the application uses.
There are two types of permissions in SQL Server:
  • Server Roles
  • Database Roles

Disadvantages of using MS Access, why choose MS SQL


While there are many advantages to choosing Microsoft Access as your database management system, there are also a few disadvantages that you might want to consider.

Concurrent Users

Even though Microsoft has stated that Access is able to support 255 concurrent users, it is a more practical choice to select Access when the database will be used by only 15 to 20 simultaneous users. If more than 20 users try to gain access to the database, performance and response time may begin to suffer.

Multiple Windows Operating Systems

Whenever various Windows operating systems like Windows 98, Windows ME, Windows 2000, Windows XP, or Windows Vista are used to access the same database, the database may experience corruption issues. It is recommended that all Access users are operating with the identical operating system.

Novice User – Design and Development Limitations

A user who is new to software development may find that they are a bit overwhelmed when using Microsoft Access. In order to maximize the potential of Access, a thorough understanding of relational database design along with VBA coding is highly recommended. As with any software development tool, the ease of screen use, database performance, automation of data input, and report quality relies on the skill level and experience of the developer.

Sensitive Data Needs

If you have sensitive data needs like storing health and medical records, financial information, academic records or employment data, for example, you will require more extensive database security than Access can offer. One suggestion would be to use SQL Server as the database engine with an Access front-end. We can help. Please contact us to discuss your needs. We have a solution that will work for you.

Internet Limitations

Microsoft Access does have a few limitations. Access was not designed to be used with applications that are web based since its forms and reports are only functional in a Windows environment and are not compatible with an internet browser like Internet Explorer. The Access Jet database that is used by Microsoft Access is a file based system and does not contain the server features that are available in SQL Server, Oracle or other ODBC compliant databases. Access databases are more suited for web based solutions when the number of users is small or when the users are just viewing data rather than editing or deleting.

Remote Access

In the past, record locking activities in Access caused an increase in response time which resulted in slow performance on a VPN (Virtual Private Network) or WAN (Wide Area Network) when accessing the Access Jet database. However, this problem has been eliminated as a result of quicker broadband connections. It is important to note that performance can also be enhanced if a continuous connection is maintained to the back end database throughout the session rather than opening and closing the sessions every time a table is requested. If you find that Access database performance over a VPN or WAN connection continues to suffer, then perhaps Microsoft Terminal Services is an effective alternative. We have found that Access databases linked to a SQL Server or Access Data Projects work quite well over VPN and WAN connection. Please contact us and let us know how we can help.

Access on a Mac

Unfortunately, Microsoft Access is not designed to operate on Apple Macintosh computers. Since Access is commonly used in many businesses that use PC’s, Microsoft does not have any future plans to expand the database to be functional on a Mac.

Reference:
http://www.sql-programmers.com/home.aspx

Selasa, 23 Oktober 2012

Export your tables into SQL Server Express


Export Microsoft Access (v2003) tables into
SQL Server Express Edition (v2005)

People want to migrate their tables from the measly MS Access to a much powerful SQL Server Express Edition.
  • Open Microsoft SQL Server Management Studio Express
  • Create a database for the import, if necessary
  • Open the Microsoft Access database that will be imported into SQL
  • Switch to “Tables” in Microsoft Access
  • Right click on the Table to be exported
  • Select “Export”
  • In the ‘Save Type As…’ field select “ODBC Databases()”
  • Name the Table and Press “OK”
  • Press the “New…” button
  • Select “SQL Server” and Press “Next>”
  • Enter a [name] for the Data Source then Press “Next>”
  • Press “Finish”
  • Enter a Description if necessary
  • Select the SQL Server to connect to and Press “Next>”
  • Select the correct Authentication settings and Press “Next>”
  • Check “Change the default database to:” and Select the correct Database then Press “Next>”
  • Press “Finish”
  • Press “Test Data Source…” to very the information entered was correct
  • If test completed successfully then Press “OK”
  • In “SQL Server Login” dialog box, enter correct authentication information and Press “OK”
  • Export should occur without any further dialog boxes or prompts
  • Switch back to the Microsoft SQL Server Management Studio Express
  • Navigate to the database, refresh the console and verify the table was created/imported
Sample Video of creating a table in MS SQL server


Data Warehousing, first step in Data Mining.

Data Warehousing is one of the important aspects in data mining, without a data warehouse, you will get lost with all your data, Imagine thousands of excel files that you will be merging without any primary key, this will really make you feel lost, Always structure your data in such a way that they can be linked together like a star schema or snow flake schema, make sure that your data lies in one platform, it can be in MS SQL Server, Oracle or a large scale database server.

Once all of these are in place, then that is the time you will be able to do data mining, up to the last minute detail. This type of process is also helpful in creating your Digital Dashboard.


Kamis, 18 Oktober 2012

Creating Metrics for your Digital Dashboard

by Albert Anthony D. Gavino



Most businesses think or Educational Institutions think its pretty easy to make a digital dashboard, when in fact its not, the first thing you may want to consider is your available data, now if your data is in MS SQL server or in one intact database and assuming most of its records are clean and quantitative in nature, which means they consist of dates, binary codes (1,0), income levels in one rate then they are good to go, However if your database is in rotten or dirty condition, doing a data mining is not a good idea.

Furthermore, you may want to consult with your Board of Directors what kind of metrics they want, is it a metric of INCOME, what is the target Income for this year, these things are important in establishing your dashboard, which in turn will become valuable reports real-time. Its the same scenario when your buying or selling a certain stock, the data is in real time.

for example:
Number of Actual Students for SY 2011-2012: 5,000 (is your actual data)Target Metric of Students for SY 2011-2012: 6,000 (is the metric set by your BOT)
Also consider the power of your hardware in your company or institution, this way, when users query reports on your dashboard it will be speedy.

Digital Dashboard/ Business Intelligence Software

  • Qlikview
  • Cognos Insight
  • Oracle Business Intelligence
  • Dundas
  • Microsoft

Online Applications:

One of the famous free apps you can find on the web for your dashboard solutions for a website is your friendly Google Analytics apps, Google analytics is free and tags your customer base globally and regionally, most of its stats are based on IP addresses and even be used for marketing such as Google Ad Words.

to get a google analytics account you will need a gmail account and just sign up to this URL, http://www.google.com/analytics/ Voila! you know have a digital dashboard for your web site.

to set your own metrics, Google Analytics allows you to set them on your goals and objectives.

Who uses Business Intelligence and Predictive Analytics Software?

  • Banks who approve house loans, credit loans
  • Credit card companies who approve credit card applications
  • Education institutions who would want to screen for scholarship applications



Using Crosstabs to validate or quality check data


One of the useful things in SPSS is to use Crosstabs to validate data, you can also put category and sub-category as what we do in pivot tables in excel, to do this just click on Analyze>Descriptive Statistics>CrossTabs (SPSS version 15.0)




You can put variables on row vs columns to compare the dataset(s) for quality assurance.

Categories and Sub-Categories, Solution to cleaning "string" and "text" fields

by Albert Anthony D. Gavino

Region or Geographic Location Categories and Sub-Categories


Database Administrators have problems cleaning "strings" and "texts", to prevent this from happening you could always resort to putting categories and sub-categories, for example think of a solution for ZIP-CODES, it can be broken further into COUNTRY, REGION, CITY.

In my case I live in the Philippines, so its good to have LUZVISMIN which are 3 categories namely Luzon, Visayas and Mindanao and then Region has 17 regions including NCR, and lastly the difficult part would be CITY such as Metro Manila, Cavite and Quezon City for example. This gives you a clear idea that your variable ZIP code needs to be connected to at least three other fields, so when creating your ZIP code table, include these 3 additional variables to make life easier, hence no need to clean your "STRINGS" or "TEXT FILES" for other variables that you can't categorize, make sure you have a category such as "OTHERS", rule of thumb of what I do is limit your "OTHERS" category to at least 10 percent. 

for Geographic Region
here, your category is structured as: 
Category Design: COUNTRY > LUZVISMIN > Region > City
ex. Philippines > Luzon > NCR > Manila

Geographic Tip: 
Add Country: ABROAD, to make way for Foreign Students/ Applicants

University or College Categories or Sub-Categories

the same thing can be used for a School or University say for example:
Category Design: University > College > Degree Program 

Ex. De La Salle University > College of Engineering > B.S. in Mechanical Engineering

Personal Tip:
Don't use text boxes in your web app forms, if you want your Database to be clean, instead USE Drop Down Boxes and Radio buttons, this also follows for Survey Forms. As much as possible do not use text fields, Use them only for qualitative purposes such as Comments and Feedback, but do not use them for categories as this will be a big problem to your database cleaning.

Philippine Regions

Luzon
  1. NCR
  2. CAR
  3. Ilocos
  4. Cagayan Valley
  5. Central Luzon
  6. CALABARZON
  7. MIMAROPA
  8. Bicol
Visayas
  1. Western
  2. Central
  3. Eastern
Mindanao 
  1. Zamboanga
  2. Northern Mindanao
  3. Davao
  4. SOCCSKSARGEN
  5. Caraga
  6. ARMM