Monday, April 6, 2009


There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions. Given the importance of early detection of software defects, let's first review some general goals of testing an ETL application:

* Data completeness : Ensures that all expected data is loaded.

* Data transformation : Ensures that all data is transformed correctly according to business rules and/or design specifications.

* Data quality : Ensures that the ETL application correctly rejects, substitutes default values, corrects or ignores and reports invalid data.

* Performance and scalability : Ensures that data loads and queries perform within expected time frames and that the technical architecture is scalable.

* Integration testing : Ensures that the ETL process functions well with other upstream and downstream processes.

* User-acceptance testing : Ensures the solution meets users' current expectations and anticipates their future expectations.

* Regression testing : Ensures existing functionality remains intact each time a new release of code is completed.

Data Completeness

One of the most basic tests of data completeness is to verify that all expected data loads into the data warehouse. This includes validating that all records, all fields and the full contents of each field are loaded. Strategies to consider include:

* Comparing record counts between source data, data loaded to the warehouse and rejected records.

* Comparing unique values of key fields between source data and data loaded to the warehouse. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.

* Utilizing a data profiling tool that shows the range and value distributions of fields in a data set. This can be used during testing and in production to compare source and target data sets and point out any data anomalies from source systems that may be missed even when the data movement is correct.

* Populating the full contents of each field to validate that no truncation occurs at any step in the process. For example, if the source data field is a string(30) make sure to test it with 30 characters.

* Testing the boundaries of each field to find any database limitations. For example, for a decimal(3) field include values of -99 and 999, and for date fields include the entire range of dates expected. Depending on the type of database and how it is indexed, it is possible that the range of values the database accepts is too small.

Data Transformation

Validating that data is transformed correctly based on business rules can be the most complex part of testing an ETL application with significant transformation logic. One typical method is to pick some sample records and "stare and compare" to validate data transformations manually. This can be useful but requires manual testing steps and testers who understand the ETL logic. A combination of automated data profiling and automated data movement validations is a better long-term strategy. Here are some simple automated data movement techniques:

* Create a spreadsheet of scenarios of input data and expected results and validate these with the business customer. This is a good requirements elicitation exercise during design and can also be used during testing.

* Create test data that includes all scenarios. Elicit the help of an ETL developer to automate the process of populating data sets with the scenario spreadsheet to allow for flexibility because scenarios will change.

* Utilize data profiling results to compare range and distribution of values in each field between source and target data.

* Validate correct processing of ETL-generated fields such as surrogate keys.

* Validate that data types in the warehouse are as specified in the design and/or the data model.

* Set up data scenarios that test referential integrity between tables. For example, what happens when the data contains foreign key values not in the parent table?

* Validate parent-to-child relationships in the data. Set up data scenarios that test how orphaned child records are handled.

Data Quality

For the purposes of this discussion, data quality is defined as "how the ETL system handles data rejection, substitution, correction and notification without modifying data." To ensure success in testing data quality, include as many data scenarios as possible. Typically, data quality rules are defined during design, for example:

* Reject the record if a certain decimal field has nonnumeric data.

* Substitute null if a certain decimal field has nonnumeric data.

* Validate and correct the state field if necessary based on the ZIP code.

* Compare product code to values in a lookup table, and if there is no match load anyway but report to users.

Depending on the data quality rules of the application being tested, scenarios to test might include null key values, duplicate records in source data and invalid data types in fields (e.g., alphabetic characters in a decimal field). Review the detailed test scenarios with business users and technical designers to ensure that all are on the same page. Data quality rules applied to the data will usually be invisible to the users once the application is in production; users will only see what's loaded to the database. For this reason, it is important to ensure that what is done with invalid data is reported to the users. These data quality reports present valuable data that sometimes reveals systematic issues with source data. In some cases, it may be beneficial to populate the "before" data in the database for users to view.

Performance and Scalability

As the volume of data in a data warehouse grows, ETL load times can be expected to increase, and performance of queries can be expected to degrade. This can be mitigated by having a solid technical architecture and good ETL design. The aim of the performance testing is to point out any potential weaknesses in the ETL design, such as reading a file multiple times or creating unnecessary intermediate files. The following strategies will help discover performance issues:

* Load the database with peak expected production volumes to ensure that this volume of data can be loaded by the ETL process within the agreed-upon window.

* Compare these ETL loading times to loads performed with a smaller amount of data to anticipate scalability issues. Compare the ETL processing times component by component to point out any areas of weakness.

* Monitor the timing of the reject process and consider how large volumes of rejected data will be handled.

* Perform simple and multiple join queries to validate query performance on large database volumes. Work with business users to develop sample queries and acceptable performance criteria for each query.

Integration Testing

Typically, system testing only includes testing within the ETL application. The endpoints for system testing are the input and output of the ETL code being tested. Integration testing shows how the application fits into the overall flow of all upstream and downstream applications. When creating integration test scenarios, consider how the overall process can break and focus on touchpoints between applications rather than within one application. Consider how process failures at each step would be handled and how data would be recovered or deleted if necessary.

Most issues found during integration testing are either data related to or resulting from false assumptions about the design of another application. Therefore, it is important to integration test with production-like data. Real production data is ideal, but depending on the contents of the data, there could be privacy or security concerns that require certain fields to be randomized before using it in a test environment. As always, don't forget the importance of good communication between the testing and design teams of all systems involved. To help bridge this communication gap, gather team members from all systems together to formulate test scenarios and discuss what could go wrong in production. Run the overall process from end to end in the same order and with the same dependencies as in production. Integration testing should be a combined effort and not the responsibility solely of the team testing the ETL application.

User-Acceptance Testing

The main reason for building a data warehouse application is to make data available to business users. Users know the data best, and their participation in the testing effort is a key component to the success of a data warehouse implementation. User-acceptance testing (UAT) typically focuses on data loaded to the data warehouse and any views that have been created on top of the tables, not the mechanics of how the ETL application works. Consider the following strategies:

* Use data that is either from production or as near to production data as possible. Users typically find issues once they see the "real" data, sometimes leading to design changes.

* Test database views comparing view contents to what is expected. It is important that users sign off and clearly understand how the views are created.

* Plan for the system test team to support users during UAT. The users will likely have questions about how the data is populated and need to understand details of how the ETL works.

* Consider how the users would require the data loaded during UAT and negotiate how often the data will be refreshed.

Regression Testing

Regression testing is revalidation of existing functionality with each new release of code. When building test cases, remember that they will likely be executed multiple times as new releases are created due to defect fixes, enhancements or upstream systems changes. Building automation during system testing will make the process of regression testing much smoother. Test cases should be prioritized by risk in order to help determine which need to be rerun for each new release. A simple but effective and efficient strategy to retest basic functionality is to store source data sets and results from successful runs of the code and compare new test results with previous runs. When doing a regression test, it is much quicker to compare results to a previous execution than to do an entire data validation again.

Taking these considerations into account during the design and testing portions of building a data warehouse will ensure that a quality product is produced and prevent costly mistakes from being discovered in production.


1) Schedule variance = (Actual time taken - Planned time) / Planned time * 100

2) Effort variance = (Actual effort - Planned Effort)/Planned effort * 100

3) Defect unearthing efficiency = 100 * STRs found in pass 1/ ( STRs found in pass 1 + STRs found in pass 2 but existing in pass 1) = Defect unearthing efficiency for pass 1.

Thus defect unearthing efficiency can be found for all passes.

Also defect unearthing efficiency in terms of field STRs could be found as follows:

= 100 * Total STRs found in Polaris for a release/ (Total STRs found in Polaris for a release + Field STRs for that release)

4) Test case efficiency = (Total STRs - STRs not mapped)/Total STRs * 100

5) Test case coverage = (Total Test cases - STRs that cannot be mapped to test cases)/ Total Test Cases * 100




Eg: Non Commented Software Source (NCSS) Per Engineer Per Month

NCSS per Person Month

NCSS per Function Point

NCSS can also be replaced by KLOC (Kilo Lines of Code)


















Thursday, February 5, 2009

Testing for Error Code

A common error that we can see during our search is the HTTP 404 Not Found.
Often we can see this error code with many details about web server and other components.
For Example:

Not Found

The requested URL /page.html was not found on this server.

Apache/2.2.3 (Unix) mod_ssl/2.2.3 OpenSSL/0.9.7g DAV/2 PHP/5.1.2 Server at localhost Port 80

This error message can be generated with the insertion of non-existing URL.

After the common message that shows a page not found, there are information about web server version, OS, modules and other products used.

This information can be very important both for OS and for applications during a penetration test but web server errors aren't the only ones useful in a security analysis.

So, we'll analyze the next occurrence that shows an abnormal behaviour:

Microsoft OLE DB Provider for ODBC Drivers (0x80004005)

[DBNETLIB][ConnectionOpen(Connect())] - SQL server does not exist or access denied

What's happened?

We'll proceed step by step!

For example, the 80004005 is a generic IIS error code which indicates that isn't possible to access a database.

In many cases we can see that this code is followed by the version of the database so, the pentester with this information can plan an appropriate strategy for the security test.

Microsoft OLE DB Provider for ODBC Drivers error '80004005'

[Microsoft][ODBC Access 97 ODBC driver Driver]General error Unable to open registry key 'DriverId'

The first example shows a connection error message obtained by SQL Server Database because the database server which linked into application is down or credentials don't allow access.

But it isn't the only information that we know, in fact in this way we have discovered the kind of operating system.

In this case we could verify if the web application permits change of variables value to connect to the database.

In the second case we can see a generic error in the same situation (we know the database version) but with a different error message and database server.

But in the end...It's the same thing!

And now, we do a practical example with a security test on web application that looses the link with the database server because there is badly written code (the next error message is caused by the application which can't resolve the database server name or when the variable value is modified) or other network problems.

For example, we have a database administration web portal which can be connected to db server after a log-on phase to realize query, create tables and modify database fields.

Well, during POST of credentials for the log-on phase meet this message that evidences the presence of a MySQL database server:

Microsoft OLE DB Provider for ODBC Drivers (0x80004005)

[MySQL][ODBC 3.51 Driver]Unknown MySQL server host

If we see in the HTML code of the log-on page the presence of a hidden field with a database IP, we can try to change this value in the URL with the address of another database (our database for example).

Another example: knowing the database server that services a web application, we can take advantage of this information to carry out a SQL Injection for that kind of database or a persistent XSS test.

Information Gathering on web applications with server side technology is quite difficult so, the information discovered can be useful for the correct execution of an attempted exploit and reduce false positives.

Black Box testing and example


telnet 80

GET / HTTP/1.1


HTTP/1.1 404 Not Found

Date: Sat, 04 Nov 2006 15:26:48 GMT

Server: Apache/2.2.3 (Unix) mod_ssl/2.2.3 OpenSSL/0.9.7g

Content-Length: 310

Connection: close

Content-Type: text/html; charset=iso-8859-1


1. network problems

2. bad configuration about host database address


Microsoft OLE DB Provider for ODBC Drivers (0x80004005) '

[MySQL][ODBC 3.51 Driver]Unknown MySQL server host


1. Authentication Failed

2. Credentials not inserted


Firewall version used for authentication

Error 407

FW-1 at : Unauthorized to access the document.

Authorization is needed for FW-1.

The authentication required by FW-1 is: unknown.

Reason for failure of last attempt: no user