data1 = """id,col1_string,col2_string,col3_int,col4_float,col5_string
010,g1,g1-duck,1,1.0,r1
110,g1,g1-dog,2,2.1,r1
210,g2,g2-duck,3,3.2,r2
310,g2,g2-dog,4,4.3,r2
410,g1,g1-goat,5,5.4,r1
510,,g1-dog,6,6.5,r1
610,g2,g2-dog,7,7.6,r2
710,g2,g2-duck,8,8.7,r2
810,g1,g1-dog,9,9.8,r1
811,g1,g1-duck,10,10.9,r1
812,g1,g1-duck,11,11.0,r1"""
data2 = """id,col1_string,col2_string,col3_int,col4_float,col6_string
010,g1,g1-duck,1,1.0,z1
110,g2,g2-duck,2,2.1,z2
210,g2,g2-duck,4,3.3,z2
310,g1,g1-dog,3,4.5,z1
410,g1,g1-dog,5,5.9,z1
510,g1,g1-dog,6,6.5,z2
610,g2,g2-goat,7,,z2
710,g2,g2-duck,8,8.7,z1
810,g1,g1-dog,9,9.8,z1
911,g2,g2-dog,10,10.91,z2
912,g2,g2-dog,11,11.01,z2
922,g1,g1-duck,12,11.1,z2
"""
df1 = pandas.read_csv(StringIO(data1))
df2 = pandas.read_csv(StringIO(data2))DatacompyReport
Datacompy
DatacompyReport
DatacompyReport (compare_instance)
A class to modify the output of a datacompy instance and add reporting functionality
| Details | |
|---|---|
| compare_instance | a datacompy.core.Compare instance |
DatacompyReport.prepare_compare_results
DatacompyReport.prepare_compare_results ()
Reformats the Datacompy output to display all column differences found for matches between the two DataFrames based on the keys compare.join_columns
DatacompyReport.create_excel_report
DatacompyReport.create_excel_report (write_file:str, use_names_in_headers:bool=True)
Create Excel workbook with nicely formatted output comparing two DataFrames
| Type | Default | Details | |
|---|---|---|---|
| write_file | str | full path + filename where the Excel file is to be created | |
| use_names_in_headers | bool | True | if True, replaces df1 and df2 with Comapre.df1_name, Comapre.df2_name |
| Returns | None | Excel workbook written at write_file location |
Create Test Data
Create a Datacompy Compare Instance to Compare df1 and df2
compare = datacompy.Compare(
df1,
df2,
join_columns=['id', 'col1_string'], # Key columns, can be single or list
abs_tol=0, # Optional, defaults to 0
rel_tol=0, # Optional, defaults to 0
df1_name='Base', # Optional, defaults to 'df1'
df2_name='Compare' # Optional, defaults to 'df2'
)Create a New DataCompyReport Instance
report = DatacompyReport(compare)
report datacompy_report instance:
Comparing: Base and Compare
The .compare_instance attribute is the datacompy.core.Compare object. datacompy.core.Compare attributes can be accessed directly.
for item in dir(report.compare_instance):
if item[0] != '_':
print(f'report.compare_instance.{item}')report.compare_instance.abs_tol
report.compare_instance.all_columns_match
report.compare_instance.all_mismatch
report.compare_instance.all_rows_overlap
report.compare_instance.cast_column_names_lower
report.compare_instance.column_stats
report.compare_instance.count_matching_rows
report.compare_instance.df1
report.compare_instance.df1_name
report.compare_instance.df1_unq_columns
report.compare_instance.df1_unq_rows
report.compare_instance.df2
report.compare_instance.df2_name
report.compare_instance.df2_unq_columns
report.compare_instance.df2_unq_rows
report.compare_instance.ignore_case
report.compare_instance.ignore_spaces
report.compare_instance.intersect_columns
report.compare_instance.intersect_rows
report.compare_instance.intersect_rows_match
report.compare_instance.join_columns
report.compare_instance.matches
report.compare_instance.on_index
report.compare_instance.rel_tol
report.compare_instance.report
report.compare_instance.sample_mismatch
report.compare_instance.subset
View Modified Matched Differences Report
Datacompy provides most of the methods and attributes necessary to create a very useful comparison of two DataFrames. For some particular use cases though it is helpful to represent the data where matches on key columns are found but at least one difference exists (by value and/or by type) in a column they both share. Further, there is no added value reporting values in columns where the two DataFrames match for this use case so matching values are blanked out.
To create such a table: report.prepare_compare_results()
report.prepare_compare_results()| col2_string_df1 | col2_string_df2 | col3_int_df1 | col3_int_df2 | col4_float_df1 | col4_float_df2 | ||
|---|---|---|---|---|---|---|---|
| id | col1_string | ||||||
| 210 | g2 | 3.0 | 4.0 | 3.2 | 3.3 | ||
| 410 | g1 | g1-goat | g1-dog | 5.4 | 5.9 | ||
| 610 | g2 | g2-dog | g2-goat | 7.6 | NaN |
Create Excel Report
To create an Excel report, use report.create_excel_report('comparison_report_test_data.xlsx').
The workbook will contain multiple sheets:
- Row Summary - overall comparison of DataFrame1 and DataFrame2 by row
- Column Summary - overall comparison of DataFrame1 and DataFrame2 by row
- Columns Stats Summary - overall comparison of column metadata for DataFrame1 and DataFrame2
- DataFrame1 Observations Only - all rows (by key as defined by
self.compare_instance.join_columns) found in DataFrame 1 but not DataFrame 2 - DataFrame2 Observations Only - all rows (by key as defined by
self.compare_instance.join_columns) found in DataFrame 2 but not DataFrame 1 - Matched Differences - all rows (by key as defined by
self.compare_instance.join_columns) found in both DataFrame1 and DataFrame2 having at least one difference in a non-key field. The Matched Differences DataFrame replaces all fields where the DataFrames match with blanks by usingreport.prepare_compare_results()
report.create_excel_report('comparison_report_test_data.xlsx')Excel Workbook has been created