| 1 |
CONTENTS |
|---|
| 2 |
PATIENT MATCHING ATTRIBUTE - Describes format of the PersonAttributeType the modules uses |
|---|
| 3 |
|
|---|
| 4 |
STRING COMPARATORS - Lists the different comparators available when matching |
|---|
| 5 |
|
|---|
| 6 |
CONFIGURATION FILE - Describes requirements of the configuration file |
|---|
| 7 |
|
|---|
| 8 |
CONFIGURATION FILE TAGS - Explains the elements in the configuration file |
|---|
| 9 |
|
|---|
| 10 |
|
|---|
| 11 |
PATIENT MATCHING ATTRIBUTE |
|---|
| 12 |
The module prefers to use a special matching PersonAttributeType of "Other Matching Information". This is a list of demographic-value pairs in the form of "<demographic1>:<value1>;<demographic2>:<value2>; . . . ". If a demographic has no value, then it can either not be present in the string of the value can be an empty string. |
|---|
| 13 |
|
|---|
| 14 |
If there is no person attribute of that type, then the module will try to get some basic information as best it can. Currently, this is very minimal and would not make good matches. |
|---|
| 15 |
|
|---|
| 16 |
STRING COMPARATORS |
|---|
| 17 |
The string comparators that can be used are: |
|---|
| 18 |
Exact Match - case sensitive comparison for the whole string, similarity is iether 0 or 1 |
|---|
| 19 |
Levenshtein - Levenshtein edit distance / longest string length |
|---|
| 20 |
Longest Common Substring - Regenstrief algorithm, converts to case insensitive strings in comparison |
|---|
| 21 |
Jaro Winkler - |
|---|
| 22 |
|
|---|
| 23 |
The implementations for Levenshtein and Jaro Winkler comparators come from the Simmetrics library at http://www.dcs.shef.ac.uk/~sam/simmetrics.html. The threshold for Jaro Winkler |
|---|
| 24 |
and Longest Common Substring is a score 0.8. The threhold for Levenshtein is 0.7. |
|---|
| 25 |
|
|---|
| 26 |
|
|---|
| 27 |
CONFIGURATION FILE |
|---|
| 28 |
The default name for the configuration file is "link_config.xml" in the current working directory. For an OpenMRS module, this would be the Tomcat directory, such as "C:\Program Files\Apache Software Foundation\Tomcat 6.0" on Windows. The JDBC driver needs to be in the classpath when the program is run if the link table is in a non Postgres or MySQL directory. |
|---|
| 29 |
|
|---|
| 30 |
A excerpt of a valid configuration file is: |
|---|
| 31 |
<?xml version="1.0" encoding="UTF-8" ?> |
|---|
| 32 |
<Session> |
|---|
| 33 |
<datasource name="link_test" type="DataBase" access="<JDBC driver>,<database URL>,<user>,<passwd>" id="3"> |
|---|
| 34 |
<column include_position="0" column_id="mrn" label="mrn" type="string"/> |
|---|
| 35 |
<column include_position="1" column_id="ln" label="ln" type="string"/> |
|---|
| 36 |
. . . |
|---|
| 37 |
<column include_position="17" column_id="openmrs_id" label="openmrs_id" type="string"/> |
|---|
| 38 |
</datasource> |
|---|
| 39 |
<analysis type="scaleweight"> |
|---|
| 40 |
<init>DBCdriver,databaseURL,user,passwd</init> |
|---|
| 41 |
</analysis> |
|---|
| 42 |
<run estimate="true" name="conversion"> |
|---|
| 43 |
<row name="yb"> |
|---|
| 44 |
<BlockOrder>1</BlockOrder> |
|---|
| 45 |
<BlckChars>40</BlckChars> |
|---|
| 46 |
<Include>false</Include> |
|---|
| 47 |
<TAgreement>0.9</TAgreement> |
|---|
| 48 |
<NonAgreement>0.1</NonAgreement> |
|---|
| 49 |
<ScaleWeight lookup="TopN" N="100.0" buffer="500">true</ScaleWeight> |
|---|
| 50 |
<Algorithm>Exact Match</Algorithm> |
|---|
| 51 |
</row> |
|---|
| 52 |
. . . |
|---|
| 53 |
<row name="zip"> |
|---|
| 54 |
<BlockOrder>null</BlockOrder> |
|---|
| 55 |
<BlckChars>40</BlckChars> |
|---|
| 56 |
<Include>true</Include> |
|---|
| 57 |
<TAgreement>0.9</TAgreement> |
|---|
| 58 |
<NonAgreement>0.1</NonAgreement> |
|---|
| 59 |
<ScaleWeight>null</ScaleWeight> |
|---|
| 60 |
<Algorithm>Exact Match</Algorithm> |
|---|
| 61 |
</row> |
|---|
| 62 |
</run> |
|---|
| 63 |
</Session> |
|---|
| 64 |
|
|---|
| 65 |
|
|---|
| 66 |
CONFIGURATION FILE TAGS |
|---|
| 67 |
The description of the elements and attributes of the xml configuration file is: |
|---|
| 68 |
Session the root element |
|---|
| 69 |
Datasource a source of Record objects |
|---|
| 70 |
Name for file sources, give the path, for data bases, gives the table name |
|---|
| 71 |
Type type of datasource: CharDelimFile, DataBase, Vector |
|---|
| 72 |
Access how to access the datasource. For a character delimted file, its the delimiter. For a database, its a String holding connection information |
|---|
| 73 |
ID - a numeric unique identifier for the data source |
|---|
| 74 |
Column one column of fields in the datasource |
|---|
| 75 |
Include_position if column is a part of the analysis, what order it is. Zero indexed |
|---|
| 76 |
Column_id name of the column. For a character delimited file, its an index. For a database table, its the column name |
|---|
| 77 |
Label the name used by the linkage program and that appears in the run section. It should be the demographics that appear in the matching person attribute |
|---|
| 78 |
Type either is string or numeric and used in sorting and comparisons |
|---|
| 79 |
Run a set of link options to use with the datasources |
|---|
| 80 |
Estimate Whether to use EM to modify values |
|---|
| 81 |
Name a label for this configuration |
|---|
| 82 |
Row the options for a field in the Record |
|---|
| 83 |
Name the name of the field, must match the label in the Datasource element |
|---|
| 84 |
BlockOrder if the field is a blocking field, then uniquely number this starting with 1 |
|---|
| 85 |
BlckChars the number of characters to block on if the field is a blocking field |
|---|
| 86 |
Include indicates if the field will be compared between records |
|---|
| 87 |
TAgreement the true agreement value |
|---|
| 88 |
NonAgreement the non agreement value |
|---|
| 89 |
ScaleWeight - true for enabling weight scaling, null for disabling |
|---|
| 90 |
lookup - Determines the tokens that will be loaded to the lookup table. Possible values are: TopN, TopNPercent, AboveN, BelowN, BottomNPercent, BottomN |
|---|
| 91 |
N - Defines the size of the lookup table, must be a decimal number, use a number between 0.0 and 1.0 for percentages |
|---|
| 92 |
buffer - Number of records that will be stored in memory during analysis (no need to exceed the number of unique tokens) |
|---|
| 93 |
Algorithm the comparator to use for this field. Options are Exact Match, LEV, LCS, and JWC |
|---|