May 8, 2008

Patient Matching, The Second Step

Posted in OpenMRS, Summer of Code tagged , , at 5:54 am by nribeka

I had another discussion last Tuesday with Shaun Grannis and James Egg and we think the discussion went really well. This time I didn’t ask too much silly questions hehe … We clarify some more on what we want to do with my first project. There are couple of issues that we focus on, such as how to to propagate the u values from the random sampling result to the EM analysis process.

After doing some digging, I found out that the u value is saved in the MatchingConfigRow object  in the non-agreement property. At the end of the random sampling calculation, this non-agreement  property will be assigned with the result of the calculation. Now we already have the u values from the random sampling process. But how do we propagate this u value to the EM analysis process. Dig some more then …

Well, apparently the EM analysis also take MatchingConfig object as the parameter which contains all above MatchingConfigRow. So, now we need to tell the EM analysis process to use this value when the user want to pick to use random sampling. We need to put a switch then to let know the EM analysis which value to be used, some default value or the values from the random sampling process.

Another thing that we discuss in the phone was connecting this process to the Record Linker GUI. Arghhh, I’m not good at GUI programming. I just don’t have the sense of arts to create a good GUI. But, I have to give it a shot hehe …

Some term explanation:

  • Record Linker is the name of the program that I will work on. One of the capability of the program is to combine records from different sources using statistical analysis on those records.
  • MatchingConfig is an object that will store the parameter that will be used for analyzing those records. There are lots of parameters that need to be define, for example where to get the records, what fields can be found in the records etc
  • MatchingConfigRow is an object that will store the options to match each column in the records. These parameters for example, the algorithm that will be used for the matching process. MatchingConfig object contains series of MatchingConfigRow denoting that a single records will contains many columns in it.
  • The random sampling and EM analyzing process will take this MatchingConfig object as their process parameters. This MatchingConfig will be shared by the two process to propagate the result from random sampling to EM analyzing.

Some fact that I learn:

  • When the records are coming from file, there are a few step that need to be done before the file can be analyzed. The file are chopped to only include fields that will be used in the analysis process. After the file is chopped, the file is sorted using the operating system built-in sort function on the blocking fields.
  • Let’s keep some fact for the upcoming posts hehe …

Any question? I hope I didn’t miss anything …

April 30, 2008

Patient Matching, The First Step

Posted in OpenMRS, Summer of Code tagged , , at 6:50 am by nribeka

My first phone discussion about my project with my mentor, Shaun Grannis and James Egg, went well. Shaun and James explain to me about the project in details and I think the project is really interesting. I made a couple of stupid questions that is not related to the project though, sorry for that Shaun and James hehe …

My first project is to implement a fully functional random sample analyzer that calculates the rate of random agreement among corresponding pairs of records between two data sources. This rate value will replace the u rate, field agreement rate among pairs that are truly non-matched, that come from the Expectation Maximization analyzer. To get a better overview about linkage process and rationale behind the process you should read this publication about record linkage. If you want to know more about the Expectation Maximization algorithm you can read the wiki or some other journals and publication.

The process for generating u value for each column are as follows:

  • Generate two arrays of Record with the desired size of maximum sampling size
  • Take one Record from each array at a time and do the following:
    • For each demographic data in the Record, match their value using selected String matching algorithm (Jaro-Winkler, Levenshtein, Longest Common Substring or Exact Match)
    • If the value from both Record match each other, then increment match rate of current demographic data.
  • Do over above process until all record have been paired and examined
  • Calculate the u value for each demographic data and set the new u value to the MatchConfig object.

I still need to dig more about the first process and see how each datasource is read and converted into Record object. What do you think about the above process? Did I miss anything?