[index] [prev] [next] [options] [help]

eprints_tech messages

[EP-tech] this is what we did to make EPrints fully utf8 aware

From: "Roman Chyla" <roman.chyla AT gmail.com>
Date: Mon, 2 Jun 2008 16:05:49 +0200


Threading:      • This Message
             Re: [EP-tech] this is what we did to make EPrints fully utf8 aware from tdb01r AT ecs.soton.ac.uk
             Re: [EP-tech] this is what we did to make EPrints fully utf8 aware from roman.chyla AT gmail.com
             Re: [EP-tech] this is what we did to make EPrints fully utf8 aware from tajoli AT cilea.it
             Re: [EP-tech] this is what we did to make EPrints fully utf8 aware from roman.chyla AT gmail.com
             Re: [EP-tech] this is what we did to make EPrints fully utf8 aware from tajoli AT cilea.it
             Re: [EP-tech] this is what we did to make EPrints fully utf8 aware from roman.chyla AT gmail.com

*** 
http://www.eprints.org/tech.php/id/%3CEMEW-k51F5p85fb6162b8e75a1eb5ae2b72a68fdb81-ea0115e90806020705n652cadfg40060fc94d1042f6%40mail.gmail.com%3E
*** EPrints community wiki - http://wiki.eprints.org/

Good day to all,

I am behind documenting changes that I have done to Eprints in order
to fully benefit and make it compliant with UTF8 encoding.
It also seems I have lost my password and the system does not know my
e-mail address (could somebody please reset my account for the wiki?)

there is now a fully functional EPrints installations that handles
pretty decently all utf8 characters - if you think, that your
installation does that too, then please read on.

an 
example:http://dlib.lib.cas.cz/cgi/search/simple?q=%C4%8Cefel%C3%ADn&_action_search=Search&_order=bytitle&basic_srchtype=ALL&_satisfyall=ALL
(note the uppercase and lowercace differences and the fact you could
not search for uppercased forms in the default EPrints, indexing.pl is
not used for authors' names)

I believe there is a basic misunderstanding about unicode string
object inside EPrints. The operations like uc,lc and also all the
regular expression cannot work properly because the object is not
marked as utf8 string, you will find example in the indexing.pl below
- for more information, please see:
http://ahinea.com/en/tech/perl-unicode-struggle.html. And in many
places, the utf8() is called that creates the unicode string (and that
is something different than utf8 string, there is no utf8 object in
Perl). I would recommend fixing unicode at the input points, not
inside of the code at dozens of places


Here is what I had to do:

1. convert the database tables
2. fix several places inside EPrints (changes are actually very easy,
everything in the config files, included for your convenience, it will
be very very easy to use it- honestly, in my opinion, no system
administrator should ignore this problem)

I am afraid that this sollution works well, but it is a mere hack.
EPrints should look more closely at the code from the perspective of
non-latin countries and characters. From the perspective of users of
different parts of the world, of the whole world - I am confident,
that in the current state, it can't work for many of them.

Hope that helps someone,

  roman


1.DATABASE CONVERSION
--------------------------------------------------
#dump schema of the database
mysqldump --no-data --set-charset -u root -p<password> <db_name> 
> schema.sql

#dump the data, it will be actually utf8 encoded, don't be fooled be
the charset latin1 bit
mysqldump --no-create-info --skip-set-charset -u root -p<yourpassword>
--default-character-set=latin1 <db_name> > data.sql

#open the schema.sql in an editor and:
- replace all occurences of CHARSET=latin1 for CHARSET=utf8
- also change the dafault NULL charset for columns (see
http://bugs.mysql.com/bug.php?id=23073)
-- search for "varchar(255)" and replace "with varchar(255) 
CHARACTER SET utf8 "

#set the utf encoding for the data

in linux you can do: echo 'SET NAMES utf8;' | cat - data.sql > datautf.sql

#now load the edited db schema  (this will recreate the database, AND
DESTROY ALL THE DATA!!! - make sure you have them in datautf.sql)
mysql <db_name> -u root -p < schema.sql

#load the data
mysql <db_name> -u root -p < datautf.sql



2. then, in order to deal with utf8 strings, you will need Sub::Override
--------------------------------------------------
I have created my own module and placed it inside
perl_lib/FixIncredibleSoftware.pm (see below)

then in  session.pl, you need this

$c->{session_init} = sub
{
  my( $session, $offline ) =  AT _;
  $session->get_database->do("SET NAMES utf8");
  require FixIncredibleSoftware;
};

this will affect only your repository, will leave other intact, will
not change EPrints codebase (less headaches for you) and it works for
command-line programs too

The FixIncredibleSoftware.pm looks like this (perl programmers can see
that I am basically dealing with utf8 strings and my strings are
locale specific - the locale on my system is cs_CZ.UTF8, if you want
your EPrints installation to deal with your own alphabet, change your
locale environment)

package FixIncredibleSoftware;
use Sub::Override;
use vars qw /$OVERRIDE/;
use Unicode::String qw(utf8 );
$OVERRIDE = Sub::Override->new(  ); # assumes Foo::foo


$OVERRIDE->replace('EPrints::Index::add', \&add);
$OVERRIDE->replace('EPrints::MetaField::Name::get_index_codes_basic',
\&get_index_codes_basic);

print STDERR "FixIncredibleSoftware loaded\n";

#I just don't like this
$EPrints::Index::FREETEXT_CHAR_MAPPING = { };

sub add
{
	my( $session, $dataset, $objectid, $fieldid, $value ) =  AT _;

	my $field = $dataset->get_field( $fieldid );

	my( $codes, $grepcodes, $ignored ) = $field->get_index_codes(
$session, $value );

	my %done = ();

	my $keyfield = $dataset->get_key_field();

	my $indextable = $dataset->get_sql_index_table_name();
	my $rindextable = $dataset->get_sql_rindex_table_name();


	my $rv = 1;
	my $dbh = $session->get_database->{dbh};

	foreach my $code (  AT {$codes} )
	{
		next if $done{$code};
		$done{$code} = 1;
		my $sql;
		my $fieldword =
EPrints::Database::prep_value($field->get_sql_name().":$code");
		my $sth;
		$sql = sprintf ("SELECT max(pos) FROM %s where fieldword=%s",
$indextable, $dbh->quote($fieldword));
		$sth=$session->get_database->prepare( $sql );
		$rv = $rv && $session->get_database->execute( $sth, $sql );
		return 0 unless $rv;
		my ( $n ) = $sth->fetchrow_array;
		$sth->finish;
		my $insert = 0;
		if( !defined $n )
		{
			$n = 0;
			$insert = 1;
		}
		else
		{
			$sql = sprintf ("SELECT ids FROM $indextable WHERE fieldword=%s AND
pos=%d", $dbh->quote($fieldword), $n);
			$sth=$session->get_database->prepare( $sql );
			$rv = $rv && $session->get_database->execute( $sth, $sql );
			my( $ids ) = $sth->fetchrow_array;
			$sth->finish;
			my(  AT list ) = split( ":",$ids );
			# don't forget the first and last are empty!
			if( (scalar  AT list)-2 < 128 )
			{
				$sql = sprintf ("UPDATE $indextable SET ids=%s WHERE fieldword=%s
AND pos=%d", $dbh->quote("$ids$objectid:"), 
$dbh->quote($fieldword),
$n);
				$rv = $rv && $session->get_database->do( $sql );
				return 0 unless $rv;
			}
			else
			{
				++$n;
				$insert = 1;
			}
		}
		if( $insert )
		{
			$sql = sprintf ("INSERT INTO $indextable (fieldword,pos,ids )
VALUES (%s,%d,%s)", $dbh->quote($fieldword), $n,
$dbh->quote(":$objectid:"));
			$rv = $rv && $session->get_database->do( $sql );
			return 0 unless $rv;
		}
		$sql = sprintf ("INSERT INTO $rindextable
(field,word,".$keyfield->get_sql_name()." ) VALUES
('".$field->get_sql_name."',%s,%s)", $code, $objectid);
		$rv = $rv && $sth;
		return 0 unless $rv;

	}

	my $name = $field->get_name;
  my $tbl_name = $dataset->get_sql_grep_table_name;
  #print STDERR $tbl_name, "\n";

  my $sql_insert = $dbh->prepare("INSERT INTO `$tbl_name` VALUES 
(?,?,?)");
	foreach my $grepcode (  AT {$grepcodes} )
	{
		$sql_insert->execute( $objectid, $name, $grepcode );
	}
}

sub get_index_codes_basic
{
  my( $self, $session, $value ) =  AT _;
  return( [], [], [] ) unless( EPrints::Utils::is_set( $value ) );

  #this is unicode object (do they know that utf8 is inside?)
  my $f = &EPrints::Index::apply_mapping( $session, $value->{family} );
  my $g = &EPrints::Index::apply_mapping( $session, $value->{given} );

  #print STDERR "utf8() returns unicode::string object, not utf8! " .
Dumper($f);
  #print STDERR "So every operation on the string lc, uc, substr
ignores the locale\n";
  #print STDERR "And all the regular expressions like \\U \\L [a-zA-Z]
are plainly wrong\n";

  # Add a space before all capitals to break
  # up initials. Will screw up names with capital
  # letters in the middle of words. But that's
  # pretty rare.
  my $utf8_string = Encode::decode('UTF-8', $g->utf8, 0);
  my $len_g = length($utf8_string);
  my $new_g = utf8( "" );
  for(my $i = 0; $i<$len_g; ++$i )
  {
    my $s = substr( $utf8_string, $i, 1 );
    if( $s eq "\U$s" )
    {
      $new_g .= ' ';
    }
    $new_g .= $s;
  }

  my $code = '';
  my  AT r = ();
  foreach( EPrints::Index::split_words( $session, $f ) )
  {
    next if( $_ eq "" );
    my $word = lc Encode::decode('UTF-8', $_->utf8, 0);
    push  AT r, lc $word if length($word) > 2;
    $code.= "[$word]";
  }
  $code.= "-";
  foreach( EPrints::Index::split_words( $session, $new_g ) )
  {
    next if( $_ eq "" );
    my $word = lc Encode::decode('UTF-8', $_->utf8, 0);
    push  AT r, $word if length($word) > 2;
    $code.= "[$word]";
  }
  #print STDERR "--get_index_codes_basic returns:\n";
  #print STDERR Dumper( \ AT r, [$code], [] );
  #print STDERR $code, "\n";
  return( \ AT r, [$code], [] );
}

1;


I also needed to change cfg.d/indexing.pl

$c->{extract_words} = sub
{
	my( $session, $text ) =  AT _;

	# Acronym processing only works on uppercase non accented
	# latin letters. If you don't want this processing comment
	# out the next few lines.

	# Normalise acronyms eg.
	# The F.B.I. is like M.I.5.
	# becomes
	# The FBI  is like MI5
	# These are rather expensive to run, so are being commented out
	# by default.
	#my $a;
	#$text =~ s#[A-Z0-9]\.([A-Z0-9]\.)+#$a=$&;$a=~s/\.//g;$a#ge;
	# Remove hyphens from acronyms
	#$text=~ s#[A-Z]-[A-Z](-[A-Z])*#$a=$&;$a=~s/-//g;$a#ge;

	# Process string.
	# First we apply the char_mappings.
	my( $i, $len ),
	my $utext = utf8( "$text" ); # just in case it wasn't already.
	$len = $utext->length;
	my $buffer = utf8( "" );
	for($i = 0; $i<$len; ++$i )
	{
		my $s = $utext->substr( $i, 1 );
		# $s is now char number $i
		if( defined $EPrints::Index::FREETEXT_CHAR_MAPPING->{$s} )
		{
			$s = $EPrints::Index::FREETEXT_CHAR_MAPPING->{$s};
		}
		$buffer.=$s;
	}

	my  AT words =EPrints::Index::split_words( $session, $buffer );

	# Iterate over every word (bits divided by seperator chars)
	# We use hashes rather than arrays at this point to make
	# sure we only get each word once, not once for each occurance.
	my %good = ();
	my %bad = ();
	my $word;
	foreach $word (  AT words )
	{	
		# skip if this is nothing but whitespace;
    $word = Encode::decode('UTF-8', $word->utf8, 0);
		next if ($word =~ /^\s*$/);

		# calculate the length of this word
		my $wordlen = length $word;

		# $ok indicates if we should index this word or not

		# First approximation is if this word is over or equal
		# to the minimum size set in SiteInfo.
		my $ok = $wordlen >= $c->{indexing}->{freetext_min_word_size};
	
		# If this word is at least 2 chars long and all capitals
		# it is assumed to be an acronym and thus should be indexed.
		if( $word =~ m/^[A-Z][A-Z0-9]+$/ )
		{
			$ok=1;
		}

		# Consult list of "never words". Words which should never
		# be indexed.	
		if( $c->{indexing}->{freetext_stop_words}->{lc $word} )
		{
			$ok = 0;
		}
		# Consult list of "always words". Words which should always
		# be indexed.	
		if( $c->{indexing}->{freetext_always_words}->{lc $word} )
		{
			$ok = 1;
		}
	
		# Add this word to the good list or the bad list
		# as appropriate.	
		unless( $ok )
		{
			$bad{$word}++;
			next;
		}

		# Only "bad" words are used in display to the
		# user. Good words can be normalised even further.

		# non-acronyms (ie not all UPPERCASE words) have
		# a trailing 's' removed. Thus in searches the
		# word "chair" will match "chairs" and vice-versa.
		# This isn't perfect "mose" will match "moses" and
		# "nappy" still won't match "nappies" but it's a
		# reasonable attempt.
		$word =~ s/s$//;

		# If any of the characters are lowercase then lower
		# case the entire word so "Mesh" becomes "mesh" but
		# "HTTP" remains "HTTP".
		#if( $word =~ m/[a-z]/ )
		#{
		#	$word = lc $word;
		#}
    $word =~ s/\W//go;
    $word = lc $word;
		$good{$word}++;
	}
	# convert hash keys to arrays and return references
	# to these arrays.
	my(  AT g ) = keys %good;
	my(  AT b ) = keys %bad;
	return( \ AT g , \ AT b );
};


[index] [prev] [next] [options] [help]