Searchworkings.org feed Finite State Automata in Lucene Mike McCandless http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=467982 2012-05-15T10:51:25Z 2012-05-15T10:46:47Z Lucene Revolution 2012 is now done, and the talk Robert and I gave went well! We showed how we are using automata (FSAs and FSTs) to make great improvements throughout Lucene. You can view the slides <a... Mike McCandless 2012-05-15T10:46:47Z Spatial Solr Plugin 1.0-RC4 Chris Male http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=23956 2012-05-15T09:21:49Z 2011-06-03T12:50:17Z I am pleased to announce the latest release of our Spatial Solr Plugin, v1.0-RC4. This release is a backwards compatible with RC3, and contains the following changes: PDF documentation has been improved to remove inconsistencies in request parameter and source code package names SpatialFilter now includes hashCode and equals implementations, facilitating storage of the filter in... Chris Male 2011-06-03T12:50:17Z Lucene's TokenStreams are actually graphs! Mike McCandless http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=463129 2012-05-03T14:47:33Z 2012-05-03T14:45:52Z Lucene's TokenStream class produces the sequence of tokens to be indexed for a document's fields. The API is an iterator: you call incrementToken to advance to the next token, and then query specific attributes to obtain the details for that token. For example, <a... Mike McCandless 2012-05-03T14:45:52Z Lucene has two Google Summer of Code students! Mike McCandless http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=463110 2012-05-03T14:44:43Z 2012-05-03T14:43:12Z I'm happy to announce that two Lucene Google Summer of Code projects were accepted for this summer! The first project (LUCENE-3312), proposed by Nikola Tanković, will separate StorableField out of IndexableField, and also fix the longstanding... Mike McCandless 2012-05-03T14:43:12Z Indexing your Samba/Windows network shares using Solr Martijn van Groningen http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=24058 2012-04-24T10:54:19Z 2011-06-03T12:50:21Z Many of JTeam's clients want to search the content of their existing network shares as part of their Enterprise Search infrastructure. Over the last couple of years, more and more people are switching to Apache Lucene / Solr as their preferred, open source search solution. However, many still have the misconception that it is not possible to index the content of other enterprise content systems, like Microsoft Sharepoint and Samba / Windows... Martijn van Groningen 2011-06-03T12:50:21Z There's More Lucene in Solr than You Think! sejal korenromp 2012-04-19T13:48:50Z 2012-04-19T13:03:03Z There is an interesting blog post about Lucene & Solr consultancy and training services and how these two technologies are perceived by different companies and their technical teams. The blog post highlights how many Solr users do not realize the importance of understanding the concepts of Lucene and provides some interesting examples too. sejal korenromp 2012-04-19T13:03:03Z On Schemas and Lucene Chris Male http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=450623 2012-04-17T08:04:36Z 2012-04-04T08:35:14Z One of the very first thing users encounter when using Apache Solr is its schema. Here they configure the fields that their Documents will contain and the field types which define amongst other things, how field data will be analyzed. Solr’s schema is often touted as one of its major features and you will find it used in almost every Solr component. Yet at the same time, users of Apache Lucene won’t encounter a schema. Lucene is schemaless, letting users index Documents with any fields they... Chris Male 2012-04-04T08:35:14Z Apache Solr and Lucene 3.6.0 released Chris Male 2012-04-17T07:53:31Z 2012-04-16T04:30:34Z The Lucene PMC is pleased to announce the release of Apache Solr and Lucene 3.6.0. As this may be the last release from the 3.x line of releases, it is highly recommended that users upgrade. The releases include the new Kuromoji morphological analysis framework for Japanese, improvements to suggester implementations and query-time joining, a new SolrJ client connector based on Apache HttpComponents, and a wide variety of bug fixes. Solr and Lucene 3.6.0 can be downloaded from here and here... Chris Male 2012-04-16T04:30:34Z Faceting & result grouping Martijn van Groningen http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=446861 2012-04-10T18:43:26Z 2012-03-27T11:10:47Z Result grouping and faceting are in essence two different search features. Faceting counts the number of hits for specific field values matching the current query. Result grouping groups documents together with a common property and places these documents under a group. These groups are used as the hits in the search result. Usually result grouping and faceting are used together and a lot of times the results get misunderstood. The main reason is that when using grouping people... Martijn van Groningen 2012-03-27T11:10:47Z Lucene is participating in GSoC 2012 Luca Cavanna 2012-04-04T10:48:33Z 2012-04-04T08:16:37Z The Google Summer of Code is a global program that offers students stipends to write code for open source projects. They work with the open source community to identify and fund exciting projects for the upcoming summer. As in previous years Apache Lucene will be involved so it's once again a great chance for students to participate in exciting open source projects. The application deadline is drawing near! If you are a student don't miss out on this opportunity and have a look at the Lucene... Luca Cavanna 2012-04-04T08:16:37Z Result grouping made easier Martijn van Groningen http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=446409 2012-04-03T09:45:18Z 2012-03-26T07:59:52Z Lucene has result grouping for a while now as a contrib in Lucene 3.x and as a module in the upcoming 4.0 release. In both releases the actual grouping is performed with Lucene Collectors. As a Lucene user you need to use various of these Collectors in searches. However these Collectors have many constructor arguments. So they can become quite cumbersome to use grouping in pure Lucene apps. The example below illustrates this. Result grouping using the... Martijn van Groningen 2012-03-26T07:59:52Z Lucene Versions - Stable, Development, 3.x and 4.0 Chris Male http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=446076 2012-04-02T04:34:03Z 2012-03-25T05:11:30Z With Solr and Lucene 3.6 soon becoming the last featureful 3.x release and the release of 4.0 slowly drawing near, I thought it might be useful just to recap what all the various versions mean to you the user and why two very different versions are soon going to be made available. A Brief History of Time Prior to Solr and Lucene 3.1 and the merger of the developments of both projects, both were developed using single paths. This meant that all development was done on... Chris Male 2012-03-25T05:11:30Z Solr will check on startup for index locks as of 3.6 Luca Cavanna 2012-03-23T14:35:45Z 2012-03-23T14:35:45Z As of the 3.6 version Solr will check on startup if the index is locked. While the unlockOnStartup option allows to automatically unlock the index when locked, the SOLR-3156 issue was about checking on startup in order to raise an error and prevent the web application to start in that case. In fact, if you don't use the unlockOnStartup option, you don't know that the index is locked until someone tries to add a document to the index. Thanks to this improvement which has been committed you'll... Luca Cavanna 2012-03-23T14:35:45Z Challenges in maintaining a high performance search engine written in Java Simon Willnauer http://www.searchworkings.org/login?p_p_id=58&p_p_lifecycle=0&p_p_mode=view&_58_redirect=http%3A%2F%2Fwww.searchworkings.org%2Fdownload%2F-%2Fcontent%2Fpremium-download%2F444870&p_p_state=normal 2012-03-21T14:10:46Z 2012-03-21T14:10:46Z   During the last decade Apache Lucene became the de-facto standard in open source search technology. Thousands of applications from Twitter Scale Webservices to Computers playing Jeopardy rely on Lucene, a rock-solid, scaleable and fast information-retrieval library entirely written in Java.   Maintaining and improving such a popular software library reveals tough challenges in testing, API design, data-structures, concurrency and optimizations. This talk presents the most demanding technical... Simon Willnauer 2012-03-21T14:10:46Z Document Frequency Limited MultiTermQuerys Chris Male http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=443956 2012-03-19T16:28:06Z 2012-03-19T08:52:57Z If you've ever looked at user generated data such as tweets, forum comments or even SMS text messages, you'll have noticed there there are many variations in the spelling of words. In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes. Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed. One way to... Chris Male 2012-03-19T08:52:57Z New index statistics in Lucene 4.0 Mike McCandless http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=440120 2012-03-15T10:16:58Z 2012-03-15T10:09:40Z In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model. Fortunately, this situation is wildly improved in trunk (to be 4.0), where we <a... Mike McCandless 2012-03-15T10:09:40Z Using your Lucene index as input to your Mahout job - Part I Frank Scholten http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=425600 2012-03-05T19:37:06Z 2012-02-17T10:45:12Z This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or https://issues.apache.org/jira/browse/MAHOUT-944. This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use... Frank Scholten 2012-02-17T10:45:12Z Transactional Lucene Mike McCandless http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=435353 2012-03-04T20:49:23Z 2012-03-04T20:45:57Z Many users don't appreciate the transactional semantics of Lucene's APIs and how this can be useful in search applications. For starters, Lucene implements ACID properties: Atomicity: when you make changes (adding, removing documents) in an IndexWriter session, and then commit, either all (if the commit succeeds) or none (if the commit fails) of your changes will be visible, never... Mike McCandless 2012-03-04T20:45:57Z Different ways to make auto suggestions with Solr Luca Cavanna http://www.searchworkings.org/c/blogs/find_entry?noSuchEntryRedirect=null&entryId=420845 2012-02-15T16:21:41Z 2012-02-06T15:03:38Z Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That's a meaningful example which contains multi-term suggestions depending on the most popular queries, combined... Luca Cavanna 2012-02-06T15:03:38Z Query time join will be included in Lucene 3.6! Martijn van Groningen 2012-02-08T13:51:33Z 2012-02-08T08:41:15Z Yesterday the feature query time joining has been added to the stable 3x branch. This means that query time joining will be available in the join contrib when Lucene 3.6 will be released. The query time joining in the stable 3x branch isn't quite the same as what is committed to trunk (Lucene 4.0). The version that is in trunk is about 3 times faster and supports joining on fields that have multiple values per document. Nonetheless the query time joining that will be included in Lucene 3.6 will... Martijn van Groningen 2012-02-08T08:41:15Z