Anyone using JSON here ?

Sentinel

Bionic Poster
I've seen Perl being used to build corporate-scale applications with GUIs and the usual backend stuff (in particular an asset management application). Very impressive. I've heard Perl doesn't scale very well, but it can be done. Is it pleasant? That's another story. :D

There was once a story that hotmail was written in perl. By now they would have rewritten it, but for several years they could not. if you are talking about scalability on the web then i believe one was supposed to use mod_perl instead of loading the perl interp with each request.

Anyway, I had an idea about those massive imdb files. You know where they have it in the form of:

Hitchcock <TAB> movie1
<TAB> movie 2
<TAB> movie 3.
<blank line>


Why not I just join the lines in the director, actor and actress file (after removing all the TV show entries).

Hitchcock <SEP> movie1 <SEP> movie2 <SEP> ...

This way if i "grep" for Hitchcock i get all the movies in one row, and i can now split them on the separator.
Or if i want all the actors for a movie, i can grep the actors file for movie

grep "Singin in the Rain" actors.list | cut -f1 -d<SEP>

No complex programming involved.
 
There was once a story that hotmail was written in perl. By now they would have rewritten it, but for several years they could not. if you are talking about scalability on the web then i believe one was supposed to use mod_perl instead of loading the perl interp with each request.

Anyway, I had an idea about those massive imdb files. You know where they have it in the form of:

Hitchcock <TAB> movie1
<TAB> movie 2
<TAB> movie 3.
<blank line>


Why not I just join the lines in the director, actor and actress file (after removing all the TV show entries).

Hitchcock <SEP> movie1 <SEP> movie2 <SEP> ...

This way if i "grep" for Hitchcock i get all the movies in one row, and i can now split them on the separator.
Or if i want all the actors for a movie, i can grep the actors file for movie

grep "Singin in the Rain" actors.list | cut -f1 -d<SEP>

No complex programming involved.
The scalability comment was not about web scalability. It was about project scalability. I've heard many times the comment from people who have programmed in Perl that anything that surpasses around 1,000 in lines becomes unmanageable. Python has a much better reputation for scalability, for example.

Senti, are you thinking about performing these hard-coded joins on just a subset of the IBDB files?

Because if you are going to grep on a large file, I don't think it's going to be practical. It would take too long to run "queries" on such large files with grep, no?
 

Sentinel

Bionic Poster
The scalability comment was not about web scalability. It was about project scalability. I've heard many times the comment from people who have programmed in Perl that anything that surpasses around 1,000 in lines becomes unmanageable. Python has a much better reputation for scalability, for example.

Senti, are you thinking about performing these hard-coded joins on just a subset of the IBDB files?

Because if you are going to grep on a large file, I don't think it's going to be practical. It would take too long to run "queries" on such large files with grep, no?

oh, yes, perl codebases become unmanageable unless everyone follows some guidelines. e.g. the bugzilla codebase was very good. I made a lot of customizations to it and added some modules too.

I am cutting out a lot of crap from the IMDB files. I've already deleted all TV shows from the 1 GB actors file (it is down to 284MB). I also don't need every tom, dick and suresh who has ever acted in a single movie. also all kinds of chaps with names like "Too $hort" and "40 glocc". I am deleting everyone with names not starting with A-Z. And all actors with only one movie. .
This way i cut it down drastically. (Do you know there are about 10 or 12 Gene Kelly's in the actors file ? LOL)

Once i convert the files to the format i mentioned, i can manipulate it easily and put it into a database, or else just once in a while do a search for movies of an actor. The "grep" search will be a rare one, its not like i am sitting all day long doing it. I have already inserted the JSON files into the database and that is what i will primarily use. The other huge IMDB files , i will see if i really need them.
 
@Rusty Shackleford

Does a "wireless optical" mouse imply it is bluetooth ?
Was searching for bluetooth mouse on amazon for my Mac laptop and only one says bt, others just say wireless. THe one that says BT is more than 2x the cost of the others, so wondering.

check this link

http://www.amazon.in/s/ref=nb_sb_ss_i_2_10?url=search-alias=computers&field-keywords=bluetooth+mouse&sprefix=bluetooth+,computers,323
Actually, it doesn't!

You need to look at the specifications and make sure it's a bluetooth device.

For example, if you go to Amazon and search "bluetooth mouse" or "bluetooth keyboard", it will give you a bunch of results for "wireless" devices. Those devices are normally not bluetooth, but use a proprietary protocol (I think Microsoft and Logitech do that). So they would be completely worthless if you want bluetooth.

Those devices normally also have an USB dongle (the receiver), whereas real bluetooth devices rarely do that, since most hosts have built-in bluetooth, and when they don't bluetooth USB adapters are readily accessible.

This irks me quite a bit. I've been looking for a bluetooth keyboard with a built-in trackpad or trackpoint for a while now, but I've only found the "wireless" kind. I think bluetooth is preferable for many reasons (works in multiple OSs, and doesn't need a proprietary receiver, for one).

Some Chinese guy was selling a USB/PS2 to bluetooth hub, but it was clunky and expensive.
 
oh, yes, perl codebases become unmanageable unless everyone follows some guidelines. e.g. the bugzilla codebase was very good. I made a lot of customizations to it and added some modules too.

I am cutting out a lot of crap from the IMDB files. I've already deleted all TV shows from the 1 GB actors file (it is down to 284MB). I also don't need every tom, dick and suresh who has ever acted in a single movie. also all kinds of chaps with names like "Too $hort" and "40 glocc". I am deleting everyone with names not starting with A-Z. And all actors with only one movie. .
This way i cut it down drastically. (Do you know there are about 10 or 12 Gene Kelly's in the actors file ? LOL)

Once i convert the files to the format i mentioned, i can manipulate it easily and put it into a database, or else just once in a while do a search for movies of an actor. The "grep" search will be a rare one, its not like i am sitting all day long doing it. I have already inserted the JSON files into the database and that is what i will primarily use. The other huge IMDB files , i will see if i really need them.
I've probably used your code, Senti. I used Bugzilla heavily in a job 5 years ago. Filed a Priority 0 bug once. It was fun! :D

Yeah, if you can just put all the records into your SQLite database, that's probably the best and fastest thing to do. Quick question: Do you use some kind of graphical frontend to make queries against the database, or do you just use the command line?

12 Gene Kellys? Moses supposes that's a bit too much, no? ;)
 

Sentinel

Bionic Poster
I've probably used your code, Senti. I used Bugzilla heavily in a job 5 years ago. Filed a Priority 0 bug once. It was fun! :D

Yeah, if you can just put all the records into your SQLite database, that's probably the best and fastest thing to do. Quick question: Do you use some kind of graphical frontend to make queries against the database, or do you just use the command line?

12 Gene Kellys? Moses supposes that's a bit too much, no? ;)

Usually I would just write some wrapper script that takes the name of the title (or part of), or name of actor etc and spits out the result by sending the appropriate query to sqlite.

So i just have to say: movies.sh -a kelly
or movies.sh -t 'window' to search titles. Additional options will give me brief or detailed listings etc.

When i was using Java I used to make GUI utilities, and i certainly did do so with Swing and JDBC which was excellent in exposing database metadata across a variety of databases. When i chucked Java and went into rubyland, i could never find a decent complete GUI. Each of them was either obsolete and a new version (not backward compat) was coming out, or it was too new and had hardly any features etc. And i was wary of getting into wxWindows, and QT had licensing issues. There was also nothing compared to JDBC in terms of exposing metadata. So i had to dump my ideas of porting my Swing tools to ruby. Never was able to do GUI stuff since 2004. I used to love making a UI interface, even though i am primarily a command line guy.

I am hoping to get back now and do some GUI's perhaps with Java 8 and FX or whatever is the current thing.
 

Sentinel

Bionic Poster
Actually, it doesn't!

You need to look at the specifications and make sure it's a bluetooth device.

For example, if you go to Amazon and search "bluetooth mouse" or "bluetooth keyboard", it will give you a bunch of results for "wireless" devices. Those devices are normally not bluetooth, but use a proprietary protocol (I think Microsoft and Logitech do that). So they would be completely worthless if you want bluetooth.

Those devices normally also have an USB dongle (the receiver), whereas real bluetooth devices rarely do that, since most hosts have built-in bluetooth, and when they don't bluetooth USB adapters are readily accessible.

This irks me quite a bit. I've been looking for a bluetooth keyboard with a built-in trackpad or trackpoint for a while now, but I've only found the "wireless" kind. I think bluetooth is preferable for many reasons (works in multiple OSs, and doesn't need a proprietary receiver, for one).

Some Chinese guy was selling a USB/PS2 to bluetooth hub, but it was clunky and expensive.


this is precisely the response my bro just sent me. I want to free my USB port since i sometimes connect a second drive, i have 3 drives with movies and one is always connected. I have just 2 USB ports. My mouse uses one and my mouse is heavily misbehaving.

my local amazon.in has just this one Logitech BT mouse. is it oka y ?

http://www.amazon.in/Logitech-Bluet...id=1446614769&sr=1-1&keywords=bluetooth+mouse
 
this is precisely the response my bro just sent me. I want to free my USB port since i sometimes connect a second drive, i have 3 drives with movies and one is always connected. I have just 2 USB ports. My mouse uses one and my mouse is heavily misbehaving.

my local amazon.in has just this one Logitech BT mouse. is it oka y ?

http://www.amazon.in/Logitech-Bluet...id=1446614769&sr=1-1&keywords=bluetooth+mouse
Yes, that sounds like a good bluetooth mouse, and they say it works on Mac too.

If you find a bluetooth keyboard with an integrated trackpoint or trackpad, please let me know! :)
 

Sentinel

Bionic Poster
If you find a bluetooth keyboard with an integrated trackpoint or trackpad, please let me know! :)


Haha, even i would love that. I moved to a BT keyboard which doesn't have extended keys, and the F1...F1n are the same as the Mac controls. I was used to having them separate on my old USB keyboard. Only issue is it does not have a trackpad or point, would be so nice.
 
Haha, even i would love that. I moved to a BT keyboard which doesn't have extended keys, and the F1...F1n are the same as the Mac controls. I was used to having them separate on my old USB keyboard. Only issue is it does not have a trackpad or point, would be so nice.
I feel your pain.

Hey, look what I found in ****! It looks like a newer, more attractive bluetooth adapter for keyboards!

http://www.****.com/itm/Nulaxy-Blue...846540?hash=item33a5e90f4c:g:issAAOSwLqFV9d9B

s-l300.jpg


The **** is e-bay.
 

Sentinel

Bionic Poster
The imdb data set from the FTP site is screwed. Seems all accented O's and U's are coming as "?". So most foreign names, esp Japanese ones, are messed up.
 

Sentinel

Bionic Poster
How are you opening that file? Can you try to change the character set and see if that helps?

I tried opening using TextEdit.

Now I am trying using MS Word which is hanging. One file it refused to open saying it;s larger than 32 MB.
Another it is hanging since it keeps trying to paginate. I am trying to get to the end where all the names have "?" at the start, MS WORd does not recognize Command+OPt+DOwnArrow for going to end of file. I think Word has stopped responding, will have to kill it.
 
I tried opening using TextEdit.

Now I am trying using MS Word which is hanging. One file it refused to open saying it;s larger than 32 MB.
Another it is hanging since it keeps trying to paginate. I am trying to get to the end where all the names have "?" at the start, MS WORd does not recognize Command+OPt+DOwnArrow for going to end of file. I think Word has stopped responding, will have to kill it.
You can always try opening it in a browser too.
 

Sentinel

Bionic Poster
@Rusty Shackleford

I have figured out the character set encoding issue of the files i downloaded from IMDB.
My terminal setting of UTF-8 is fine. The download is fine. It is the gunzip that i used to open the gz file that corrupts the accented characters. If I double-click on the file, then OSX correctly uncompresses it, and i can see the accented characters in Terminal.

I have tried to google and figure out how gunzip can maintain the character set (the files were encoded using ISO 8859-1 but have found nothing. I wonder how you would do this on linux. "man gunzip" does not say anything. stackoverflow etc also don't seem to have anything. How did you unzip the files ?



Ref:
ftp://ftp.fu-berlin.de/pub/misc/movies/database/

http://www.imdb.com/help/search?domain=helpdesk_faq&index=2&file=akas&ref_=hlp_sr_1 ) it talks of "9. Character Set Variants".
 
@Rusty Shackleford

I have figured out the character set encoding issue of the files i downloaded from IMDB.
My terminal setting of UTF-8 is fine. The download is fine. It is the gunzip that i used to open the gz file that corrupts the accented characters. If I double-click on the file, then OSX correctly uncompresses it, and i can see the accented characters in Terminal.

I have tried to google and figure out how gunzip can maintain the character set (the files were encoded using ISO 8859-1 but have found nothing. I wonder how you would do this on linux. "man gunzip" does not say anything. stackoverflow etc also don't seem to have anything. How did you unzip the files ?



Ref:
ftp://ftp.fu-berlin.de/pub/misc/movies/database/

http://www.imdb.com/help/search?domain=helpdesk_faq&index=2&file=akas&ref_=hlp_sr_1 ) it talks of "9. Character Set Variants".
Senti, I just uncompressed the movies.list.gz file using gunzip 1.6 (gunzip -k movies.list.gz) and then I opened the file in vim and it shows all those characters just fine.

I'll upload my uncompressed file somewhere and you can download it and see if that's the problem. It's 170Mb. Are you OK with that? Let me know and I'll upload to Google Drive.
 

Sentinel

Bionic Poster
Senti, I just uncompressed the movies.list.gz file using gunzip 1.6 (gunzip -k movies.list.gz) and then I opened the file in vim and it shows all those characters just fine.

I'll upload my uncompressed file somewhere and you can download it and see if that's the problem. It's 170Mb. Are you OK with that? Let me know and I'll upload to Google Drive.

My gunzip says "Apple gzip 2"

I am fine now, no need to upload anything. I was going fine in vim and even using grep, and then suddenly once again the files began showing "?" characters.

Anyway, i have understood what to do. After downloading, it is not enough to just unzip using OSX's archive utility. I have to convert the file from iso-8859-1 to utf-8. Then it opens fine in vim and also i can use grep etc.

Do one thing. Check the encoding of the file by doing:

file -I movies.list

Does it say iso8859-1 ? Or utf-8 ?
What vim version do you have ?

But all my various utilities were showing crap, not just `vim`, but even `less` and others were showing '?'s and i refuse to change my LC_CTYPE etc ( did try a half-hearted attempt).
 

Sentinel

Bionic Poster
Okay, it was NOT the gunzip that messed the files.
It is just my encoding settings. I only need to run iconv on the files to change them to UTF-8, then all is fine.

I am surprised you are able to `vim` the files and see the accented chars.
 
D

Deleted member 23235

Guest
Skimmed through this a bit... I think the gist is you want to ETL a moviesdb file, upload into a database, to be able to query on various meta information?
my $0.02 approach (and what I do/have done alot of)... is use perl/python, convert the input file into some uniform-ish json format, and upload into ElasticSearch...
I personally prefer perl (more experience), but python (i've written some maintenance scripts) is more OO, and looks closer to c++ (my primary lang).. and perl also has memory limitations, if you have large (>2GB) files to parse.
Then you should be able to query on alot of different keywords, and keyIds.

bash to do ETL? strikes me as a macgyver exercise in how to solve a problem with the bare minimum of tools...
 
My gunzip says "Apple gzip 2"

I am fine now, no need to upload anything. I was going fine in vim and even using grep, and then suddenly once again the files began showing "?" characters.

Anyway, i have understood what to do. After downloading, it is not enough to just unzip using OSX's archive utility. I have to convert the file from iso-8859-1 to utf-8. Then it opens fine in vim and also i can use grep etc.

Do one thing. Check the encoding of the file by doing:

file -I movies.list

Does it say iso8859-1 ? Or utf-8 ?
What vim version do you have ?

But all my various utilities were showing crap, not just `vim`, but even `less` and others were showing '?'s and i refuse to change my LC_CTYPE etc ( did try a half-hearted attempt).
Yes, it is iso8859-1.

I am using vim 7.4. But that might not tell much, as when vim is compiled you can use a lot of flags for it to behave differently. You get this information from running vim --version. I personally don't know what flags would affect this.

But yeah, it looks like it's not the file, but the way your tools are opening it.
 

Sentinel

Bionic Poster
my $0.02 approach (and what I do/have done alot of)... is use perl/python, convert the input file into some uniform-ish json format, and upload into ElasticSearch...

thanks,that is interesting. I already have a lot of data in JSON format. I just looked up the wiki entry and it says it uses JSON. Will explore this. Their own website was not so helpful, but will look into this.

okay, found some easy tutorials to help me in ...
http://joelabrahamsson.com/elasticsearch-101/
 

Sentinel

Bionic Poster
Found some interesting utilities.

"q" which allows you to run sql queries on flat files, csv files. But it is slow (written in python) compared to C programs like sed. (brew install q).

"look" and "sgrep" (sorted grep) for binary searching sorted files.

Finally, i got csvkit to install. I guess my OS was outdated and so was my XCode. After updating both, "pip" worked fine.
 
Top